当前位置：首页 > news >正文

深度学习N2周：构建词典

news 2025/6/6 6:58:45

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

本周任务：使用N1周的.txt文件构建词典，停用词请自定义

1.导入数据

from torchtext.vocab import build_vocab_from_iterator
from collections import Counter
from torchtext.data.utils import get_tokenizer
import jieba, re, torchdata = ["我是K同学啊！","我是一个深度学习博主，","这是我的365天深度学习训练营教案，","你可以通过百度、微信搜索关键字【K同学啊】找到我"
]

2.设置分词器

# 中文分词方法
tokenizer = jieba.lcut# 加载自定义词典
jieba.load_userdict("D:/learn/my_dict.txt")

jieba是一个由python编写的中文分词库，用于对中文文本进行分词，其安装方法：

cmd命令：pip install jieba

my_dict.txt文件内容如下：

3.清除标点符号与停用词

在使用jieba进行分词时，可以通过去除标点符号来减少分词结果种的噪音。

# 去除标点符号的函数
def remove_punctuation(text):return re.sub(r'[^\w\s]', '', text)

在使用jieba进行分词时，可以通过去除停用词（即没有具体含义、对文本语义没有影响的词汇，如“的”、“是”、“这”）来减少分词结果中的噪音。标点符号与停用词的去除通常有助于提高文本分类任务的结果。

# 假设我们有一个停用词表，内容如下：
stopwords = set(["的", "这", "是"])# 去除停用词的函数
def remove_stopwords(words):return [word for word in words if word not in stopwords]

4.设置迭代器

# 定义一个迭代器来返回文本数据中的词汇
def yield_tokens(data_iter):for text in data_iter:# 去除标点符号text = remove_punctuation(text)# 分词并生成词汇text = tokenizer(text)# 去除停用词text = remove_stopwords(text)yield text

5.构建词典

# 使用 build_vocab_from_iterator 来构建词汇表
vocab = build_vocab_from_iterator(yield_tokens(data), specials=["<unk>"])# 将未知的词汇索引为0
vocab.set_default_index(vocab["<unk>"])

.build_vocab_from_iterator()函数详解

作用：从一个可迭代对象中统计token的频次，并返回一个vocab（词汇字典）

6.文本数字化

# 打印词汇表中的内容
print("词典大小:", len(vocab))
print("词典内部映射:", vocab.get_stoi())text = "这是我的365天深度学习训练营教案"
words = remove_stopwords(jieba.lcut(text))
print("\n")
print("jieba分词后的文本: ", jieba.lcut(text))
print("去除停用词后的文本: ", remove_stopwords(jieba.lcut(text)))
print("数字化后的文本: ", [vocab[word] for word in words])

结果：

查了很多资料，依旧没有解决这个问题。但通过看别人的帖子，理解了算法的原理。