当前位置：首页 > news >正文

词嵌入基础

news 2025/7/3 13:36:13

一前言

最近在学习NLP方面知识，在此记录一下词嵌入的技术。

二词袋法

1.理论

就是统计一个句子或文章中，词语出现的次数。这方法有去重词袋法，无去重词袋法。

a 原理与案例

chinese_docs = ["我爱自然语言处理","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"
]
把句子进行分词得到如下结果
['处理', '我', '爱', '自然语言']
['处理', '很', '有趣', '自然语言']
['人工智能', '学习', '我', '爱']
['人工智能', '和', '处理', '很', '有趣', '自然语言', '都']
然后把所有分词结果合并一起，然后去重，得到所以词语的list如下
['人工智能', '和', '处理', '学习', '很', '我', '有趣', '爱', '自然语言', '都']
把这个新的词语list长度，作为词向量的长度，然后统计每句话中，词语是否有相等的词语在这个list中，如果存在，对应的位置加一
"我爱自然语言处理" 对应的词向量 [0, 0, 1, 0, 0, 1, 0, 1, 1, 0]
所谓的去重词袋法就是一个句子中，分词后有多个相同的句子，统计词频时候，只统计一个。

去重法结果

“我爱自然语言处理” 分词结果 [‘处理’, ‘我’, ‘爱’, ‘自然语言’],对应的词向量结果如下

人工智能	和	处理	学习	很	我	有趣	爱	自然语言	都
0	0	1	0	0	1	0	1	1	0

无去重法案例

“我爱自然语言处理，自然语言很有趣” 分词结果[‘处理’, ‘很’, ‘我’, ‘有趣’, ‘爱’, ‘自然语言’, ‘自然语言’]

人工智能	和	处理	学习	很	我	有趣	爱	自然语言	都
0	0	1	0	1	1	1	1	2	0

2.代码实现

import jieba
from collections import defaultdictdef chinese_tokenize(text):"""中文分词函数"""return [word for word in jieba.cut(text) if word.strip()]def build_chinese_vocabulary(documents):"""构建中文词汇表"""vocabulary = set()for doc in documents:tokens = chinese_tokenize(doc)vocabulary.update(tokens)return sorted(vocabulary) #sorted返回是一个listdef chinese_bow_vectorize(document, vocabulary):"""生成中文词袋向量"""tokens = chinese_tokenize(document)vector = [0] * len(vocabulary)for token in tokens:if token in vocabulary:index = vocabulary.index(token)vector[index] += 1return vector# 中文示例文档
chinese_docs = ["我爱自然语言处理","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"
]# 构建中文词汇表
chinese_vocab = build_chinese_vocabulary(chinese_docs)
print("中文词汇表:", chinese_vocab)# 为每个文档生成词袋向量
for doc in chinese_docs:vector = chinese_bow_vectorize(doc, chinese_vocab)print(f"文档: '{doc}'")print("词袋向量:", vector)

中文词汇表: ['人工智能', '和', '处理', '学习', '很', '我', '有趣', '爱', '自然语言', '都']
文档: '我爱自然语言处理'
词袋向量: [0, 0, 1, 0, 0, 1, 0, 1, 1, 0]
文档: '自然语言处理很有趣'
词袋向量: [0, 0, 1, 0, 1, 0, 1, 0, 1, 0]
文档: '我爱学习人工智能'
词袋向量: [1, 0, 0, 1, 0, 1, 0, 1, 0, 0]
文档: '人工智能和自然语言处理都很有趣'
词袋向量: [1, 1, 1, 0, 1, 0, 1, 0, 1, 1]

三.TF-IDF法

1.TF理论知识

在词袋法中，我们统计的是词语的出现的个数。我们会发现，当我们文章很长时候，有些词语出现的次数很多，这样统计处理的词频，不太合理，我们需要改进这统计方式。统计词频时候，我们把该句词语出现的次数除以该句词语个数，作为词频。

原理与案例

“我爱自然语言处理” 分词结果 [‘处理’, ‘我’, ‘爱’, ‘自然语言’],一共分成4个词语，每一个词语只出现一次，1/4
对应的词向量结果如下

人工智能	和	处理	学习	很	我	有趣	爱	自然语言	都
0	0	0.25	0	0	0.25	0	0.25	0.25	0

2.代码实现

import jiebadef split_document(text):tokenize = list(jieba.cut(text))return tokenizedef build_vector(documents):vector = set()for text in documents:tokenize = split_document(text)vector.update(tokenize)return sorted(vector) #sorted返回是一个list
#有去重词袋法，无去重词袋法
def vector_bag(documents):vector = build_vector(documents)vector_len = len(vector)for text in documents:tokenize = split_document(text)bag_of_vector = [0] * vector_lenfor token in tokenize:index = vector.index(token)bag_of_vector[index] += 1print('文档: ', text)print('词袋向量: ', bag_of_vector)if __name__ == "__main__":chinese_docs = ["我爱自然语言处理，自然语言很有意思","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"]vector_bag(chinese_docs)

文档:  我爱自然语言处理，自然语言很有意思
词袋向量:  [0.0, 0.0, 0.125, 0.0, 0.125, 0.125, 0.125, 0.0, 0.125, 0.25, 0.0, 0.125]
文档:  自然语言处理很有趣
词袋向量:  [0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0]
文档:  我爱学习人工智能
词袋向量:  [0.25, 0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0]
文档:  人工智能和自然语言处理都很有趣
词袋向量:  [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.0, 0.14285714285714285, 0.0, 0.0, 0.14285714285714285, 0.0, 0.14285714285714285, 0.14285714285714285, 0.0]

3. IDF理论

我们需要知道一个词语的稀缺性，当一个词语很稀缺时候，这个词语很重要。我们计算稀缺性，公式是，一个词语出现在那些文档中。公式为文档总数除以词语出现次数。

原理与案例

    "我爱自然语言处理，自然语言很有有趣","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"

一共有4个文档，自然语言出现在3个文档中，所以它的稀缺性为4 / 3。但是这样计算有一个问题，当文档比较少，或比较多时候，这个值比较极端，为了缓解极端，我们使用log函数。公式：log(总文档 / 词语出现个数)

人工智能	和	处理	学习	很	我	有趣	爱	自然语言	都
0.693	1.386	0.287	1.386	0.287	0.693	0.287	0.693	0.287	1.386

4.代码实现

import jieba
import numpy as np
def split_document(text):tokenize = list(jieba.cut(text))return tokenizedef build_vector(documents):vector = set()for text in documents:tokenize = split_document(text)vector.update(tokenize)return sorted(vector) #sorted返回是一个list# TF
#有去重词袋法，无去重词袋法
def vector_tf(documents):vector = build_vector(documents)vector_len = len(vector)for text in documents:tokenize = split_document(text)bag_of_vector = [0] * vector_lenfor token in tokenize:index = vector.index(token)bag_of_vector[index] += 1bag_of_vector = np.array(bag_of_vector)tokenizer_sum = np.sum(bag_of_vector)bag_of_vector = bag_of_vector / tokenizer_sumbag_of_vector = bag_of_vector.tolist()print('文档: ', text)print('词袋向量: ', bag_of_vector)def vector_idf(documents):vector = build_vector(documents)vector_len = len(vector)all_vector = []for text in documents:tokenize = split_document(text)bag_of_vector = [0] * vector_lenfor token in tokenize:index = vector.index(token)bag_of_vector[index] += 1all_vector.append(bag_of_vector)all_vector = np.array(all_vector)all_vector_shape = all_vector.shapevector_rate = [0] * all_vector_shape[1]for i in range(all_vector_shape[1]):dim_value = all_vector[..., i]mask = dim_value > 0num = sum(mask)vector_rate[i] = np.log(all_vector_shape[0] / num)print(vector)print(vector_rate)
if __name__ == "__main__":chinese_docs = ["我爱自然语言处理，自然语言很有趣","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"]#vector_tf(chinese_docs)vector_idf(chinese_docs)

5.TF-IDF

就是TF * IDF，可以用于提取关键词，得分最高的就是关键词。

6.代码实现

import jieba
import numpy as np
def split_document(text):tokenize = list(jieba.cut(text))return tokenizedef build_vector(documents):vector = set()for text in documents:tokenize = split_document(text)vector.update(tokenize)return sorted(vector) #sorted返回是一个list# TF
#有去重词袋法，无去重词袋法
def vector_tf(documents):vector = build_vector(documents)vector_len = len(vector)all_vector = []for text in documents:tokenize = split_document(text)bag_of_vector = [0] * vector_lenfor token in tokenize:index = vector.index(token)bag_of_vector[index] += 1bag_of_vector = np.array(bag_of_vector)tokenizer_sum = np.sum(bag_of_vector)bag_of_vector = bag_of_vector / tokenizer_sumbag_of_vector = bag_of_vector.tolist()all_vector.append(bag_of_vector)print('文档: ', text)print('词袋向量: ', bag_of_vector)return all_vectordef vector_idf(documents):vector = build_vector(documents)vector_len = len(vector)all_vector = []for text in documents:tokenize = split_document(text)bag_of_vector = [0] * vector_lenfor token in tokenize:index = vector.index(token)bag_of_vector[index] += 1all_vector.append(bag_of_vector)all_vector = np.array(all_vector)all_vector_shape = all_vector.shapevector_rate = [0] * all_vector_shape[1]for i in range(all_vector_shape[1]):dim_value = all_vector[..., i]mask = dim_value > 0num = sum(mask)vector_rate[i] = np.log(all_vector_shape[0] / num)print(vector)print(vector_rate)return vector_ratedef tf_idf(vector1, vector2):vector1 = np.array(vector1)vector2 = np.array(vector2)tf_idf_out = vector1 * vector2print('=' * 30)print(tf_idf_out)if __name__ == "__main__":chinese_docs = ["我爱自然语言处理，自然语言很有趣","自然语言处理很有趣","我爱学习人工智能","人工智能和自然语言处理都很有趣"]vector1 = vector_tf(chinese_docs)vector2 = vector_idf(chinese_docs)tf_idf(vector1, vector2)

四. word embedding

1.知识背景

由于前面两个方式，没有解决句子的语义，所以该方法是为了解决句子的语义。

2.word2vec

首先准备语料库，把预料库中句子转换成向量存储。训练方式如下
在这里插入图片描述
根据周围的词预测中间的词。比如：美女很漂亮。这个句子。我们要预测很字。前一个词就是美女，后一个词是漂亮。
我们一开始把语料库的向量进行随机初始化，然后让神经网络进行训练。我们会更新词表矩阵，和神经网络权重。注意，注意，注意，我们需要的是此表矩阵，不是训练神经网络的权重。后续我们就使用这个词表矩阵。当我们的把句子对应词表矩阵，取出向量，然后使用cos计算余弦相似度，就可以得到两个句子的相似度。该方法叫CBOW。

当我们使用中间的词去预测两边的词，该方法叫做 skip-gram。其他操作和上面一样。

3.缺点

该方法没考虑词语的顺序比如：我爱你。你爱我。当把这两句子分词后，这个两个句的词向量是一样的，这显然不太合理。还有就是一词多义情况比如：苹果手机的苹果。水果中苹果。同样是苹果，但是意思却不一样。但是他们的词向量一样，这也不太合理。词表矩阵是提前准备好的，所以无法根据语境动态计算词向量。

五.据语境动态计算词向量 Transformer Attention

1. Embedding模型 BGE，根据不同语境，给出不同向量。

bge模型的训练，采用是对比学习。首先数据准备，一个样本，对应一个正样本，对应三个负样本。正样本语义与样本语义相似。负样本与语义不相似。然后我们计算损失，样本与正样本相似度越高越好，样本与负样本的相似度越低越好。 S正 / （S正 + S负 + S负 + S负）这个值越高越好。然后我们保留神经网络参数。重点是数据准备，正负样本很重要。

查看全文

http://www.xdnf.cn/news/522559.html