当前位置：首页 > news >正文

分别用语言模型雏形N-Gram 和文本表示BoW词袋来实现文本情绪分类

news 2025/7/4 11:08:09

语言模型的雏形 N-Gram 和简单文本表示 Bag-of-Words

语言表示模型简介

(1) Bag-of-Words (BoW)

是什么？

*定义：将文本表示为词频向量，忽略词序和语法，仅记录每个词的出现次数。
**示例：
- 句子1：I love cats and cats love me.
- 句子2：Dogs love me too.
- 词表：[“I”, “love”, “cats”, “and”, “me”, “dogs”, “too”]`
- BoW向量：
  句子1 ：[1, 2, 2, 1, 1, 0, 0]
  句子2 ：[0, 1, 0, 0, 1, 1, 1]

为什么需要？

简单高效：适合早期文本分类（如垃圾邮件识别、情感分析）。
可解释性强：词频直接反映文本主题。
局限性：
- 忽略词序 “猫吃鱼” “鱼吃猫" 向量表示在词袋表示中相同
- 高维稀疏（词表大时向量维度爆炸）。

(2) N-Gram

是什么？

定义：将文本分割为连续的N个词（或字符）组成的片段，捕捉局部上下文。
示例（N=2）：
- 句子：“I love cats”
- Bigrams（2-grams）：[“I love”, “love cats”]`
- Trigrams（3-grams）：[“I love cats”]`

为什么需要？

捕捉局部词序：比BoW更细致，能表达短语（如）。
建模上下文：通过统计N-Gram概率预测下一个词（语言模型）。
局限性：
- 数据稀疏性（长N-Gram在训练集中可能未出现）。
- 无法建模远距离依赖（如段落级关系）。

2. 项目实战：BoW与N-Gram的文本分类

任务目标

用BoW和Bigram特征对电影评论进行情感分类（正/负面），并比较效果。

代码实现

环境准备

pip install numpy scikit-learn nltk

数据集

使用简单的自定义数据集（实际项目可用IMDB数据集）：

# 自定义数据：0为负面，1为正面
texts = ["I hate this movie",          # 0"This film is terrible",      # 0"I love this wonderful film",# 1"What a great movie",         # 1
]
labels = [0, 0, 1, 1]

步骤1：Bag-of-Words特征提取

from sklearn.feature_extraction.text import CountVectorizer# 创建BoW向量器
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(texts)print("BoW特征词表：", bow_vectorizer.get_feature_names_out())
print("BoW特征矩阵：\n", bow_features.toarray())

输出：

BoW特征词表： ['film' 'great' 'hate' 'is' 'love' 'movie' 'terrible' 'this' 'what' 'wonderful']
BoW特征矩阵：
[[0 0 1 0 0 1 0 1 0 0][1 0 0 1 0 0 1 1 0 0][1 0 0 0 1 0 0 1 0 1][0 1 0 0 0 1 0 0 1 0]]

步骤2：Bigram特征提取


from sklearn.feature_extraction.text import CountVectorizer# 创建Bigram向量器（N=2）
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_features = bigram_vectorizer.fit_transform(texts)print("Bigram特征词表：", bigram_vectorizer.get_feature_names_out())
print("Bigram特征矩阵：\n", bigram_features.toarray())

输出：

Bigram特征词表： ['film is' 'hate this' 'is terrible' 'love this' 'terrible this'
'this movie' 'this wonderful' 'what great' 'wonderful film']
Bigram特征矩阵：
[[0 1 0 0 0 1 0 0 0][1 0 1 0 0 0 0 0 0][0 0 0 1 0 0 1 0 1][0 0 0 0 0 0 0 1 0]]

步骤3：训练分类模型


from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split# 划分训练集和测试集（此处仅演示，数据量小直接训练）
X_train_bow, X_test_bow = bow_features, bow_features  # 实际需划分
X_train_bigram, X_test_bigram = bigram_features, bigram_features
y_train, y_test = labels, labels# 训练BoW模型
model_bow = MultinomialNB()
model_bow.fit(X_train_bow, y_train)
print("BoW模型准确率：", model_bow.score(X_test_bow, y_test))# 训练Bigram模型
model_bigram = MultinomialNB()
model_bigram.fit(X_train_bigram, y_train)
print("Bigram模型准确率：", model_bigram.score(X_test_bigram, y_test))

输出：

BoW模型准确率： 1.0
Bigram模型准确率： 1.0

# 自定义数据：0为负面，1为正面
texts = ["I hate this movie",          # 0"This film is terrible",      # 0"I love this wonderful film",# 1"What a great movie",         # 1"I dislike this film",       # 0"This movie is amazing",     # 1"I enjoy this film",         # 1"This film is awful",        # 0    "I adore this movie",        # 1"This film is fantastic",    # 1"I loathe this movie",       # 0"This movie is boring",      # 0"I appreciate this film",    # 1"This film is dreadful",     # 0"I cherish this movie",      # 1"This film is mediocre",     # 0"I detest this movie",       # 0"This film is superb",       # 1"I value this film",          # 1"This movie is subpar",      # 0"I respect this film",       # 1"This film is excellent",    # 1"I abhor this movie",        # 0"This film is lackluster",   # 0"I admire this film",        # 1"This movie is unsatisfactory", # 0"I relish this film",        # 1"This film is remarkable",   # 1"I scorn this movie",        # 0"This film is outstanding",  # 1"I disapprove of this film", # 0"This movie is unremarkable", # 0"I treasure this film",      # 1"This film is commendable",  # 1"I find this movie distasteful", # 0"This film is praiseworthy", # 1"I think this movie is substandard", # 0"This film is noteworthy",   # 1"I consider this movie to be poor", # 0"This film is exceptional",  # 1"I feel this movie is inadequate", # 0"This film is extraordinary", # 1"I regard this movie as unsatisfactory", # 0"This film is phenomenal",   # 1"I perceive this movie as disappointing", # 0"This film is stellar",      # 1"I think this movie is mediocre" # 0   
]
labels = [0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,0, 0, 1,1, 0,1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,1, 0, 1,0,1,0,1,0,1,0,1,0]
print("文本数据：", len(texts), "条")
print("label：", len(labels), "条")
# 导入所需库
from sklearn.feature_extraction.text import CountVectorizer
# 创建BoW向量器
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(texts)print("BoW特征词表：", bow_vectorizer.get_feature_names_out())
print("BoW特征矩阵：\n", bow_features.toarray())# 创建Bigram向量器（N=2）
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_features = bigram_vectorizer.fit_transform(texts)print("Bigram特征词表：", bigram_vectorizer.get_feature_names_out())
print("Bigram特征矩阵：\n", bigram_features.toarray())from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split# 划分训练集和测试集（此处仅演示，数据量小直接训练）
train_test_split = 0.8
train_len = int(len(texts) * train_test_split)X_train_bow, X_test_bow = bow_features[:train_len], bow_features[train_len:]  # 实际需划分
X_train_bigram, X_test_bigram = bigram_features[:train_len], bigram_features[train_len:]
y_train, y_test = labels[:train_len], labels[train_len:]# 训练BoW模型
model_bow = MultinomialNB()
model_bow.fit(X_train_bow, y_train)
print("BoW模型准确率：", model_bow.score(X_test_bow, y_test))# 训练Bigram模型
model_bigram = MultinomialNB()
model_bigram.fit(X_train_bigram, y_train)
print("Bigram模型准确率：", model_bigram.score(X_test_bigram, y_test))