使用Gemini, LangChain, Gradio打造一个书籍推荐系统 (第四部分)
第四部分:为每本书加上情绪标签
import pandas as pd
books = pd.read_csv("books_with_categories.csv")
from transformers import pipeline
classifier = pipeline("text-classification",model="j-hartmann/emotion-english-distilroberta-base",top_k = None,device = 0)
classifier("I love this!")
transformers 是 Hugging Face 提供的自然语言处理工具包。
pipeline 是它的一个便捷工具,可以快速调用预训练的模型进行各种 NLP 任务(如文本分类、生成、翻译等)。
pipeline(“text-classification”, …) 表示创建一个 文本分类任务 的 pipeline。
model=“j-hartmann/emotion-english-distilroberta-base” 指定所使用的预训练模型,这个模型专门用于 情绪分析(识别情绪,如喜悦、悲伤、愤怒等)。
top_k=None 表示返回 所有可能的分类及其概率分数。
device=0 表示使用 GPU 0 加速计算(如果有 GPU)。如果没有 GPU,可以改为 device=-1,表示使用 CPU。
将字符串 “I love this!” 输入到模型中,让模型对这段文本进行情绪分类预测。
返回结果是一个列表,里面包含每个可能的情绪类别及其概率。
Device set to use cuda:0
[[{'label': 'joy', 'score': 0.9771687984466553},{'label': 'surprise', 'score': 0.00852868054062128},{'label': 'neutral', 'score': 0.005764591973274946},{'label': 'anger', 'score': 0.004419785924255848},{'label': 'sadness', 'score': 0.0020923891570419073},{'label': 'disgust', 'score': 0.001611991785466671},{'label': 'fear', 'score': 0.00041385178337804973}]]
books["description"][0]
A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world has to offer. At its heart is a tale of the sacred bonds between fathers and sons, pitch-perfect in style and story, set to dazzle critics and readers alike.
classifier(books["description"][0])
[[{'label': 'fear', 'score': 0.6548413634300232},{'label': 'neutral', 'score': 0.16985207796096802},{'label': 'sadness', 'score': 0.11640888452529907},{'label': 'surprise', 'score': 0.02070062793791294},{'label': 'disgust', 'score': 0.019100705161690712},{'label': 'joy', 'score': 0.015161297284066677},{'label': 'anger', 'score': 0.003935146611183882}]]
classifier(books["description"][0].split("."))
books[“description”] 表示获取 books 表格中的 description(描述)这一列,返回一个 Series。
[0] 表示取这一列中的第一个元素,也就是第 0 行的 description。
.split(“.”),这是 字符串的 split() 方法,按 句号 . 拆分文本。
它会将描述文本按照句子分开,得到一个句子列表。
调用之前创建的 情绪分类模型。
它会将句子列表中的每个句子作为输入,分别进行情绪分类。
返回的是 每个句子的情绪预测结果列表,每个结果都是一组情绪及其分数。
[[{'label': 'surprise', 'score': 0.7296026945114136},{'label': 'neutral', 'score': 0.1403856873512268},{'label': 'fear', 'score': 0.06816219538450241},{'label': 'joy', 'score': 0.04794241115450859},{'label': 'anger', 'score': 0.009156348183751106},{'label': 'disgust', 'score': 0.00262847519479692},{'label': 'sadness', 'score': 0.0021221605129539967}],[{'label': 'neutral', 'score': 0.44937071204185486},{'label': 'disgust', 'score': 0.2735914885997772},{'label': 'joy', 'score': 0.10908304899930954},{'label': 'sadness', 'score': 0.09362724423408508},{'label': 'anger', 'score': 0.040478333830833435},{'label': 'surprise', 'score': 0.02697017975151539},{'label': 'fear', 'score': 0.006879060063511133}],[{'label': 'neutral', 'score': 0.6462162137031555},{'label': 'sadness', 'score': 0.2427332103252411},{'label': 'disgust', 'score': 0.04342261329293251},{'label': 'surprise', 'score': 0.028300540521740913},{'label': 'joy', 'score': 0.014211442321538925},{'label': 'fear', 'score': 0.014084079302847385},{'label': 'anger', 'score': 0.01103188470005989}],[{'label': 'fear', 'score': 0.928167998790741},{'label': 'anger', 'score': 0.03219102695584297},{'label': 'neutral', 'score': 0.012808729894459248},{'label': 'sadness', 'score': 0.008756889030337334},{'label': 'surprise', 'score': 0.008597911335527897},{'label': 'disgust', 'score': 0.008431846275925636},{'label': 'joy', 'score': 0.001045582932420075}],[{'label': 'sadness', 'score': 0.9671575427055359},{'label': 'neutral', 'score': 0.015104170888662338},{'label': 'disgust', 'score': 0.006480592768639326},{'label': 'fear', 'score': 0.005393994972109795},{'label': 'surprise', 'score': 0.0022869433742016554},{'label': 'anger', 'score': 0.0018428893527016044},{'label': 'joy', 'score': 0.0017338789766654372}],[{'label': 'joy', 'score': 0.9327971935272217},{'label': 'disgust', 'score': 0.03771771863102913},{'label': 'neutral', 'score': 0.01589190773665905},{'label': 'sadness', 'score': 0.006444551516324282},{'label': 'anger', 'score': 0.005025018472224474},{'label': 'surprise', 'score': 0.0015812073834240437},{'label': 'fear', 'score': 0.0005423100665211678}],[{'label': 'joy', 'score': 0.6528703570365906},{'label': 'neutral', 'score': 0.25427502393722534},{'label': 'surprise', 'score': 0.0680830255150795},{'label': 'sadness', 'score': 0.009908979758620262},{'label': 'disgust', 'score': 0.006512209307402372},{'label': 'anger', 'score': 0.00482131028547883},{'label': 'fear', 'score': 0.0035290152300149202}],[{'label': 'neutral', 'score': 0.5494765639305115},{'label': 'sadness', 'score': 0.1116902083158493},{'label': 'disgust', 'score': 0.10400670021772385},{'label': 'surprise', 'score': 0.07876556366682053},{'label': 'anger', 'score': 0.0641336739063263},{'label': 'fear', 'score': 0.05136282742023468},{'label': 'joy', 'score': 0.040564440190792084}]]
sentences = books["description"][0].split(".")
predictions = classifier(sentences)
sentences[0]
A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives
predictions[0]
[{'label': 'surprise', 'score': 0.7296026945114136},{'label': 'neutral', 'score': 0.1403856873512268},{'label': 'fear', 'score': 0.06816219538450241},{'label': 'joy', 'score': 0.04794241115450859},{'label': 'anger', 'score': 0.009156348183751106},{'label': 'disgust', 'score': 0.00262847519479692},{'label': 'sadness', 'score': 0.0021221605129539967}]
sentences[3]
Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist
predictions[3]
[{'label': 'fear', 'score': 0.928167998790741},{'label': 'anger', 'score': 0.03219102695584297},{'label': 'neutral', 'score': 0.012808729894459248},{'label': 'sadness', 'score': 0.008756889030337334},{'label': 'surprise', 'score': 0.008597911335527897},{'label': 'disgust', 'score': 0.008431846275925636},{'label': 'joy', 'score': 0.001045582932420075}]
predictions
[[{'label': 'surprise', 'score': 0.7296026945114136},{'label': 'neutral', 'score': 0.1403856873512268},{'label': 'fear', 'score': 0.06816219538450241},{'label': 'joy', 'score': 0.04794241115450859},{'label': 'anger', 'score': 0.009156348183751106},{'label': 'disgust', 'score': 0.00262847519479692},{'label': 'sadness', 'score': 0.0021221605129539967}],[{'label': 'neutral', 'score': 0.44937071204185486},{'label': 'disgust', 'score': 0.2735914885997772},{'label': 'joy', 'score': 0.10908304899930954},{'label': 'sadness', 'score': 0.09362724423408508},{'label': 'anger', 'score': 0.040478333830833435},{'label': 'surprise', 'score': 0.02697017975151539},{'label': 'fear', 'score': 0.006879060063511133}],[{'label': 'neutral', 'score': 0.6462162137031555},{'label': 'sadness', 'score': 0.2427332103252411},{'label': 'disgust', 'score': 0.04342261329293251},{'label': 'surprise', 'score': 0.028300540521740913},{'label': 'joy', 'score': 0.014211442321538925},{'label': 'fear', 'score': 0.014084079302847385},{'label': 'anger', 'score': 0.01103188470005989}],[{'label': 'fear', 'score': 0.928167998790741},{'label': 'anger', 'score': 0.03219102695584297},{'label': 'neutral', 'score': 0.012808729894459248},{'label': 'sadness', 'score': 0.008756889030337334},{'label': 'surprise', 'score': 0.008597911335527897},{'label': 'disgust', 'score': 0.008431846275925636},{'label': 'joy', 'score': 0.001045582932420075}],[{'label': 'sadness', 'score': 0.9671575427055359},{'label': 'neutral', 'score': 0.015104170888662338},{'label': 'disgust', 'score': 0.006480592768639326},{'label': 'fear', 'score': 0.005393994972109795},{'label': 'surprise', 'score': 0.0022869433742016554},{'label': 'anger', 'score': 0.0018428893527016044},{'label': 'joy', 'score': 0.0017338789766654372}],[{'label': 'joy', 'score': 0.9327971935272217},{'label': 'disgust', 'score': 0.03771771863102913},{'label': 'neutral', 'score': 0.01589190773665905},{'label': 'sadness', 'score': 0.006444551516324282},{'label': 'anger', 'score': 0.005025018472224474},{'label': 'surprise', 'score': 0.0015812073834240437},{'label': 'fear', 'score': 0.0005423100665211678}],[{'label': 'joy', 'score': 0.6528703570365906},{'label': 'neutral', 'score': 0.25427502393722534},{'label': 'surprise', 'score': 0.0680830255150795},{'label': 'sadness', 'score': 0.009908979758620262},{'label': 'disgust', 'score': 0.006512209307402372},{'label': 'anger', 'score': 0.00482131028547883},{'label': 'fear', 'score': 0.0035290152300149202}],[{'label': 'neutral', 'score': 0.5494765639305115},{'label': 'sadness', 'score': 0.1116902083158493},{'label': 'disgust', 'score': 0.10400670021772385},{'label': 'surprise', 'score': 0.07876556366682053},{'label': 'anger', 'score': 0.0641336739063263},{'label': 'fear', 'score': 0.05136282742023468},{'label': 'joy', 'score': 0.040564440190792084}]]
sorted(predictions[0], key=lambda x: x["label"])
predictions[0] 是第一个预测结果,形式是一个 字典列表(list of dictionaries),每个字典表示一个标签(情绪)及其得分
sorted() 是 Python 内置的 排序函数,用于对列表中的元素排序,返回一个新的排序后的列表。
排序需要一个 排序依据(key),所以这里用了 key=lambda x: x[“label”]。
key 参数用于告诉 sorted() 按什么排序。
lambda x: x[“label”] 是一个 匿名函数(lambda表达式),表示“取 x(每个字典)的 ‘label’ 值”。
也就是,排序会按照每个标签的 字母顺序 进行。
[{'label': 'anger', 'score': 0.009156348183751106},{'label': 'disgust', 'score': 0.00262847519479692},{'label': 'fear', 'score': 0.06816219538450241},{'label': 'joy', 'score': 0.04794241115450859},{'label': 'neutral', 'score': 0.1403856873512268},{'label': 'sadness', 'score': 0.0021221605129539967},{'label': 'surprise', 'score': 0.7296026945114136}]
import numpy as npemotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]
isbn = []
emotion_scores = {label: [] for label in emotion_labels}def calculate_max_emotion_scores(predictions):per_emotion_scores = {label: [] for label in emotion_labels}for prediction in predictions:sorted_predictions = sorted(prediction, key=lambda x: x["label"])for index, label in enumerate(emotion_labels):per_emotion_scores[label].append(sorted_predictions[index]["score"])return {label: np.max(scores) for label, scores in per_emotion_scores.items()}
定义了我们关注的情绪类别。
isbn 用于存放书籍编号
emotion_scores 是一个字典,用于存放每个情绪类别的分数列表。初始化为:
{‘anger’: [], ‘disgust’: [], ‘fear’: [], ‘joy’: [], ‘sadness’: [], ‘surprise’: [], ‘neutral’: []}
定义了一个函数,输入参数是 predictions(通常是一个列表,包含多段文本的情绪预测结果,每段是多个情绪及其分数的列表)。
同样初始化一个字典,用于存储 当前这组预测 中,每种情绪对应的分数。
每个 prediction 是一个情绪字典列表。
将每个 prediction 按 label 字母顺序排序,以便后续按照 emotion_labels 顺序索引。
通过 index 索引到排序后的 sorted_predictions,取出 score。
将该分数加到 per_emotion_scores[label] 对应的列表中。
对每个情绪的分数列表,取最大值 np.max(scores)。
for i in range(10):isbn.append(books["isbn13"][i])sentences = books["description"][i].split(".")predictions = classifier(sentences)max_scores = calculate_max_emotion_scores(predictions)for label in emotion_labels:emotion_scores[label].append(max_scores[label])
把第 i 本书的 ISBN 号码(isbn13 列中的值)添加到 isbn 列表中,方便后续关联。
将第 i 本书的 description(简介文本)按句号 . 分割成多个句子列表。
把所有句子 sentences 送入 classifier(模型),进行情绪分类。
结果 predictions 是一个列表,每个元素是一个句子的预测结果,通常像这样:
[
[{‘label’: ‘joy’, ‘score’: 0.85}, {‘label’: ‘sadness’, ‘score’: 0.10}, {‘label’: ‘anger’, ‘score’: 0.05}],
[{‘label’: ‘joy’, ‘score’: 0.65}, {‘label’: ‘sadness’, ‘score’: 0.30}, {‘label’: ‘anger’, ‘score’: 0.05}],
…
]
调用我们之前写的函数 calculate_max_emotion_scores(),统计每种情绪的最大分数。
遍历我们定义好的情绪标签(emotion_labels),把每种情绪的最大分数添加到 emotion_scores 字典的相应列表中。
emotion_scores
{'anger': [np.float64(0.0641336739063263),np.float64(0.6126185059547424),np.float64(0.0641336739063263),np.float64(0.35148391127586365),np.float64(0.08141247183084488),np.float64(0.2322249710559845),np.float64(0.5381842255592346),np.float64(0.0641336739063263),np.float64(0.30067017674446106),np.float64(0.0641336739063263)],'disgust': [np.float64(0.2735914885997772),np.float64(0.3482844829559326),np.float64(0.10400670021772385),np.float64(0.15072263777256012),np.float64(0.1844954937696457),np.float64(0.727174699306488),np.float64(0.15585491061210632),np.float64(0.10400670021772385),np.float64(0.2794813811779022),np.float64(0.1779276728630066)],'fear': [np.float64(0.928167998790741),np.float64(0.9425278306007385),np.float64(0.9723208546638489),np.float64(0.36070623993873596),np.float64(0.09504325687885284),np.float64(0.05136282742023468),np.float64(0.7474286556243896),np.float64(0.4044959247112274),np.float64(0.9155241250991821),np.float64(0.05136282742023468)],'joy': [np.float64(0.9327971935272217),np.float64(0.7044215202331543),np.float64(0.7672368884086609),np.float64(0.25188079476356506),np.float64(0.040564440190792084),np.float64(0.04337584972381592),np.float64(0.8725654482841492),np.float64(0.040564440190792084),np.float64(0.040564440190792084),np.float64(0.040564440190792084)],'sadness': [np.float64(0.6462162137031555),np.float64(0.887939453125),np.float64(0.5494765639305115),np.float64(0.732685387134552),np.float64(0.8843895196914673),np.float64(0.6213927268981934),np.float64(0.7121942639350891),np.float64(0.5494765639305115),np.float64(0.8402896523475647),np.float64(0.8603722453117371)],'surprise': [np.float64(0.9671575427055359),np.float64(0.1116902083158493),np.float64(0.1116902083158493),np.float64(0.1116902083158493),np.float64(0.4758807122707367),np.float64(0.1116902083158493),np.float64(0.40800026059150696),np.float64(0.820281982421875),np.float64(0.35446029901504517),np.float64(0.1116902083158493)],'neutral': [np.float64(0.7296026945114136),np.float64(0.2525450885295868),np.float64(0.07876556366682053),np.float64(0.07876556366682053),np.float64(0.07876556366682053),np.float64(0.27190276980400085),np.float64(0.07876556366682053),np.float64(0.23448744416236877),np.float64(0.13561409711837769),np.float64(0.07876556366682053)]}
from tqdm import tqdmemotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]
isbn = []
emotion_scores = {label: [] for label in emotion_labels}for i in tqdm(range(len(books))):isbn.append(books["isbn13"][i])sentences = books["description"][i].split(".")predictions = classifier(sentences)max_scores = calculate_max_emotion_scores(predictions)for label in emotion_labels:emotion_scores[label].append(max_scores[label])
tqdm 是一个非常流行的进度条库,用来美化循环进度显示。这样可以直观看到代码运行到哪了,特别是处理大量数据时非常有用。
列出了情绪分类器支持的情绪类别,这些标签对应模型输出结果中的 label 字段。
isbn:用于存储每本书的 ISBN 编号。
emotion_scores:用于存储每个情绪标签的最高分数,每个标签对应一个列表。
初始化后的样子:
{
‘anger’: [],
‘disgust’: [],
‘fear’: [],
‘joy’: [],
‘sadness’: [],
‘surprise’: [],
‘neutral’: []
}
用 tqdm 包裹的 for 循环会显示漂亮的进度条。
range(len(books)) 意味着从第一本书到最后一本书,逐行遍历整个 books DataFrame。
把当前书的 ISBN 编号添加到 isbn 列表中。
把当前书的 description 按句号 . 切分为一个句子列表。
把句子列表 sentences 输入到 classifier(情绪分类模型)。
模型返回每个句子的情绪预测结果(每个句子会有多个情绪分数)。
调用之前定义的 calculate_max_emotion_scores 函数,统计当前书中每个情绪类别的 最高分数。
遍历每个情绪标签。
把当前书的每个情绪分数保存到 emotion_scores 字典对应的列表中。
这样,循环跑完后:
isbn 列表里会有所有书的 ISBN。
emotion_scores 字典会有所有书的情绪分数(每种情绪对应一个分数列表)。
emotions_df = pd.DataFrame(emotion_scores)
emotions_df["isbn13"] = isbn
emotions_df
books = pd.merge(books, emotions_df, on = "isbn13")
books
pd.merge() 是 Pandas 的一个函数,用来 合并(Join)两个 DataFrame,类似 SQL 中的 JOIN 操作。
合并的两个表:
左表(books):包含了书籍的基本信息,比如 isbn13、title、author、description、simple_categories 等。
右表(emotions_df):包含了情绪分析后的结果,比如每本书的 anger、joy、sadness、fear 等分数。
on = “isbn13”:
告诉 Pandas 用哪一列作为合并的依据。
因为每本书都有唯一的 ISBN 编号,所以选择 isbn13 作为合并键。
books.to_csv("books_with_emotions.csv", index = False)