当前位置：首页 > news >正文

Qwen3 Embedding 测试

news 2025/6/15 14:48:41

环境准备

uv init
uv venv
.venv/Script/activate
uv pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
uv pip install transformers

初始化

import torch
import torch.nn.functional as Ffrom torch import Tensor
from transformers import AutoTokenizer, AutoModel

这部分导入了PyTorch相关库和Hugging Face的Transformers库，为使用Qwen3-Embedding模型做准备。

def last_token_pool(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])if left_padding:print("left_padding")return last_hidden_states[:, -1]else:sequence_lengths = attention_mask.sum(dim=1) - 1batch_size = last_hidden_states.shape[0]return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]def get_detailed_instruct(task_description: str, query: str) -> str:return f'Instruct: {task_description}\nQuery:{query}'

这里定义了两个关键函数：

last_token_pool：从模型输出的隐藏状态中提取最后一个有效token的表示。它处理两种填充情况：
- 左侧填充：直接取最后一个位置的向量
- 右侧填充：计算每个序列的实际长度，然后取对应位置的向量
get_detailed_instruct：创建指令格式的查询字符串，将任务描述和查询组合在一起

例子

task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'What is the capital of China?'),get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
input_texts

['Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:What is the capital of China?','Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:Explain gravity','The capital of China is Beijing.','Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.']

这部分准备了测试数据：

定义了一个检索任务描述
创建了两个带指令格式的查询
准备了两个文档作为检索目标
将查询和文档合并为一个输入列表

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B')

这里加载了Qwen3-Embedding-0.6B模型和对应的分词器：

使用padding_side='left'设置左侧填充，这对于后续提取最后一个token的表示很重要
模型加载为CPU模式（如需GPU加速可添加.cuda()）

max_length = 8192# Tokenize the input texts
batch_dict = tokenizer(input_texts,padding=True,truncation=True,max_length=max_length,return_tensors="pt",
)
batch_dict.keys()

dict_keys(['input_ids', 'attention_mask'])

这部分对输入文本进行分词处理：

设置最大长度为8192（Qwen3-Embedding支持的上下文长度）
启用填充和截断，确保所有序列长度一致
返回PyTorch张量格式
输出显示batch_dict包含’input_ids’和’attention_mask’两个键

batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings.shape

这部分进行模型推理和嵌入提取：

将输入数据移至模型所在设备
通过模型获取输出
使用last_token_pool函数提取每个序列的表示向量
输出显示"left_padding"，表明使用了左侧填充
嵌入向量形状为[4, 1024]，表示4个输入文本，每个嵌入维度为1024

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
scores

    tensor([[0.7646, 0.1414],[0.1355, 0.6000]], grad_fn=<MmBackward0>)

最后一部分计算相似度分数：

对嵌入向量进行L2归一化，使其长度为1
计算查询嵌入（前2个）与文档嵌入（后2个）的点积，得到相似度矩阵
结果显示查询1与文档1的相似度为0.7646（高），与文档2的相似度为0.1414（低）
查询2与文档1的相似度为0.1355（低），与文档2的相似度为0.6000（高）

这表明模型成功地将查询与相关文档匹配起来："中国首都"的查询与包含"北京"的文档相似度高，"解释重力"的查询与描述重力的文档相似度高。

另个例子

task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'How does photosynthesis work?'),get_detailed_instruct(task, 'What are the benefits of exercise?')
]
# No need to add instruction for retrieval documents
documents = ["Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll. During this process, plants convert light energy into chemical energy, absorb carbon dioxide and release oxygen.","Regular exercise offers numerous benefits including improved cardiovascular health, stronger muscles and bones, better weight management, enhanced mental health, reduced risk of chronic diseases, improved sleep quality, and increased energy levels."
]
input_texts = queries + documents
max_length = 8192# Tokenize the input texts
batch_dict = tokenizer(input_texts,padding=True,truncation=True,max_length=max_length,return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)scores

left_paddingtensor([[0.6436, 0.1156],[0.2306, 0.7621]], grad_fn=<MmBackward0>)

参考链接：https://github.com/QwenLM/Qwen3-Embedding?tab=readme-ov-file

查看全文

http://www.xdnf.cn/news/1043137.html

8. TypeScript 类

Lambda 表达式的语法与使用：更简洁、更灵活的函数式编程！

Dina靶机渗透

算法训练第十七天

关于allegro 导入网表报错：Unable to find pin name in问题的解决

Java大模型开发入门 (9/15)：连接外部世界(中) - 向量嵌入与向量数据库

JS进阶 Day03

【构建】Meson、Bazel、Buck现代构建系统

RPG28.使用GameplayCue和制作死亡效果

Java线程安全计数器实现方案

【stm32f4】ADC实验(stm32hal库)

什么是旋转开关？

使用NVIDIA TensorRT for RTX运行高性能AI应用程序

C++线性DP-最优解问题、方案数问题

PCL 计算点云的投影密度

【整数递增加法拆分】2022-4-11

LangGraph基础知识(Human-in-the-loop)（五）

《甘肃棒垒球》奥运会项目有哪些·垒球1号位

vue | async-validator 表单验证库第三方库安装与使用

高效I/O处理：模型与多路复用的探讨

Spring学习笔记

（14）python+ selenium自动化测试 -回顾

探索数据的力量：Elasticsearch中指定链表字段的统计查询记录

生日悖论理论及在哈希函数碰撞中的应用

AI视野：写作应用AI排行榜Top10 | 2025年05月

隐式时钟与外时钟对比2025.6.14

boost之signal的封装及使用例子

数列求和计算

XCTF-misc-János-the-Ripper

Qwen3 Embedding 测试

目录

环境准备

初始化

例子

另个例子

相关文章：