当前位置: 首页 > ai >正文

【AI】Ubuntu 22.04 evalscope 模型评测 Qwen3-4B-FP8

安装evalscope

mkdir evalscope
cd evalscope/
python3 -m venv venv
source venv/bin/activate
pip install 'evalscope[app,perf]' -U -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.compip install tiktoken omegaconf -i https://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com

下载通用评测数据集(暂时未用上)

pip install -U modelscopemodelscope download --dataset modelscope/mmlu --local_dir /data/ai/evalscope_data/mmlu
modelscope download --dataset modelscope/gsm8k --local_dir /data/ai/evalscope_data/gsm8k
modelscope download --dataset modelscope/human_eval --local_dir /data/ai/evalscope_data/human_eval

本地部署Qwen3-4B-FP8

modelscope download --model Qwen/Qwen3-4B-FP8
vllm serve /home/yeqiang/.cache/modelscope/hub/models/Qwen/Qwen3-4B-FP8 --served-model-name Qwen3-4B-FP8 --port 8000 --dtype auto --gpu-memory-utilization 0.8 --max-model-len 40960 --tensor-parallel-size 1

编辑评测脚本(采用EvalScope-Qwen3-Test数据集)

eval_qwen3_mmlu.py

from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(model='Qwen3-4B-FP8',api_url='http://localhost:8000/v1/chat/completions',eval_type='service',datasets=['data_collection',],dataset_args={'data_collection': {'dataset_id': 'modelscope/EvalScope-Qwen3-Test','filters': {'remove_until': '</think>'}  # 过滤掉思考的内容}},eval_batch_size=128,generation_config={'max_tokens': 30000,  # 最大生成token数,建议设置为较大值避免输出截断'temperature': 0.6,  # 采样温度 (qwen 报告推荐值)'top_p': 0.95,  # top-p采样 (qwen 报告推荐值)'top_k': 20,  # top-k采样 (qwen 报告推荐值)'n': 1,  # 每个请求产生的回复数量},timeout=60000,  # 超时时间stream=True,  # 是否使用流式输出limit=100,  # 设置为100条数据进行测试
)run_task(task_cfg=task_cfg)

执行评测

(venv) yeqiang@yeqiang-Default-string:/data/ai/evalscope$ python eval_qwen3_mmlu.py 
2025-05-06 22:44:04,363 - evalscope - INFO - Args: Task config is provided with TaskConfig type.
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
ANTLR runtime and generated code versions disagree: 4.9.3!=4.7.2
2025-05-06 22:44:06,473 - evalscope - INFO - Loading dataset from modelscope: > dataset_name: modelscope/EvalScope-Qwen3-Test
Downloading Dataset to directory: /home/yeqiang/.cache/modelscope/hub/datasets/modelscope/EvalScope-Qwen3-Test
2025-05-06 22:44:08,753 - evalscope - INFO - Dump task config to ./outputs/20250506_224404/configs/task_config_7d0e13.yaml
2025-05-06 22:44:08,755 - evalscope - INFO - {"model": "Qwen3-4B-FP8","model_id": "Qwen3-4B-FP8","model_args": {"revision": "master","precision": "torch.float16"},"model_task": "text_generation","template_type": null,"chat_template": null,"datasets": ["data_collection"],"dataset_args": {"data_collection": {"dataset_id": "modelscope/EvalScope-Qwen3-Test","filters": {"remove_until": "</think>"}}},"dataset_dir": "/home/yeqiang/.cache/modelscope/hub/datasets","dataset_hub": "modelscope","generation_config": {"max_tokens": 30000,"temperature": 0.6,"top_p": 0.95,"top_k": 20,"n": 1},"eval_type": "service","eval_backend": "Native","eval_config": null,"stage": "all","limit": 100,"eval_batch_size": 128,"mem_cache": false,"use_cache": null,"work_dir": "./outputs/20250506_224404","outputs": null,"debug": false,"dry_run": false,"seed": 42,"api_url": "http://localhost:8000/v1/chat/completions","api_key": "EMPTY","timeout": 60000,"stream": true,"judge_strategy": "auto","judge_worker_num": 1,"judge_model_args": {}
}Getting answers:  15%|████████████████████████████████▊                                            [00:50<02:06,  1.49s/it]
Getting answers:  32%|█████████████████████████████████████████████████████████████████████                                                            
Getting answers:  33%|█████████████████████████████████████████████████████████████████████| 33/100 [02:28<07:44,  6.93s/it]

nvidia-smi

vllm服务状态

报告

2025-05-06 23:13:37,099 - evalscope - INFO - subset_level Report:
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   |        subset_name         | average_score | count |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |           health           |    0.6667     |   9   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            math            |      1.0      |   7   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |        engineering         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         chemistry          |      0.5      |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         psychology         |    0.6667     |   6   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          biology           |      0.8      |   5   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |            law             |      0.2      |   5   |
| instruction | prompt_level_strict_acc |     ifeval      |          default           |     0.75      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |          default           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          physics           |     0.75      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |           other            |      0.5      |   4   |
| instruction | prompt_level_loose_acc  |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |      computer science      |      1.0      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |          default           |      1.0      |   4   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          business          |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |          history           |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         philosophy         |    0.6667     |   3   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |         prehistory         |      1.0      |   2   |
|    exam     |     AverageAccuracy     |    mmlu_pro     |         economics          |      0.5      |   2   |
|    exam     |     AverageAccuracy     |      ceval      |     education_science      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |            law             |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       tax_accountant       |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |             EQ             |      1.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |    high_school_biology     |      1.0      |   1   |
|    code     |         Pass@1          | live_code_bench |           v5_v6            |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      ceval      |       basic_medicine       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          anatomy           |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  college_computer_science  |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    college_mathematics     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      abstract_algebra      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |  high_school_mathematics   |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_macroeconomics |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   high_school_chemistry    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    high_school_biology     |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |     conceptual_physics     |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    | high_school_world_history  |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |       miscellaneous        |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      medical_genetics      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |          virology          |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      security_studies      |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |   professional_medicine    |      0.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      moral_scenarios       |      1.0      |   1   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |      world_religions       |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |        gpqa_diamond        |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 3           |      1.0      |   1   |
|    math     |      AveragePass@1      |    math_500     |          Level 5           |      1.0      |   1   |
+-------------+-------------------------+-----------------+----------------------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - dataset_level Report:
+-------------+-------------------------+-----------------+---------------+-------+
|  task_type  |         metric          |  dataset_name   | average_score | count |
+-------------+-------------------------+-----------------+---------------+-------+
|    exam     |     AverageAccuracy     |    mmlu_pro     |    0.6716     |  67   |
|    exam     |     AverageAccuracy     |   mmlu_redux    |    0.6842     |  19   |
|    exam     |     AverageAccuracy     |      ceval      |      0.6      |   5   |
| instruction | prompt_level_loose_acc  |     ifeval      |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     ifeval      |     0.75      |   4   |
| instruction |  inst_level_loose_acc   |     ifeval      |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     ifeval      |     0.75      |   4   |
|    math     |      AveragePass@1      |    math_500     |      1.0      |   2   |
|    code     |         Pass@1          | live_code_bench |      0.0      |   1   |
|    exam     |     AverageAccuracy     |      iquiz      |      1.0      |   1   |
|  knowledge  |      AveragePass@1      |      gpqa       |      1.0      |   1   |
+-------------+-------------------------+-----------------+---------------+-------+
2025-05-06 23:13:37,099 - evalscope - INFO - task_level Report:
+-------------+-------------------------+---------------+-------+
|  task_type  |         metric          | average_score | count |
+-------------+-------------------------+---------------+-------+
|    exam     |     AverageAccuracy     |    0.6739     |  92   |
| instruction |  inst_level_loose_acc   |      1.0      |   4   |
| instruction |  inst_level_strict_acc  |     0.75      |   4   |
| instruction | prompt_level_loose_acc  |      1.0      |   4   |
| instruction | prompt_level_strict_acc |     0.75      |   4   |
|    math     |      AveragePass@1      |      1.0      |   2   |
|    code     |         Pass@1          |      0.0      |   1   |
|  knowledge  |      AveragePass@1      |      1.0      |   1   |
+-------------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - tag_level Report:
+------+-------------------------+---------------+-------+
| tags |         metric          | average_score | count |
+------+-------------------------+---------------+-------+
|  en  |     AverageAccuracy     |    0.6744     |  86   |
|  zh  |     AverageAccuracy     |    0.6667     |   6   |
|  en  |  inst_level_strict_acc  |     0.75      |   4   |
|  en  |  inst_level_loose_acc   |      1.0      |   4   |
|  en  | prompt_level_loose_acc  |      1.0      |   4   |
|  en  | prompt_level_strict_acc |     0.75      |   4   |
|  en  |      AveragePass@1      |      1.0      |   3   |
|  en  |         Pass@1          |      0.0      |   1   |
+------+-------------------------+---------------+-------+
2025-05-06 23:13:37,100 - evalscope - INFO - category_level Report:
+-----------+--------------+-------------------------+---------------+-------+
| category0 |  category1   |         metric          | average_score | count |
+-----------+--------------+-------------------------+---------------+-------+
|   Qwen3   |   English    |     AverageAccuracy     |    0.6744     |  86   |
|   Qwen3   |   Chinese    |     AverageAccuracy     |    0.6667     |   6   |
|   Qwen3   |   English    |  inst_level_loose_acc   |      1.0      |   4   |
|   Qwen3   |   English    |  inst_level_strict_acc  |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_strict_acc |     0.75      |   4   |
|   Qwen3   |   English    | prompt_level_loose_acc  |      1.0      |   4   |
|   Qwen3   | Math&Science |      AveragePass@1      |      1.0      |   3   |
|   Qwen3   |     Code     |         Pass@1          |      0.0      |   1   |
+-----------+--------------+-------------------------+---------------+-------+

参考

https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html

http://www.xdnf.cn/news/4119.html

相关文章:

  • 上传下载接口
  • Android第六次面试总结之Java设计模式(二)
  • vue3+ts+自定义指令,生产页面水印
  • 云计算训练营笔记day02(Linux、计算机网络、进制)
  • 论文速读《Embodied-R: 基于强化学习激活预训练模型具身空间推理能力》
  • STL之stackqueue
  • 【Vue.js】 插槽通信——具名插槽通信
  • 大模型在宫颈癌诊疗全流程预测与应用研究报告
  • 免费视频压缩软件
  • Paramiko 性能优化详解
  • 神经网络之互动练习详解:从基础到拟合非线性数据
  • 【回眸】QAC使用指南——导出 Dashboard Report个性化定制Report
  • NLP中词嵌入的几个方法介绍快速入门视频推荐
  • Adobe卸载清理工具Creative Cloud Cleaner Tool下载
  • C# Winforms 本地化 多语言支持 字符串资源
  • 【商城系统中的多商户和单商户模式差异】
  • 第22节:深度学习基础-损失函数介绍
  • 神经网络:节点、隐藏层与非线性学习
  • 【WPS】怎么解决“word的复制表格”粘贴到“excel的单元格”变多行单元格的问题
  • Suno v4.5:AI 音乐创作的新突破
  • PHP分页显示数据,在phpMyadmin中添加数据
  • window 显示驱动开发-线程同步和 TDR
  • 【Docker系列】使用格式化输出与排序技巧
  • 如何对 Redis 进行水平扩展和垂直扩展以应对微服务流量的增长?
  • 【ARM】DS-试用授权离线激活
  • 动手学深度学习12.1. 编译器和解释器-笔记练习(PyTorch)
  • Kubernetes环境部署OwnCloud网盘服务
  • 学苑教育杂志学苑教育杂志社学苑教育编辑部2025年第9期目录
  • 前端知识-hook
  • 荣耀A8互动娱乐组件部署实录(第4部分:房间服务端逻辑与玩家交互处理)