当前位置：首页 > news >正文

Qwen2.5-VL技术解读和文档解析可行性验证

news 2025/7/13 3:43:38

概述

水群时，发现群友提了一个方案：是否可以用多模态大模型Qwen2.5-VL进行文档解析？

看了下他发的截图，效果看上去很不错。

图片来自社群群友

于是进行进一步调研，验证这种方法的可行性。

阅读完本文，将能得到这些问题的答案：

多模态是指哪些模态？
Qwen2.5-VL的能力有哪些？
Qwen2.5-VL有哪些参数版本？
Qwen2.5-VL有哪些应用场景？
接收不同模态输入后，模型原本的文本输出能力会提升还是退化？
Qwen2.5-VL应用文档解析，靠谱吗？

Qwen2.5-VL技术分析

Qwen2.5-VL是阿里云Qwen团队研发的多模态大型语言模型系列，在2025.01.28发布。

开源仓库地址：https://github.com/QwenLM/Qwen2.5-VL

下面从其技术报告入手，进行解读分析。

1. Qwen2.5-VL主要贡献

1.在视觉编码器(visual encoder)中实现窗口注意力(window attention)去提升推理效率；

2.引入动态FPS采样(dynamic FPS sampling)，支持不同采样率的视频理解；

3.在时间域中升级了MRoPE，支持更复杂的时间序列；

4.预训练数据集规模从1.2万亿个token拓展到4.1万亿个token。

2. Qwen2.5-VL主要特性

1.文档解析能力：支持手写、表格、图表、化学公式和乐谱等各种文档内容解析

2.对象定位能力：支持高级空间推理，涵盖检测、指向、计数等相关任务

3.视频理解能力：能够理解超长视频，提取事件片段

4.代理(agent)能力：具备一定推理和决策能力

3. Qwen2.5-VL算法框架

Qwen2.5-VL的算法框架如下图所示：

Qwen2.5-VL的算法框架图

整体上看，整个算法遵循Encoder-Decoder的标准架构。

对于Vision Encoder，支持不同分辨率大小的图片和视频输入，在视觉数据编码后，会随着提示词指令一起被输入到LM Decoder中，实现文本的输出。

因此，Qwen2.5-VL定位是一个视觉理解模型，能够接收图片、视频、文字三种模态输入，输出只有文字。

4. 视觉编码器相关特点

视觉编码器的作用是将二维视觉数据进行编码，实现与文本嵌入的特征进行对齐，该部分主要有以下特点：

采用最大窗口大小为112×112的窗口注意力
采用二维旋转位置嵌入（RoPE）来捕捉二维空间的空间关系，并拓展到三维视频。
采用 RMSNorm 进行归一化，并使用 SwiGLU 作为激活函数。

5. 预训练数据集

高质量的数据集对大模型的训练至关重要。因此，在标准数据清洗后，单独构建了一套四阶段评分系统，包括：

1.文本质量评估
2.图像-文本相关性评估
3.图像-文本互补性评估
4.信息密度平衡

数据集类型包括：

文档解析数据
OCR数据
视频数据
Agent数据

其中，对于文档解析数据进行了Html格式化规范。

这意味着，如果要进一步处理解析后的数据，需要根据它定义得Html标准进行。

html格式化数据

6. 预训练过程

预训练整体包含三个训练阶段：

第一阶段：仅训练视觉变换器（Vision Transformer, ViT)，提升其与语言模型的对齐能力。
第二阶段：解冻所有模型参数，在多样化的多模态图像数据集上进行训练，以增强其处理复杂视觉信息的能力。
第三阶段：进一步增加视频和Agent数据集的序列长度，增强模型在更长序列上的推理能力。

三个阶段的数据量和组成如下表所示：

不同阶段的训练数据量和组成

7. 后训练过程

Qwen2.5-VL的后训练对齐框架采用了双阶段优化范式，包括监督微调（SFT）和直接偏好优化（DPO）。

SFT阶段采用了包含50%文本数据和50%多模态数据的数据集，旨在增强模型在多种模态下的指令跟随能力。

DPO阶段专注于图像-文本和纯文本数据，利用偏好数据使模型与人类偏好对齐，每个样本仅处理一次，以确保优化的高效性。

8. 实验结果

与SOTA结果比较结果如下表所示：

Qwen2.5-VL和最先进技术的性能

简单总结，在部分数据集上，Qwen2.5-VL有领先，但幅度有限，总体看来和其它主流模型效果差不多。

纯文本任务上的表现如下表所示：

70B+指令模型和Qwen2.5-VL在纯文本任务上的表现

在这里同样对比了相同参数规模下，纯文本模型Qwen2.5-72B和 Qwen2.5-VL的文本能力表现情况。

从表中可以看出，在一半的数据集上，Qwen2.5-VL的文本能力有提升，另一半数据集上，文本能力有退化。

此现象表明，并不是说在融合视觉能力之后，语言能力能够得到相应提高。

就像一个人，同时练习书法和体育，学的东西杂了，在书法能力的表现上，可能还不如其它只练习书法的人，尽管它们的上课时间都一样。

实验部分还展示了OCR理解、视频理解、代理任务的相关对比参数，这里略过。

Qwen2.5-VL部署测试

看完 Qwen2.5-VL 的技术原理后，下面来实际测试一下性能效果。

Qwen2.5-VL 共包括以下四种模型型号：

Qwen2.5-VL-3B
Qwen2.5-VL-7B
Qwen2.5-VL-32B
Qwen2.5-VL-72B

下面试一下本地部署和API调用两种方式。

1. 本地部署

官方仓库提供了基于gradio的本地运行可视化界面。

下载完官方仓库代码，安装环境：

pip install -r requirements_web_demo.txt

我的测试环境是6GB显存的显卡，因此修改web_demo_mm.py文件，将默认的DEFAULT_CKPT_PATH设置为最小的3B模型Qwen/Qwen2.5-VL-3B-Instruct。

启动运行：

python web_demo_mm.py

访问:http://127.0.0.1:7860/，可进入到可视化界面中。

上传一个文档图像，并写指令。

可视化界面演示

测试下来，模型确实能够完成解析任务，但速度奇慢无比。

官方文档大概也知道模型推理速度很慢，因此强烈建议使用FlashAttention-2进行加速。

图来自官方仓库Readme.md

如需加速，可以先安装flash-attn：

pip install flash-attn --no-build-isolation

再执行：

python web_demo_mm.py --flash-attn2

FlashAttention-2需要较高的cuda版本(cuda11.6以上–Deepseek语)

我的测试环境cuda版本较老，没装成功。

2. API 调用

采用API调用方式，需要先去官网申请API-Key

阿里云百炼控制台：https://bailian.console.aliyun.com

阿里云的每个模型都有免费的试用额度，相当良心。

计费说明和试用额度

官方仓库的cookbooks中，有各种使用场景的demo，但都是jupyter平台格式的，使用颇为不便。

于是我将其API解析部分，单独抽取出来，并参考文档，添加了流式输出的模式，代码如下：

import os
import base64
import re
import requests
from openai import OpenAI
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from qwen_vl_utils import smart_resize
from bs4 import BeautifulSoup, Tag
from pathlib import Path#  base 64 编码格式
def encode_image(image_path):with open(image_path, "rb") as image_file:return base64.b64encode(image_file.read()).decode("utf-8")def inference_with_api(image_path,prompt,sys_prompt="You are a helpful assistant.",model_id="qwen2.5-vl-72b-instruct",min_pixels=512 * 28 * 28,max_pixels=2048 * 28 * 28,stream=True,
):base64_image = encode_image(image_path)client = OpenAI(api_key=os.getenv("DASHSCOPE_API_KEY"),base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",)messages = [{"role": "system", "content": [{"type": "text", "text": sys_prompt}]},{"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},},{"type": "text", "text": prompt},],},]if stream:completion = client.chat.completions.create(model=model_id, messages=messages, stream=True)for chunk in completion:if chunk.choices and chunk.choices[0].delta.content:yield chunkelse:completion = client.chat.completions.create(model=model_id, messages=messages, stream=False)return completion.choices[0].message.contentdef draw_bbox(image_path, resized_width, resized_height, full_predict, output_path="output.png"
):if image_path.startswith("http"):response = requests.get(image_path)image = Image.open(BytesIO(response.content))else:image = Image.open(image_path)original_width = image.widthoriginal_height = image.height# Parse the provided HTML contentsoup = BeautifulSoup(full_predict, "html.parser")# Extract all elements that have a 'data-bbox' attributeelements_with_bbox = soup.find_all(attrs={"data-bbox": True})filtered_elements = []for el in elements_with_bbox:if el.name == "ol":continue  # Skip <ol> tagselif el.name == "li" and el.parent.name == "ol":filtered_elements.append(el)  # Include <li> tags within <ol>else:filtered_elements.append(el)  # Include all other elementsfont = ImageFont.truetype("msyh.ttc", 20)  # 微软雅黑draw = ImageDraw.Draw(image)# Draw bounding boxes and text for each elementfor element in filtered_elements:bbox_str = element["data-bbox"]text = element.get_text(strip=True)x1, y1, x2, y2 = map(int, bbox_str.split())# Calculate scaling factorsscale_x = resized_width / original_widthscale_y = resized_height / original_height# Scale coordinates accordinglyx1_resized = int(x1 / scale_x)y1_resized = int(y1 / scale_y)x2_resized = int(x2 / scale_x)y2_resized = int(y2 / scale_y)if x1_resized > x2_resized:x1_resized, x2_resized = x2_resized, x1_resizedif y1_resized > y2_resized:y1_resized, y2_resized = y2_resized, y1_resized# Draw bounding boxdraw.rectangle([x1_resized, y1_resized, x2_resized, y2_resized], outline="red", width=2)# Draw associated textdraw.text((x1_resized, y2_resized), text, fill="black", font=font)# Save the imageimage.save(output_path)def clean_and_format_html(full_predict):soup = BeautifulSoup(full_predict, "html.parser")# 正则表达式匹配 'color' 样式color_pattern = re.compile(r"\bcolor:[^;]+;?")# 清理 style 中的 color 样式for tag in soup.find_all(style=True):original_style = tag.get("style", "")new_style = color_pattern.sub("", original_style)if not new_style.strip():del tag["style"]else:new_style = new_style.rstrip(";")tag["style"] = new_style# 移除 data-bbox 和 data-polygon 属性for attr in ["data-bbox", "data-polygon"]:for tag in soup.find_all(attrs={attr: True}):del tag[attr]# 修改特定 class 名称classes_to_update = ["formula.machine_printed", "formula.handwritten"]for tag in soup.find_all(class_=True):if isinstance(tag, Tag) and "class" in tag.attrs:new_classes = [cls if cls not in classes_to_update else "formula"for cls in tag.get("class", [])]tag["class"] = list(dict.fromkeys(new_classes))# 清空指定 class 的 div 内容，并重命名 classfor div in soup.find_all("div", class_="image caption"):div.clear()div["class"] = ["image"]# 清空这些 class 的标签内容，并移除 format 属性classes_to_clean = ["music sheet", "chemical formula", "chart"]for class_name in classes_to_clean:for tag in soup.find_all(class_=class_name):if isinstance(tag, Tag):tag.clear()if "format" in tag.attrs:del tag["format"]# 尝试获取 <body> 标签，没有就手动创建一个包裹 soup 内容body = soup.bodyif body is None:body = soup.new_tag("body")for element in soup.contents:if isinstance(element, Tag):body.append(element)# 构建输出 HTML 字符串output = []for child in body.children:if isinstance(child, Tag):output.append(str(child))output.append("\n")elif isinstance(child, str) and not child.strip():continuecomplete_html = f"<html><body>\n{''.join(output)}</body></html>"return complete_htmlif __name__ == "__main__":# 获取API_KEY，参考文档：https://help.aliyun.com/zh/model-studio/obtain-api-key-app-id-and-workspace-idos.environ["DASHSCOPE_API_KEY"] = "自己的API—KEY"system_prompt = "You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."prompt = "QwenVL HTML "img_url = "D:/Code/Qwen2.5-VL/test_img.jpg"  # 图片的路径或网络url# 选择模型# model_id = "qwen2.5-vl-3b-instruct"# model_id = "qwen2.5-vl-7b-instruct"# model_id = "qwen2.5-vl-32b-instruct"model_id = "qwen2.5-vl-72b-instruct"stream = True  # 是否使用流式输出# 创建result目录的逻辑result_dir = Path("result")result_dir.mkdir(exist_ok=True)# 修改保存结果的代码部分output_path = str(result_dir / f"{model_id}_result.png")html_output_path = str(result_dir / f"{model_id}_result.html")min_pixels = 512 * 28 * 28max_pixels = 2048 * 28 * 28image = Image.open(img_url)width, height = image.sizeinput_height, input_width = smart_resize(height, width, min_pixels=min_pixels, max_pixels=max_pixels)response = inference_with_api(img_url,prompt,sys_prompt=system_prompt,model_id=model_id,min_pixels=min_pixels,max_pixels=max_pixels,stream=stream,)# 根据不同输出,进行后处理if stream:full_content = ""for chunk in response:if chunk.choices[0].delta.content is None:continuefull_content += chunk.choices[0].delta.contentprint(chunk.choices[0].delta.content)else:full_content = responseprint(full_content)print(type(full_content))if not isinstance(full_content, str):raise TypeError(f"Expected str, got {type(full_content)}")# 可视化处理draw_bbox(img_url, input_width, input_height, full_content, output_path)# 清理和格式化HTML内容(可选：会去除坐标等信息)# full_content = clean_and_format_html(full_content)# 保存HTML内容with open(html_output_path, "w", encoding="utf-8") as f:f.write(full_content)