当前位置：首页 > news >正文

Qwen2.5-VL视觉-语言模型做图片理解调研

news 2025/7/15 17:42:12

我们这里调研使用多模态大模型来做图片理解、图生文等处理功能。调研的是千问大模型Qwen2.5-VL系列。

考虑到服务器GPU的大小和内存限制，我们选择比较小的模型进行部署和测试。

Qwen/Qwen2.5-VL-3B-Instruct 模型参数量30亿，模型占用内存约7.5G；

Qwen/Qwen2.5-VL-7B-Instruct 模型参数量70亿，模型占用内存约 17G；

等....

我们这里选择参数量和内存占用最小的Qwen/Qwen2.5-VL-3B-Instruct 进行部署和测试。

模型可以在modelscope魔塔社区里查看和下载。魔搭社区

1、Qwen2.5-VL-3B-Instruct简介

关键增强：

视觉理解：Qwen2.5-VL不仅擅长识别如花、鸟、鱼和昆虫等常见物体，还非常擅长分析图像中的文本、图表、图标、图形和布局。
自主性：Qwen2.5-VL可以直接作为视觉代理，能够进行推理并动态指导工具使用，包括电脑和手机操作。
理解和捕捉长视频及事件：Qwen2.5-VL可以理解超过1小时的视频，并且新增了通过定位相关视频片段来捕捉事件的能力。
支持多种格式的视觉定位：Qwen2.5-VL可以通过生成边界框或点来精确地在图像中定位对象，并提供稳定的JSON输出以供坐标和属性使用。
生成结构化输出：对于发票扫描件、表格等形式的数据，Qwen2.5-VL支持其内容的结构化输出，这在金融、商业等领域具有应用价值

2、快速推理部署

这里的场景是应用Qwen2.5-VL-3B-Instruct对输入图片进行处理，返回图片的描述、识别图片中的文字。并测试模型使用的内存、返回结果用时。代码实现如下：

2.1 server.py

from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from PIL import Image
import io
import uvicorn
import time
import torch
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_downloadapp = FastAPI(title="Qwen2.5-VL-3B-Instruct API", description="图片描述生成API")# 模型初始化
@app.on_event("startup")
async def load_model():global model, processortry:# 下载模型model_dir = snapshot_download('Qwen/Qwen2.5-VL-3B-Instruct')print(f"模型目录: {model_dir}")# 加载模型和处理器model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_dir, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True  # 减少CPU内存使用)processor = AutoProcessor.from_pretrained(model_dir)print("模型加载成功")except Exception as e:print(f"模型加载失败: {e}")raise HTTPException(status_code=500, detail="模型加载失败")@app.post("/describe-image/", response_model=dict)
async def describe_image(file: UploadFile = File(...)):try:# 读取图片contents = await file.read()image = Image.open(io.BytesIO(contents)).convert("RGB")# 图像尺寸验证if min(image.size) < 32:raise ValueError(f"图像尺寸过小: {image.size}")# 调整图像大小original_size = image.sizeimage = image.resize((224, 224), Image.Resampling.BICUBIC)# 打印图像信息print(f"图片名称: {file}")print(f"原始图像尺寸: {original_size}")print(f"调整后图像尺寸: {image.size}")# 构建消息结构messages = [{"role": "user","content": [{"type": "image","image": image},{"type": "text", "text": "使用中文详细描述图片.如果图片中有文字，请将文字识别并输出."},],}]# 准备输入start_time = time.time()text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, video_inputs = process_vision_info(messages)# 检查图像输入if not image_inputs:raise ValueError("图像输入处理失败")#print(f"图像输入形状: {image_inputs[0].shape}")# 验证token数量tokenized_input = processor.tokenizer(text)token_count = len(tokenized_input["input_ids"])print(f"Token数量: {token_count}")if token_count > 4096:  # 根据模型实际限制调整raise ValueError(f"输入token数量({token_count})超过模型最大限制")inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",)# 移到GPUinputs = inputs.to(model.device)# 模型推理with torch.no_grad():generated_ids = model.generate(**inputs, max_new_tokens=1024)# 处理输出generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)# 推理后释放缓存torch.cuda.empty_cache()inference_time = time.time() - start_timereturn {"description": output_text[0],"inference_time": inference_time,"model": "Qwen2.5-VL-3B-Instruct"}except Exception as e:import tracebackprint(traceback.format_exc())raise HTTPException(status_code=500, detail=f"处理图片时出错: {str(e)}")if __name__ == "__main__":uvicorn.run(app, host="0.0.0.0", port=8000)

2.2 client.py

import requests
import json
from pathlib import Path# API端点
url = "http://localhost:8000/describe-image/"# 图片文件路径
image_path = "/home/ubuntu/dev/vl/qwen_test/000000004016.png"  # 替换为你的图片路径
#image_path = "puppy.png"# 发送请求
with open(image_path, "rb") as f:files = {"file": f}response = requests.post(url, files=files)# 处理响应
if response.status_code == 200:result = response.json()print("图片描述:")print(result["description"])print("用时:",result['inference_time'])
else:print(f"请求失败，状态码: {response.status_code}")print(f"错误信息: {response.text}")# 按文件夹读取图片
# 图片路径
image_dir = Path("./images/")for image_path in image_dir.rglob('*'):if image_path.is_file() and image_path.suffix.lower() in ['.jpg', '.jpeg', '.png', '.bmp']:print(f"处理图片: {image_path.name}")# 发送请求with open(image_path, "rb") as f:files = {"file": f}response = requests.post(url, files=files)# 处理响应if response.status_code == 200:result = response.json()print(f"  图片描述: {result['description']}")print(f"  推理时间: {result['inference_time']:.2f}秒")else:print(f"  请求失败，状态码: {response.status_code}")print(f"  错误信息: {response.text}")

2.3 性能对比

代码示例里使用的是 Qwen/Qwen2.5-VL-3B-Instruct ，可以将上述的模型替换为 Qwen/Qwen2.5-VL-7B-Instruct，分别对跑出来的结果进行对比，根据效果、资源消耗来做选择。

模型	推理部署	cpu/cpu	图片描述效果	响应速度	是否可上线
Qwen/Qwen2.5-VL-3B-Instruct	transformer普通部署		较准确	6s	可上线使用
Qwen/Qwen2.5-VL-3B-Instruct	transformer普通部署, 使用float16、推荐的加速和保存配置、调整图片大小	auto 8G-10G	较准确	3s-5s	可上线使用
Qwen/Qwen2.5-VL-7B-Instruct	transformer普通部署, 使用float16、推荐的加速和保存配置、调整图片大小	17G	准确性比 3B的略好一点	3s-5s	可上线使用

2.4 结果示例

输入图片

处理图片: puppy.png

a、模型 Qwen2.5-VL-3B-Instruct

图片描述: 图片中有一只小狗，它正站在雪地上。小狗的毛色是黑色、棕色和白色相间的，看起来非常可爱。它的耳朵竖立着，眼睛大而圆，显得非常好奇。小狗的鼻子湿润，似乎刚刚从雪地里出来。背景是一片被雪覆盖的地面，远处可以看到一些树木和一个木制的长椅。整个场景给人一种宁静而美丽的冬日感觉。

图片中的文字是“a puppy playing in the snow”，翻译成中文就是“一只小狗在雪地里玩耍”。

推理时间: 4.50秒

b、模型 Qwen2.5-VL-7B-Instruct

图片描述: 这张图片展示了一只小狗在雪地里玩耍的场景。小狗的毛色主要是黑色和白色，脸部有一些棕色的斑点。它的头上有一层薄薄的积雪，看起来像是刚刚从雪堆里爬出来。背景中可以看到一些树木和一个木制的长椅，表明这是一个户外的公园或花园环境。图片下方有一段被部分遮挡的文字，但可以辨认出“ed caption: a puppy playing in the snow”，意思是“带字幕：一只小狗在雪地里玩耍”。

推理时间: 4.18秒

3、结论

Qwen2.5-VL视觉-语言模型在图片理解、图文生成上是做的效果比较理想，推理时间在2GPU的服务器上3-5s响应，也是比较符合上线使用的。当然参数量大一些，模型在处理复杂的图片理解上效果会更好一些，但要根据自己业务的场景、服务资源来做选择。这个模型比上篇中CLIP的模型效果更好、应用也更灵活。所以，能选择使用生成式的视觉-语言模型就优先使用它。

查看全文

http://www.xdnf.cn/news/674587.html