当前位置：首页 > backend >正文

【原创】MiniCPM-V 4.5模型测试 pk gemini2.5pro 本地8G显卡

backend 2025/8/30 11:21:32

MiniCPM-V 4.5

MiniCPM-V 4.5是MiniCPM-V系列中最新、功能最强的型号。该型号基于Qwen3-8B和SigLIP2-400M，共有8B参数。它比以前的MiniCPM-V和MiniCPM-o型号表现出显着的性能改进，并引入了新的有用特性。MiniCPM-V 4.5的显着特点包括：

🔥最先进的视觉语言能力。MiniCPM-V 4.5在OpenCompass上的平均得分为77.0，这是对8个流行基准测试的综合评估。仅8B参数，它就超越了广泛使用的专有模型，如GPT-4o-update、双子座2.0专业版，以及强大的开源模型，如Qwen2.5-VL72B视觉语言能力，使其成为30B参数下性能最高的MLLM。
🎬高效的高刷新率和长视频理解。MiniCPM-V 4.5现在可以实现96倍的视频令牌压缩率，其中6个448x448视频帧可以联合压缩成64个视频令牌（大多数MLLM通常为1,536个令牌）。这意味着该模型可以在不增加LLM推理成本的情况下感知更多的视频帧。这带来了视频-MME、LVBench、MLVU、MotionBench、FavorBench等上最先进的高刷新率（高达10FPS）视频理解和长视频理解能力。，高效。
⚙️可控混合快速/深度思考。MiniCPM-V 4.5支持快速思考，以实现具有竞争力的高效频繁使用，也支持深度思考，以解决更复杂的问题。为了覆盖不同用户场景下的效率和性能权衡，这种快速/深度思考模式可以以高度受控的方式切换。
💪强大的OCR，文档解析和其他。基于LLaVA-UHD架构，MiniCPM-V 4.5可以处理任何长宽比和高达180万像素（例如1344x1344）的高分辨率图像，使用的视觉标记比大多数MLLM少4倍。该模型实现了OCRBench上的领先性能，超过了专有型号，如GPT-4o-update和双子座2.5RLAIF-V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT-4o-latest on MMHal-Bench, and supports multilingual capabilities in more than 30 languages.

hf测试

The video showcases a group of individuals dressed in turquoise protective suits and gas masks, navigating various sections of an ancient cave or underground structure. They begin by examining a large stone slab with intricate carvings, which has yellowish liquid pooling around it. The setting is dark but illuminated by their headlamps as they move through the space.
As they progress upward on a spiral staircase integrated into the rock formation, more people are seen ascending behind them. Upon reaching higher levels, water starts dripping from above, creating small pools at lower elevations where some team members stand waist-deep to examine submerged bones arranged neatly within circular niches carved into the wall.The exploration continues deeper into the cave-like environment, revealing walls adorned with numerous statues set into alcoves. Water cascades down from openings high above, contributing to the damp conditions below. One individual uses tools like pickaxes and brushes amidst scattered artifacts such as animal skulls and coils of black rope while others survey the surroundings cautiously under the dim light provided by flashlights.In another segment, two team members kneel before a rectangular object embedded in the floor of what appears to be a tunnel lined with concentric rings etched onto its surface. This scene highlights both collaborative investigation efforts among the suited explorers and detailed documentation techniques employed during this archaeological expedition inside these enigmatic subterranean chambers.视频展示了一群人身着青绿色防护服，戴着防毒面具，在一个古老洞穴或地下建筑的不同区域探索。他们先是查看一块刻有复杂图案的大石板，石板周围有黄色液体汇聚。环境昏暗，但他们走动时，头灯照亮了四周。他们沿着融入岩层的螺旋楼梯向上行进，身后还有更多人在攀登。到达更高处后，水开始从上方滴落，在地势较低处形成小水洼，一些队员站在齐腰深的水中，查看整齐排列在墙壁圆形壁龛里的水下骨头。探索继续向类似洞穴的环境深处推进，只见墙壁壁龛中装饰着众多雕像。水从高处开口倾泻而下，加剧了下方的潮湿。一名队员在散落的文物（如动物头骨和黑色绳索卷）中使用鹤嘴锄和刷子等工具，其他人则在手电筒微弱的光线中小心翼翼地查看周围环境。在另一个片段中，两名队员跪在嵌入地板的一个矩形物体前，这里似乎是一条隧道，隧道表面刻有同心圆环。这一场景既凸显了身着防护服的探险队员之间的协作调查，也展现了在这些神秘地下洞穴中进行考古探险时所采用的详细记录方法。

谷歌gemini 2.5 pro分析的

Based on the analysis of the video, here is a description of its content:This video appears to be an AI-generated animation depicting a team of explorers in green hazmat suits investigating a series of surreal and mysterious underground locations. The overall tone is one of sci-fi horror and mysterious discovery, similar to a "found footage" or exploration log.Here's a breakdown of the scenes:Scene 1 (0-3 seconds): The team is inside what looks like the massive rib cage of a colossal creature. They gather around a central stone-like structure as water begins to pour from above.Scene 2 (4-7 seconds): The setting shifts to a large, water-filled cavern with strange, concentric circles carved into the ceiling. The team wades through the water, examining a collection of bones and symbols laid out on the cavern floor.Scene 3 (8-11 seconds): The explorers are now in a circular, tomb-like chamber with statues in niches along the wall. In the center lies a broken sarcophagus, a metallic bull skull, and other mysterious artifacts. The ceiling above is broken open.Scene 4 (12-15 seconds): The perspective changes to a first-person view, looking down a long, ribbed, and seemingly organic tunnel. At the far end, two other team members are examining a black, monolithic object, resembling a server or a mysterious artifact.The video creates a strong atmosphere of suspense and otherworldly exploration, suggesting a journey deep into an ancient, non-human structure filled with unsettling secrets.基于对该视频的分析，以下是对其内容的描述：这段视频似乎是一段由人工智能生成的动画，描绘了一队身着绿色防护服的探险者对一系列超现实且神秘的地下场所展开调查。整体基调带有科幻恐怖色彩以及神秘探索的感觉，类似“伪纪录片”或探险日志。以下是各场景的详细描述：场景1（0 - 3秒）：探险队身处一个看起来像是某种巨大生物的巨大胸腔内。他们围聚在一个类似中央石质的结构周围，此时水开始从上方倾泻而下。场景2（4 - 7秒）：场景转换到一个巨大的、充满水的洞穴，洞顶刻有奇怪的同心圆。队员们在水中蹚行，查看散落在洞穴地面上的一堆骨头和符号。场景3（8 - 11秒）：探险者们此时位于一个圆形的、类似墓室的房间，墙壁壁龛里有雕像。房间中央放置着一个破碎的石棺、一个金属牛头骨以及其他神秘的文物。上方的天花板已破开。场景4（12 - 15秒）：视角切换为第一人称，顺着一条长长的、带有肋骨状且看似有机的隧道望去。在隧道远端，另外两名队员正在查看一个黑色的、类似服务器或神秘文物的巨石状物体。该视频营造出强烈的悬念氛围以及超凡脱俗的探索感，仿佛是深入一个古老的、非人类建造的结构中，那里充斥着令人不安的秘密。

本地llama.cpp

>>> Step2: 逐帧分析...分析帧 1/16...分析帧 2/16...分析帧 3/16...分析帧 4/16...分析帧 5/16...分析帧 6/16...分析帧 7/16...分析帧 8/16...分析帧 9/16...分析帧 10/16...分析帧 11/16...分析帧 12/16...分析帧 13/16...分析帧 14/16...分析帧 15/16...分析帧 16/16...
>>> Step3: 汇总...
>>> 最终输出：
```json
{"timeline": [{"time": "00:00","description": "洞穴内，多名身穿绿色防护服、戴着防毒面具的人在活动，一人攀爬岩壁，水流从上方滴落。"},{"time": "00:05","description": "洞穴中央有水流直泻而下，四名穿着绿色防护服的人手持手电筒站在水中，洞内昏暗。"},{"time": "00:10","description": "大型管道内，两名绿色防护服人员蹲在管道尽头对黑色矩形物体，两人手持照明设备。"},{"time": "00:15","description": "洞穴中喷水，团队成员在操作设备，背景为弧形石质洞壁。"},{"time": "00:20","description": "四名绿色防护服人员站在水中，水中有大量骨骼，洞顶为圆拱形，墙壁有沉积层。"},{"time": "00:25","description": "五名蓝色防护服人员在水中，一人手持手电筒照亮区域，洞底有黑色垫子、骨骼和物品。"},{"time": "00:30","description": "团队在洞穴中涉水前行，一人手持光源，洞顶有同心圆纹理。"},{"time": "00:35","description": "四名绿色防护服人员在洞穴中，中央有水流，手持手电筒照亮，洞壁粗糙。"},{"time": "00:40","description": "洞窟内有石块、铜牛头、绳索等，洞顶破损有蓝色光斑，洞壁有佛像。"},{"time": "00:45","description": "四名人员在清理残骸，洞顶有大洞透蓝光，洞壁有佛像，有石板、线团、牛头骨。"},{"time": "00:50","description": "多名人员在洞窟内清理文物碎片，洞壁有佛像龛，环境潮湿昏暗。"},{"time": "00:55","description": "多名人员在洞穴中清理，洞壁有佛像，顶部破损，环境潮湿尘土。"},{"time": "01:00","description": "圆形管道内，两名绿色防护服人员蹲在管道中央，前方有黑色金属箱体，一人手持相机。"},{"time": "01:05","description": "圆形隧道，两名绿色防护服人员朝隧道深处前进，中间两人操作黑色金属箱。"},{"time": "01:10","description": "三名绿色防护服人员在隧道内操作黑色设备，隧道中央有设备，一人手持对讲机。"},{"time": "01:15","description": "三名绿色防护服人员在隧道内，中间一人蹲在设备前，右侧人员手持对讲机。"}],"summary": "该视频展示了一个考古团队在洞穴中进行探索和清理的全过程。团队成员身穿绿色或蓝色防护服，佩戴防毒面具，使用手电筒等工具，面对黑暗、潮湿、危险的环境，进行水下作业、管道探索、佛像清理、文物发掘等工作。洞穴内部结构复杂，有水流、喷水、喷雾、水柱等现象，洞顶有弧形、圆形、破损等特征，洞壁有层理、岩层、同心圆、佛像等纹饰。团队成员操作设备，清理残骸，发现石块、铜牛头、黑色绳索、佛像、骨骼、金属线团等文物，环境阴暗潮湿，充满未知与紧张感。"
}
```
⚠️ 输出不是标准 JSON

import os, subprocess, glob, json, base64
from openai import OpenAIVIDEO_PATH = r"1.mp4"
FRAME_DIR = "frames"
FPS = 0.2
API_URL = "http://127.0.0.1:8003/v1"
MODEL_NAME = "minicpm-v4.5"def extract_frames(video_path, frame_dir, fps=1):os.makedirs(frame_dir, exist_ok=True)subprocess.run(["ffmpeg", "-i", video_path, "-vf", f"fps={fps}", f"{frame_dir}/frame_%04d.jpg", "-y"])def encode_image_to_base64(path):with open(path, "rb") as f:return base64.b64encode(f.read()).decode("utf-8")def analyze_single_frame(client, frame_path, seconds):time_str = f"{seconds//60:02d}:{seconds%60:02d}"img_b64 = encode_image_to_base64(frame_path)messages = [{"role": "system", "content": "你是一个视频逐帧分析助手"},{"role": "user", "content": [{"type": "text", "text": f"请描述这张图像（视频时间 {time_str}），输出 JSON 格式：{{\"time\": \"{time_str}\", \"description\": 描述}}。只输出 JSON，不要多余文字。"},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}]}]resp = client.chat.completions.create(model=MODEL_NAME, messages=messages)return resp.choices[0].message.contentdef merge_results(client, frame_results):messages = [{"role": "system", "content": "你是一个视频总结助手"},{"role": "user", "content": f"以下是逐帧描述：\n{json.dumps(frame_results, ensure_ascii=False, indent=2)}\n请汇总成 JSON 格式：{{\"timeline\": [...], \"summary\": 总结}}。只输出 JSON，不要多余文字。"}]resp = client.chat.completions.create(model=MODEL_NAME, messages=messages)return resp.choices[0].message.contentif __name__ == "__main__":print(">>> Step1: 抽帧中...")extract_frames(VIDEO_PATH, FRAME_DIR, FPS)frames = sorted(glob.glob(f"{FRAME_DIR}/*.jpg"))client = OpenAI(base_url=API_URL, api_key="none")print(">>> Step2: 逐帧分析...")frame_results = []for i, f in enumerate(frames):print(f"  分析帧 {i+1}/{len(frames)}...")result = analyze_single_frame(client, f, i*int(1/FPS))try:frame_results.append(json.loads(result))except:print("⚠️ 单帧结果不是标准 JSON:", result)print(">>> Step3: 汇总...")final_result = merge_results(client, frame_results)print(">>> 最终输出：")print(final_result)try:data = json.loads(final_result)with open("video_analysis.json", "w", encoding="utf-8") as f:json.dump(data, f, ensure_ascii=False, indent=2)print(">>> JSON 已保存到 video_analysis.json")except:print("⚠️ 输出不是标准 JSON")

查看全文

http://www.xdnf.cn/news/19079.html