当前位置：首页 > web >正文

开源界迎来重磅核弹！月之暗面开源了自家最新模型 K2

web 2025/7/12 12:41:21

1. 模型简介

Kimi K2 是一款尖端专家混合（MoE）语言模型，激活参数量达320亿，总参数量突破1万亿。该模型采用Muon优化器训练，在前沿知识、推理和编程任务中展现出卓越性能，同时针对智能体能力进行了精细化优化。

核心特性

超大规模训练：基于15.5万亿token预训练1万亿参数MoE模型，全程保持训练稳定性
MuonClip优化器：将Muon优化器应用于前所未有的规模，开发新型优化技术解决扩展过程中的稳定性问题
智能体能力：专为工具调用、逻辑推理和自主问题解决设计

模型变体

Kimi-K2-Base：基础模型，为希望完全掌控微调和定制解决方案的研究者与开发者提供坚实的起点。
Kimi-K2-Instruct：经过后训练的模型，最适合即插即用的通用聊天及代理体验。它属于无需长思考的反射级模型。

2. 模型概述


架构	专家混合模型 (MoE)
总参数量	1万亿
激活参数量	320亿
层数 (含全连接层)	61
全连接层数量	1
注意力隐藏层维度	7168
MoE隐藏层维度 (单专家)	2048
注意力头数量	64
专家总数	384
单token选用专家数	8
共享专家数量	1
词表大小	16万
上下文长度	12万8千
注意力机制	多层注意力
激活函数	SwiGLU

3. 评估结果

指令模型评估结果

Benchmark	Metric	^{Kimi K2 Instruct}	^{DeepSeek-V3-0324}	^{Qwen3-235B-A22B ^{(non-thinking)}}	^{Claude Sonnet 4 ^{(w/o extended thinking)}}	^{Claude Opus 4 ^{(w/o extended thinking)}}	^GPT-4.1	^{Gemini 2.5 Flash Preview (05-20)}
Coding Tasks
LiveCodeBench v6 ^{(Aug 24 - May 25)}	Pass@1	53.7	46.9	37.0	48.5	47.4	44.7	44.7
OJBench	Pass@1	27.1	24.0	11.3	15.3	19.6	19.5	19.5
MultiPL-E	Pass@1	85.7	83.1	78.2	88.6	89.6	86.7	85.6
SWE-bench Verified ^{(Agentless Coding)}	Single Patch	51.8	36.6	39.4	50.2	53.0	40.8	32.6
SWE-bench Verified ^{(Agentic Coding)}	Single Attempt (Acc)	65.8	38.8	34.4	72.7^*	72.5^*	54.6	—
SWE-bench Verified ^{(Agentic Coding)}	Multiple Attempts (Acc)	71.6	—	—	80.2	79.4^*	—	—
SWE-bench Multilingual ^{(Agentic Coding)}	Single Attempt (Acc)	47.3	25.8	20.9	51.0	—	31.5	—
TerminalBench	Inhouse Framework (Acc)	30.0	—	—	35.5	43.2	8.3	—
TerminalBench	Acc	25.0	16.3	6.6	—	—	30.3	16.8
Aider-Polyglot	Acc	60.0	55.1	61.8	56.4	70.7	52.4	44.0
Tool Use Tasks
Tau2 retail	Avg@4	70.6	69.1	57.0	75.0	81.8	74.8	64.3
Tau2 airline	Avg@4	56.5	39.0	26.5	55.5	60.0	54.5	42.5
Tau2 telecom	Avg@4	65.8	32.5	22.1	45.2	57.0	38.6	16.9
AceBench	Acc	76.5	72.7	70.5	76.2	75.6	80.1	74.5
Math & STEM Tasks
AIME 2024	Avg@64	69.6	59.4^*	40.1^*	43.4	48.2	46.5	61.3
AIME 2025	Avg@64	49.5	46.7	24.7^*	33.1^*	33.9^*	37.0	46.6
MATH-500	Acc	97.4	94.0^*	91.2^*	94.0	94.4	92.4	95.4
HMMT 2025	Avg@32	38.8	27.5	11.9	15.9	15.9	19.4	34.7
CNMO 2024	Avg@16	74.3	74.7	48.6	60.4	57.6	56.6	75.0
PolyMath-en	Avg@4	65.1	59.5	51.9	52.8	49.8	54.0	49.9
ZebraLogic	Acc	89.0	84.0	37.7^*	73.7	59.3	58.5	57.9
AutoLogi	Acc	89.5	88.9	83.3	89.8	86.1	88.2	84.1
GPQA-Diamond	Avg@8	75.1	68.4^*	62.9^*	70.0^*	74.9^*	66.3	68.2
SuperGPQA	Acc	57.2	53.7	50.2	55.7	56.5	50.8	49.6
Humanity's Last Exam ^{(Text Only)}	-	4.7	5.2	5.7	5.8	7.1	3.7	5.6
General Tasks
MMLU	EM	89.5	89.4	87.0	91.5	92.9	90.4	90.1
MMLU-Redux	EM	92.7	90.5	89.2	93.6	94.2	92.4	90.6
MMLU-Pro	EM	81.1	81.2^*	77.3	83.7	86.6	81.8	79.4
IFEval	Prompt Strict	89.8	81.1	83.2^*	87.6	87.4	88.0	84.3
Multi-Challenge	Acc	54.1	31.4	34.0	46.8	49.0	36.4	39.5
SimpleQA	Correct	31.0	27.7	13.2	15.9	22.8	42.3	23.3
Livebench	Pass@1	76.4	72.4	67.6	74.8	74.6	69.8	67.8

^{• 加粗表示全球最佳，下划线表示开源最佳。}
^{• 标记有 * 的数据点直接取自模型的技术报告或博客。}
^{• 除SWE-bench Verified (Agentless)外，所有指标均在8k输出标记长度下进行评估。SWE-bench Verified (Agentless)则限制在16k输出标记长度。}
^{• Kimi K2在SWE-bench Verified测试中的单次尝试补丁（无需测试时计算）通过率达到了65.8%（使用bash/编辑器工具）。在相同条件下，其在SWE-bench Multilingual测试中的单次通过率为47.3%。此外，我们报告了利用并行测试时计算的SWE-bench Verified测试结果（71.6%），即通过采样多个序列并通过内部评分模型选择最佳方案。}
^{•为确保评估的稳定性，我们在AIME、HMMT、CNMO、PolyMath-en、GPQA-Diamond、EvalPlus、Tau2上采用了avg@k方法。}
^{• 由于评估成本过高，部分数据点已被省略。}

基础模型评估结果

Benchmark	Metric	Shot	Kimi K2 Base	Deepseek-V3-Base	Qwen2.5-72B	Llama 4 Maverick
General Tasks
MMLU	EM	5-shot	87.8	87.1	86.1	84.9
MMLU-pro	EM	5-shot	69.2	60.6	62.8	63.5
MMLU-redux-2.0	EM	5-shot	90.2	89.5	87.8	88.2
SimpleQA	Correct	5-shot	35.3	26.5	10.3	23.7
TriviaQA	EM	5-shot	85.1	84.1	76.0	79.3
GPQA-Diamond	Avg@8	5-shot	48.1	50.5	40.8	49.4
SuperGPQA	EM	5-shot	44.7	39.2	34.2	38.8
Code Tasks
LiveCodeBench v6	Pass@1	1-shot	26.3	22.9	21.1	25.1
EvalPlus	Pass@1	-	80.3	65.6	66.0	65.5
Mathematics Tasks
MATH	EM	4-shot	70.2	60.1	61.0	63.0
GSM8k	EM	8-shot	92.1	91.7	90.4	86.3
Chinese Tasks
C-Eval	EM	5-shot	92.5	90.0	90.9	80.9
CSimpleQA	Correct	5-shot	77.6	72.1	50.5	53.5

^{• 在本研究中，我们仅评估开源预训练模型。由于Qwen3-235B-A22B的基准检查点在我们研究时尚未开源，因此我们报告了Qwen2.5-72B的结果。}
^{• 所有模型均采用相同的评估协议进行测试。}

4. 部署说明

[!注意]
您可以通过 https://platform.moonshot.ai 访问Kimi K2的API服务，我们提供了兼容OpenAI/Anthropic规范的API接口。

其中Anthropic兼容API的温度参数映射关系为real_temperature = request_temperature * 0.6，以更好地适配现有应用程序。

我们的模型检查点采用block-fp8格式存储，您可以在Huggingface平台获取。

当前推荐在以下推理引擎上运行Kimi-K2模型：

vLLM
SGLang
KTransformers
TensorRT-LLM

关于vLLM和SGLang的部署示例，请参阅模型部署指南。

5. 模型使用

聊天补全

本地推理服务启动后，您可以通过聊天端点与之交互：

def simple_chat(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},]response = client.chat.completions.create(model=model_name,messages=messages,stream=False,temperature=0.6,max_tokens=256)print(response.choices[0].message.content)

[!注意]
Kimi-K2-Instruct 的推荐温度为 temperature = 0.6。
如无特殊要求，上述系统提示是良好的默认设置。

工具调用

Kimi-K2-Instruct 具备强大的工具调用能力。
启用功能需在每次请求中传入可用工具列表，模型将自主决定调用时机与方式。

以下示例展示了端到端的天气工具调用流程：

# Your tool implementation
def get_weather(city: str) -> dict:return {"weather": "Sunny"}# Tool schema definition
tools = [{"type": "function","function": {"name": "get_weather","description": "Retrieve current weather information. Call this when the user asks about the weather.","parameters": {"type": "object","required": ["city"],"properties": {"city": {"type": "string","description": "Name of the city"}}}}
}]# Map tool names to their implementations
tool_map = {"get_weather": get_weather
}def tool_call_with_client(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}]finish_reason = Nonewhile finish_reason is None or finish_reason == "tool_calls":completion = client.chat.completions.create(model=model_name,messages=messages,temperature=0.6,tools=tools,          # tool list defined abovetool_choice="auto")choice = completion.choices[0]finish_reason = choice.finish_reasonif finish_reason == "tool_calls":messages.append(choice.message)for tool_call in choice.message.tool_calls:tool_call_name = tool_call.function.nametool_call_arguments = json.loads(tool_call.function.arguments)tool_function = tool_map[tool_call_name]tool_result = tool_function(**tool_call_arguments)print("tool_result:", tool_result)messages.append({"role": "tool","tool_call_id": tool_call.id,"name": tool_call_name,"content": json.dumps(tool_result)})print("-" * 100)print(choice.message.content)