当前位置：首页 > java >正文

Unsloth 实战：DeepSeek-R1 模型高效微调指南（下篇）

java 2025/7/15 10:42:03

食用指南

本系列因篇幅原因拆分为上下两篇：

上篇以基础环境搭建为主，介绍了 Unsloth 框架、基座模型下载、导入基座模型、数据集下载/加载/清洗、SwanLab 平台账号注册。

下篇（本文）以实战微调为主，介绍预训练、全量微调、LoRA微调。

一、LoRA微调实战

准备完数据之后，即可开始进行微调。这里我们先进行少量数据微调测试，程序能够基本跑通后再进行大规模数据集微调。

Step 1、LoRA参数注入

https://docs.unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide

model = FastLanguageModel.get_peft_model(model,r = 16,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128target_modules = ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],lora_alpha = 16,  # Best to choose alpha = rank or rank*2lora_dropout = 0, # Supports any, but = 0 is optimizedbias = "none",    # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long contextrandom_state = 3407,use_rslora = False,   # We support rank stabilized LoRAloftq_config = None,  # And LoftQ
)

Unsloth 2025.6.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.

Step 2、设置微调参数

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(model = model,tokenizer = tokenizer,train_dataset = combined_dataset,eval_dataset = None, # Can set up evaluation!args = SFTConfig(dataset_text_field = "text",per_device_train_batch_size = 2,gradient_accumulation_steps = 4, # Use GA to mimic batch size!warmup_steps = 5,# num_train_epochs = 1, # Set this for 1 full training run.max_steps = 30,learning_rate = 2e-4, # Reduce to 2e-5 for long training runslogging_steps = 1,optim = "adamw_8bit",weight_decay = 0.01,lr_scheduler_type = "linear",seed = 3407,report_to = "none", ),
)

其中SFTTrainer：一个专门为指令微调设计的训练器，封装了 Hugging Face 的 Trainer，而SFTConfig：配置训练参数的专用类，功能类似 TrainingArguments。而SFTConfig核心参数解释如下：

参数名	含义
`dataset_text_field="text"`	数据集中用于训练的字段名称，如 `text` 或 `prompt`
`per_device_train_batch_size=2`	每张 GPU 上的 batch size 是 2
`gradient_accumulation_steps=4`	梯度累计 4 次后才进行一次反向传播（等效于总 batch size = 2 × 4 = 8）
`warmup_steps=5`	前 5 步进行 warmup（缓慢提升学习率）
`max_steps=30`	最多训练 30 步（适合调试或快速实验）
`learning_rate=2e-4`	初始学习率（短训练可用较高值）
`logging_steps=1`	每训练 1 步就打印一次日志
`optim="adamw_8bit"`	使用 8-bit AdamW 优化器（节省内存，Unsloth 支持）
`weight_decay=0.01`	权重衰减，用于防止过拟合
`lr_scheduler_type="linear"`	线性学习率调度器（从高到低线性下降）
`seed=3407`	固定随机种子，确保结果可复现
`report_to="none"`	不使用 WandB 或 TensorBoard 等日志平台

此时基本训练过程为：

从数据集 combined_dataset 中取出一批样本（2 条）
重复上面过程 4 次（gradient_accumulation_steps=4）
将累计的梯度用于更新模型一次参数（等效于一次大 batch 更新）
重复上述过程，直到 max_steps=30 停止

因此训练数据为：2 * 4 * 30 = 240 条

Step 3、设置 SwanLab

实例化 SwanLabCallback：

from swanlab.integration.transformers import SwanLabCallback# 实例化 SwanLabCallback
swanlab_callback = SwanLabCallback(project="trl_integration",experiment_name="DeepSeek-R1-Distill-Qwen-1.5B-SFT",description="测试swanlab和trl的集成",config={"framework": "🤗TRL"},
)

找到trl的Trainer（比如SFTTrainer、PPOTrainer、GRPOTrainer等），然后把swanlab_callback实例传入到callbacks参数中：

from trl import SFTConfig, SFTTrainer...trainer = SFTTrainer(...# 传入callbacks参数callbacks=[swanlab_callback],
)

完整代码如下：

from trl import SFTTrainer, SFTConfig
from swanlab.integration.transformers import SwanLabCallbacktrainer = SFTTrainer(model = model,tokenizer = tokenizer,train_dataset = combined_dataset,eval_dataset = None, # Can set up evaluation!args = SFTConfig(dataset_text_field = "text",per_device_train_batch_size = 2,gradient_accumulation_steps = 4, # Use GA to mimic batch size!warmup_steps = 5,# num_train_epochs = 1, # Set this for 1 full training run.max_steps = 30,learning_rate = 2e-4, # Reduce to 2e-5 for long training runslogging_steps = 1,optim = "adamw_8bit",weight_decay = 0.01,lr_scheduler_type = "linear",seed = 3407,report_to = "none", ),callbacks=[swanlab_callback],
)

Unsloth: Tokenizing ["text"] (num_proc=16): 100%|█| 5000/5000 [00:07<00:00, 703.

此时显卡占用如下：

# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3060 Laptop GPU. Max memory = 5.676 GB.
3.609 GB of memory reserved.

Step 4、微调执行流程

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1\\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8"-____-"     Trainable parameters = 36,929,536/1,814,017,536 (2.04% trained)[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/home/sam/MyWorkSpace/notebook/炼丹指南/swanlog/run-20250702_204318-0e8cd89d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mCulinaryAlchemist[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33mDeepSeek-R1-Distill-Qwen-1.5B-SFT[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration/runs/ah9pc66lt4sahsepni8qy[0m[0m

在这里插入图片描述

trainer_stats

TrainOutput(global_step=30, training_loss=1.396385904153188, metrics={'train_runtime': 310.6789, 'train_samples_per_second': 0.773, 'train_steps_per_second': 0.097, 'total_flos': 5744717262397440.0, 'train_loss': 1.396385904153188})

微调期间显存占用检测

# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

310.6789 seconds used for training.
5.18 minutes used for training.
Peak reserved memory = 4.641 GB.
Peak reserved memory for training = 1.032 GB.
Peak reserved memory % of max memory = 81.765 %.
Peak reserved memory for training % of max memory = 18.182 %.

注意，unsloth在微调结束后，会自动更新模型权重（在缓存中），因此无需手动合并模型权重即可直接调用微调后的模型：

FastLanguageModel.for_inference(model)

PeftModelForCausalLM((base_model): LoraModel((model): Qwen2ForCausalLM((model): Qwen2Model((embed_tokens): Embedding(151936, 1536, padding_idx=151654)(layers): ModuleList((0-27): 28 x Qwen2DecoderLayer((self_attn): Qwen2Attention((q_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=1536, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(k_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=256, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=256, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(v_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=256, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=256, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(o_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=1536, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(rotary_emb): LlamaRotaryEmbedding())(mlp): Qwen2MLP((gate_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=8960, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=8960, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(up_proj): lora.Linear((base_layer): Linear(in_features=1536, out_features=8960, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=8960, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(down_proj): lora.Linear((base_layer): Linear(in_features=8960, out_features=1536, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=8960, out_features=32, bias=False))(lora_B): ModuleDict((default): Linear(in_features=32, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(act_fn): SiLU())(input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)(post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)))(norm): Qwen2RMSNorm((1536,), eps=1e-06)(rotary_emb): LlamaRotaryEmbedding())(lm_head): Linear(in_features=1536, out_features=151936, bias=False)))
)

Step 5、模型对话测试

messages = [{"role" : "user", "content" : "解方程 (x + 2)^2 = 0."}
]text = tokenizer.apply_chat_template(messages,tokenize = False,add_generation_prompt = True,
)inputs = tokenizer(text, return_tensors="pt").to("cuda")outputs = model.generate(input_ids=inputs.input_ids,attention_mask=inputs.attention_mask,max_new_tokens=1200,use_cache=False,
)
response = tokenizer.batch_decode(outputs)

The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

response

['<｜begin▁of▁sentence｜><｜begin▁of▁sentence｜><｜User｜>解方程 (x + 2)^2 = 0.<｜Assistant｜><think>\n嗯，我现在要解这个方程：(x + 2)^2 = 0。好，首先，我记得解二次方程通常需要把方程转化为标准形式，然后使用求根公式或者因式分解来解。但这里方程看起来有点特别，因为左边已经是平方了，所以可能不需要那么麻烦。\n\n首先，我应该回忆一下平方等于零的情况。如果一个数的平方等于零，那么这个数本身必须是零。对吗？所以，我应该把括号里的部分设为零，这样就能解出x的值了。也就是说，(x + 2)^2 = 0的话，里面的括号部分x + 2也必须等于零。这样的话，解就是x + 2 = 0，解得x = -2。是不是这样？嗯，对的，这个方法应该没错。\n\n不过，我是不是应该检查一下我的步骤是否正确？比如，如果我把方程展开的话，左边是(x + 2)^2，展开后是x² + 4x +4，所以方程变成x² + 4x +4 = 0。然后用求根公式解这个方程的话，判别式Δ = b² - 4ac = 16 - 16 = 0，说明这个方程有一个双重根，也就是x = (-b)/(2a) = (-4)/(2*1) = -2。嗯，结果和之前的方法一致，都是x = -2。所以这个解是对的。\n\n或者，我可以代入x = -2回去看看，方程左边是(-2 + 2)^2 = (0)^2 = 0，右边也是0，两边相等，说明x = -2确实是解。所以这个解是正确的。\n\n不过，我是不是应该考虑有没有其他可能的解呢？比如，如果方程左边是平方，是否有其他值也会导致平方为零？比如负数？但平方的结果都是非负的，所以只有当括号里的数为零的时候，平方才能等于零。所以没有其他解。因此，x只能等于-2。\n\n嗯，可能还有其他的解吗？比如，如果方程是(x + 2)^2 = 0，那么解只能是x = -2，因为任何其他的数平方后都不会等于零。比如说，如果x + 2是零，那么x必须是-2，所以没有其他可能的解。\n\n所以，总结一下，解这个方程的过程是：\n\n1. 观察方程的结构，左边已经是平方的形式，所以直接令括号内的表达式等于零。\n2. 解这个线性方程，得到x = -2。\n3. 验证解的正确性，代入原方程，结果正确。\n4. 检查是否有其他可能的解，发现没有，因为平方等于零只有当括号内为零时才成立。\n\n所以，最终解是x = -2。\n\n**答案**\n方程的解是\\boxed{-2}\n</think>\n\n\n\n首先，我们观察到方程 \\((x + 2)^2 = 0\\) 已经是平方的形式，因此我们需要令括号内的表达式等于零。\n\n1. 令 \\(x + 2 = 0\\)，解得 \\(x = -2\\)。\n\n2. 验证解的正确性：将 \\(x = -2\\) 代入原方程，左边为 \\((-2 + 2)^2 = 0\\)，右边为0，两边相等，说明解正确。\n\n3. 检查是否有其他可能的解：平方等于零的数只有零，因此没有其他解。\n\n最终解为 \\(x = -2\\)。\n\n\\[\n\\boxed{-2}\n\\]<｜end▁of▁sentence｜>']

Step 6、模型大规模微调

注意这里已经训练了一小部分数据，接下来会基于训练后的模型继续训练

经过最小可行性实验后，接下来训练完一整个epoch：

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(model = model,tokenizer = tokenizer,train_dataset = combined_dataset,eval_dataset = None, # Can set up evaluation!args = SFTConfig(dataset_text_field = "text",per_device_train_batch_size = 4,gradient_accumulation_steps = 2, # Use GA to mimic batch size!warmup_steps = 5,num_train_epochs = 1, # Set this for 1 full training run.learning_rate = 2e-4, # Reduce to 2e-5 for long training runslogging_steps = 1,optim = "adamw_8bit",weight_decay = 0.01,lr_scheduler_type = "linear",seed = 3407,report_to = "none",# output_dir="outputs", # 训练结果的输出目录),callbacks=[swanlab_callback],
)

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1\\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 625
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8"-____-"     Trainable parameters = 18,464,768/1,795,552,768 (1.03% trained)[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/home/sam/MyWorkSpace/notebook/炼丹指南/swanlog/run-20250703_091536-0e8cd89d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mCulinaryAlchemist[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33mDeepSeek-R1-Distill-Qwen-1.5B-SFT[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration/runs/s7igpey8zrvv5mqoiukvz[0m[0m

trainer_stats

TrainOutput(global_step=625, training_loss=1.3668213096618653, metrics={'train_runtime': 8672.3949, 'train_samples_per_second': 0.577, 'train_steps_per_second': 0.072, 'total_flos': 1.5498959703926784e+17, 'train_loss': 1.3668213096618653})

SwanLab 记录的训练图表：

！实际上，这里的训练效果并不收敛，只需知道怎么看效果就行。

在这里插入图片描述

此时训练完成后再进行对话：

messages = [{"role" : "user", "content" : "提问XXX"}
]text = tokenizer.apply_chat_template(messages,tokenize = False,add_generation_prompt = True,
)inputs = tokenizer(text, return_tensors="pt").to("cuda")outputs = model.generate(input_ids=inputs.input_ids,attention_mask=inputs.attention_mask,max_new_tokens=max_seq_length,temperature = 0.6, top_p = 0.95,use_cache=False,
)response = tokenizer.batch_decode(outputs)

The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

此时就可以验证 response 是否符合训练后的预期效果。

！注意推理参数 temperature = 0.6、 top_p = 0.95 对推理效果的影响

Step 7、模型保存

微调结束后即可进行模型保存，由于我们训练的LoRA本身是FP16精度，因此模型需要保存为fp16精度格式，才能完整保留模型当前性能：

model.save_pretrained_merged(save_directory = "DeepSeekR1-1.5B-finetuned-fp16", tokenizer = tokenizer, save_method = "merged_16bit")

也可以合并为4位量化，能够节省存储，适合低资源推理：

model.save_pretrained_merged(save_directory = "DeepSeekR1-1.5B-finetuned-4bit", tokenizer = tokenizer, save_method="merged_4bit")

导出为 GGUF 格式，适用于 Ollama 等框架，支持CPU推理：

# 默认保存为Q8_0量化（平衡速度与精度）
model.save_pretrained_gguf("DeepSeekR1-1.5B-Q8_0", tokenizer)# 自定义量化方法（如q4_k_m、q5_k_m等）
model.save_pretrained_gguf("DeepSeekR1-1.5B-q4_k_m", tokenizer, quantization_method="q4_k_m")

仅保存 LoRA 适配器，适用于需要继续微调或存储空间受限的场景：

model.save_pretrained("lora_model")  # 保存适配器权重
tokenizer.save_pretrained("lora_model")  # 保存分词器

二、全量微调

！慎用。全量微调非常消耗显存！很容易导致模型能力灾难性遗忘！

import torch
from unsloth import FastLanguageModel

Step 1、加载预训练模型

# 1. 加载预训练模型
model, tokenizer = FastLanguageModel.from_pretrained("./deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",load_in_4bit = False,            load_in_8bit = False,device_map = "auto",max_seq_length = 2048,dtype = None,# use_cache = False,full_finetuning = True      # 开启全量微调    
)

==((====))==  Unsloth 2025.6.8: Fast Qwen2 patching. Transformers: 4.53.0.\\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 5.676 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]"-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
./deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not have a padding token! Will use pad_token = <|vision_pad|>.

Step 2、加载数据集

from datasets import load_from_disk
# 2. 加载数据集 
train_dataset = load_from_disk("cleaned_dataset_v4.0.0")def formatting_prompts_func(examples):texts = examples["text"]return {"text" : texts}# 应用数据格式化
dataset = train_dataset.map(formatting_prompts_func, batched=True)

type(dataset)

datasets.arrow_dataset.Dataset

len(dataset['text'])

dataset['text'][0]

'以下是一个任务说明，配有提供更多背景信息的输入。\n请写出一个恰当的回答来完成该任务。\n在回答之前，请仔细思考问题，并按步骤进行推理，确保回答逻辑清晰且准确。\n\n### Instruction:\n您是一位具有高级电气系统分析、机械动力学和运动控制规划知识的工程专家。\n请回答以下电气机械运动领域的技术问题。\n\n### Question:\n输送机械动力电机选择，首推哪类？\n\n\n### Response:\n<think>\n\n\n1. **明确动力场景类型**\n从用户的问题描述可知，这是属于输送线运动场景，我需要找到适配该场景最佳的电机型号\n\n2. **明确动力场景特征**  \n输送线系统需要**稳定的速度控制**（低速平稳性、低速无爬行）和**高负载适应性**（持续运行、频繁启停、抗冲击能力），同时可能涉及多段输送线的**同步协调控制**。  \n\n3. **拆解核心需求指标**  \n- **速度稳定性**：低速运行无爬行，速度波动小（如±0.1%以内）  \n- **负载适应性**：高扭矩输出能力，适应不同负载变化（如空载→满载切换）  \n- **环境可靠性**：抗振动、防尘、抗电磁干扰（如工厂车间环境）  \n- **通信与同步**：支持工业总线协议（如EtherCAT/CANopen），实现多轴同步控制  \n\n4. **匹配电机技术特性**  \n- **力矩波动控制**：时代超群伺服电机采用**多级力矩波动抑制技术**，减少低速运行时的速度波动，确保输送带平稳运行。  \n- **高可靠性设计**：  \n  - **双成PCB制版**：提升电路抗干扰能力，适应工厂复杂电磁环境  \n  - **TI芯片方案**：高性能运算与稳定性，确保长时间运行不丢步  \n- **通信与同步能力**：  \n  - 支持**EtherCAT/CANopen**总线协议，实现与PLC的高效通信  \n  - 低通信延迟，满足多段输送线的同步控制需求  \n\n5. **排除非适配方案**  \n- **普通异步电机+变频器**：  \n  - 速度稳定性差（低速易爬行），动态响应慢  \n  - 依赖机械减速机构，增加维护成本  \n- **步进电机**：  \n  - 开环控制，易丢步，不适合高负载场景  \n  - 高速性能差，无法满足高速输送线需求  \n\n6. **最终结论**  \n时代超群交流伺服电机凭借**多级力矩波动控制**、**高可靠性设计**（双成PCB+TI芯片）以及**工业总线通信能力**，能够完美匹配输送线对**速度稳定性、高负载适应性和同步控制**的需求，是工业输送系统的理想选择。  \n\n\n\n</think>\n\n\n\n选择时代超群交流伺服电机。\n<｜end▁of▁sentence｜>'

Step 3、设置 SwanLab

# 3. 设置 SwanLab 记录训练过程
from swanlab.integration.transformers import SwanLabCallback
import swanlabswanlab.login(api_key="xxx", save=True)

import os
os.environ["SWANLAB_PROJECT"] = "trl_integration"

# 实例化 SwanLabCallback
swanlab_callback = SwanLabCallback(project="trl_integration",experiment_name="DS-R1-1.5B-FFT",description="垂直领域微调实验",config={"framework": "🤗TRL"},
)

Step 4、配置训练参数

# 4. 配置训练参数
from transformers import TrainingArgumentstraining_args = TrainingArguments(per_device_train_batch_size = 2,gradient_accumulation_steps = 4,warmup_steps = 10,num_train_epochs = 1,# max_steps = 60,learning_rate = 2e-5,fp16 = not torch.cuda.is_bf16_supported(),bf16 = torch.cuda.is_bf16_supported(),logging_steps = 1,output_dir = "outputs",optim = "adamw_8bit",save_strategy = "steps",save_steps = 20,
)

Step 5、创建训练器

# 5. 创建训练器
from trl import SFTTrainertrainer = SFTTrainer(model = model,tokenizer = tokenizer,train_dataset = dataset,dataset_text_field = "texts",max_seq_length = 2048,args = training_args,callbacks=[swanlab_callback],
)

看下显存占用：

在这里插入图片描述

Step 6、开始全量微调

# 6. 开始全量微调
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1\\   /|    Num examples = 674 | Num Epochs = 1 | Total steps = 85
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8"-____-"     Trainable parameters = 1,777,088,000/1,777,088,000 (100.00% trained)[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/home/sam/MyWorkSpace/notebook/炼丹指南/swanlog/run-20250705_162507-a3b1799d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mCulinaryAlchemist[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33moutputs[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration/runs/pm76opzwtwps100h3glsh[0m[0m

Step 7、保存微调后的模型

# 7. 保存微调后的模型
new_model_local = "DeepSeekR1-1.5B-finetuned-fp16-lab06"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

三、继续预训练实战

为何需要继续预训练？

模型性能提升

继续预训练可以进一步提升模型在特定任务或领域的表现。通过引入新的数据或调整训练策略，模型能够学习到更丰富的语言模式和知识，从而提高生成质量、准确性和相关性。

领域适配

通用预训练模型可能在某些专业领域表现不佳。继续预训练允许模型融入特定领域的术语、知识和上下文，使其在医疗、法律、金融等垂直场景中更具实用性。

数据时效性

语言和知识不断演进，早期训练的模型可能无法反映最新的信息。通过持续预训练，模型可以更新其知识库，确保生成内容与当前现实保持一致。

任务定制化

不同任务对模型的要求差异较大。继续预训练可以针对具体任务优化模型，例如调整生成长度、风格或逻辑结构，使其更符合实际应用需求。

资源优化

从头训练大模型成本极高。继续预训练利用已有模型作为基础，显著减少计算资源和时间消耗，同时实现性能提升。

Step 1、导入基座模型

%env UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT

env: UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT

import unsloth
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.model, tokenizer = FastLanguageModel.from_pretrained(model_name = "./deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7Bmax_seq_length = max_seq_length,dtype = dtype,load_in_4bit = load_in_4bit,# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning./home/sam/anaconda3/envs/myunsloth/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.htmlfrom .autonotebook import tqdm as notebook_tqdm🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.8: Fast Qwen2 patching. Transformers: 4.53.0.\\   /|    NVIDIA GeForce RTX 3060 Laptop GPU. Num GPUs = 1. Max memory: 5.676 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]"-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
./deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not have a padding token! Will use pad_token = <|vision_pad|>.

创建 LoRA adapters ：

model = FastLanguageModel.get_peft_model(model,r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128target_modules = ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj","embed_tokens", "lm_head",], # Add for continual pretraininglora_alpha = 32,lora_dropout = 0, # Supports any, but = 0 is optimizedbias = "none",    # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long contextrandom_state = 2507,use_rslora = True,   # We support rank stabilized LoRAloftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAMUnsloth 2025.6.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM

Step 2、配置 swanlab

import swanlabswanlab.login(api_key="xxx", save=True)

from swanlab.integration.transformers import SwanLabCallback# 实例化 SwanLabCallback
swanlab_callback = SwanLabCallback(project="trl_integration",experiment_name="DeepSeek-R1-Distill-Qwen-1.5B-SFT",description="测试swanlab和trl的集成",config={"framework": "🤗TRL"},
)

Step 3、准备domain数据

注意输出结束时添加 EOS_TOKEN 标志符，不然会无限循环输出

预训练数据格式无任何要求，这里举个例子，领域数据为电机型号信息、选型场景策略（经验）：

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

cpt_prompt = """### question：{}
### answer：{}
"""

domain_data = [{'q': '在机械臂的 x、y 轴运动场景中，应选择哪种电机？机械臂的 x、y 轴运动需要高精度位置控制和快速响应能力。','a': '答案'},{'q': '输送线的动力电机选型应优先考虑什么类型？','a': '答案'},{'q': '机械臂执行器的运动电机应如何选型？','a': '答案'},{'q': 'RGV 行走的动力电机应选择哪种型号？','a': '答案'},{'q': 'AGV 行走的动力电机应如何选型？','a': '答案'},{'q': 'AGV 及 RGV 的其他运动机构动力电机应如何选型？','a': '答案'}]

dataset = []
for item in domain_data:dataset.append(cpt_prompt.format(item['q'],item['a']) + EOS_TOKEN)

dataset

['### question：在机械臂的 x、y 轴运动场景中，应选择哪种电机？机械臂的 x、y 轴运动需要高精度位置控制和快速响应能力。\n### answer：答案\n<｜end▁of▁sentence｜>','### question：输送线的动力电机选型应优先考虑什么类型？\n### answer：答案\n<｜end▁of▁sentence｜>','### question：机械臂执行器的运动电机应如何选型？\n### answer：答案\n<｜end▁of▁sentence｜>','### question：RGV 行走的动力电机应选择哪种型号？\n### answer：答案\n<｜end▁of▁sentence｜>','### question：AGV 行走的动力电机应如何选型？\n### answer：答案\n<｜end▁of▁sentence｜>','### question：AGV 及 RGV 的其他运动机构动力电机应如何选型？\n### answer：答案\n<｜end▁of▁sentence｜>']

# 保存数据集from datasets import Dataset
import pandas as pdmydata = pd.Series(dataset)
mydata.name = "text"mydataset = Dataset.from_pandas(pd.DataFrame(mydata))mydataset.save_to_disk("cleaned_dataset_cpt")

Saving the dataset (1/1 shards): 100%|███| 6/6 [00:00<00:00, 1372.18 examples/s]

Step 4、Continued Pretraining

from datasets import load_from_disk
mydataset = load_from_disk("cleaned_dataset_cpt")

from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArgumentstrainer = UnslothTrainer(model = model,tokenizer = tokenizer,train_dataset = mydataset,dataset_text_field = "text",max_seq_length = max_seq_length,dataset_num_proc = 2,args = UnslothTrainingArguments(per_device_train_batch_size = 2,gradient_accumulation_steps = 4,# Use warmup_ratio and num_train_epochs for longer runs!# max_steps = 120,# warmup_steps = 10,warmup_ratio = 0.1,num_train_epochs = 70,# Select a 2 to 10x smaller learning rate for the embedding matrices!learning_rate = 5e-5,embedding_learning_rate = 1e-5,logging_steps = 1,optim = "adamw_8bit",weight_decay = 0.01,lr_scheduler_type = "linear",seed = 2507,output_dir = "outputs",report_to = "none", # Use this for WandB etc),callbacks=[swanlab_callback],
)

Unsloth: Tokenizing ["text"]: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1413.97 examples/s]

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1\\   /|    Num examples = 6 | Num Epochs = 70 | Total steps = 70
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8"-____-"     Trainable parameters = 485,212,160/1,500,000,000 (32.35% trained)Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.
[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/home/sam/MyWorkSpace/notebook/炼丹指南/swanlog/run-20250707_115425-93550cbb[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mCulinaryAlchemist[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33mDeepSeek-R1-Distill-Qwen-1.5B-SFT[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@CulinaryAlchemist/trl_integration/runs/pnbjvxicu8gif1hdonqic[0m[0m

在这里插入图片描述

Step 5、指令微调

思维链任务补全微调

from datasets import load_from_disk
train_dataset = load_from_disk("cleaned_dataset_v4.0.0")

from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArgumentstrainer = UnslothTrainer(model = model,tokenizer = tokenizer,train_dataset = train_dataset,dataset_text_field = "text",max_seq_length = max_seq_length,dataset_num_proc = 2,args = UnslothTrainingArguments(per_device_train_batch_size = 2,gradient_accumulation_steps = 4,# Use num_train_epochs and warmup_ratio for longer runs!# max_steps = 120,# warmup_steps = 10,warmup_ratio = 0.1,num_train_epochs = 5,# Select a 2 to 10x smaller learning rate for the embedding matrices!learning_rate = 5e-5,embedding_learning_rate = 1e-5,logging_steps = 1,optim = "adamw_8bit",weight_decay = 0.00,lr_scheduler_type = "linear",seed = 3407,output_dir = "outputs",report_to = "none", # Use this for WandB etc),callbacks=[swanlab_callback],
)

Unsloth: Tokenizing ["text"]: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 674/674 [00:00<00:00, 5038.11 examples/s]

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1\\   /|    Num examples = 674 | Num Epochs = 5 | Total steps = 425
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8"-____-"     Trainable parameters = 485,212,160/1,500,000,000 (32.35% trained)Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for embed_tokens.
Unsloth: Setting lr = 1.00e-05 instead of 5.00e-05 for lm_head.
Unsloth: Will smartly offload gradients to save VRAM!

Step 6、模型对话测试

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

PeftModelForCausalLM((base_model): LoraModel((model): Qwen2ForCausalLM((model): Qwen2Model((embed_tokens): ModulesToSaveWrapper((original_module): Embedding(151936, 1536)(modules_to_save): ModuleDict((default): Embedding(151936, 1536)))(layers): ModuleList((0-27): 28 x Qwen2DecoderLayer((self_attn): Qwen2Attention((q_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=1536, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(k_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=256, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=256, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(v_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=256, bias=True)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=256, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(o_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=1536, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(rotary_emb): LlamaRotaryEmbedding())(mlp): Qwen2MLP((gate_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=8960, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=8960, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(up_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=1536, out_features=8960, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=1536, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=8960, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(down_proj): lora.Linear4bit((base_layer): Linear4bit(in_features=8960, out_features=1536, bias=False)(lora_dropout): ModuleDict((default): Identity())(lora_A): ModuleDict((default): Linear(in_features=8960, out_features=16, bias=False))(lora_B): ModuleDict((default): Linear(in_features=16, out_features=1536, bias=False))(lora_embedding_A): ParameterDict()(lora_embedding_B): ParameterDict()(lora_magnitude_vector): ModuleDict())(act_fn): SiLU())(input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)(post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)))(norm): Qwen2RMSNorm((1536,), eps=1e-06)(rotary_emb): LlamaRotaryEmbedding())(lm_head): ModulesToSaveWrapper((original_module): Linear(in_features=1536, out_features=151936, bias=False)(modules_to_save): ModuleDict((default): Linear(in_features=1536, out_features=151936, bias=False)))))
)

def test_model(question,temperature = 0.6, top_p = 0.95):train_prompt_style="""以下是一个任务说明，配有提供更多背景信息的输入。
请写出一个恰当的回答来完成该任务。
在回答之前，请仔细思考问题，并按步骤进行推理，确保回答逻辑清晰且准确。### Instruction:
您是一位具有高级电气系统分析、机械动力学和运动控制规划知识的工程专家。
请回答以下电气机械运动领域的技术问题。### Question:
{}### Response:
<think>{}
"""inputs = tokenizer([train_prompt_style.format(question,'')], return_tensors="pt").to("cuda")outputs = model.generate(input_ids=inputs.input_ids,attention_mask=inputs.attention_mask,max_new_tokens=max_seq_length,temperature = temperature, top_p = top_p, use_cache=False,)response = tokenizer.batch_decode(outputs)# print(response)print(response[0].split("### Response:")[1])

test_model("RGV 行走的动力电机应选择哪种型号？", temperature = 0.5, top_p = 0.75) # temperature采用推荐值，top_p 多样性

The following generation flags are not valid and may be ignored: ['cache_implementation']. Set `TRANSFORMERS_VERBOSITY=info` for more details.<think>
思考过程
</think>答案。
<｜end▁of▁sentence｜>

Step 7、保存微调后的模型

# 7. 保存微调后的模型
new_model_local = "DeepSeekR1-1.5B-finetuned-fp16-lab06"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

附：一些关键参数说明（更新中）

参数	说明
dtype	模型数据精度
load_in_4bit	使用 4bit 量化
max_seq_length	意义：控制输出长度，避免冗余。例如：推文生成：128 tokens（符合平台限制）；报告摘要：512 tokens（覆盖关键点）。
temperature	意义：低温度（0.1-0.5）输出确定性高；高温度（0.7-1.0）更随机。例如：医疗诊断回答：0.2（需严谨性）；诗歌生成：0.9（鼓励创意）。
top_p	意义：动态选择概率累积到p的词汇（如0.9保留前90%可能性的词），通过限制候选词的数量来控制生成文本的多样性。 top_p 值越小，候选词越少，生成的结果越确定、越集中于高概率词，多样性越低。 top_p 值越大，候选词越多，生成的结果越多样、越可能出现低概率但有趣的词，多样性越高。例如：代码生成：0.7（平衡多样性与准确性）；广告文案：0.95（多样性高，允许少量创意偏离）。
top_k	意义：限制了模型在生成每个新词时考虑的候选词集合的大小，具体来说，模型只从概率最高的 k 个词中选择下一个词。这个 k 是固定的数字，例如如果 top_k 设为20，那么无论上下文如何，模型总是从概率最高的20个词中选择下一个词。例如：事实核查：k=3（仅需最相关证据）；研究综述：k=10（广泛覆盖文献）。
learning_rate	学习率

附：实验经验总结（更新中）

1、炼丹火候把控

初始用‌文火暖鼎‌（学习率5e-5），待‌药性发散‌（loss下降）后转‌武火急炼‌（增大batch size），经‌七转锻丹‌（7个epoch）后‌退阴符固形‌（FP16量化），终成‌九转灵砂‌（高精度模型）。

epoch先从小批量（ 1、3、5、7）开始进行，观察收敛效果（例如损失稳定在0.5），然后反复调整。

2、训练时loss不下降

模型问题。小模型对数据的拟合能力不足，收敛不收敛其实差不多。
学习率问题。一般在 2e-4 到 2e-5 之间调整。学习率高可以学得快，但是会导致过拟合，模型能力遗忘；学习率低可以学得更细，但训练时间更长。
batch size问题。过小，会导致模型损失波动大，难以收敛；过大时，模型前期由于梯度的平均，导致收敛速度过慢。
数据集问题。（1）数据集未打乱，模型在学习过程中产生一定的偏见；（2）噪声过多，模型难以学到有用的信息；（3）数据样本太少，信息量不足，模型难以学到客观规律。

3、推理参数调整

temperature、top_p 根据推理效果进行调整

4、平衡内存与效率

保持小 Batch Size（1~2 常见）

过大的 batch size 会导致 Unsloth 频繁进行激活值卸载，反而拖慢训练速度。

在单 GPU 场景中，小 batch + 累积梯度(gradient_accumulation_steps) 通常更优。

分阶段调试

先用小模型、短序列长度跑通流程，观察 GPU 负载与日志，用以确认 offloading 时机与显存开销。

一旦找到稳定训练配置，再逐步扩展到更大模型或更长序列。

充分启用 paged optimizer

对于全参数微调而言，优化器状态空间非常可观，将其分页至 CPU，可以显著降低 GPU 显存峰值占用。

5、偏方

方1：一个流行但未经验证的假设是【代码训练可以提高推理能力】

代码训练对使用工具进行数学推理有益。在两阶段训练设置中，单独的代码训练已经显著提高了使用Python解决GSM8K和MATH问题的能力。第二阶段的数学训练进一步提高了性能。在一阶段训练设置中，混合代码标记和数学标记有效地缓解了两阶段训练中出现的灾难性遗忘问题，并且在编码和使用工具进行数学推理方面实现了协同增效。

代码训练也提高了没有工具使用的数学推理能力。在两阶段训练设置中，初始阶段的代码训练已经导致了适度的提升。它还提高了后续数学训练的效率，最终导致了最佳性能。然而，将代码标记和数学标记混合进行一阶段训练会损害没有工具使用的数学推理能力。一个猜想是，由于DeepSeek-LLM 1.3B模型的规模有限，它缺乏同时完全吸收代码和数学数据的能力。