当前位置：首页 > ds >正文

day 37

ds 2025/8/28 18:29:00

模型的保存和加载

仅保存模型参数

- 原理：保存模型的权重参数，不保存模型结构代码。加载时需提前定义与训练时一致的模型类。

- 优点：文件体积小（仅含参数），跨框架兼容性强（需自行定义模型结构）。

# 保存模型参数
torch.save(model.state_dict(), "model_weights.pth")# 加载参数（需先定义模型结构）
model = MLP()  # 初始化与训练时相同的模型结构
model.load_state_dict(torch.load("model_weights.pth"))
# model.eval()  # 切换至推理模式（可选）

保存模型+权重

- 原理：保存模型结构及参数

- 优点：加载时无需提前定义模型类

- 缺点：文件体积大，依赖训练时的代码环境（如自定义层可能报错）。

# 保存整个模型
torch.save(model, "full_model.pth")# 加载模型（无需提前定义类，但需确保环境一致）
model = torch.load("full_model.pth")
model.eval()  # 切换至推理模式（可选）

保存训练状态（断点续训）

- 原理：保存模型参数、优化器状态（学习率、动量）、训练轮次、损失值等完整训练状态，用于中断后继续训练。

- 适用场景：长时间训练任务（如分布式训练、算力中断）。

 # 保存训练状态checkpoint = {"model_state_dict": model.state_dict(),"optimizer_state_dict": optimizer.state_dict(),"epoch": epoch,"loss": best_loss,}torch.save(checkpoint, "checkpoint.pth")# 加载并续训model = MLP()optimizer = torch.optim.Adam(model.parameters())checkpoint = torch.load("checkpoint.pth")model.load_state_dict(checkpoint["model_state_dict"])optimizer.load_state_dict(checkpoint["optimizer_state_dict"])start_epoch = checkpoint["epoch"] + 1  # 从下一轮开始训练best_loss = checkpoint["loss"]# 继续训练循环for epoch in range(start_epoch, num_epochs):train(model, optimizer, ...)

早停法(early stop)

- 正常情况：训练集和测试集损失同步下降，最终趋于稳定。

- 过拟合：训练集损失持续下降，但测试集损失在某一时刻开始上升（或不再下降）。

如果可以监控验证集的指标不再变好，此时提前终止训练，避免模型对训练集过度拟合。----监控的对象是验证集的指标。这种策略叫早停法。

if test_loss.item() < best_test_loss: # 如果当前测试集损失小于最佳损失best_test_loss = test_loss.item() # 更新最佳损失best_epoch = epoch + 1 # 更新最佳epochcounter = 0 # 重置计数器# 保存最佳模型torch.save(model.state_dict(), 'best_model.pth')else:counter += 1if counter >= patience:print(f"早停触发！在第{epoch+1}轮，测试集损失已有{patience}轮未改善。")print(f"最佳测试集损失出现在第{best_epoch}轮，损失值为{best_test_loss:.4f}")early_stopped = Truebreak  # 终止训练循环

逻辑：

- 首先初始一个计数器counter。

- 每 200 轮训练执行一次判断：比较当前损失与历史最佳损失。

- 若当前损失更低，保存模型参数。

- 若当前损失更高或相等，计数器加 1。

- 若计数器达到最大容许的阈值patience，则停止训练。

@浙大疏锦行

查看全文

http://www.xdnf.cn/news/9184.html