当前位置：首页 > news >正文

从 0 构建一个 AI 模型测试小项目（含 pandas+sklearn 实战）

news 2025/9/1 18:44:36

本文将带你从一份行为日志数据出发，完成特征处理、模型训练、评估分析、并进行预测结果的可视化和测试验证，非常适合 AI
测试工程师、数据分析初学者练手。

💡 项目背景

在 AI 项目中，除了算法工程师设计模型，AI 测试工程师的职责是验证模型是否预测准确、公平、稳健。例如：

哪些国家的预测容易出错？
模型在哪些时段容易误判？
预测偏差是否影响决策？

为此，我构建了一个小项目来完成这一闭环验证流程。

📁 项目结构设计

ai-model-testing-demo/
├── user_events.csv          # 模拟行为数据
├── model_test.py            # 主逻辑代码（训练、预测、分析）
├── model_predictions.csv    # 模型输出结果

user_event.csv内容：

ser_id,product_id,action,timestamp,country
1006,P001,click,2025-06-24 09:57,CN
1006,P003,view,2025-06-24 10:15,IN
1010,P001,click,2025-06-24 10:28,US
1007,P001,click,2025-06-24 12:44,IN
1007,P002,click,2025-06-24 10:18,CN
1001,P001,view,2025-06-24 12:53,US
1003,P002,buy,2025-06-24 10:42,IN
1003,P001,click,2025-06-24 11:03,CN
1005,P003,view,2025-06-24 11:11,CN

model_test.py：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family'] = 'SimHei'
matplotlib.rcParams['axes.unicode_minus'] = False# -------------------------------
# 1. 数据读取与预处理
# -------------------------------
df = pd.read_csv("user_events.csv")  # 替换为你真实的数据文件路径
df["timestamp"] = pd.to_datetime(df["timestamp"])  # 转换时间格式
df["hour"] = df["timestamp"].dt.hour  # 提取小时作为特征# -------------------------------
# 2. 构造目标变量（label）
# -------------------------------
df["label"] = (df["action"] == "buy").astype(int)  # 'buy' 为 1，其它为 0# -------------------------------
# 3. 特征选择与编码
# -------------------------------
features = ["product_id", "country", "hour"]
X = df[features]
y = df["label"]# One-Hot 编码
X_encoded = pd.get_dummies(X)# -------------------------------
# 4. 划分训练集和测试集
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42
)# -------------------------------
# 5. 模型训练
# -------------------------------
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)# -------------------------------
# 6. 模型评估
# -------------------------------
y_pred = model.predict(X_test)
print("\n📋 分类评估报告:")
print(classification_report(y_test, y_pred, target_names=["Not Buy", "Buy"]))# -------------------------------
# 7. 构建预测结果 DataFrame
# -------------------------------
df_pred = X_test.copy()
df_pred["y_true"] = y_test.values
df_pred["y_pred"] = y_pred# 恢复原始 hour 和 country 列（从编码中还原）
df_pred["hour"] = X.loc[df_pred.index, "hour"]
df_pred["country"] = X.loc[df_pred.index, "country"]# -------------------------------
# 8. 分析模型在各国家表现
# -------------------------------
print("\n🌎 模型在不同国家的预测均值（buy 概率）:")
print(df_pred.groupby("country")[["y_true", "y_pred"]].mean())# -------------------------------
# 9. 分析模型在各小时表现
# -------------------------------
hour_buy_rate = df_pred[df_pred["y_pred"] == 1]["hour"].value_counts().sort_index()
print("\n⏰ 每小时预测为 buy 的样本数:")
print(hour_buy_rate)# 可选：可视化每小时的预测 buy 分布
plt.figure(figsize=(8, 4))
hour_buy_rate.plot(kind="bar", color="skyblue")
plt.title("每小时预测为购买的次数")
plt.xlabel("小时")
plt.ylabel("次数")
plt.tight_layout()
plt.grid(axis="y")
plt.show()# -------------------------------
# 10. 保存预测结果
# -------------------------------
df_pred.to_csv("model_predictions.csv", index=False)
print("\n✅ 预测结果已保存为 model_predictions.csv")

📊 数据集说明（user_events.csv）

我们使用模拟的用户行为日志数据，包含字段如下：

字段名	说明
user_id	用户 ID
product_id	产品 ID
action	用户行为（click/view/buy）
timestamp	行为时间
country	国家（CN/US/IN）

我们目标是根据用户在某个时间点击某个商品，判断他是否会“buy”。

🧪 模型测试流程详解

✅ 1. 数据预处理与特征工程

提取小时特征（从时间戳中）
对 product_id 和 country 做独热编码（One-Hot）
构造标签列：buy 为 1，其它为 0

df["label"] = (df["action"] == "buy").astype(int)
df["hour"] = pd.to_datetime(df["timestamp"]).dt.hour
X = pd.get_dummies(df[["product_id", "country", "hour"]])

✅ 2. 模型训练与评估（使用决策树）

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_reportmodel = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)print("\n📋 分类评估报告:")
print(classification_report(y_test, y_pred, target_names=["Not Buy", "Buy"]))

示例输出：

在这里插入图片描述

✅ 3. 模型行为分析（按国家 & 小时分布）

我们进一步分析模型的表现是否在某些维度（国家 / 时间）上存在偏差。

🇨🇳 国家维度分析：

df_pred["country"] = X.loc[df_pred.index, "country"]
print(df_pred.groupby("country")[["y_true", "y_pred"]].mean())

示例输出：

在这里插入图片描述

表示：CN 用户中真实 “buy” 占比为 42%，而模型预测为 “buy” 的概率为 50%。

⏰ 小时分布分析：

df_pred["hour"] = X.loc[df_pred.index, "hour"]
hour_buy_rate = df_pred[df_pred["y_pred"] == 1]["hour"].value_counts().sort_index()
print(hour_buy_rate)

输出：

在这里插入图片描述

📊 可视化预测分布（matplotlib）

import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "SimHei"  # 支持中文显示hour_buy_rate.plot(kind="bar", color="skyblue")
plt.title("每小时预测为购买的次数")
plt.xlabel("小时")
plt.ylabel("次数")
plt.tight_layout()
plt.show()

输出：
在这里插入图片描述

📁 保存结果
我们将最终预测结果输出到 CSV：

df_pred.to_csv("model_predictions.csv", index=False)

保存结果：

hour,product_id_P001,product_id_P002,product_id_P003,country_CN,country_IN,country_US,y_true,y_pred,country
9,False,True,False,False,True,False,0,1,IN
12,False,False,True,False,False,True,0,1,US
10,False,True,False,False,False,True,0,0,US
8,False,True,False,False,False,True,0,0,US
9,True,False,False,False,False,True,0,1,US
12,True,False,False,True,False,False,0,1,CN
10,True,False,False,False,False,True,0,1,US
9,False,False,True,True,False,False,0,0,CN
10,True,False,False,True,False,False,1,1,CN
12,False,False,True,False,False,True,0,1,US