当前位置：首页 > ds >正文

机器学习——集成学习基础

ds 2025/8/29 2:33:20

一、鸢尾花数据训练模型

1. 使用鸢尾花数据分别训练集成模型：AdaBoost模型，Gradient Boosting模型

2. 对别两个集成模型的准确率以及报告

3. 两个模型的预测结果进行可视化需要进行降维处理，两个图像显示在同一个坐标系中

代码展示：

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifierplt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = Falsedf = load_iris()X = df.data
y = df.targetx_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)adaboost = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1),n_estimators=100,learning_rate=1.0,random_state=42
)gradient = GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=3,random_state=42
)pca = PCA(n_components=2)pca_x = pca.fit_transform(X)x_train_pca,x_test_pca,y_train_pca,y_test_pca = train_test_split(pca_x,y,test_size=0.3,random_state=42)rf_pca1 = AdaBoostClassifier(n_estimators=100,random_state=42)
rf_pca2 = GradientBoostingClassifier(n_estimators=100,random_state=42)
rf_pca1.fit(x_train_pca,y_train_pca)
rf_pca2.fit(x_train_pca,y_train_pca)x_min,x_max = pca_x[:,0].min()-1,pca_x[:,0].max()+1
y_min,y_max = pca_x[:,1].min()-1,pca_x[:,1].max()+1xx,yy = np.meshgrid(np.arange(x_min,x_max,0.01),np.arange(y_min,y_max,0.01))z1 = rf_pca1.predict(np.c_[xx.ravel(),yy.ravel()])
z2 = rf_pca2.predict(np.c_[xx.ravel(),yy.ravel()])plt.figure(figsize=(14,6))plt.subplot(121)
z1 = z1.reshape(xx.shape)
plt.contourf(xx,yy,z1,alpha=0.5)plt.scatter(pca_x[:,0],pca_x[:,1],s=20,c=y,edgecolors="k")plt.title("AdaBoost决策边界")
plt.xlabel("主要成分1")
plt.ylabel("主要成分2")plt.subplot(122)
z2 = z2.reshape(xx.shape)
plt.contourf(xx,yy,z2,alpha=0.5)plt.scatter(pca_x[:,0],pca_x[:,1],s=20,c=y,edgecolors="k")plt.title("Gradient Boosting决策边界")
plt.xlabel("主要成分1")
plt.ylabel("主要成分2")plt.tight_layout()
plt.show()

结果展示：

二、随机森林鸢尾花分类

基础应用 - 鸢尾花分类

‌任务目标‌：

使用随机森林对鸢尾花数据集进行分类，并分析特征重要性

‌数据集‌：

sklearn.datasets.load_iris()

‌要求步骤‌：

加载鸢尾花数据集并划分训练集/测试集(70%/30%)
创建随机森林分类器(设置n_estimators=100, max_depth=3)
训练模型并在测试集上评估准确率
输出分类报告和混淆矩阵
可视化特征重要性
(选做)尝试调整n_estimators和max_depth观察准确率变化

代码展示：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as pltplt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = Falseiris = load_iris()X = iris.data
y = iris.targetx_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)random = RandomForestClassifier(n_estimators=100,max_depth=3
)random.fit(x_train,y_train)y_pred = random.predict(x_test)
print("Random的准确率：",accuracy_score(y_test,y_pred))print("分类报告：")
print(classification_report(y_test,y_pred,target_names=iris.target_names))cm = confusion_matrix(y_test,y_pred)disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=iris.target_names)
disp.plot(cmap=plt.cm.Blues)
plt.title("随机森林分类器的混淆矩阵")
plt.show()print("特征的重要性：")
for i,j in zip(iris.feature_names,random.feature_importances_):print(f"{i}:{j:.4f}")pca = PCA(n_components=2)
pca_x = pca.fit_transform(X)x_train_pca,x_test_pca,y_train_pca,y_test_pca = train_test_split(pca_x,y,test_size=0.3,random_state=42)rf_pca = RandomForestClassifier(n_estimators=100,random_state=42)
rf_pca.fit(x_train_pca,y_train_pca)x_min,x_max = pca_x[:,0].min()-1,pca_x[:,0].max()+1
y_min,y_max = pca_x[:,1].min()-1,pca_x[:,1].max()+1xx,yy = np.meshgrid(np.arange(x_min,x_max,0.01),np.arange(y_min,y_max,0.01))z = rf_pca.predict(np.c_[xx.ravel(),yy.ravel()])plt.figure(figsize=(10,6))
z = z.reshape(xx.shape)
plt.contourf(xx,yy,z,alpha=0.5)plt.scatter(pca_x[:,0],pca_x[:,1],s=20,c=y,edgecolors="k")plt.title("随机森林决策边界的可视化")
plt.xlabel("主要成分1")
plt.ylabel("主要成分2")
plt.show()

结果展示：

Random的准确率： 1.0
分类报告：precision    recall  f1-score   supportsetosa       1.00      1.00      1.00        19versicolor       1.00      1.00      1.00        13virginica       1.00      1.00      1.00        13accuracy                           1.00        45macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45特征的重要性：
sepal length (cm):0.0745
sepal width (cm):0.0142
petal length (cm):0.4432
petal width (cm):0.4681

三、信用卡欺诈检测

信用卡欺诈检测

‌任务目标‌：

使用随机森林处理类别不平衡的信用卡欺诈检测问题

‌数据集‌：

Kaggle信用卡欺诈数据集【Credit Card Fraud Detection】

‌要求步骤‌：

加载信用卡交易数据(注意数据高度不平衡)
标准化Amount特征，Time特征可删除
使用分层抽样划分训练集/测试集
创建随机森林分类器(class_weight='balanced')
评估模型(使用精确率、召回率、F1、AUC-ROC)
(选做)使用SMOTE过采样处理类别不平衡

代码展示：

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as npdf = pd.read_csv("./data/creditcard.csv",encoding="utf-8")df = df.drop("Time",axis=1)transfer = StandardScaler()
df["Amount"] = transfer.fit_transform(df[["Amount"]])X = df.drop("Class",axis=1)
y = df["Class"]x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)rf = RandomForestClassifier(n_estimators=100,class_weight='balanced',random_state=42
)rf.fit(x_train,y_train)y_pred = rf.predict(x_test)
y_prob = rf.predict_proba(x_test)[:,1]print("分类报告：")
print(classification_report(y_test,y_pred,target_names=["0","1"]))cm = confusion_matrix(y_test,y_pred)
tn,fp,fn,tp = cm.ravel()
print("精确率：",tp/(tp+fp))
print("召回率：",tp/(tp+fn))
print("F1-score",(2*tp)/(2*tp+fn+fp))
print("AUC指标：",roc_auc_score(y_test,y_pred))fpr,tpr,_ = roc_curve(y_test,y_prob)plt.figure(figsize=(8,9))
plt.plot(fpr,tpr,color="b")
plt.plot([0,1],[0,1],color="r",linestyle="--")
plt.xlabel("fpr")
plt.ylabel("tpr")
plt.grid()
plt.show()

结果展示：

分类报告：precision    recall  f1-score   support0       1.00      1.00      1.00     852951       0.97      0.70      0.82       148accuracy                           1.00     85443macro avg       0.99      0.85      0.91     85443
weighted avg       1.00      1.00      1.00     85443精确率： 0.9719626168224299
召回率： 0.7027027027027027
F1-score 0.8156862745098039
AUC指标： 0.8513337653263792