当前位置: 首页 > ds >正文

DAY 23 pipeline管道

目录

      • DAY 23 pipeline管道
        • 1.转化器和估计器的概念
        • 2.管道工程
        • 3.ColumnTransformer和Pipeline类
        • 作业:整理下全部逻辑的先后顺序,看看能不能制作出适合所有机器学习的通用pipeline

DAY 23 pipeline管道

from catboost import CatBoostClassifier
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.manifold import TSNE
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random as rnd
import seaborn as sns
import time
import umap
import warnings
import xgboost as xgbwarnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data = pd.read_csv(r'data.csv')list_discrete = data.select_dtypes(include=['object']).columns.tolist()home_ownership_mapping = {'Own Home': 1, 'Rent': 2,'Have Mortgage': 3, 'Home Mortgage': 4}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)years_in_job_mapping = {'< 1 year': 1, '1 year': 2, '2 years': 3, '3 years': 4, '4 years': 5,'5 years': 6, '6 years': 7, '7 years': 8, '8 years': 9, '9 years': 10, '10+ years': 11}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)data = pd.get_dummies(data, columns=['Purpose'])
data2 = pd.read_csv(r'data.csv')
list_new = []
for i in data.columns:if i not in data2.columns:list_new.append(i)
for i in list_new:data[i] = data[i].astype(int)term_mapping = {'Short Term': 0, 'Long Term': 1}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)list_continuous = data.select_dtypes(include=['int64', 'float64']).columns.tolist()for i in list_continuous:median_value = data[i].median()data[i] = data[i].fillna(median_value)X = data.drop(['Credit Default'], axis=1)
Y = data['Credit Default']X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)print('默认参数随机森林(训练集 -> 测试集)')start_time = time.time()
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, Y_train)
rf_pred = rf_model.predict(X_test)
end_time = time.time()print(f'训练与预测耗时: {end_time - start_time:.4f} 秒')
print('默认随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, rf_pred))
print('默认随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, rf_pred))
默认参数随机森林(训练集 -> 测试集)
训练与预测耗时: 1.0814 秒
默认随机森林 在测试集上的分类报告:precision    recall  f1-score   support0       0.76      0.97      0.85      10591       0.78      0.27      0.40       441accuracy                           0.76      1500macro avg       0.77      0.62      0.63      1500
weighted avg       0.77      0.76      0.72      1500默认随机森林 在测试集上的混淆矩阵:
[[1025   34][ 321  120]]
data = pd.read_csv(r'data.csv')print('原始数据加载完成, 形状为:', data.shape)
原始数据加载完成, 形状为: (7500, 18)
X = data.drop(['Credit Default'], axis=1)
Y = data['Credit Default']print('特征和标签分离完成')
print('特征 X 的形状:', X.shape)
print('标签 Y 的形状:', Y.shape)X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)print('数据集划分完成 (预处理之前)')
print('X_train 形状:', X_train.shape)
print('X_test 形状:', X_test.shape)
print('Y_train 形状:', Y_train.shape)
print('Y_test 形状:', Y_test.shape)
特征和标签分离完成
特征 X 的形状: (7500, 17)
标签 Y 的形状: (7500,)
数据集划分完成 (预处理之前)
X_train 形状: (6000, 17)
X_test 形状: (1500, 17)
Y_train 形状: (6000,)
Y_test 形状: (1500,)
1.转化器和估计器的概念
2.管道工程
object_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(exclude=['object']).columns.tolist()
ordinal_features = ['Home Ownership', 'Years in current job', 'Term']
ordinal_categories = [['Own Home', 'Rent', 'Have Mortgage', 'Home Mortgage'], ['< 1 year', '1 year', '2 years','3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'], ['Short Term', 'Long Term']]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=-1))])print('有序特征处理 Pipeline 定义完成')nominal_features = ['Purpose']
nominal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])print('标称特征处理 Pipeline 定义完成')continuous_features = [f for f in X.columns if f not in ordinal_features + nominal_features]
continuous_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('scaler', StandardScaler())])print('连续特征处理 Pipeline 定义完成')
有序特征处理 Pipeline 定义完成
标称特征处理 Pipeline 定义完成
连续特征处理 Pipeline 定义完成
3.ColumnTransformer和Pipeline类
preprocessor = ColumnTransformer(transformers=[('ordinal', ordinal_transformer, ordinal_features), ('nominal', nominal_transformer, nominal_features), ('continuous', continuous_transformer, continuous_features)], remainder='passthrough')print('ColumnTransformer (预处理器) 定义完成')
ColumnTransformer (预处理器) 定义完成
pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(random_state=42))])print('完整的 Pipeline 定义完成')
完整的 Pipeline 定义完成
print('默认参数随机森林(训练集 -> 测试集)')start_time = time.time()
pipeline.fit(X_train, Y_train)
pipeline_pred = pipeline.predict(X_test)
end_time = time.time()print(f'训练与预测耗时: {end_time - start_time:.4f} 秒')
print('默认随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, pipeline_pred))
print('默认随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, pipeline_pred))
默认参数随机森林(训练集 -> 测试集)
训练与预测耗时: 1.1147 秒
默认随机森林 在测试集上的分类报告:precision    recall  f1-score   support0       0.77      0.97      0.86      10591       0.83      0.30      0.44       441accuracy                           0.78      1500macro avg       0.80      0.64      0.65      1500
weighted avg       0.79      0.78      0.74      1500默认随机森林 在测试集上的混淆矩阵:
[[1031   28][ 308  133]]
作业:整理下全部逻辑的先后顺序,看看能不能制作出适合所有机器学习的通用pipeline
from catboost import CatBoostClassifier
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.manifold import TSNE
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random as rnd
import seaborn as sns
import time
import umap
import warnings
import xgboost as xgbwarnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = Falsedata = pd.read_csv(r'data.csv')print('原始数据加载完成, 形状为:', data.shape)X = data.drop(['Credit Default'], axis=1)
Y = data['Credit Default']print('特征和标签分离完成')
print('特征 X 的形状:', X.shape)
print('标签 Y 的形状:', Y.shape)X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)print('数据集划分完成 (预处理之前)')
print('X_train 形状:', X_train.shape)
print('X_test 形状:', X_test.shape)
print('Y_train 形状:', Y_train.shape)
print('Y_test 形状:', Y_test.shape)object_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(exclude=['object']).columns.tolist()
ordinal_features = ['Home Ownership', 'Years in current job', 'Term']
ordinal_categories = [['Own Home', 'Rent', 'Have Mortgage', 'Home Mortgage'], ['< 1 year', '1 year', '2 years','3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'], ['Short Term', 'Long Term']]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=-1))])print('有序特征处理 Pipeline 定义完成')nominal_features = ['Purpose']
nominal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])print('标称特征处理 Pipeline 定义完成')continuous_features = [f for f in X.columns if f not in ordinal_features + nominal_features]
continuous_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('scaler', StandardScaler())])print('连续特征处理 Pipeline 定义完成')preprocessor = ColumnTransformer(transformers=[('ordinal', ordinal_transformer, ordinal_features), ('nominal', nominal_transformer, nominal_features), ('continuous', continuous_transformer, continuous_features)], remainder='passthrough')print('ColumnTransformer (预处理器) 定义完成')pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(random_state=42))])print('完整的 Pipeline 定义完成')print('默认参数随机森林(训练集 -> 测试集)')start_time = time.time()
pipeline.fit(X_train, Y_train)
pipeline_pred = pipeline.predict(X_test)
end_time = time.time()print(f'训练与预测耗时: {end_time - start_time:.4f} 秒')
print('默认随机森林 在测试集上的分类报告:')
print(classification_report(Y_test, pipeline_pred))
print('默认随机森林 在测试集上的混淆矩阵:')
print(confusion_matrix(Y_test, pipeline_pred))
原始数据加载完成, 形状为: (7500, 18)
特征和标签分离完成
特征 X 的形状: (7500, 17)
标签 Y 的形状: (7500,)
数据集划分完成 (预处理之前)
X_train 形状: (6000, 17)
X_test 形状: (1500, 17)
Y_train 形状: (6000,)
Y_test 形状: (1500,)
有序特征处理 Pipeline 定义完成
标称特征处理 Pipeline 定义完成
连续特征处理 Pipeline 定义完成
ColumnTransformer (预处理器) 定义完成
完整的 Pipeline 定义完成
默认参数随机森林(训练集 -> 测试集)
训练与预测耗时: 1.1326 秒
默认随机森林 在测试集上的分类报告:precision    recall  f1-score   support0       0.77      0.97      0.86      10591       0.83      0.30      0.44       441accuracy                           0.78      1500macro avg       0.80      0.64      0.65      1500
weighted avg       0.79      0.78      0.74      1500默认随机森林 在测试集上的混淆矩阵:
[[1031   28][ 308  133]]

@浙大疏锦行

http://www.xdnf.cn/news/12695.html

相关文章:

  • 10分钟私有部署Deepseek-R1-0518,打造团队专属AI助手
  • 小牛电动NXT,市场销量第一
  • 嵌入式学习笔记- freeRTOS 带FromISR后缀的函数
  • 打卡day46
  • 【leetcode】3. 无重复字符的最长子串
  • 变频串联谐振试验技术解析
  • 【python基础知识】变量名和方法名的单下划线(_)和双下划线(__)总结
  • Unity基础-数学向量
  • 无刷电机的驱动MOSFET
  • Windows安装 cityflow
  • 超声波清洗设备的清洗效果如何?
  • Python 3.11.9 安装教程
  • 视频监控平台建设方案
  • 数据结构之LinkedList
  • RetroMAE 预训练任务
  • 字符串加密(华为OD)
  • Mathematica 打印输出时,隐藏 In[n] 和 Out[n] 标签
  • 【51单片机】1. 基础点灯大师
  • Splash动态渲染技术全解析:从基础到企业级应用(2025最新版)
  • echarts树状图与vue3
  • Kubernetes 从入门到精通-label标签
  • 山东大学算法设计与分析复习笔记
  • 多模态+空间智能:考拉悠然以AI+智慧灯杆,点亮城市治理新方式
  • C#实现Stdio通信方式的MCP Server
  • 高级网络工具包用户操作指南
  • Linux 中替换文件中的某个字符串
  • 如何调控gpu训练参数
  • 优选算法第十二讲:队列 + 宽搜 优先级队列
  • 深度学习在RNA分子动力学中的特征提取与应用指南
  • 每日互动方毅:数据要素价值在于流转,用好AI的前提是用好数据 | 爱分析访谈