2.1.3
# Load the data
file_path = 'finance数据集.csv'
data = pd.__________(file_path)
---
data = pd.read_csv(file_path)
# 识别数值列用于箱线图
numeric_cols = data.select_dtypes(include=['float64', 'int64']).__________
---
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
代码解释:
select_dtypes筛选数据类型
箱线图,查看数据分布是否有异常值
数值列,columns(键盘打出col后点击“ Tab”按键)
# 使用IQR处理异常值
Q1 = data[numeric_cols].__________(0.25)
Q3 = data[numeric_cols].quantile(0.75)
---
Q1 = data[numeric_cols].quantile(0.25)
代码解释:类似于一个箱子,
# 检查重复值
duplicates = data_cleaned.duplicated()
num_duplicates = duplicates.sum()
scaler = MinMaxScaler()
data_cleaned[numeric_cols] = scaler.__________(data_cleaned[numeric_cols])
---
data_cleaned[numeric_cols] = scaler.fit_transform(data_cleaned[numeric_cols])
代码解释:
# 将SeriousDlqin2yrs设为目标变量
target_variable = '__________'
---
target_variable = 'SeriousDlqin2yrs'
# 定义特征和目标
X = data_cleaned.drop(columns=[__________])
y = data_cleaned[__________]
---
X = data_cleaned.drop(columns=[target_variable])
y = data_cleaned[target_variable]
代码解释:target_variable
通常指的是目标变量(也称为因变量、标签或响应变量)
# 划分数据
X_train, X_test, y_train, y_test = __________(X, y, test_size=0.2, random_state=42)
---
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=42)
代码解释:划分数据集train_test_split
# 保存清洗后的数据到CSV
cleaned_file_path = '2.1.3_cleaned_data.csv'
data_cleaned.to_csv(cleaned_file_path, index=False)