当前位置: 首页 > news >正文

第R9周:阿尔茨海默病诊断(优化特征选择版)

文章目录

  • 1. 导入数据
  • 2. 数据处理
    • 2.1 患病占比
    • 2.2 相关性分析
    • 2.3 年龄与患病探究
  • 3. 特征选择
  • 4. 构建数据集
    • 4.1 数据集划分与标准化
    • 4.2 构建加载
  • 5. 构建模型
  • 6. 模型训练
    • 6.1 构建训练函数
    • 6.2 构建测试函数
    • 6.3 设置超参数
  • 7. 模型训练
  • 8. 模型评估
    • 8.1 结果图
  • 8.2 混淆矩阵
  • 9. 总结:

  • 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
  • 🍖 原作者:K同学啊

1. 导入数据

import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt  
import seaborn as sns 
import torch  
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDatasetplt.rcParams["font.sans-serif"] = ["Microsoft YaHei"]  # 显示中文
plt.rcParams['axes.unicode_minus'] = False		# 显示负号data_df = pd.read_csv("alzheimers_disease_data.csv")data_df.head()
PatientIDAgeGenderEthnicityEducationLevelBMISmokingAlcoholConsumptionPhysicalActivityDietQuality...MemoryComplaintsBehavioralProblemsADLConfusionDisorientationPersonalityChangesDifficultyCompletingTasksForgetfulnessDiagnosisDoctorInCharge
047517300222.927749013.2972186.3271121.347214...001.725883000100XXXConfid
147528900026.82768104.5425247.6198850.518767...002.592424000010XXXConfid
247537303117.795882019.5550857.8449881.826335...007.119548010100XXXConfid
347547410133.800817112.2092668.4280017.435604...016.481226000000XXXConfid
447558900020.716974018.4543566.3104610.795498...000.014691001100XXXConfid

5 rows × 35 columns

# 标签中文化
data_df.rename(columns={ "Age": "年龄", "Gender": "性别", "Ethnicity": "种族", "EducationLevel": "教育水平", "BMI": "身体质量指数(BMI)", "Smoking": "吸烟状况", "AlcoholConsumption": "酒精摄入量", "PhysicalActivity": "体育活动时间", "DietQuality": "饮食质量评分", "SleepQuality": "睡眠质量评分", "FamilyHistoryAlzheimers": "家族阿尔茨海默病史", "CardiovascularDisease": "心血管疾病", "Diabetes": "糖尿病", "Depression": "抑郁症史", "HeadInjury": "头部受伤", "Hypertension": "高血压", "SystolicBP": "收缩压", "DiastolicBP": "舒张压", "CholesterolTotal": "胆固醇总量", "CholesterolLDL": "低密度脂蛋白胆固醇(LDL)", "CholesterolHDL": "高密度脂蛋白胆固醇(HDL)", "CholesterolTriglycerides": "甘油三酯", "MMSE": "简易精神状态检查(MMSE)得分", "FunctionalAssessment": "功能评估得分", "MemoryComplaints": "记忆抱怨", "BehavioralProblems": "行为问题", "ADL": "日常生活活动(ADL)得分", "Confusion": "混乱与定向障碍", "Disorientation": "迷失方向", "PersonalityChanges": "人格变化", "DifficultyCompletingTasks": "完成任务困难", "Forgetfulness": "健忘", "Diagnosis": "诊断状态", "DoctorInCharge": "主诊医生" },inplace=True)data_df.columns
Index(['PatientID', '年龄', '性别', '种族', '教育水平', '身体质量指数(BMI)', '吸烟状况', '酒精摄入量','体育活动时间', '饮食质量评分', '睡眠质量评分', '家族阿尔茨海默病史', '心血管疾病', '糖尿病', '抑郁症史','头部受伤', '高血压', '收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)','高密度脂蛋白胆固醇(HDL)', '甘油三酯', '简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题','日常生活活动(ADL)得分', '混乱与定向障碍', '迷失方向', '人格变化', '完成任务困难', '健忘', '诊断状态','主诊医生'],dtype='object')

2. 数据处理

data_df.isnull().sum()
PatientID           0
年龄                  0
性别                  0
种族                  0
教育水平                0
身体质量指数(BMI)         0
吸烟状况                0
酒精摄入量               0
体育活动时间              0
饮食质量评分              0
睡眠质量评分              0
家族阿尔茨海默病史           0
心血管疾病               0
糖尿病                 0
抑郁症史                0
头部受伤                0
高血压                 0
收缩压                 0
舒张压                 0
胆固醇总量               0
低密度脂蛋白胆固醇(LDL)      0
高密度脂蛋白胆固醇(HDL)      0
甘油三酯                0
简易精神状态检查(MMSE)得分    0
功能评估得分              0
记忆抱怨                0
行为问题                0
日常生活活动(ADL)得分       0
混乱与定向障碍             0
迷失方向                0
人格变化                0
完成任务困难              0
健忘                  0
诊断状态                0
主诊医生                0
dtype: int64
from sklearn.preprocessing import LabelEncoder# 创建 LabelEncoder 实例
label_encoder = LabelEncoder()# 对非数值型列进行标签编码
data_df['主诊医生'] = label_encoder.fit_transform(data_df['主诊医生'])data_df.head()
PatientID年龄性别种族教育水平身体质量指数(BMI)吸烟状况酒精摄入量体育活动时间饮食质量评分...记忆抱怨行为问题日常生活活动(ADL)得分混乱与定向障碍迷失方向人格变化完成任务困难健忘诊断状态主诊医生
047517300222.927749013.2972186.3271121.347214...001.7258830001000
147528900026.82768104.5425247.6198850.518767...002.5924240000100
247537303117.795882019.5550857.8449881.826335...007.1195480101000
347547410133.800817112.2092668.4280017.435604...016.4812260000000
447558900020.716974018.4543566.3104610.795498...000.0146910011000

5 rows × 35 columns

2.1 患病占比

# 计算是否患病, 人数
counts = data_df["诊断状态"].value_counts()# 计算百分比
sizes = counts / counts.sum() * 100# 绘制环形图
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(sizes, labels=sizes.index, autopct='%1.2ff%%', startangle=90, wedgeprops=dict(width=0.3))plt.title("患病占比(1患病,0没有患病)")
plt.show()

在这里插入图片描述

2.2 相关性分析

plt.figure(figsize=(40, 35))
sns.heatmap(data_df.corr(), annot=True, fmt=".2f")
plt.show()

在这里插入图片描述

2.3 年龄与患病探究

data_df['年龄'].min(), data_df['年龄'].max()

(np.int64(60), np.int64(90))

# 计算每一个年龄段患病人数 
age_bins = range(60, 91)
grouped = data_df.groupby('年龄').agg({'诊断状态': ['sum', 'size']})  # 分组、聚合函数: sum求和,size总大小
grouped.columns = ['患病', '总人数']
grouped['不患病'] = grouped['总人数'] - grouped['患病']  # 计算不患病的人数# 设置绘图风格
sns.set(style="whitegrid")plt.figure(figsize=(12, 5))# 获取x轴标签(即年龄)
x = grouped.index.astype(str)  # 将年龄转换为字符串格式便于显示# 画图
plt.bar(x, grouped["不患病"], 0.35, label="不患病", color='skyblue')
plt.bar(x, grouped["患病"], 0.35, label="患病", color='salmon')# 设置标题
plt.title("患病年龄分布", fontproperties='Microsoft YaHei')
plt.xlabel("年龄", fontproperties='Microsoft YaHei')
plt.ylabel("人数", fontproperties='Microsoft YaHei')# 如果需要对图例也应用相同的字体
plt.legend(prop={'family': 'Microsoft YaHei'})# 展示
plt.tight_layout()
plt.show()

在这里插入图片描述

3. 特征选择

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_reportdata = data_df.copy()X = data_df.iloc[:, 1:-2]
y = data_df.iloc[:, -2]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 标准化
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)# 模型创建
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
pred = tree.predict(X_test)reporter = classification_report(y_test, pred)
print(reporter)
              precision    recall  f1-score   support0       0.91      0.92      0.92       2771       0.85      0.84      0.85       153accuracy                           0.89       430macro avg       0.88      0.88      0.88       430
weighted avg       0.89      0.89      0.89       430
# 特征展示
feature_importances = tree.feature_importances_
features_rf = pd.DataFrame({'特征': X.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(20, 10))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()

在这里插入图片描述

from sklearn.feature_selection import RFE# 使用 RFE 来选择特征
rfe_selector = RFE(estimator=tree, n_features_to_select=20)  # 选择前20个特征
rfe_selector.fit(X, y)  
X_new = rfe_selector.transform(X)
feature_names = np.array(X.columns) 
selected_feature_names = feature_names[rfe_selector.support_]
print(selected_feature_names)

[‘年龄’ ‘种族’ ‘教育水平’ ‘身体质量指数(BMI)’ ‘酒精摄入量’ ‘体育活动时间’ ‘饮食质量评分’ ‘睡眠质量评分’ ‘心血管疾病’
‘收缩压’ ‘舒张压’ ‘胆固醇总量’ ‘低密度脂蛋白胆固醇(LDL)’ ‘高密度脂蛋白胆固醇(HDL)’ ‘甘油三酯’
‘简易精神状态检查(MMSE)得分’ ‘功能评估得分’ ‘记忆抱怨’ ‘行为问题’ ‘日常生活活动(ADL)得分’]

4. 构建数据集

4.1 数据集划分与标准化

feature_selection = ['年龄', '种族','教育水平','身体质量指数(BMI)', '酒精摄入量', '体育活动时间', '饮食质量评分', '睡眠质量评分', '心血管疾病','收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)', '高密度脂蛋白胆固醇(HDL)', '甘油三酯','简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题', '日常生活活动(ADL)得分']X = data_df[feature_selection]# 标准化, 标准化其实对应连续性数据,分类数据不适合,由于特征中只有种族是分类数据,这里我偷个“小懒”
sc = StandardScaler()
X = sc.fit_transform(X)X = torch.tensor(np.array(X), dtype=torch.float32)
y = torch.tensor(np.array(y), dtype=torch.long)# 再次进行特征选择
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)X_train.shape, y_train.shape

(torch.Size([1719, 20]), torch.Size([1719]))

4.2 构建加载

batch_size = 32train_dl = DataLoader(TensorDataset(X_train, y_train),batch_size=batch_size,shuffle=True
)test_dl = DataLoader(TensorDataset(X_test, y_test),batch_size=batch_size,shuffle=False
)

5. 构建模型

class Rnn_Model(nn.Module):def __init__(self):super().__init__()# 调用rnnself.rnn = nn.RNN(input_size=20, hidden_size=200, num_layers=1, batch_first=True)self.fc1 = nn.Linear(200, 50)self.fc2 = nn.Linear(50, 2)def forward(self, x):x, hidden1 = self.rnn(x)x = self.fc1(x)x = self.fc2(x)return x# 数据不大,cpu即可
device = "cpu"model = Rnn_Model().to(device)
model

Rnn_Model(
(rnn): RNN(20, 200, batch_first=True)
(fc1): Linear(in_features=200, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=2, bias=True)
)

model(torch.randn(32, 20)).shape

torch.Size([32, 2])

6. 模型训练

6.1 构建训练函数

def train(data, model, loss_fn, opt):size = len(data.dataset)batch_num = len(data)train_loss, train_acc = 0.0, 0.0for X, y in data:X, y = X.to(device), y.to(device)pred = model(X)loss = loss_fn(pred, y)# 反向传播opt.zero_grad()  # 梯度清零loss.backward()  # 求导opt.step()       # 设置梯度train_loss += loss.item()train_acc += (pred.argmax(1) == y).type(torch.float).sum().item()train_loss /= batch_numtrain_acc /= size return train_acc, train_loss 

6.2 构建测试函数

def test(data, model, loss_fn):size = len(data.dataset)batch_num = len(data)test_loss, test_acc = 0.0, 0.0 with torch.no_grad():for X, y in data: X, y = X.to(device), y.to(device)pred = model(X)loss = loss_fn(pred, y)test_loss += loss.item()test_acc += (pred.argmax(1) == y).type(torch.float).sum().item()test_loss /= batch_numtest_acc /= sizereturn test_acc, test_loss 

6.3 设置超参数

loss_fn = nn.CrossEntropyLoss()  # 损失函数     
learn_lr = 1e-4            # 超参数
optimizer = torch.optim.Adam(model.parameters(), lr=learn_lr) 

7. 模型训练

train_acc = []
train_loss = []
test_acc = []
test_loss = []epoches = 50for i in range(epoches):model.train()epoch_train_acc, epoch_train_loss = train(train_dl, model, loss_fn, optimizer)model.eval()epoch_test_acc, epoch_test_loss = test(test_dl, model, loss_fn)train_acc.append(epoch_train_acc)train_loss.append(epoch_train_loss)test_acc.append(epoch_test_acc)test_loss.append(epoch_test_loss)# 输出template = ('Epoch:{:2d}, Train_acc:{:.1f}%, Train_loss:{:.3f}, Test_acc:{:.1f}%, Test_loss:{:.3f}')print(template.format(i + 1, epoch_train_acc*100, epoch_train_loss, epoch_test_acc*100, epoch_test_loss))print("Done")

Epoch: 1, Train_acc:57.9%, Train_loss:0.675, Test_acc:66.0%, Test_loss:0.608
Epoch: 2, Train_acc:67.2%, Train_loss:0.589, Test_acc:68.8%, Test_loss:0.556
Epoch: 3, Train_acc:75.1%, Train_loss:0.540, Test_acc:75.1%, Test_loss:0.506
Epoch: 4, Train_acc:79.1%, Train_loss:0.485, Test_acc:82.1%, Test_loss:0.460
Epoch: 5, Train_acc:83.0%, Train_loss:0.442, Test_acc:81.4%, Test_loss:0.427
Epoch: 6, Train_acc:83.5%, Train_loss:0.411, Test_acc:84.2%, Test_loss:0.407
Epoch: 7, Train_acc:83.3%, Train_loss:0.395, Test_acc:82.8%, Test_loss:0.400
Epoch: 8, Train_acc:84.1%, Train_loss:0.383, Test_acc:84.0%, Test_loss:0.396
Epoch: 9, Train_acc:84.1%, Train_loss:0.380, Test_acc:84.0%, Test_loss:0.394
Epoch:10, Train_acc:83.9%, Train_loss:0.375, Test_acc:84.0%, Test_loss:0.395
Epoch:11, Train_acc:84.5%, Train_loss:0.375, Test_acc:84.4%, Test_loss:0.396
Epoch:12, Train_acc:84.5%, Train_loss:0.374, Test_acc:83.5%, Test_loss:0.399
Epoch:13, Train_acc:83.7%, Train_loss:0.373, Test_acc:83.0%, Test_loss:0.401
Epoch:14, Train_acc:84.3%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.402
Epoch:15, Train_acc:84.1%, Train_loss:0.375, Test_acc:83.3%, Test_loss:0.400
Epoch:16, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.0%, Test_loss:0.404
Epoch:17, Train_acc:84.2%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.406
Epoch:18, Train_acc:84.3%, Train_loss:0.377, Test_acc:83.3%, Test_loss:0.401
Epoch:19, Train_acc:84.8%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.402
Epoch:20, Train_acc:84.8%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.402
Epoch:21, Train_acc:84.9%, Train_loss:0.374, Test_acc:83.7%, Test_loss:0.399
Epoch:22, Train_acc:85.2%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.401
Epoch:23, Train_acc:84.7%, Train_loss:0.374, Test_acc:84.4%, Test_loss:0.401
Epoch:24, Train_acc:84.2%, Train_loss:0.371, Test_acc:84.2%, Test_loss:0.398
Epoch:25, Train_acc:84.3%, Train_loss:0.370, Test_acc:83.7%, Test_loss:0.399
Epoch:26, Train_acc:84.8%, Train_loss:0.373, Test_acc:83.7%, Test_loss:0.398
Epoch:27, Train_acc:84.6%, Train_loss:0.373, Test_acc:83.7%, Test_loss:0.395
Epoch:28, Train_acc:85.1%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.397
Epoch:29, Train_acc:84.4%, Train_loss:0.373, Test_acc:84.4%, Test_loss:0.399
Epoch:30, Train_acc:85.0%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.401
Epoch:31, Train_acc:84.7%, Train_loss:0.372, Test_acc:83.7%, Test_loss:0.401
Epoch:32, Train_acc:84.5%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.400
Epoch:33, Train_acc:84.4%, Train_loss:0.369, Test_acc:83.5%, Test_loss:0.397
Epoch:34, Train_acc:84.7%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.401
Epoch:35, Train_acc:84.6%, Train_loss:0.372, Test_acc:83.3%, Test_loss:0.396
Epoch:36, Train_acc:84.8%, Train_loss:0.370, Test_acc:83.3%, Test_loss:0.396
Epoch:37, Train_acc:84.8%, Train_loss:0.371, Test_acc:83.5%, Test_loss:0.399
Epoch:38, Train_acc:84.9%, Train_loss:0.369, Test_acc:83.5%, Test_loss:0.398
Epoch:39, Train_acc:85.0%, Train_loss:0.370, Test_acc:83.0%, Test_loss:0.395
Epoch:40, Train_acc:83.8%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.394
Epoch:41, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.7%, Test_loss:0.394
Epoch:42, Train_acc:85.1%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.392
Epoch:43, Train_acc:84.4%, Train_loss:0.371, Test_acc:84.0%, Test_loss:0.393
Epoch:44, Train_acc:84.5%, Train_loss:0.372, Test_acc:84.7%, Test_loss:0.396
Epoch:45, Train_acc:85.3%, Train_loss:0.372, Test_acc:84.2%, Test_loss:0.396
Epoch:46, Train_acc:85.0%, Train_loss:0.368, Test_acc:84.4%, Test_loss:0.397
Epoch:47, Train_acc:85.0%, Train_loss:0.372, Test_acc:84.0%, Test_loss:0.395
Epoch:48, Train_acc:84.5%, Train_loss:0.370, Test_acc:84.4%, Test_loss:0.394
Epoch:49, Train_acc:85.1%, Train_loss:0.368, Test_acc:84.2%, Test_loss:0.400
Epoch:50, Train_acc:84.9%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.397
Done

8. 模型评估

8.1 结果图

import matplotlib.pyplot as plt
#隐藏警告
import warnings
warnings.filterwarnings("ignore")               #忽略警告信息
from datetime import datetime
current_time = datetime.now() # 获取当前时间epochs_range = range(epoches)plt.figure(figsize=(12, 3))
plt.subplot(1, 2, 1)plt.plot(epochs_range, train_acc, label='Training Accuracy')
plt.plot(epochs_range, test_acc, label='Test Accuracy')
plt.legend(loc='lower right')
plt.title('Training Accuracy')
plt.xlabel(current_time) # 打卡请带上时间戳,否则代码截图无效plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss, label='Training Loss')
plt.plot(epochs_range, test_loss, label='Test Loss')
plt.legend(loc='upper right')
plt.title('Training= Loss')
plt.show()

在这里插入图片描述

8.2 混淆矩阵

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay pred = model(X_test.to(device)).argmax(1).cpu().numpy()# 计算混淆矩阵
cm = confusion_matrix(y_test, pred)# 计算
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
# 标题
plt.title("混淆矩阵")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")plt.tight_layout()  # 自适应
plt.show()

在这里插入图片描述

9. 总结:

本周在上周的基础上更加完善了阿尔茨海默病诊断模型,加入了REF(递归特征消除)特征选择方法。并且通过实践更好的理解了模型以及该如何使用这种特征选择方法。

http://www.xdnf.cn/news/901045.html

相关文章:

  • Visual Studio 中的 MD、MTD、MDD、MT 选项详解
  • 使用Python和TensorFlow实现图像分类
  • 【vue3】十大核心 API 推动前端开发的革新与进阶
  • 振动力学:二自由度系统
  • html css js网页制作成品——HTML+CSS榴莲商城网页设计(4页)附源码
  • Nature子刊同款的宏基因组免疫球蛋白测序怎么做?
  • miniforge3安装之后激活anaconda的虚拟环境
  • robot_lab——rsl_rl的train.py整体逻辑
  • 从入门到进阶:Python 学习参考书的深度解析
  • OPenCV CUDA模块光流------高效地执行光流估计的类BroxOpticalFlow
  • 传统的将自然语言转化为嵌入向量的核心机制是:,将离散的语言符号转化为连续的语义向量,其核心依赖“上下文决定语义”的假设和神经网络的特征提取能力。
  • Vue.js 生命周期全面解析
  • Proxmox Mail Gateway安装指南:从零开始配置高效邮件过滤系统
  • 第三方测试机构进行科技成果鉴定测试有什么价值
  • 使用Python和OpenCV实现图像识别与目标检测
  • 20250606-C#知识:List排序
  • 32单片机——窗口看门狗
  • 青少年编程与数学 01-011 系统软件简介 05 macOS操作系统
  • java 实现excel文件转pdf | 无水印 | 无限制
  • 大故障:阿里云核心域名爆炸了
  • 在web-view 加载的本地及远程HTML中调用uniapp的API及网页和vue页面是如何通讯的?
  • Qt客户端技巧 -- 窗口美化 -- 窗口阴影
  • linux 故障处置通用流程-36计-28-37
  • 设计模式——模板方法
  • 基于 JavaSE 实现(GUI)的 小型ATM 银行模拟系统
  • [特殊字符]解决 “IDEA 登录失败。不支持早于 14.0 的 GitLab 版本” 问题的几种方法
  • LangChain【6】之输出解析器:结构化LLM响应的关键工具
  • [ElasticSearch] DSL查询
  • 【Spark征服之路-2.3-Spark运行架构】
  • OpenLayers 分屏对比(地图联动)