当前位置：首页 > java >正文

Kaggle 竞赛入门指南

java 2025/8/6 3:07:01

一、账号注册

https://www.kaggle.com/account/login?phase=emailRegister

在这里插入图片描述

使用邮箱进行注册，但是通常会因为网络问题导致验证码无法正常显示，此时需要安装插件 Header Editor，将网页URL进行重定向，让验证码显示出来。

在这里插入图片描述

如图，到这里已经注册成功了。

二、竞赛入门

在这里插入图片描述

官方提供了一些入门级竞赛项目，非常容易上手，我们可以从这些赛事入手。

三、房价预测竞赛

例如，我们可以从 House Prices - Advanced Regression Techniques 这个赛事开始。

1、背景与任务

在这里插入图片描述

根据 Overview 页面的介绍，我们可以得知：购房者理想房屋描述与真实价格差异往往由多个因素决定，赛事提供了79个解释变量包含了住宅几乎各方面特征，希望参赛者能够使用这些特征进行最终房价预测。

2、数据集

每个数据集的每个字段在 Data 页面都有具体说明：

在这里插入图片描述

3、Baseline

在 Code 页面中，House Prices Prediction using TFDF 是一个能够完成赛事任务的基础实现方案，同时也可以看到众多参赛者分享的 Notebook：

在这里插入图片描述

四、跑通Baseline

1、创建一个notebook

在这里插入图片描述

可以直接复制一份 Baseline 的 Notebook，也可以手动创建一个新的 Notebook：

在这里插入图片描述

为了方便演示，这里直接复制一份 Baseline：

在这里插入图片描述

可以看到，该 Notebook 还自动引入了竞赛数据集，现在所有事情都可以在云端完成，无需本地搭建开发环境，非常方便。

2、一键执行代码

在这里插入图片描述

3、提交预测结果

所有代码执行完毕之后，在 output 目录下生成 submission.csv 结果文件，该文件就是对房价进行预测后的结果文件。

在这里插入图片描述

此时可以直接展开 Submit to competition 提交预测文件：

在这里插入图片描述

4、得分与排行榜

在赛事的 Leaderboard 页面可以查看自己的得分与名次：

在这里插入图片描述

五、Baseline拆解

Step1、导入依赖

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt# Comment this if the data visualisations doesn't work on your side
%matplotlib inline

在这里插入图片描述

TensorFlow v2.11.0
TensorFlow Decision Forests v1.2.0

Step2、加载数据集

train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(dataset_df.shape))

Full train dataset shape is (1460, 81)

在这里插入图片描述

Step3、查看房价分布

print(dataset_df['SalePrice'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

在这里插入图片描述

Step4、数值数据分布

list(set(dataset_df.dtypes.tolist()))df_num = dataset_df.select_dtypes(include = ['float64', 'int64'])
df_num.head()df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

在这里插入图片描述

Step5、划分数据集

import numpy as npdef split_dataset(dataset, test_ratio=0.30):test_indices = np.random.rand(len(dataset)) < test_ratioreturn dataset[~test_indices], dataset[test_indices]train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(len(train_ds_pd), len(valid_ds_pd)))

label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)

Step6、选择随机森林模型

rf = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"]) # Optional, you can use this to include a list of eval metrics

Step7、训练模型

rf.fit(x=train_ds)

Step8、评估模型

import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()

在这里插入图片描述

inspector = rf.make_inspector()
inspector.evaluation()

Evaluation(num_examples=1019, accuracy=None, loss=None, rmse=29893.939666232716, ndcg=None, aucs=None, auuc=None, qini=None)

evaluation = rf.evaluate(x=valid_ds,return_dict=True)for name, value in evaluation.items():print(f"{name}: {value:.4f}")

1/1 [==============================] - 1s 747ms/step - loss: 0.0000e+00 - mse: 698953728.0000
loss: 0.0000
mse: 698953728.0000

Step9、特征重要性

print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():print("\t", importance)

Available variable importances:SUM_SCORENUM_AS_ROOTNUM_NODESINV_MEAN_MIN_DEPTH

inspector.variable_importances()["NUM_AS_ROOT"]

[("OverallQual" (1; #62), 106.0),("Neighborhood" (4; #59), 53.0),("ExterQual" (4; #22), 52.0),("GarageCars" (1; #32), 22.0),("GrLivArea" (1; #38), 17.0),("YearBuilt" (1; #76), 16.0),("KitchenQual" (4; #44), 10.0),("TotalBsmtSF" (1; #73), 10.0),("BsmtQual" (4; #14), 9.0),("1stFlrSF" (1; #0), 2.0),("GarageArea" (1; #31), 1.0),("GarageFinish" (4; #34), 1.0),("GarageYrBlt" (1; #37), 1.0)]

plt.figure(figsize=(12, 4))# Mean decrease in AUC of the class 1 vs the others.
variable_importance_metric = "NUM_AS_ROOT"
variable_importances = inspector.variable_importances()[variable_importance_metric]# Extract the feature name and importance values.
#
# `variable_importances` is a list of <feature, importance> tuples.
feature_names = [vi[0].name for vi in variable_importances]
feature_importances = [vi[1] for vi in variable_importances]
# The feature are ordered in decreasing importance value.
feature_ranks = range(len(feature_names))bar = plt.barh(feature_ranks, feature_importances, label=[str(x) for x in feature_ranks])
plt.yticks(feature_ranks, feature_names)
plt.gca().invert_yaxis()# TODO: Replace with "plt.bar_label()" when available.
# Label each bar with values
for importance, patch in zip(feature_importances, bar.patches):plt.text(patch.get_x() + patch.get_width(), patch.get_y(), f"{importance:.4f}", va="top")plt.xlabel(variable_importance_metric)
plt.title("NUM AS ROOT of the class 1 vs the others")
plt.tight_layout()
plt.show()

在这里插入图片描述

Step10、保存预测结果

test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data,task = tfdf.keras.Task.REGRESSION)preds = rf.predict(test_ds)
output = pd.DataFrame({'Id': ids,'SalePrice': preds.squeeze()})output.head()

sample_submission_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
sample_submission_df['SalePrice'] = rf.predict(test_ds)
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()