当前位置：首页 > news >正文

Python Pandas.cut函数解析与实战教程

news 2025/7/28 19:48:07

Python `Pandas.cut` 函数解析与实战教程

摘要

pandas.cut 是数据分析工具库 Pandas 中一个极其强大且常用的函数。它的核心功能是将连续的数值型数据根据指定的间断点（bins）进行分割，转换成离散化的区间类别（categorical data）。这种操作在数据预处理、特征工程和数据可视化中至关重要，例如，将用户的年龄分段、将考试分数评级、或将销售额划分为不同的等级。本文章将从基础用法到高级技巧，全面、深入地剖析 pandas.cut 函数，帮助你在数据分析项目中游刃有余。

1. 函数核心功能与参数详解

首先，我们来看一下 pandas.cut 函数的官方签名，并逐一解析其参数。

pandas.cut(x,bins,right: bool = True,labels=None,retbins: bool = False,precision: int = 3,include_lowest: bool = False,duplicates: str = 'raise',ordered: bool = True
)

1.1 `x`

类型: 一维数组（array-like），如 pandas.Series, numpy.ndarray, 或 Python 列表。
说明: 这是需要进行分箱操作的输入数据。cut 函数会逐一判断 x 中的每个元素应归属哪个区间。

1.2 `bins`

类型: int, scalar sequence (如 list 或 numpy array), 或 pandas.IntervalIndex。
说明: 这是 cut 函数最核心的参数，它定义了分箱的规则。

整数: 如果 bins 是一个整数，cut 会在 x 的最小值和最大值之间生成等宽的区间。例如，bins=5 会将数据分成5个宽度相等的桶。

import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(20) * 100)
# 将数据分为10个等宽的区间
categories = pd.cut(data, 10)
print(categories.value_counts())

标量序列: 这是最常用的方式，通过一个列表或数组来手动指定区间的边界。例如，bins=[0, 18, 40, 60, 100] 会生成以下几个区间：(0, 18], (18, 40], (40, 60], (60, 100] 。

ages = pd.Series([1, 15, 25, 35, 45, 55, 65, 99])
age_bins = [0, 18, 35, 60, 100]
# 根据自定义边界进行分箱
age_categories = pd.cut(ages, bins=age_bins)
print(age_categories)
# 输出:
# 0      (0, 18]
# 1      (0, 18]
# 2     (18, 35]
# 3     (35, 60]
# 4     (35, 60]
# 5     (35, 60]
# 6    (60, 100]
# 7    (60, 100]
# dtype: category