当前位置：首页 > news >正文

没人和我重复标题！第1章 Pandas基础

news 2025/6/6 12:29:47

第1章 Pandas基础

查看Pandas版本（本教程全部使用1.0.0版本）

import pandas as pd
import numpy as np

pd.__version__

'0.25.1'

一、文件读取与写入

1. 读取

（a）csv格式

df = pd.read_csv('data/table.csv')
df.head()    #head()函数，读取前5行，原型默认参数为5

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

（b）txt格式

df_txt = pd.read_table('data/table.txt') #可设置sep分隔符参数 ，‘\t’默认由tab分割
df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple
1	3	b	3.4	banana
2	6	c	2.5	orange
3	5	d	3.2	lemon

（c）xls或xlsx格式

#需要安装xlrd包
df_excel = pd.read_excel('data/table.xlsx')
df_excel.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

2. 写入

（a）csv格式

names = ['Bob','Jessica','Mary','John','Mel']
births = [968,155,77,578,973]
DataSet = list(zip(names,births))   #用 zip 函数将这两个列表合并
DataSet
df = pd.DataFrame(data = DataSet ,columns=['Names','Births'])   #生成一个DataFrame对象
df
df.to_csv('data/new_table.csv')

（b）xls或xlsx格式

#需要安装openpyxl
df.to_excel('data/new_table2.xlsx', sheet_name='Sheet1')

二、基本数据结构

1. Series

（a）创建一个Series （类似一维数组）

对于一个Series，其中最常用的属性为值（values），索引（index），名字（name），类型（dtype）

s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'],name='一个Series',dtype='float64')  #一个符合标准正态分布的Series
s

a   -0.377312
b   -0.229854
c   -1.465434
d   -0.840431
e    0.527155
Name: 一个Series, dtype: float64

（b）访问Series属性

s.values

array([-0.37731161, -0.22985384, -1.4654339 , -0.84043107,  0.52715519])

s.name

'一个Series'

s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

s.dtype

dtype('float64')

（c）取出某一个元素

将在第2章详细讨论索引的应用，这里先大致了解

s['a']

-0.3773116135477631

（d）调用方法

s.mean()   #平均值

-0.4771750456994931

Series有相当多的方法可以调用：

print([attr for attr in dir(s) if not attr.startswith('_')])

['T', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'as_matrix', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'autocorr', 'axes', 'b', 'base', 'between', 'between_time', 'bfill', 'bool', 'c', 'clip', 'clip_lower', 'clip_upper', 'combine', 'combine_first', 'compound', 'compress', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'd', 'data', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'e', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_array', 'ftype', 'ftypes', 'ge', 'get', 'get_dtype_counts', 'get_ftype_counts', 'get_values', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'imag', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'itemsize', 'iteritems', 'ix', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'nonzero', 'notna', 'notnull', 'nsmallest', 'nunique', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'ptp', 'put', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'real', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'std', 'strides', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dense', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_msgpack', 'to_numpy', 'to_period', 'to_pickle', 'to_sparse', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']

2. DataFrame

（a）创建一个DataFrame

df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[2,2.5,3.6,4.6,5.8]},index=list('一二三四五'))   #dataframe{列名：对应值，索引}
df

	col1	col2	col3
一	a	5	2.0
二	b	6	2.5
三	c	7	3.6
四	d	8	4.6
五	e	9	5.8

（b）从DataFrame取出一列为Series

df['col1']

一    a
二    b
三    c
四    d
五    e
Name: col1, dtype: object

type(df)  #数据类型

pandas.core.frame.DataFrame

type(df['col1']) #取出dataframe的一列的类型，结果为series

pandas.core.series.Series

（c）修改行或列名

df.rename(index={'一':'one','五':'five'},columns={'col1':'new_col1','col3':'new_col3'})  #index/columns={旧名称：新名称}

	new_col1	col2	new_col3
one	a	5	2.0
二	b	6	2.5
三	c	7	3.6
四	d	8	4.6
five	e	9	5.8

（d）调用属性和方法

df.index   #取索引，即行名

Index(['一', '二', '三', '四', '五'], dtype='object')

df.columns  #取列名

Index(['col1', 'col2', 'col3'], dtype='object')

df.values   #取dataframe的数值（类型为矩阵）

array([['a', 5, 1.3],['b', 6, 2.5],['c', 7, 3.6],['d', 8, 4.6],['e', 9, 5.8]], dtype=object)

df.shape #数据库维度

(5, 3)

df.mean() #本质上是一种Aggregation操作，将在第3章详细介绍  （列的运算）

col2    3.0
col3    3.7
dtype: float64

（e）索引对齐特性

这是Pandas中非常强大的特性，不理解这一特性有时就会造成一些麻烦

df1 = pd.DataFrame({'A':[1,2,3]},index=[1,2,3])
df2 = pd.DataFrame({'A':[1,2,3]},index=[3,2,1])
df1-df2 #由于索引对齐，因此结果不是0，索引对应做运算

	A
1	-2
2	0
3	2

（f）列的删除与添加

对于删除而言，可以使用drop函数或del或pop

df.drop(index='五',columns='col1') #设置inplace=True后会直接在原DataFrame中改动

	col2	col3
一	1	2.0
二	2	2.5
三	3	3.6
四	4	4.6
五	5	5.8

df['col1']=[1,2,3,4,5]
del df['col1']
df

	col2	col3
一	5	2.0
二	6	2.5
三	7	3.6
四	8	4.6
五	9	5.8

pop方法直接在原来的DataFrame上操作，且返回被删除的列，与python中的pop函数类似

df['col1']=[1,2,3,4,5]
df.pop('col1')

一    1
二    2
三    3
四    4
五    5
Name: col1, dtype: int64

df

	col2	col3
一	5	2.0
二	6	2.5
三	7	3.6
四	8	4.6
五	9	5.8

可以直接增加新的列，也可以使用assign方法

df1['B']=list('abc')
df1

	A	B
1	1	a
2	2	b
3	3	c

df1.assign(C=pd.Series(list(‘def’)))

但assign方法不会对原DataFrame做修改

df1

	A	B
1	1	a
2	2	b
3	3	c

（g）根据类型选择列

df.select_dtypes(include=['number']).head()

	col2	col3
一	5	2.0
二	6	2.5
三	7	3.6
四	8	4.6
五	9	5.8

df.select_dtypes(include=['float']).head()

	col3
一	2.0
二	2.5
三	3.6
四	4.6
五	5.8

（h）将Series转换为DataFrame

s = df.mean()
s.name='to_DataFrame'
s

col2    7.00
col3    3.56
Name: to_DataFrame, dtype: float64

s.to_frame()

	to_DataFrame
col2	7.00
col3	3.56

使用T符号可以转置

s.to_frame().T

	col2	col3
to_DataFrame	7.0	3.56

三、常用基本函数

从下面开始，包括后面所有章节，我们都会用到这份虚拟的数据集

df = pd.read_csv('data/table.csv')

1. head和tail

df.head()  #前五行

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

df.tail()  #后五行

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
34	S_2	C_4	2405	F	street_6	193	54	47.6	B

可以指定n参数显示多少行

df.head(3)

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+

2. unique和nunique

nunique显示有多少个唯一值

df['Physics'].nunique()   #number unique

unique显示所有的唯一值

df['Physics'].unique()

array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

3. count和value_counts

count返回非缺失值元素个数

df['Physics'].count()

value_counts返回每个元素有多少个（统计所有唯一值的个数）

df['Physics'].value_counts()

B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

4. describe和info

info函数返回有哪些列、有多少非缺失值、每列的类型

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 一 to 五
Data columns (total 2 columns):
col2    5 non-null int64
col3    5 non-null float64
dtypes: float64(1), int64(1)
memory usage: 280.0+ bytes

describe默认统计数值型数据的各个统计量

df.describe()

	ID	Height	Weight	Math
count	35.00000	35.000000	35.000000	35.000000
mean	1803.00000	174.142857	74.657143	61.351429
std	536.87741	13.541098	12.895377	19.915164
min	1101.00000	155.000000	53.000000	31.500000
25%	1204.50000	161.000000	63.000000	47.400000
50%	2103.00000	173.000000	74.000000	61.700000
75%	2301.50000	187.500000	82.000000	77.100000
max	2405.00000	195.000000	100.000000	97.000000

可以自行选择分位数

df.describe(percentiles=[.05, .25, .75, .95])

	ID	Height	Weight	Math
count	35.00000	35.000000	35.000000	35.000000
mean	1803.00000	174.142857	74.657143	61.351429
std	536.87741	13.541098	12.895377	19.915164
min	1101.00000	155.000000	53.000000	31.500000
5%	1102.70000	157.000000	56.100000	32.640000
25%	1204.50000	161.000000	63.000000	47.400000
50%	2103.00000	173.000000	74.000000	61.700000
75%	2301.50000	187.500000	82.000000	77.100000
95%	2403.30000	193.300000	97.600000	90.040000
max	2405.00000	195.000000	100.000000	97.000000

对于非数值型也可以用describe函数

df['Physics'].describe()

count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

5. idxmax和nlargest

idxmax函数返回最大值，在某些情况下特别适用，idxmin功能类似

df['Math'].idxmax()   #返回对应索引

nlargest函数返回前几个大的元素值，nsmallest功能类似

df['Math'].nlargest(3)  #返回索引及对应值

5     97.0
28    95.5
11    87.7
Name: Math, dtype: float64

6. clip和replace

clip和replace是两类替换函数

clip是对超过或者低于某些值的数进行截断

df['Math'].head()

0    34.0
1    32.5
2    87.2
3    80.4
4    84.8
Name: Math, dtype: float64

df['Math'].clip(33,80).head()  #小于33的都取33，大于80的都取80,33到80之间不变

0    34.0
1    33.0
2    80.0
3    80.0
4    80.0
Name: Math, dtype: float64

df['Math'].mad()  #平均绝对偏差

16.924244897959188

replace是对某些值进行替换

df['Address'].head()

0    street_1
1    street_2
2    street_2
3    street_2
4    street_4
Name: Address, dtype: object

df['Address'].replace(['street_1','street_2'],['one','two']).head()

0         one
1         two
2         two
3         two
4    street_4
Name: Address, dtype: object

通过字典，可以直接在表中修改

df.replace({'Address':{'street_1':'one','street_2':'two'}}).head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	one	173	63	34.0	A+
1	S_1	C_1	1102	F	two	192	73	32.5	B+
2	S_1	C_1	1103	M	two	186	82	87.2	B+
3	S_1	C_1	1104	F	two	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

7. apply函数

apply是一个自由度很高的函数，在第3章我们还要提到

对于Series，它可以迭代每一列的值操作：

df['Math'].apply(lambda x:str(x)+'!').head() #可以使用lambda表达式，也可以使用函数

0    34.0!
1    32.5!
2    87.2!
3    80.4!
4    84.8!
Name: Math, dtype: object

对于DataFrame，它可以迭代每一个列操作：

df.apply(lambda x:x.apply(lambda x:str(x)+'!')).head() #这是一个稍显复杂的例子，有利于理解apply的功能

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1!	C_1!	1101!	M!	street_1!	173!	63!	34.0!	A+!
1	S_1!	C_1!	1102!	F!	street_2!	192!	73!	32.5!	B+!
2	S_1!	C_1!	1103!	M!	street_2!	186!	82!	87.2!	B+!
3	S_1!	C_1!	1104!	F!	street_2!	167!	81!	80.4!	B-!
4	S_1!	C_1!	1105!	F!	street_4!	159!	64!	84.8!	B+!

四、排序

1. 索引排序

df.set_index('Math').head()#set_index函数可以设置索引，将在下一章详细介绍

	School	Class	ID	Gender	Address	Height	Weight	Physics
Math
34.0	S_1	C_1	1101	M	street_1	173	63	A+
32.5	S_1	C_1	1102	F	street_2	192	73	B+
87.2	S_1	C_1	1103	M	street_2	186	82	B+
80.4	S_1	C_1	1104	F	street_2	167	81	B-
84.8	S_1	C_1	1105	F	street_4	159	64	B+

df.set_index('Math').sort_index().head() #可以设置ascending参数，默认为升序，True

	School	Class	ID	Gender	Address	Height	Weight	Physics
Math
31.5	S_1	C_3	1301	M	street_4	161	68	B+
32.5	S_1	C_1	1102	F	street_2	192	73	B+
32.7	S_2	C_3	2302	M	street_5	171	88	A
33.8	S_1	C_2	1204	F	street_5	162	63	B
34.0	S_1	C_1	1101	M	street_1	173	63	A+

2. 值排序

df.sort_values(by='Class')

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
19	S_2	C_1	2105	M	street_4	170	81	34.2	A
18	S_2	C_1	2104	F	street_5	159	97	72.2	B+
16	S_2	C_1	2102	F	street_6	161	61	50.6	B+
15	S_2	C_1	2101	M	street_7	174	84	83.3	C
17	S_2	C_1	2103	M	street_4	157	61	52.5	B-
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+
6	S_1	C_2	1202	F	street_4	176	94	63.5	B-
24	S_2	C_2	2205	F	street_7	183	76	85.4	B
23	S_2	C_2	2204	M	street_1	175	74	47.2	B-
22	S_2	C_2	2203	M	street_4	155	91	73.8	A+
21	S_2	C_2	2202	F	street_7	194	77	68.5	B+
5	S_1	C_2	1201	M	street_5	188	68	97.0	A-
20	S_2	C_2	2201	M	street_5	193	100	39.1	B
9	S_1	C_2	1205	F	street_6	167	63	68.4	B-
8	S_1	C_2	1204	F	street_5	162	63	33.8	B
7	S_1	C_2	1203	M	street_6	160	53	58.8	A+
25	S_2	C_3	2301	F	street_4	157	78	72.3	B+
13	S_1	C_3	1304	M	street_2	195	70	85.2	A
12	S_1	C_3	1303	M	street_7	188	82	49.7	B
11	S_1	C_3	1302	F	street_1	175	57	87.7	A-
10	S_1	C_3	1301	M	street_4	161	68	31.5	B+
14	S_1	C_3	1305	F	street_5	187	69	61.7	B-
26	S_2	C_3	2302	M	street_5	171	88	32.7	A
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
28	S_2	C_3	2304	F	street_6	164	81	95.5	A-
29	S_2	C_3	2305	M	street_4	187	73	48.9	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
34	S_2	C_4	2405	F	street_6	193	54	47.6	B

多个值排序，即先对第一层排，在第一层相同的情况下对第二层排序

df.sort_values(by=['Address','Height'])

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
11	S_1	C_3	1302	F	street_1	175	57	87.7	A-
23	S_2	C_2	2204	M	street_1	175	74	47.2	B-
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
13	S_1	C_3	1304	M	street_2	195	70	85.2	A
22	S_2	C_2	2203	M	street_4	155	91	73.8	A+
17	S_2	C_1	2103	M	street_4	157	61	52.5	B-
25	S_2	C_3	2301	F	street_4	157	78	72.3	B+
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+
10	S_1	C_3	1301	M	street_4	161	68	31.5	B+
19	S_2	C_1	2105	M	street_4	170	81	34.2	A
6	S_1	C_2	1202	F	street_4	176	94	63.5	B-
29	S_2	C_3	2305	M	street_4	187	73	48.9	B
18	S_2	C_1	2104	F	street_5	159	97	72.2	B+
8	S_1	C_2	1204	F	street_5	162	63	33.8	B
26	S_2	C_3	2302	M	street_5	171	88	32.7	A
14	S_1	C_3	1305	F	street_5	187	69	61.7	B-
5	S_1	C_2	1201	M	street_5	188	68	97.0	A-
20	S_2	C_2	2201	M	street_5	193	100	39.1	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
7	S_1	C_2	1203	M	street_6	160	53	58.8	A+
16	S_2	C_1	2102	F	street_6	161	61	50.6	B+
28	S_2	C_3	2304	F	street_6	164	81	95.5	A-
9	S_1	C_2	1205	F	street_6	167	63	68.4	B-
34	S_2	C_4	2405	F	street_6	193	54	47.6	B
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
15	S_2	C_1	2101	M	street_7	174	84	83.3	C
24	S_2	C_2	2205	F	street_7	183	76	85.4	B
12	S_1	C_3	1303	M	street_7	188	82	49.7	B
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
21	S_2	C_2	2202	F	street_7	194	77	68.5	B+

五、问题与练习

1. 问题

【问题一】在常用函数一节中，由于一些函数的功能比较简单，因此没有列入，现在将它们列在下面，请分别说明它们的用途并尝试使用。

sum/mean/median/mad/min/max/abs/std/var/quantile/cummax/cumsum/cumprod

和/平均值/中位数/平均值偏差/最小/最大/绝对值/标准差/方差/分位数/累计最大值/累计和/累计乘

【问题二】 df.mean(axis=1)是什么意思？它与df.mean()的结果一样吗？第一问提到的函数也有axis参数吗？怎么使用？

对每行取平均值，df.mean()取列平均值（默认axis=0）,第一问的函数也有axis参数，axis=0对列运算，axis=1对行运算

2. 练习

【练习一】现有一份关于美剧《权力的游戏》剧本的数据集，请解决以下问题：

（a）在所有的数据中，一共出现了多少人物？ 564

（b）以单元格计数（即简单把一个单元格视作一句），谁说了最多的话？ tyrion lannister 1760句

（c）以单词计数，谁说了最多的单词？

data=pd.read_csv('data/Game_of_Thrones_Script.csv')
N=data['Name'].nunique()
data['Name'].value_counts()
data['Name'].describe()
data['words']=data['Sentence'].map(lambda x:len(x.split(' ')))   #每句单词数
words1=data.groupby('Name')['words'].sum()  #通过Name分组求words列的和并输出series，（1）
words2=data.groupby('Name').sum()   #数据框对列求和，返回dataframe   （2）type(words)
words1

Name
a voice             5
addam marbrand      9
aemon             615
aeron             161
aerson             45... 
young hodor        87
young lyanna       26
young man          19
young ned          48
young rodrik        5
Name: words, Length: 564, dtype: int64

words1[words==26009]  #(1)
words2[words2['words']==26009]  #(2)

	words
Name
tyrion lannister	26009

【练习二】现有一份关于科比的投篮数据集，请解决如下问题：

（a）哪种action_type和combined_shot_type的组合是最多的？

（b）在所有被记录的game_id中，遭遇到最多的opponent是一个支？

data=pd.read_csv('data/Kobe_data.csv',index_col='shot_id')
data
#index_col的作用是将某一列作为行索引

	action_type	combined_shot_type	game_event_id	game_id	lat	loc_x	loc_y	lon	minutes_remaining	period	...	shot_made_flag	shot_type	shot_zone_area	shot_zone_basic	shot_zone_range	team_id	team_name	game_date	matchup	opponent
shot_id
1	Jump Shot	Jump Shot	10	20000012	33.9723	167	72	-118.1028	10	1	...	NaN	2PT Field Goal	Right Side(R)	Mid-Range	16-24 ft.	1610612747	Los Angeles Lakers	2000/10/31	LAL @ POR	POR
2	Jump Shot	Jump Shot	12	20000012	34.0443	-157	0	-118.4268	10	1	...	0.0	2PT Field Goal	Left Side(L)	Mid-Range	8-16 ft.	1610612747	Los Angeles Lakers	2000/10/31	LAL @ POR	POR
3	Jump Shot	Jump Shot	35	20000012	33.9093	-101	135	-118.3708	7	1	...	1.0	2PT Field Goal	Left Side Center(LC)	Mid-Range	16-24 ft.	1610612747	Los Angeles Lakers	2000/10/31	LAL @ POR	POR
4	Jump Shot	Jump Shot	43	20000012	33.8693	138	175	-118.1318	6	1	...	0.0	2PT Field Goal	Right Side Center(RC)	Mid-Range	16-24 ft.	1610612747	Los Angeles Lakers	2000/10/31	LAL @ POR	POR
5	Driving Dunk Shot	Dunk	155	20000012	34.0443	0	0	-118.2698	6	2	...	1.0	2PT Field Goal	Center(C)	Restricted Area	Less Than 8 ft.	1610612747	Los Angeles Lakers	2000/10/31	LAL @ POR	POR
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
30693	Jump Shot	Jump Shot	397	49900088	33.9963	1	48	-118.2688	6	4	...	0.0	2PT Field Goal	Center(C)	In The Paint (Non-RA)	Less Than 8 ft.	1610612747	Los Angeles Lakers	2000/6/19	LAL vs. IND	IND
30694	Tip Shot	Tip Shot	398	49900088	34.0443	0	0	-118.2698	6	4	...	NaN	2PT Field Goal	Center(C)	Restricted Area	Less Than 8 ft.	1610612747	Los Angeles Lakers	2000/6/19	LAL vs. IND	IND
30695	Running Jump Shot	Jump Shot	426	49900088	33.8783	-134	166	-118.4038	3	4	...	1.0	2PT Field Goal	Left Side Center(LC)	Mid-Range	16-24 ft.	1610612747	Los Angeles Lakers	2000/6/19	LAL vs. IND	IND
30696	Jump Shot	Jump Shot	448	49900088	33.7773	31	267	-118.2388	2	4	...	0.0	3PT Field Goal	Center(C)	Above the Break 3	24+ ft.	1610612747	Los Angeles Lakers	2000/6/19	LAL vs. IND	IND
30697	Jump Shot	Jump Shot	471	49900088	33.9723	1	72	-118.2688	0	4	...	0.0	2PT Field Goal	Center(C)	In The Paint (Non-RA)	Less Than 8 ft.	1610612747	Los Angeles Lakers	2000/6/19	LAL vs. IND	IND

30697 rows × 24 columns

uni=data.groupby('action_type')['combined_shot_type'].value_counts()
uni.max()

uni.idxmax()

('Jump Shot', 'Jump Shot')

pd.Series(list(list(zip(*(pd.Series(list(zip(data['game_id'],data['opponent']))).unique()).tolist()))[1])).value_counts().index[0]

'SAS'

查看全文

http://www.xdnf.cn/news/841303.html

男生诡异世界观,是不是跟你想的不一样~

android led hal实践

免费在线客服软件推荐：经济实用的客户沟通解决方案

这个问题怎么解决？

基于ssm网上医院预约挂号系统+jsp论文

【web前端开发网页设计】一步步实现：HTML + CSS + JavaScript 完整vue+elementui 个人资料（超实用）轻松搞定编辑保存：完整源代码与详细教程（新手必看）轻松入门前端开

ave2

计算机中丢失atl80,atl80.dll丢失了怎么办-atl80.dll丢失的解决方法 - 河东软件园

解决Win7的一个毛病——睡眠失效（只关闭显示器，不关主机）

4. 2019年《斯坦福大学CS330多任务和元学习》第4讲：非参数元学习【中文字幕】

co作为前缀的意思_20个最常用的英语必备前缀！超实用！所有英语学习者必背...

最全的免费OA试用地址

NVIDIA发布Tegra 4：四核A15+72核心GPU

Win7如何简单的关闭445端口及445端口入侵详解

2018第九届蓝桥杯JavaC组省赛真题详解

上海帖易之易度文档管理

不容错过的5大化学试剂购买网站，确定不来看看？

详解机器学习中的熵、联合熵、条件熵、相对熵和交叉熵

30s带你了解 #!/bin/bash

Linux一句话精彩问答

Pro-Level Photography for Graphic Designers 平面设计师专业摄影教程 Lynda课程中文字幕

Python实战 | 如何使用 Python 调用 API

外国电影字幕翻译，怎么把英文字幕翻译成中文字幕?

english joke,and some notes for english study

第1章 Pandas基础

查看Pandas版本（本教程全部使用1.0.0版本）

一、文件读取与写入

1. 读取

（a）csv格式

（b）txt格式

（c）xls或xlsx格式

2. 写入

（a）csv格式

（b）xls或xlsx格式

二、基本数据结构

1. Series

（a）创建一个Series （类似一维数组）

对于一个Series，其中最常用的属性为值（values），索引（index），名字（name），类型（dtype）

（b）访问Series属性

（c）取出某一个元素

将在第2章详细讨论索引的应用，这里先大致了解

（d）调用方法

Series有相当多的方法可以调用：

2. DataFrame

（a）创建一个DataFrame

（b）从DataFrame取出一列为Series

（c）修改行或列名

（d）调用属性和方法

（e）索引对齐特性

这是Pandas中非常强大的特性，不理解这一特性有时就会造成一些麻烦

（f）列的删除与添加

对于删除而言，可以使用drop函数或del或pop

pop方法直接在原来的DataFrame上操作，且返回被删除的列，与python中的pop函数类似

可以直接增加新的列，也可以使用assign方法

但assign方法不会对原DataFrame做修改

（g）根据类型选择列

（h）将Series转换为DataFrame

使用T符号可以转置

三、常用基本函数

从下面开始，包括后面所有章节，我们都会用到这份虚拟的数据集

1. head和tail

可以指定n参数显示多少行

2. unique和nunique

nunique显示有多少个唯一值

unique显示所有的唯一值

3. count和value_counts

count返回非缺失值元素个数

value_counts返回每个元素有多少个（统计所有唯一值的个数）

4. describe和info

info函数返回有哪些列、有多少非缺失值、每列的类型

describe默认统计数值型数据的各个统计量

可以自行选择分位数

对于非数值型也可以用describe函数

5. idxmax和nlargest

idxmax函数返回最大值，在某些情况下特别适用，idxmin功能类似

nlargest函数返回前几个大的元素值，nsmallest功能类似

6. clip和replace

clip和replace是两类替换函数

clip是对超过或者低于某些值的数进行截断

replace是对某些值进行替换

通过字典，可以直接在表中修改

7. apply函数

apply是一个自由度很高的函数，在第3章我们还要提到

对于Series，它可以迭代每一列的值操作：

对于DataFrame，它可以迭代每一个列操作：

四、排序

1. 索引排序

2. 值排序

多个值排序，即先对第一层排，在第一层相同的情况下对第二层排序

五、问题与练习

1. 问题

【问题一】 在常用函数一节中，由于一些函数的功能比较简单，因此没有列入，现在将它们列在下面，请分别说明它们的用途并尝试使用。

sum/mean/median/mad/min/max/abs/std/var/quantile/cummax/cumsum/cumprod

和/平均值/中位数/平均值偏差/最小/最大/绝对值/标准差/方差/分位数/累计最大值/累计和/累计乘

【问题二】 df.mean(axis=1)是什么意思？它与df.mean()的结果一样吗？第一问提到的函数也有axis参数吗？怎么使用？

对每行取平均值，df.mean()取列平均值（默认axis=0）,第一问的函数也有axis参数，axis=0对列运算，axis=1对行运算

2. 练习

【练习一】 现有一份关于美剧《权力的游戏》剧本的数据集，请解决以下问题：

（a）在所有的数据中，一共出现了多少人物？ 564

（b）以单元格计数（即简单把一个单元格视作一句），谁说了最多的话？ tyrion lannister 1760句

（c）以单词计数，谁说了最多的单词？

【练习二】现有一份关于科比的投篮数据集，请解决如下问题：

（a）哪种action_type和combined_shot_type的组合是最多的？

（b）在所有被记录的game_id中，遭遇到最多的opponent是一个支？

【问题一】在常用函数一节中，由于一些函数的功能比较简单，因此没有列入，现在将它们列在下面，请分别说明它们的用途并尝试使用。

【练习一】现有一份关于美剧《权力的游戏》剧本的数据集，请解决以下问题：