{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# pandas数据运算与数据统计\n",
"\n",
"## 1.算术运算\n",
"\n",
"Pandas 库中的算术运算是数据集之间根据索引进行运算。如果存在不同的索引对,运算的结果是相同索引数据和不同索引数据的并集,并且用 NaN 填充运算结果。\n",
"当进行对应索引列之间的加减乘除运算时,要注意对应索引列之间的类型是否相同,如果不同类型进行运算时会报类型转换错误,同时,要考虑到该类型是否能进行某种算术运算,比如字符串之间的加法运算是可以的,但是字符串之间是不存在减法、乘法和除法运算的,否则会报错。\n",
"\n",
"Pandas 库中的算术运算可以使用算术运算符,也可以使用算数运算函数,常用的算数运算函数如下表。\n",
"\n",
"| 算术运算函数 | 用法介绍(以DataFrame为例) | 描述 |\n",
"| :------------: | :-------------------------: | :------------------------- |\n",
"| add() | df1.add(df2) | df1与df2进行加法运算,df1+df2 |\n",
"| radd() | df1.radd(df2) | df2与df1进行加法运算,df2+df1 |\n",
"| sub() | df1.sub(df2) | 用df1减df2,df1-df2 |\n",
"| rsub() | df1.rsub(df2) | 用df2减df1,df2-df1 |\n",
"| mul() | df1.mul(df2) | df1与df2进行乘法运算,df1\\*df2 |\n",
"| rmul() | df1.rmul(df2) | df2与df1进行乘法运算,df2\\*df1 |\n",
"| div() | df1.div(df2) | 用df1除df2,df1/df2 |\n",
"| rdiv() | df1.rdiv(df2) | 用df2除df1,df2/df1 |\n",
"| truediv() | df1.truediv(df2) | 用df1除df2,df1/df2 |\n",
"| rtruediv() | df1.rtruediv(df2) | 用df2除df1,df2/df1 |\n",
"| floordiv() | df1.floordiv(df2) | 用df1除df2,取整除,df1//df2 |\n",
"| rfloordiv() | df1.rfloordiv(df2) | 用df2除df1,取整除,df2//df1 |\n",
"| mod() | df1.mod(df2) | 用df1除df2,取余数,df1%df2 |\n",
"| rmod() | df1.rmod(df2) | 用df2除df1,取余数,df2%df1 |\n",
"| pow() | df1.pow(df2) | 计算df1的df2次方,df1^df2 |\n",
"| rpow() | df1.rpow(df2) | 计算df2的df1次方,df2^df1 |\n",
"\n",
"在Pandas中,这些运算的用法和规则都相同,运算结果的数据结构也都相同,所以下面仅以加法举例,其他运算大家可自行尝试。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"df1 = pd.DataFrame(np.arange(16).reshape((4,4)), index=['i1', 'i2', 'i3', 'i4'], columns=['c1', 'c2', 'c3', 'c4'])\n",
"df2 = pd.DataFrame(np.eye(4), index=['i1', 'i2', 'i3', 'i4'], columns=['c1', 'c2', 'c3', 'c4'])\n",
"df3 = pd.DataFrame(np.arange(9).reshape((3,3)), index=['i1', 'i3', 'i5'], columns=['c2', 'c4', 'c6'])\n",
"s1 = pd.Series(np.logspace(1,4,4), index=['i1', 'i2', 'i3', 'i4'])\n",
"s2 = pd.Series(np.logspace(1,4,4), index=['c1', 'c2', 'c3', 'c4'])\n",
"s3 = pd.Series(np.logspace(1,2,2), index=['i1', 'c4'])\n",
"print('-'*10+'df1'+'-'*10)\n",
"print(df1)\n",
"print('-'*10+'df2'+'-'*10)\n",
"print(df2)\n",
"print('-'*10+'df3'+'-'*10)\n",
"print(df3)\n",
"print('-'*10+'df1+10'+'-'*10)\n",
"print(df1+10) # DataFrame与标量运算,每个元素与标量做相同运算,得到一个新的DataFrame\n",
"print('-'*10+'df1+df2'+'-'*10)\n",
"print(df1+df2) # 两个形状和对应的索引都一样的DataFrame相加,直接将对应位置的数据相加,得到一个新的DataFrame\n",
"print('-'*10+'df1.add(df3)'+'-'*10)\n",
"# 两个的形状或索引不完全一样DataFrame相加,只会将两个DataFrame中行索引和列索引对应的数据相加,\n",
"#生成一个形状能兼容两个DataFrame的新DataFrame,其他没有运算结果的位置填充空值(NaN)。\n",
"print(df1.add(df3)) \n",
"print('-'*10+'df1.add(s1, axis=0)'+'-'*10)\n",
"# Series的索引与DataFrame的行索引对应,可以将axis参数设置成0或'index',\n",
"# Series依次与DataFrame中的每一列数据进行运算,得到一个新的DataFrame。\n",
"print(df1.add(s1, axis=0)) \n",
"print('-'*10+'df1.add(s2)'+'-'*10)\n",
"# Series的索引与DataFrame的列索引对应,此时axis参数设置成0(默认值)或'columns',\n",
"# Series依次与DataFrame中的每一行数据进行运算,得到一个新的DataFrame。\n",
"print(df1.add(s2))\n",
"print('-'*10+'df1.add(s3)'+'-'*10)\n",
"# Series的行索引与DataFrame的行索引或列索引不完全相同,根据axis取值,\n",
"# 会得到一个形状能兼容DataFrame和Series的新DataFrame。\n",
"print(df1.add(s3, axis='index'))\n",
"print(df1.add(s3, axis='columns'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ch8/12.png\" style=\"zoom:120%;\">\n",
"\n",
"\n",
"## 2. 关系运算与逻辑运算\n",
"\n",
"pandas中的关系运算与Numpy中类似,将返回一个由布尔值组成的Series或DataFrame对象,主要用于条件过滤,即根据条件筛选出符合要求的数据,\n",
"\n",
"pandas的逻辑运算同样主要用于条件过滤,但与Python基础语法中的逻辑运算存在一些差异。进行或(|)、与(&)、非(~)运算时,各个独立逻辑表达式需要用括号。\n",
"\n",
"有关通过比较运算和逻辑运算进行数据筛选,在‘DataFrame的基本操作’中已经举例介绍过,此处不再赘述。\n",
"\n",
"## 3. 常用统计分析方法\n",
"\n",
"pandas提供了一些数据统计分析的函数,这些函数使用于Series和DataFrame类型。下面以文件‘score_3.csv’中的学生成绩分析为例,介绍一些常用的方法。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据下载: \n",
"<a href=\"images/ch8/scoregroup.csv\" target=\"_blank\">scoregroup.csv</a> \n",
"<a href=\"images/ch8/score_3.csv\" target=\"_blank\">score_3.csv</a> \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install pandas\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>姓名</th>\n",
" <th>学号</th>\n",
" <th>高数</th>\n",
" <th>英语</th>\n",
" <th>Python</th>\n",
" <th>物理</th>\n",
" <th>Java</th>\n",
" <th>C语言</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>罗明</td>\n",
" <td>121701100510</td>\n",
" <td>95</td>\n",
" <td>85</td>\n",
" <td>96</td>\n",
" <td>88</td>\n",
" <td>85</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>朱佳</td>\n",
" <td>121701100511</td>\n",
" <td>85</td>\n",
" <td>86</td>\n",
" <td>90</td>\n",
" <td>70</td>\n",
" <td>88</td>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>王雪</td>\n",
" <td>121701100515</td>\n",
" <td>80</td>\n",
" <td>90</td>\n",
" <td>75</td>\n",
" <td>85</td>\n",
" <td>98</td>\n",
" <td>95</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>李思</td>\n",
" <td>121701100513</td>\n",
" <td>78</td>\n",
" <td>92</td>\n",
" <td>85</td>\n",
" <td>72</td>\n",
" <td>95</td>\n",
" <td>75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>蒋维</td>\n",
" <td>121701100519</td>\n",
" <td>99</td>\n",
" <td>88</td>\n",
" <td>65</td>\n",
" <td>80</td>\n",
" <td>85</td>\n",
" <td>75</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 姓名 学号 高数 英语 Python 物理 Java C语言\n",
"0 罗明 121701100510 95 85 96 88 85 95\n",
"1 朱佳 121701100511 85 86 90 70 88 85\n",
"2 王雪 121701100515 80 90 75 85 98 95\n",
"3 李思 121701100513 78 92 85 72 95 75\n",
"4 蒋维 121701100519 99 88 65 80 85 75"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"score_df = pd.read_csv('images/ch8/score_3.csv')\n",
"score_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用describe()方法,可以获得Series或DataFrame的统计变量。可方便观察这一系列数据的范围、大小、波动趋势等等,为后面的分析打下基础。函数原型为:\n",
"```python\n",
"DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)\n",
"```\n",
"\n",
"* percentiles:设定数值型特征的统计量,默认\\[.25, .5, .75\\],返回25%,50%,75%时候的数据,可修改\n",
"* include:默认只输出数值型数据的统计信息,设置参数为'all'则输入的所有列都在输出中,设置为O则只输出离散型变量的统计信息。\n",
"* exclude:与include参数相反,用于设置不输出哪些类型。默认不排除任何数据"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>学号</th>\n",
" <th>高数</th>\n",
" <th>英语</th>\n",
" <th>Python</th>\n",
" <th>物理</th>\n",
" <th>Java</th>\n",
" <th>C语言</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>5.000000e+00</td>\n",
" <td>5.0000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1.217011e+11</td>\n",
" <td>87.4000</td>\n",
" <td>88.200000</td>\n",
" <td>82.200000</td>\n",
" <td>79.000000</td>\n",
" <td>90.200000</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.577709e+00</td>\n",
" <td>9.2358</td>\n",
" <td>2.863564</td>\n",
" <td>12.316655</td>\n",
" <td>7.874008</td>\n",
" <td>5.974948</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.217011e+11</td>\n",
" <td>78.0000</td>\n",
" <td>85.000000</td>\n",
" <td>65.000000</td>\n",
" <td>70.000000</td>\n",
" <td>85.000000</td>\n",
" <td>75.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.217011e+11</td>\n",
" <td>80.0000</td>\n",
" <td>86.000000</td>\n",
" <td>75.000000</td>\n",
" <td>72.000000</td>\n",
" <td>85.000000</td>\n",
" <td>75.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.217011e+11</td>\n",
" <td>85.0000</td>\n",
" <td>88.000000</td>\n",
" <td>85.000000</td>\n",
" <td>80.000000</td>\n",
" <td>88.000000</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.217011e+11</td>\n",
" <td>95.0000</td>\n",
" <td>90.000000</td>\n",
" <td>90.000000</td>\n",
" <td>85.000000</td>\n",
" <td>95.000000</td>\n",
" <td>95.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.217011e+11</td>\n",
" <td>99.0000</td>\n",
" <td>92.000000</td>\n",
" <td>96.000000</td>\n",
" <td>88.000000</td>\n",
" <td>98.000000</td>\n",
" <td>95.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 学号 高数 英语 Python 物理 Java C语言\n",
"count 5.000000e+00 5.0000 5.000000 5.000000 5.000000 5.000000 5.0\n",
"mean 1.217011e+11 87.4000 88.200000 82.200000 79.000000 90.200000 85.0\n",
"std 3.577709e+00 9.2358 2.863564 12.316655 7.874008 5.974948 10.0\n",
"min 1.217011e+11 78.0000 85.000000 65.000000 70.000000 85.000000 75.0\n",
"25% 1.217011e+11 80.0000 86.000000 75.000000 72.000000 85.000000 75.0\n",
"50% 1.217011e+11 85.0000 88.000000 85.000000 80.000000 88.000000 85.0\n",
"75% 1.217011e+11 95.0000 90.000000 90.000000 85.000000 95.000000 95.0\n",
"max 1.217011e+11 99.0000 92.000000 96.000000 88.000000 98.000000 95.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_df.describe() # 默认只输出数值型数据的统计信息"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_df.describe(exclude=['学号']) # 输出所有列的统计信息"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Python</th>\n",
" <th>Java</th>\n",
" <th>C语言</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>82.200000</td>\n",
" <td>90.200000</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>12.316655</td>\n",
" <td>5.974948</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>65.000000</td>\n",
" <td>85.000000</td>\n",
" <td>75.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30%</th>\n",
" <td>77.000000</td>\n",
" <td>85.600000</td>\n",
" <td>77.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>85.000000</td>\n",
" <td>88.000000</td>\n",
" <td>85.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70%</th>\n",
" <td>89.000000</td>\n",
" <td>93.600000</td>\n",
" <td>93.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90%</th>\n",
" <td>93.600000</td>\n",
" <td>96.800000</td>\n",
" <td>95.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>96.000000</td>\n",
" <td>98.000000</td>\n",
" <td>95.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Python Java C语言\n",
"count 5.000000 5.000000 5.0\n",
"mean 82.200000 90.200000 85.0\n",
"std 12.316655 5.974948 10.0\n",
"min 65.000000 85.000000 75.0\n",
"30% 77.000000 85.600000 77.0\n",
"50% 85.000000 88.000000 85.0\n",
"70% 89.000000 93.600000 93.0\n",
"90% 93.600000 96.800000 95.0\n",
"max 96.000000 98.000000 95.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_df[['Python', 'Java', 'C语言']].describe(percentiles=[0.3, 0.7, 0.9]) # 输出指定列的统计信息,并设置percenttiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"统计数据的含义如下:\n",
"\n",
"|名称|含义|名称|含义|\n",
"|:-----:|:-----|:-----:|:------|\n",
"|top|最常见的值|freq|标识最常见的值的出现频次|\n",
"|count|有效值数量|unipue|不同的值数量|\n",
"|std|标准差|min|最小值|\n",
"|25%|四分之一分位数|50%|二分之一分位数|\n",
"|75%|四分之三分位数|max|最大值|\n",
"|mean|均值||\n",
"\n",
"其他常用统计分析方法见下表。\n",
"\n",
"| 方法 | 功能 | 方法 | 功能 |\n",
"| :-----------: | :--------------------------- | :------: | :----------- |\n",
"| describe() | 一次性输出多个描述性统计指标 | count() | 非空元素计算 |\n",
"| min() | 最小值 | max() | 最大值 |\n",
"| idxmin() | 最小值的位置 | idxmax() | 最大值的位置 |\n",
"| quantile(0.1) | 10%分位数 | sum() | 求和 |\n",
"| mean() | 均值 | median() | 中位数 |\n",
"| mode() | 众数 | var() | 方差 |\n",
"| std() | 标准差 | mad() | 平均绝对偏差 |\n",
"| skew() | 偏度 | kurt() | 峰度 |\n",
"| abs() | 求绝对值 | prod() | 元素乘积 |\n",
"| cumsum() | 累计和 | cumprod() | 累计乘积 |\n",
"\n",
"大部分方法的功能与Numpy中的函数功能类似,下面结合例子简单介绍。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python总分: 411 Python平均分: 82.2\n",
"Python最高分: 罗明 96\n",
"Python成绩标准差: 12.31665539016173\n"
]
}
],
"source": [
"print('Python总分:', score_df['Python'].sum(), 'Python平均分:', score_df['Python'].mean()) # 求Python课程的总分和平均分\n",
"print('Python最高分:', score_df['姓名'][score_df['Python'].idxmax()], score_df['Python'].max()) # 求Python课程的总分和平均分\n",
"print('Python成绩标准差:', score_df['Python'].std()) # 默认ddof=1,无偏方差,与Numpy不同。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 排序与排名\n",
"\n",
"pandas中的常用的排序方法由两个,分别是按索引排序方法sort_index()和按值排序方法sort_values()。函数原型为:\n",
"```python\n",
"DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)\n",
"```\n",
"\n",
"* axis:0按照行名排序;1按照列名排序\n",
"* ascending:默认True升序排列;False降序排列\n",
"* inplace:默认False,否则排序之后的数据直接替换原来的数据框\n",
"* kind:默认quicksort,排序的方法\n",
"* na_position:缺失值默认排在最后{\"first\",\"last\"}\n",
"\n",
"```python\n",
"DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)\n",
"```\n",
"* axis:{0 or ‘index’, 1 or ‘columns’}, default 0,默认按照索引排序,即纵向排序,如果为1,则是横向排序 \n",
"* by:str or list of str;如果axis=0,那么by=\"列名\";如果axis=1,那么by=\"行名\"; \n",
"* ascending:布尔型,True则升序,可以是\\[True,False\\],即第一字段升序,第二个降序 \n",
"* inplace:布尔型,是否用排序后的数据框替换现有的数据框 \n",
"* kind:排序方法,{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’。 \n",
"* na_position : {‘first’, ‘last’}, default ‘last’,默认缺失值排在最后面 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_df.sort_index(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_df.sort_values(by=['C语言','Python'], ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://www.educoder.net/api/attachments/3517085?type=image/png\" style=\"zoom:50%;\">\n",
"\n",
"rank()方法可用于沿着某个轴(0或者1)计算对象的排名,并返回排名结果。函数原型为:\n",
"```python\n",
"DataFrame.rank(axis=0, method='average', numeric_only=NoDefault.no_default, na_option='keep', ascending=True, pct=False)\n",
"```\n",
"* axis:设置沿着哪个轴计算排名(0或者1)\n",
"* numeric_only:是否仅仅计算数字型的columns,布尔值\n",
"* na_option:NaN值是否参与排序及如何排序(‘keep’,‘top',’bottom')\n",
"* ascending:设定升序排还是降序排\n",
"* pct:是否以排名的百分比显示排名(所有排名与最大排名的百分比)\n",
"* method:取值可以为'average','first','min', 'max','dense'。average,在相等分组中,为各个值分配平均排名 (默认);min,使用整个分组的最小排名;max,使用整个分组的最大排名;first,按值在原始数据中的出现顺序分配排名;dense,类似 min,但是组之间的等级总是增加1。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_df.rank()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://www.educoder.net/api/attachments/3517314?type=image/png\" style=\"zoom:50%;\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"python_score_df = score_df[['姓名', '学号', 'Python']].copy() # 提取所有学生Python成绩的副本\n",
"python_score_df['排名'] = score_df['Python'].rank(method='min', ascending=False) # 增加一列python成绩排名\n",
"python_score_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 计算相关系数\n",
"\n",
"corr()方法检查指定列之间的相关系数(变化趋势的方向以及程度),返回值各列之间的相关系数DataFrame。值范围-1到+1,0表示两个变量不相关,正值表示正相关,负值表示负相关,值越大相关性越强。函数原型为:\n",
"```python\n",
"DataFrame.corr(method='pearson', min_periods=1)\n",
"```\n",
"* method:可选值pearson、kendall或spearman。pearson即皮尔森相关性系数,针对线性数据的相关系数计算,针对非线性数据便会有误差;kendall即肯德尔秩相关系数,用于反映分类变量相关性的指标,即针对无序序列的相关系数,非正太分布的数据;spearman即斯皮尔曼等级相关系数,主要用于非线性的、非正态分布的数据的相关系数。\n",
"* min_periods:样本最少的数据量"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.unicode.east_asian_width', True) # 设置输出列对齐\n",
"score_df['学号'] = score_df['学号'].astype(str)\n",
"print(score_df)\n",
"score_df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 数据分组统计\n",
"\n",
"在数据分析中,经常会遇到这样的情况:根据某一列(或多列)标签把数据划分为不同的组别,然后再对其进行数据分析。在 Pandas 中,要完成数据的分组操作,需要使用groupby()方法。其函数原型为:\n",
"```python\n",
"DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)\n",
"```\n",
"* by:用于确定groupby的组。如果by是一个函数,它会在行位置索引的每个值上调用。如果传递了dict或Series,则Series或dict VALUES将用于确定组。\n",
"* axis:沿行 (0) 或列 (1) 拆分。\n",
"* level:如果轴是多级标签索引,则按一个或多个特定级别进行分组。\n",
"* as_index:对于聚合输出,返回具有组标签作为索引的对象。\n",
"* sort:对组键进行排序。 关闭此功能可获得更好的性能。 \n",
"* group_keys:当调用apply时,将组键添加到index以识别片段。\n",
"* squeeze:如果可能,降低返回类型的维数,否则返回一致的类型。\n",
"* observed:如果为True,仅显示分类分组的观察值;如果为 False,显示分类分组的所有值。\n",
"* dropna:如果为True,并且组键包含NA值,则NA值连同行/列将被删除;如果为False,NA值也将被视为组中的键。\n",
"\n",
"下面以文件“scoregroup.csv”为例讲解group()方法的使用,文件部分内容如下图。\n",
"<img src=\"https://www.educoder.net/api/attachments/3518471?type=image/png\" style=\"zoom:50%;\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"score_df = pd.read_csv('images/ch8/scoregroup.csv',dtype={'学号':str}, encoding='utf-8')\n",
"score_groups = score_df.groupby('学号') # 按学号分组\n",
"score_groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"返回的结果是一个DataFrameGroupBy对象,可通过调用groups属性查看分组结果,其值是一个字典,字典键是用来分类的数据,值是分类对应的行索引值。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups.groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrameGroupBy对象的get_group()方法可以获取分组之后指定组的数据。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups.get_group('0121701100521')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by2 = score_df.groupby(['学号', '姓名']) # 按学号和姓名两列分组\n",
"score_groups_by2.groups"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by2.get_group(('0121701100623', '孙伟'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 按函数返回值分组,设置学号位行索引,按学号倒数3、4位班级号分组\n",
"score_groups_by_func = score_df.set_index('学号').groupby(lambda x:x[-4:-2]) \n",
"score_groups_by_func.groups"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在得到DataFrameGroupBy对象后,我们就可以根据需要进行相应的统计。\n",
"\n",
"* ### 在DataFrameGroupBy对象后直接加统计分析聚合函数"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by2 = score_df.groupby(['学号', '姓名']) # 按学号和姓名两列分组\n",
"score_groups_by2.sum() # 统计每个学生的总分"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by_course = score_df.groupby('课程名') # 按课程分组\n",
"score_groups_by_course.mean() # 统计每门课的平均分"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* ### 使用agg()方法对DataFrameGroupBy对象应用多个聚合函数\n",
"agg()的功能更加强大,除了可以向agg()函数中传入聚合函数外,也常用列表、字典等形式作为参数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by_course = score_df.drop(['学号', '姓名'], axis=1).groupby('课程名') # 按课程分组\n",
"score_groups_by_course.agg(['mean', 'median', 'max', 'min']) # 统计每门课的平均分、中位数、最高分和最低分"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by_course = score_df.groupby('课程名') # 按课程分组\n",
" # 传字典,每列统计不同数据\n",
"score_groups_by_course.agg({'分数':['mean', 'sum'], '学号':['count']}) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* ### 使用apply()方法对DataFrameGroupBy对象应用自定义函数"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 统计指定分值区间内的人数\n",
"def count_by_interval(df, start=0, end=101):\n",
" return ((df['分数'] >= start) & (df['分数'] < end)).sum()\n",
"\n",
"score_groups_by_course = score_df.groupby('课程名') # 按课程分组\n",
" # 传字典,每列统计不同数据\n",
"score_groups_by_course.apply(count_by_interval, end=60) # 统计每门课的不及格人数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* #### 使用filter()方法对DataFrameGroupBy对象进行数据筛选\n",
"根据定义的条件函数过滤数据,并返回一个新的数据集。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score_groups_by2 = score_df.groupby(['学号', '姓名']) # 按学号和姓名两列分组\n",
"score_groups_by2.filter(lambda df:df['分数'].mean()>=85 ) # 返回平均分在85分以上的学生构成的新数据集"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}