02.03 回归分析(学生版).ipynb - luxu1011578034644 (master)

02.03 回归分析(学生版).ipynb @master

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.3 回归分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**回归分析**：分析不同变量之间存在关系的研究。   \n",
    "**回归模型**：刻画不同变量之间关系的模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><video src=\"http://files.momodel.cn/regression_intro.mp4\" controls=\"controls\" width=800px></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.1 回归分析的基本概念"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><video src=\"http://files.momodel.cn/regression_basic_concept.mp4\" controls=\"controls\" width=800px></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**数据**：下表给出了莫纳罗亚山从 1970 年到 2005 年间每 5 年的二氧化碳浓度，单位是百万分比浓度（parts per million，简称ppm）\n",
    "\n",
    "<table>\n",
    "    <h4 align=\"center\">莫纳罗亚山从 1970 年到 2005 年间每 5 年的二氧化碳浓度</h4>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">**年份 $x$ ** </th>\n",
    "        <td align=\"center\">1970</td>\n",
    "        <td align=\"center\">1975</td>\n",
    "        <td align=\"center\">1980</td> \n",
    "        <td align=\"center\">1985</td>\n",
    "        <td align=\"center\">1990</td>\n",
    "        <td align=\"center\">1995</td>\n",
    "        <td align=\"center\">2000</td>\n",
    "        <td align=\"center\">2005</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">**$CO_2$(ppm) $y$**</th>\n",
    "        <td align=\"center\">325.68</td>\n",
    "        <td align=\"center\">331.15</td>\n",
    "        <td align=\"center\">338.69</td> \n",
    "        <td align=\"center\">345.90</td>\n",
    "        <td align=\"center\">354.19</td>\n",
    "        <td align=\"center\">360.88</td>\n",
    "        <td align=\"center\">369.48</td>\n",
    "        <td align=\"center\">379.67</td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "\n",
    "**目标**：分析时间年份和二氧化碳浓度之间的关联关系，由此预测2010年二氧化碳浓度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "plt.rcParams['figure.dpi'] = 150\n",
    "\n",
    "x = np.array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005])\n",
    "y = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Year\")\n",
    "plt.ylabel(\"Co2\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该地区二氧化碳浓度在逐年缓慢增加，因此我们使用简单的**线性模型**来刻画时间年份和二氧化碳浓度两者之间的关系，即 $二氧化碳浓度 = a × 时间 + b$。 \n",
    "\n",
    "设时间年份为 $x$，二氧化碳浓度为 $y$，即 $y = ax + b$ 。\n",
    "\n",
    "通过上述数据来确定模型中 $a$ 和 $b$ 的值，一旦求解出 $a$ 和 $b$ 的值，输入任意的时间年份即可估算出该年份对应的二氧化碳浓度值。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.2 回归分析中参数计算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><video src=\"http://files.momodel.cn/regression_solve_params.mp4\" controls=\"controls\" width=800px></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最简单的线性回归是**一元线性回归模型**，只包含一个自变量 $x$ 和一个因变量 $y$，并且假定自变量和因变量之间存在 $y=ax+b$ 的线性关系。\n",
    "\n",
    "求解参数 $a$ 和 $b$，需要给定若干组 $(x,y)$ 数据，然后从这些数据出发来计算参数 $a$ 和 $b$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在一元线性回归模型中，最关键的问题是如何计算参数 $a$ 和参数 $b$ 使误差最小化。\n",
    "\n",
    "最拟合直线  $y=ax+b$ 应该与这 8 组样本数据点距离都很近，最好的情况是这些样本数据点都在该直线上（不现实），让所有样本数据点离直线尽可能的近（被定义为预测数值和实际数值之间的差）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**想一想**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "预测值，真实值，残差分别是什么？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**动手练**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据书中的计算公式编写代码来求解 $a$ 和 $b$。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cal_a_b(x, y):\n",
    "    \"\"\"\n",
    "    计算 x 和 y 的线性系数\n",
    "    :param x: np array 格式的自变量\n",
    "    :param y: np array 格式的因变量\n",
    "    :return: 系数 a 和 b\n",
    "    \"\"\"\n",
    "    # todo 完成求解参数 a，b 的代码\n",
    "    return a, b\n",
    "\n",
    "a, b = cal_a_b(x, y)\n",
    "print(a, b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上：得到的预测莫纳罗亚山地区二氧化碳浓度的一元线性回归模型是什么？  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据求解结果绘制出拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(1965, 2010, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Year\")\n",
    "plt.ylabel(\"Co2\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后对该地区1970年之前和2005年之后的二氧化碳浓度进行估算。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例如，预测 2015 年的二氧化碳浓度\n",
    "a * 2015 + b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将你的最终的预测结果填写在下表中：  \n",
    "\n",
    "<table>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">**年份 $x$ ** </th>\n",
    "        <td align=\"center\">1960</td>\n",
    "        <td align=\"center\">1965</td>\n",
    "        <td align=\"center\">1970-2005</td> \n",
    "        <td align=\"center\">2010</td>\n",
    "        <td align=\"center\">2015</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">**$CO_2$(ppm) $y$**</th>\n",
    "        <td align=\"center\">  </td>\n",
    "        <td align=\"center\"> </td>\n",
    "        <td align=\"center\">已有数据</td> \n",
    "        <td align=\"center\"> </td>\n",
    "        <td align=\"center\"> </td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 扩展内容\n",
    "\n",
    "**1.使用 sklearn 工具包构建回归模型**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们也可以使用 sklearn 工具包来解决上面的问题。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入工具包\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# 定义数据\n",
    "x = np.array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005]).reshape(-1,1)\n",
    "y = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67]).reshape(-1,1)\n",
    "\n",
    "# 构建模型\n",
    "reg = LinearRegression()\n",
    "\n",
    "# 使用数据训练模型\n",
    "reg.fit(x, y)\n",
    "\n",
    "# 打印模型参数\n",
    "print(reg.coef_)\n",
    "print(reg.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.梯度下降法**\n",
    "\n",
    "在上面的例子中，不同的参数 a 和 b 将带来不同的残差值。我们把残差值更统一的称为代价函数。\n",
    "\n",
    "我们的目标就是选择合适的参数 a 和 b 来让这个代价函数的值最小。\n",
    "\n",
    "梯度下降是一个用来求函数最小值的算法，我们可以使用梯度下降算法来求出代价函数$J(\\theta_{0}, \\theta_{1})$的最小值。 \n",
    "\n",
    "梯度下降背后的思想是：开始时我们随机选择一个参数的组合$(\\theta_{0},\\theta_{1},......,\\theta_{n})$ ，计算代价函数，然后我们寻找下一个能让代价函数值下降最多的参数组合。我们持续这么做直到抵达一个局部最小值，因为我们并没有尝试完所有的参数组合，所以不能确定我们得到的局部最小值是否便是全局最小值，选择不同的初始参数组合，可能会找到不同的局部最小值。 \n",
    " \n",
    " <img src=\"http://imgbed.momodel.cn//20200115014102.png\" width=500>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "梯度下降算法的公式为：\n",
    "\n",
    "<img src=\"http://imgbed.momodel.cn//20200115014016.png\" width=350>\n",
    " \n",
    "其中 $J$ 是代价函数，$\\theta_{0},\\theta_{1}$ 是待求参数， $α$ 是学习率，它决定了我们沿着能让代价函数下降程度最大的方向向下迈出的步子有多大。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于线性回归，我们的代价函数的曲线是一个 U 型。\n",
    "\n",
    "<img src=\"http://imgbed.momodel.cn//20200115000050.png\" width=300>\n",
    "\n",
    "也由于代价函数曲线是 U 形，所以梯度下降算法肯定会找到其全局最小值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "梯度下降其实用途广泛，不仅可以解决回归问题，也可以用来解决分类问题。在下图可以看到模型学习的过程。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://imgbed.momodel.cn/panel_49_animation.gif\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 探究莫纳罗亚山地区二氧化碳与温度之间的关系\n",
    "\n",
    "该地区 1970 年到 2005 年间每 5 年的二氧化碳浓度以及全球温度（相对于 1961 - 1990 年经过平滑处理的平均温度增长量）\n",
    "\n",
    "<table>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">$CO_2$(ppm) $x$</th>\n",
    "        <td align=\"center\">325.68</td>\n",
    "        <td align=\"center\">331.15</td>\n",
    "        <td align=\"center\">338.69</td> \n",
    "        <td align=\"center\">345.90</td>\n",
    "        <td align=\"center\">354.19</td>\n",
    "        <td align=\"center\">360.88</td>\n",
    "        <td align=\"center\">369.48</td>\n",
    "        <td align=\"center\">379.67</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">温度 $y$ </th>\n",
    "        <td align=\"center\">-0.108</td>\n",
    "        <td align=\"center\">-0.082</td>\n",
    "        <td align=\"center\">0.015</td>\n",
    "        <td align=\"center\">0.080</td>\n",
    "        <td align=\"center\">0.149</td>\n",
    "        <td align=\"center\">0.240</td>\n",
    "        <td align=\"center\">0.370</td>\n",
    "        <td align=\"center\">0.420</td>\n",
    "\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以使用上面同样的方法来求解得到参数 $a$ 和 $b$。并绘制出拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据\n",
    "x = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n",
    "y = np.array([-0.108, -0.082, 0.015, 0.080, 0.149, 0.24, 0.370, 0.420])\n",
    "\n",
    "# 计算参数 a 和 b\n",
    "a, b = cal_a_b(x, y)\n",
    "\n",
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(325, 380, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"Co2\")\n",
    "plt.ylabel(\"Temperature\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 思考与练习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 摄氏温度（℃）和华氏温度（℉）是两种计量温度的标准，下表给出了两种温度之间的若干关系，如摄氏温度 0℃ 等于华氏温度 32℉。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "    <h4 align=\"center\">不同温度下测得摄氏/华氏温度表</h4>\n",
    "<tbody>\n",
    "    <tr>\n",
    "        <th align=\"left\">摄氏温度（℃） </th>\n",
    "        <td align=\"center\">0</td>\n",
    "        <td align=\"center\">10</td>\n",
    "        <td align=\"center\">15</td> \n",
    "        <td align=\"center\">20</td>\n",
    "        <td align=\"center\">25</td>\n",
    "        <td align=\"center\">30</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th align=\"left\">华氏温度（℉）</th>\n",
    "        <td align=\"center\">32</td>\n",
    "        <td align=\"center\">50</td>\n",
    "        <td align=\"center\">59</td> \n",
    "        <td align=\"center\">68</td>\n",
    "        <td align=\"center\">77</td>\n",
    "        <td align=\"center\">86</td>\n",
    "    </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "试判断摄氏温度和华氏温度之间是否符合线性关系。如符合，请通过线性回归分析计算出摄氏温度和华氏温度之间的线性回归方程。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先：我们观察一下摄氏华氏温度的散点图"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 数据\n",
    "x = np.array([0, 10, 15, 20, 25, 30])\n",
    "y = np.array([32, 50, 59, 68, 77, 86])\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"摄氏温度\")\n",
    "plt.ylabel(\"华氏温度\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**问题 1**：观察上图，摄氏温度和华氏温度是否符合线性关系？ 如果是，使用我们上面写好求解参数的方法来快速求解系数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# todo 编写代码求解系数\n",
    "a, b = \n",
    "print('参数 a 的值为：{:g}，参数 b 的值为：{:g}'.format(a, b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(0, 30, 1000)\n",
    "y_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"摄氏温度\")\n",
    "plt.ylabel(\"华氏温度\")\n",
    "plt.scatter(x, y, c='r')\n",
    "plt.plot(x_predict, y_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 摩尔定律是由英特尔创始人之一的戈登·摩尔提出，其基本内容为：当价格不变时，集成电路上可容纳的元器件的数目，大约每隔 18-24 个月变会增加一倍，性能也将提升一倍。下表记录了 1971-2004 年英特尔微处理器晶体管数量的增长。需要注意的是，随着单位面积上晶体管体积越来越小，摩尔定律所描述的晶体管增长在不久的将来会面临发展的极限。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|微处理器|推出年份($x$)|晶体管数量($y$)|$z$=log<sub>2</sub>$y$|\n",
    "|--|--|--|--|\n",
    "|4004|1971|2300|11.17|\n",
    "|8008|1972|2500|11.29|\n",
    "|8080|1974|4500|12.14|\n",
    "|8086|1978|29000|14.82|\n",
    "|Intel266|1982|134000|17.03|\n",
    "|Intel386~processor|1985|275000|18.07|\n",
    "|Intel486~processor|1989|1200000|20.19|\n",
    "|Intel Pentium processor|1993|3100000|21.56|\n",
    "|Intel Pentium Ⅱ processor|1997|7500000|22.84|\n",
    "|Intel Pentium Ⅲ processor|1999|9500000|23.18|\n",
    "|Intel Pentium 4 processor|2000|42000000|25.32|\n",
    "|Intel Itanium processor|2001|25000000|24.58|\n",
    "|Intel Itanium 2 processor|2003|220000000|27.72|\n",
    "|Intel Itanium 2 processor(9MB cache)|2004|592000000|29.14|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "摩尔定律刻画了晶体管数量与时间之间存在指数关系，可用非线性回归拟合来表示这种关系，非线性回归拟合超出了本教程的内容范围。不过我们可以对晶体管数量取以 2 为底的对数（记为 $z$ ），通过判断 $z$ 与时间 $x$ 之间是否存在线性关系，来验证摩尔定律。如果上述线性关系存在，使用线性回归方法计算之间的最佳拟合直线。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 年份\n",
    "x = np.array(\n",
    "    [1971, 1972, 1974, 1978, 1982, 1985, 1989, 1993, 1997, 1999, 2000, 2001,\n",
    "     2003, 2004])\n",
    "# 晶体管取以 2 为底的对数\n",
    "z = np.array(\n",
    "    [11.17, 11.29, 12.14, 14.82, 17.03, 18.07, 20.19, 21.56, 22.84, 23.18,\n",
    "     25.32, 24.58, 27.72, 29.14])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们绘图观察 $x$ 和 $z$ 之间的关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "plt.xlabel(\"年份\")\n",
    "plt.ylabel(\"晶体管取以 2 为底的对数\")\n",
    "plt.scatter(x, z, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**问题 1**：观察上图，$z$ 与时间 $x$ 之间是否存在线性关系？如果是，我们用上面写好的方法来求解系数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# todo 编写代码求解系数\n",
    "a, b = \n",
    "print('参数 a 的值为：{:g}，参数 b 的值为：{:g}'.format(a, b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构造 y = ax + b 直线\n",
    "x_predict = np.linspace(1970, 2005, 1000)\n",
    "z_predict = a * x_predict + b\n",
    "\n",
    "# 绘图\n",
    "fig = plt.figure()\n",
    "plt.xlabel(\"年份\")\n",
    "plt.ylabel(\"晶体管取以 2 为底的对数\")\n",
    "plt.scatter(x, z, c='r')\n",
    "plt.plot(x_predict, z_predict, c='b')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 扩展阅读\n",
    "\n",
    "1. [线性回归白板推导](https://www.bilibili.com/video/av31989606?from=search&seid=15463936019723788543)\n",
    "2. [sklearn 线性回归](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}