master
/ 8.5.1 pandas文件读写.ipynb

8.5.1 pandas文件读写.ipynb @master

d487d71
 
 
 
cc3fd0d
 
 
d487d71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc3fd0d
 
 
d487d71
 
 
 
 
 
 
 
 
 
cc3fd0d
d487d71
cc3fd0d
d487d71
 
 
 
 
cc3fd0d
d487d71
 
 
 
 
 
 
 
 
cc3fd0d
d487d71
 
 
 
cc3fd0d
d487d71
 
 
 
 
cc3fd0d
 
 
d487d71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc3fd0d
d487d71
 
 
 
 
cc3fd0d
d487d71
 
 
 
 
 
 
 
 
 
 
cc3fd0d
 
 
d487d71
 
 
 
 
 
 
 
 
 
 
 
cc3fd0d
 
 
 
 
 
 
 
 
 
d487d71
 
cc3fd0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d487d71
 
 
 
 
 
 
cc3fd0d
d487d71
 
 
cc3fd0d
 
 
 
 
 
 
 
 
 
 
 
 
 
d487d71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "inputHidden": true
   },
   "source": [
    "# Pandas导入外部数据\n",
    "\n",
    "pandas支持方便快速的从多种格式的外部文件中读取数据形成DataFrame,或将DataFrame写入不同格式的外部文件。\n",
    "下表是Pandas官方手册上给出的一张表格,表格描述的是Pandas中对各种数据文件类型的读、写函数。\n",
    "\n",
    "| Format Type | Data Description      | Reader         | Writer          |\n",
    "| :-----------: | :---------------------: | :--------------: | :---------------: |\n",
    "| text        | CSV                   | read_csv       | to_csv          |\n",
    "| text        | Fixed-Width Text File | read_fwf       |                 |\n",
    "| text        | JSON                  | read_json      | to_json         |\n",
    "| text        | HTML                  | read_html      | to_html         |\n",
    "| text        | LaTeX                 |                | Styler.to_latex |\n",
    "| text        | XML                   | read_xml       | to_xml          |\n",
    "| text        | Local clipboard       | read_clipboard | to_clipboard    |\n",
    "| binary      | MS Excel              | read_excel     | to_excel        |\n",
    "| binary      | OpenDocument          | read_excel     |                 |\n",
    "| binary      | HDF5 Format           | read_hdf       | to_hdf          |\n",
    "| binary      | Feather Format        | read_feather   | to_feather      |\n",
    "| binary      | Parquet Format        | read_parquet   | to_parquet      |\n",
    "| binary      | ORC Format            | read_orc       |                 |\n",
    "| binary      | Stata                 | read_stata     | to_stata        |\n",
    "| binary      | SAS                   | read_sas       |                 |\n",
    "| binary      | SPSS                  | read_spss      |                 |\n",
    "| binary      | Python Pickle Format  | read_pickle    | to_pickle       |\n",
    "| SQL         | SQL                   | read_sql       | to_sql          |\n",
    "| SQL         | Google BigQuery       | read_gbq       | to_gbq          |\n",
    "\n",
    "本节仅介绍几种常用文件的读取方式,其他方法大家如需使用可自行查询[相关文档](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)。\n",
    "\n",
    "## 读取txt或csv文件\n",
    "```python\n",
    "pandas.read_csv(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='\"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)\n",
    "```\n",
    "该方法参数较多,下面简单介绍一些常用参数,其他参数含义可自行查询文档。\n",
    "* filepath_or_buffer:读取的文件路径,URL(包含http,ftp,s3)链接等\n",
    "* sep:指定分隔符,默认逗号分隔。\n",
    "* delimiter:定界符,备选分隔符(如果指定该参数,则sep参数失效)\n",
    "* delim_whitespace:boolean, 是否指定空白字符作为分隔符,如果为Ture那么delimiter参数失效。\n",
    "* header:指定作为整个数据集列名的行,默认为第一行.如果数据集中没有列名,则需要设置为None\n",
    "* names:用于结果的列名列表。如果数据文件中没有列标题行,就需要设置header=None\n",
    "* index_col:指定数据集中的某列作为行索引\n",
    "* usecols:指定只读取文件中的某几列数据\n",
    "* dtype:设置每列数据的数据类型\n",
    "* skiprows:设置需要忽略的行数(从文件开始处算起),或需要跳过的行号列表(从0开始)\n",
    "* skipfooter:设置需要忽略的行数(从文件尾部处算起)\n",
    "* nrows:设置需要读取的行数(从文件头开始算起)\n",
    "* skip_blank_lines:如果为True,则跳过空行;否则记为NaN\n",
    "\n",
    "本节结合如下csv文件,仅讲解最简单用法,其他参数大家可结合数据文件自行尝试其效果。\n",
    "<img src=\"images/ch8/11.png\" style=\"zoom:100%;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "inputHidden": true
   },
   "source": [
    "数据下载:  \n",
    "\n",
    "<a href=\"images/ch8/2020各手机参数对比.xls\" target=\"_blank\">2020各手机参数对比.xls</a>  \n",
    "\n",
    "<a href=\"images/ch8/wine point.csv\" target=\"_blank\">wine point.csv</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "wine_reviews = pd.read_csv(\"images/ch8/wine point.csv\")\n",
    "\n",
    "print(wine_reviews.shape)   \n",
    "pd.set_option('display.max_columns', None)   # 显示所有列\n",
    "pd.set_option('display.max_rows', None)      # 显示所有行\n",
    "pd.set_option('display.width', None)         # 显示宽度是无限\n",
    "print(wine_reviews)                          # 返回DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 明确规定某列做行索引\n",
    "wine_reviews = pd.read_csv(\"images/ch8/wine point.csv\", index_col=6)\n",
    "print(wine_reviews)                          # 返回DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "inputHidden": true
   },
   "source": [
    "## 读取excel文件\n",
    "\n",
    "```python\n",
    "pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, decimal='.', comment=None, skipfooter=0, convert_float=None, mangle_dupe_cols=True, storage_options=None)\n",
    "```\n",
    "\n",
    "该函数大部分参数与read_csv()方法相同。其中sheet_name用于指定要读取的sheet,默认读取Excel文件中的第一个sheet。\n",
    "需要注意的是,使用该方法读取.xlsx文件需要借助openpyxl模块,读取.xls文件需要借助xlrd模块。\n",
    "\n",
    "本节结合文件“2020各手机参数对比.xls”,仅讲解最简单用法,其他参数大家可结合数据文件自行尝试其效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "phone_infos = pd.read_excel(\"images/ch8/2020各手机参数对比.xls\", skiprows=2) # 跳过第一行的表格标题\n",
    "# 读取/data/bigfiles/wine point.csv\n",
    "print(phone_infos.shape)   \n",
    "pd.set_option('display.max_columns', None)   # 显示所有列\n",
    "pd.set_option('display.max_rows', None)      # 显示所有行\n",
    "pd.set_option('display.width', None)         # 显示宽度是无限\n",
    "pd.set_option('display.unicode.east_asian_width', True) # 显示时列对齐\n",
    "print(phone_infos)                          # 返回DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "inputHidden": true
   },
   "source": [
    "### 读取HTML网页\n",
    "```python\n",
    "pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)\n",
    "```\n",
    "该函数是将HTML的表格转换为DataFrame的一种快速方便的方法,不需要用爬虫获取站点的HTML。match参数通过正则表达式匹配需要的表格;flavor参数设置解析器,默认为lxml。\n",
    "\n",
    "本节以[ESPN网站的中超排名表](https://www.espn.com/soccer/team/_/id/21506/wuhan-three-towns)为例,讲解其基本用法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install lxml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install html5lib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
      "Collecting BeautifulSoup4\n",
      "  Downloading https://mirrors.aliyun.com/pypi/packages/b1/fe/e8c672695b37eecc5cbf43e1d0638d88d66ba3a44c4d321c796f4e59167f/beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)\n",
      "\u001b[K     |████████████████████████████████| 147 kB 4.3 MB/s eta 0:00:01\n",
      "\u001b[?25hCollecting soupsieve>1.2\n",
      "  Downloading https://mirrors.aliyun.com/pypi/packages/49/37/673d6490efc51ec46d198c75903d99de59baffdd47aea3d071b80a9e4e89/soupsieve-2.4.1-py3-none-any.whl (36 kB)\n",
      "Installing collected packages: soupsieve, BeautifulSoup4\n",
      "Successfully installed BeautifulSoup4-4.12.3 soupsieve-2.4.1\n",
      "\u001b[33mWARNING: You are using pip version 21.1.3; however, version 24.0 is available.\n",
      "You should consider upgrading via the '/home/jovyan/work/.localenv/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "pip install BeautifulSoup4\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "No tables found",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_149/3569134230.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mset_option\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'display.unicode.east_asian_width'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# 显示时列对齐\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;31m# 函数会将每个table转化为一个DataFrame,返回由两个DataFrame构成的列表\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mCSL_2022\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_html\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'https://www.taobao.com/'\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;31m# 该页面只有1个table\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      8\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mCSL_2022\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mCSL_2022\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mCSL_2022\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/util/_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m    309\u001b[0m                     \u001b[0mstacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstacklevel\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    310\u001b[0m                 )\n\u001b[0;32m--> 311\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    312\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    313\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36mread_html\u001b[0;34m(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)\u001b[0m\n\u001b[1;32m   1111\u001b[0m         \u001b[0mna_values\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_values\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1112\u001b[0m         \u001b[0mkeep_default_na\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkeep_default_na\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1113\u001b[0;31m         \u001b[0mdisplayed_only\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdisplayed_only\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1114\u001b[0m     )\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36m_parse\u001b[0;34m(flavor, io, match, attrs, encoding, displayed_only, **kwargs)\u001b[0m\n\u001b[1;32m    924\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    925\u001b[0m         \u001b[0;32massert\u001b[0m \u001b[0mretained\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m  \u001b[0;31m# for mypy\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 926\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mretained\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    927\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    928\u001b[0m     \u001b[0mret\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36m_parse\u001b[0;34m(flavor, io, match, attrs, encoding, displayed_only, **kwargs)\u001b[0m\n\u001b[1;32m    904\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    905\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 906\u001b[0;31m             \u001b[0mtables\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mparse_tables\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    907\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcaught\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    908\u001b[0m             \u001b[0;31m# if `io` is an io-like object, check if it's seekable\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36mparse_tables\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    220\u001b[0m         \u001b[0mlist\u001b[0m \u001b[0mof\u001b[0m \u001b[0mparsed\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mheader\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbody\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfooter\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0mtuples\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mtables\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    221\u001b[0m         \"\"\"\n\u001b[0;32m--> 222\u001b[0;31m         \u001b[0mtables\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parse_tables\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_build_doc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmatch\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mattrs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    223\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parse_thead_tbody_tfoot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mtable\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtables\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    224\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/.virtualenvs/basenv/lib/python3.7/site-packages/pandas/io/html.py\u001b[0m in \u001b[0;36m_parse_tables\u001b[0;34m(self, doc, match, attrs)\u001b[0m\n\u001b[1;32m    550\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    551\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtables\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 552\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"No tables found\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    553\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    554\u001b[0m         \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: No tables found"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.set_option('display.max_columns', None)   # 显示所有列\n",
    "pd.set_option('display.max_rows', None)      # 显示所有行\n",
    "pd.set_option('display.width', None)         # 显示宽度是无限\n",
    "pd.set_option('display.unicode.east_asian_width', True) # 显示时列对齐\n",
    "# 函数会将每个table转化为一个DataFrame,返回由两个DataFrame构成的列表\n",
    "CSL_2022 = pd.read_html('https://www.taobao.com/')  # 该页面只有1个table \n",
    "print(type(CSL_2022), type(CSL_2022[0]))\n",
    "print(CSL_2022[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}