{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NumPy读写文本文件\n",
"\n",
"NumPy提供了多种文本文件操作函数,用于从文本文件中读取数据形成数组,或将数组数据按相应的格式写入文本文件保存。\n",
"\n",
"## 读文本文件\n",
"\n",
"NumPy中用于读取文本文件的方法主要有loadtxt()和genfromtxt(),两者都可以用于读取txt或者csv文件。\n",
"\n",
"### 1. loadtxt()方法\n",
"\n",
"该方法用于从文本文件加载数据到二维数组中,文本文件中的每一行必须具有相同数量的值。函数原型为:\n",
"```python\n",
"numpy.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None, *, quotechar=None, like=None)\n",
"```\n",
"文件<a href=\"images/ch8/8.5 score.csv\" target=\"_blank\">8.5 score.csv</a>中保存学生成绩数据,其数据部分包括整数、浮点数和缺失数据(郑君 C 语言和 VB 成绩缺失),文件中数据见下图。下面以对这个文件的读取为例介绍几个主要参数的含义。\n",
"\n",
"<img src=\"images/ch8/6.png\" style=\"zoom:60%;\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **fname:**要读取的文件、文件名、列表或生成器。其中,生成器必须返回字节或字符串,列表中或由生成器生成的字符串被视为行。如果是.gz或.bz2的压缩文件,则会先解压缩文件。\n",
"+ **dtype:**生成数组的数据类型,默认值是 dtype=float。设置 dtype=None 时,每个列的类型从每行的各列数据中迭代确定。函数依次检查各列数据是否可以转换为布尔值、整数、浮点数、复数和字符串,直到满足条件为止。但这种方法处理速度明显慢于明确设置 dtype 数据类型。\n",
"+ **delimiter:**用于分隔值的字符串,默认值为空格。用于定义按什么字符拆分数据行。\n",
"+ **encoding:**值为字符串,用于指定解码输入文件的编码类型,当“fname”是文件对象时不可使用此参数。默认值为“bytes”,此时启用向后兼容的方案,确保在可能的情况下接收字节数据,并将拉丁编码的字符串传给转换器。当值设置为“None”时,应用操作系统的默认编码,一般windows系统默认使用GBK编码。重写此值将可以接收 unicode 数组并将字符串作为输入传递给转换器。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['姓名' '学号' 'C语言' 'Java' 'Python' 'VB' 'C++' '总分']\n",
" ['朱佳' '0121701100511' '75.2' '93' '66' '85' '88' '407']\n",
" ['李思' '0121701100513' '86' '76' '96' '93' '67' '418']\n",
" ['郑君' '0121701100514' ' ' '98' '76' ' ' '89' '263']\n",
" ['王雪' '0121701100515' '99' '96' '91' '88' '86' '460']\n",
" ['罗明' '0121701100510' '95' '96' '85' '63' '91' '430']]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"\n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=str, delimiter=',', encoding='UTF-8') # 字符串类型,逗号分隔,utf-8编码\n",
"print(data)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **skiprows:**值为整数。文件开头数据描述的存在可能阻碍数据处理,此时可以使用skiprows参数。参数的值为读取文件时跳过的行数,缺省值为 skip_header=0,表示不略过任何行。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['李思' '0121701100513' '86' '76' '96' '93' '67' '418']\n",
" ['郑君' '0121701100514' ' ' '98' '76' ' ' '89' '263']\n",
" ['王雪' '0121701100515' '99' '96' '91' '88' '86' '460']\n",
" ['罗明' '0121701100510' '95' '96' '85' '63' '91' '430']]\n"
]
}
],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=str, delimiter=',', skiprows=2, encoding='utf-8') # 跳过标题行\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **usecols:**整数序列,指明哪些列将被读取,序号从 0 开始。在某些情况下,只希望返回其中的几个列的数据,可以使用 usecols 参数选择要导入哪些列。此参数接受单个整数或对应于要导入的列的索引的整数序列。例如:“usecols = (1, 4, 5)”将读取第 2 列、第 5 列和第 6 列数据。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['75.2' '93' '66' '85' '88' '407']\n",
" ['86' '76' '96' '93' '67' '418']\n",
" [' ' '98' '76' ' ' '89' '263']\n",
" ['99' '96' '91' '88' '86' '460']\n",
" ['95' '96' '85' '63' '91' '430']]\n"
]
}
],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=str, delimiter=',', skiprows=1, usecols=range(2,8), encoding='utf-8') # 仅读取成绩部分(2-7列)\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **converters:**转换器,用于对数据进行预处理。值为函数名或包含键值对“列索引:函数名”的字典。"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=float, delimiter=',', skiprows=1, usecols=range(2,8), \n",
" converters=lambda s: float(s.strip() or 0), # 将所有列数据转换为浮点数,并处理空值为0.0\n",
" encoding='utf-8') \n",
"print(data)\n",
"data_1 = np.loadtxt(file, dtype=str, delimiter=',', skiprows=1, usecols=range(2,8), \n",
" converters={7:lambda s: round(float(s)/5, 2)}, # 将第7列转换为5门课的平均分,并保留最多2位小数\n",
" encoding='utf-8') \n",
"print(data_1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **unpack:**布尔值,默认值为 False,当该值为 True 时,返回的数组将被转置,以便可以使用 x, y, z = genfromtxt(...) 解压参数。当 unpack 参数与记录数据类型一起使用时,每个字段都返回数组。"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=str, delimiter=',', unpack=True, encoding='utf-8') # 数组转置\n",
"print(data)\n",
"java_scores, python_scores = np.loadtxt(file, dtype=float, delimiter=',', skiprows=1, \n",
" usecols=(3, 4), unpack=True, # 仅读取java与python成绩,并赋值给不同对象\n",
" encoding='utf-8') \n",
"print(java_scores)\n",
"print(python_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **max_rows:**在跳过skiprows指定的行之后,最多读取max_rows行内容。默认值为None,表示读取所有行。"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.loadtxt(file, dtype=str, delimiter=',', max_rows=3, skiprows=1, encoding='utf-8') # 跳过标题行后,只读取3行\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **comments:**字符串或字符串序列,用于指明注释开始的字符,默认值为'#'。注释标记可以出现在该行的任何地方,读取时注释符号后面的所有字符都会被忽略。\n",
"+ **ndmin:**整形值,用于指定返回的ndarray至少包含特定维度,值域 0/1/2,默认为 0。\n",
"+ **quotechar:**值为字符,定义用于表示引用项的开始和结束的字符。在一对quotechar包围的项中,出现的分隔符或注释字符将被忽略。默认值为None,表示禁用该功能。如果在一对quotechar包围的字段中发现两个连续的quotechar实例,则第一个实例将被视为转义字符。\n",
"\n",
"\n",
"### 2. genfromtxt()方法\n",
"\n",
"与loadtxt()类似,从文本文件加载数据到数组,并可按指定方式处理缺失值,面向的是结构化数组和缺失数据的处理。函数原型为:\n",
"```python\n",
"numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None,filling_values=None, usecols=None, names=None, excludelist=None, deletechars=\"!#$%&'()*+, -./:;<=>?@[\\\\]^{|}~\", replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes', *, ndmin=0, like=None)\n",
"```\n",
"该方法部分主要参数的含义和用法与 loadtxt() 方法的参数类似,下面对其独有的常用参数做简单介绍。\n",
"\n",
"+ **delimiter:**值为字符串、整数或序列。其值为字符串时,与loadtxt()方法中的delimiter参数作用相同。处理具有固定宽度的数据文件时,可用整数或整数序列确定每个字段的宽度。当所有列具有相同宽度时,值可设为单个整数;当各列宽度具有不同大小时,值可设为一个整数的序列。\n",
"+ **skip_header:**值为整数。文件开头数据描述的存在可能阻碍数据处理,此时可以使用 skip_header 参数。参数的值对应于在执行任何其他操作之前在文件开头跳过的行数,缺省值为 skip_header=0,表示不略过任何行。\n",
"+ **skip_footer:**与skip_header类似,可以使用 skip_footer 参数来跳过文件的最后若干行,缺省值为 skip_footer=0,表示不跳过任何行。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['朱佳' '0121701100511' '75.2' '93' '66' '85' '88' '407']\n",
" ['李思' '0121701100513' '86' '76' '96' '93' '67' '418']\n",
" ['郑君' '0121701100514' ' ' '98' '76' ' ' '89' '263']]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.genfromtxt(file, dtype=str, delimiter=',', encoding='utf-8', # 字符串类型,逗号分隔,utf-8编码\n",
" skip_header=1, skip_footer=2) # 跳过第一行和最后2行\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **missing_values:**设置哪些字符串被视为缺失数据。默认情况下,任何空字符串都被标记为缺失。但文件中也有可能有更复杂的字符串,比如 \"N/A\" 或 \"???\" 也代表丢失或无效的数据。missing_values 参数接受三种值:单个字符串或逗号分隔的字符串,该字符串将用作所有列缺失数据的标记;字符串序列,此时序列中每个项目都按顺序与列关联;字典,字典的值是字符串或字符串序列,相应的键可以是列索引(整数)或列名称(字符串),也可以使用特殊键 None 来定义适用于所有列的默认值。\n",
"+ **filling_values:**为缺失数据提供一个默认值。默认情况下,缺失值根据dtype参数确定(如int对应-1,string对应'???',float对应np.nan)。通过 filling_values参数,我们可以更好地控制缺失值的转换。filling_values参数接受三种值:单个值,所有列的默认值;序列,序列中每个项目都按顺序与列关联,作为相应列的默认值;字典类型,每个键可以是列索引或列名称,并且相应的值应该是单个对象做默认值,也可以使用特殊键None为所有列定义默认值。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"file = 'images/ch8/8.5 score.csv'\n",
"data = np.genfromtxt(file, dtype=float, delimiter=',', skip_header=1, \n",
" usecols=range(2,8), filling_values=0, encoding='utf-8') # 仅保留成绩,转为浮点数,缺失值填充默认值0.0\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **names:**值为None、True、字符串或序列之一,当值为“True”时,跳过文件开头的 skip_header 设定的行数后读取的第1行作为字段名。这行也可选被注释符号注释的行。如果 names 参数的值为序列或是被逗号分隔的字符串序列,那么这些字符串将被用于定义结构化类型的字段名。如果 names 参数的值为None,将使用原字段的数据类型作为字段名。\n",
"+ **defaultfmt:**值为字符串,定义初始化field names的格式。当 names = None 的时候,可通过此参数来设置字段名。\n",
"+ **deletechars:**给出一个字符串,将所有要从字段名称中删除的字符组合在一起。默认情况下,无效字符是\\~!@#$\\%^&\\*()-=+~\\\\|\\]\\}[{';: /?.>,<\n",
"+ **excludelist:**给出要排除的字段名称列表,如 return,file,print……。如果其中一个输入名称是该列表的一部分,则会附加一个下划线字符(‘_’)。\n",
"+ **case_sensitive:**字段名是否区分大小写(case_sensitive = True),转换为大写(case_sensitive = False 或 case_sensitive = ‘upper’)或小写(case_sensitive = ‘lower’)。\n",
"+ **replace_space:**为字段名中的空格设置替换的字符,默认为下划线。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(75.2, 93., 66., 85., 88., 407.) (86. , 76., 96., 93., 67., 418.)\n",
" ( 0. , 98., 76., 0., 89., 263.) (99. , 96., 91., 88., 86., 460.)\n",
" (95. , 96., 85., 63., 91., 430.)]\n",
"[('C语言', '<f8'), ('Java', '<f8'), ('Python', '<f8'), ('VB', '<f8'), ('C', '<f8'), ('总分', '<f8')]\n",
"[66. 96. 76. 91. 85.]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"file = 'images/ch8/8.5 score.csv'\n",
"# 仅保留成绩,转为浮点数,缺失值填充默认值0.0,第一行做字段名\n",
"data = np.genfromtxt(file, dtype=float, delimiter=',', names=True, \n",
" usecols=range(2,8), filling_values=0, encoding='utf-8') \n",
"print(data)\n",
"print(data.dtype) # 输出每列的字段名称和类型,deletechars使用默认值,C++中的加号将被去除\n",
"print(data['Python']) # 可通过字段名索引并输出对应列数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **usemask:**设置是否使用mask,默认值为False。为True时,使用‘–-’代替传统的nan,输出数组将成为 MaskedArray。我们可通过相关操作获取一个布尔数组,该数组中为True的位置表示缺少数据,否则为False。使用此数组,我们可跟踪丢失数据的位置。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[75.2 93.0 66.0 85.0 88.0 407.0]\n",
" [86.0 76.0 96.0 93.0 67.0 418.0]\n",
" [-- 98.0 76.0 -- 89.0 263.0]\n",
" [99.0 96.0 91.0 88.0 86.0 460.0]\n",
" [95.0 96.0 85.0 63.0 91.0 430.0]]\n",
"[[False False False False False False]\n",
" [False False False False False False]\n",
" [ True False False True False False]\n",
" [False False False False False False]\n",
" [False False False False False False]]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"file = 'images/ch8/8.5 score.csv'\n",
"# 跳过第一行,仅读取成绩列,使用mask\n",
"data = np.genfromtxt(file, dtype=float, delimiter=',', skip_header=1, usecols=range(2,8), usemask=True, encoding='utf-8') \n",
"print(data) # --替换nan\n",
"print(data.view(np.ma.MaskedArray).mask) # 得到布尔数组"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ **autostrip**:是否保留数据中前导或尾随的空白字符。默认值为False,保留前导或尾随的空白字符;设置为True时,自动去除前导或尾随的空白字符。\n",
"+ **loose:**是否针对invalid values报错,默认值为True,则不要针对无效值引发错误。\n",
"+ **invalid_raise:**默认值为True。如果为 True,则在某行检测到列数不一致时,将引发异常;如果为 False,则发出警告并跳过有问题的行。\n",
"\n",
"## 写文本文件\n",
"\n",
"NumPy中用于将数组写入文本文件的方法主要是savetxt()。\n",
"\n",
"### 1. savetxt()方法\n",
"\n",
"将数组数据按指定的格式写入文本文件,一般存储为txt或者csv格式。函数原型为:\n",
"```python\n",
"numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\\n', header='', footer='', comments='# ', encoding=None)\n",
"```\n",
"\n",
"+ **fname:**文件名或文件句柄。如果文件名结束.gz,文件将自动以压缩gzip格式保存。\n",
"+ **X:**一维或者二维数组,要保存到文本文件的数据。\n",
"+ **fmt:**字符串或字符串序列,用于设置数据写入文件时的格式。可以为单个格式化输出字符串、一个格式化输出字符串序列或一个多格式字符串。\n",
"+ **delimiter:**值为字符串,设置分隔列的字符串或字符,默认值为空格。\n",
"+ **newline:**值为字符串,设置字符串或字符分隔线,默认值为'\\n'。\n",
"+ **header:**值为字符串,设置将在文件开头写入的字符串。\n",
"+ **footer:**值为字符串,设置将写入文件末尾的字符串。\n",
"+ **comments:**值为字符串,设置附加到header和footer的字符串,标记其为注释。默认值为'#'。\n",
"+ **encoding:**值为字符串,用于指定编码输出文件的编码。\n",
"\n",
"### 实例:利用 NumPy 读写数据文件\n",
"\n",
"文件“<a href=\"images/ch8/8.5 scoreLoad.csv\" target=\"_blank\">8.5 scoreLoad.csv</a>”保存学生成绩数据,中文编码类型为utf-8,分隔符为英文逗号“,”,文件的内容如下,按要求完成以下操作:\n",
"1. 将文件读入数组,数据以字符串形式输出\n",
"2. 读取除学号和总分以外的数据到数组中\n",
"3. 返回只包含成绩数据的数组,数据转为浮点型\n",
"4. 将第3步读取的数组中的数据以浮点类型(保留两位小数)写入文本文件“scoresave.txt”中,用Tab作分隔符,编码为'utf-8。\n",
"5. 写入文件时,在文件头部增加字符串'成绩表',在文件尾部增加字符串'武汉理工大学',并设置注释字符为'@@'。\n",
"\n",
"\n",
"<img src=\"images/ch8/7.png\" style=\"zoom:60%;\">"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 75. 93. 66. 85. 88. 407.]\n",
" [ 86. 76. 96. 93. 67. 418.]\n",
" [ 88. 98. 76. 90. 89. 441.]\n",
" [ 99. 96. 91. 88. 86. 460.]\n",
" [ 95. 96. 85. 63. 91. 430.]]\n",
"文件'8.5 scoresave.txt'中的内容:\n",
"@@成绩表\n",
"75.00\t93.00\t66.00\t85.00\t88.00\t407.00\n",
"86.00\t76.00\t96.00\t93.00\t67.00\t418.00\n",
"88.00\t98.00\t76.00\t90.00\t89.00\t441.00\n",
"99.00\t96.00\t91.00\t88.00\t86.00\t460.00\n",
"95.00\t96.00\t85.00\t63.00\t91.00\t430.00\n",
"@@武汉理工大学\n",
"\n"
]
}
],
"source": [
"import numpy as np\n",
" \n",
"file = 'images/ch8/8.5 scoreLoad.csv'\n",
"# 文件无缺失数据,使用loadtxt()读文件到数组\n",
"data = np.loadtxt(file, float, delimiter=',', # 读取类型为浮点数,以英文逗号为分隔符 \n",
" usecols=(2, 3, 4, 5, 6, 7), # 仅读取成绩列 \n",
" skiprows=1, encoding='utf-8') # 跳过第一行,编码为utf-8 \n",
"print(data)\n",
"# 将成绩数组写入文件'8.5 scoresave.txt'\n",
"np.savetxt('images/ch8/8.5 scoresave.txt', data, # 数据data写入文件'8.5 scoresave.txt'\n",
" fmt=\"%.2f\", delimiter='\\t', # 设置写入格式为浮点数保留2位小数,分隔符为'\\t'\n",
" header='成绩表', # 文件头部写入'成绩表'\n",
" footer='武汉理工大学', # 文件尾部写入'武汉理工大学'\n",
" comments='@@', # 注释字符设置为'@@'\n",
" encoding='utf-8') # 编码为utf-8\n",
"\n",
"# 读取文件内容,观察写入的数据\n",
"print(\"文件'8.5 scoresave.txt'中的内容:\")\n",
"with open('images/ch8/8.5 scoresave.txt', 'r', encoding='utf-8') as f:\n",
" print(f.read())"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4113\n"
]
}
],
"source": [
"x=0x1011\n",
"print(x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}