{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 集合与字典的应用 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 实例:查询首都"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"images/ch7/7.7 国家与首都.csv\" target=\"_blank\">7.7 国家与首都.csv</a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"文件“7.7 国家与首都.csv”中保存了世界上大部分国家和地区的名称和首都信息,每行一个国家,国家名和首都名之间用英文逗号分隔。请读取文件内容,并构造合适的数据类型存储这些数据,编写程序完成查询功能。接收用户输入的国家或地区的名称,输出对应的首都或行政中心名称,当输入错误时输出“输入错误”。要求程序可以重复接收用户输入,直接输入回车时退出程序。\n",
"\n",
"文件“7.7 国家与首都.csv”中的数据格式如下:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"images/ch7/33.png\" style=\"zoom:50%;\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**解析:**本题可以用列表实现,也可以用字典实现,但字典可以根据键值进行索引,效率远高于列表。建议使用字典数据类型存储数据,以国家名为键,以首都名为值进行存储。可以通过构造一个无限循环来实现重复输入,输入回车退出的判定条件是输入为空字符串。在设计程序时,可以将读数据到文件和查询分为两个功能模块,各定义一个函数分别实现,其优点是每个函数的功能和逻辑都比较简单,代码量也少,容易实现和调试维护。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def countries(filename):\n",
" \"\"\"接收文件名为参数,读取这个文件中的数据,根据逗号将国家名与首都切分开,创建键为国家名,值为首都的字典,返回这个字典\"\"\"\n",
" capitals = {} # 创建空字典\n",
" with open(filename, 'r', encoding='utf-8') as data:\n",
" for x in data: # 逐行遍历文件,字符串\n",
" line = x.strip().split(',') # 切分x为列表,国家名,首都\n",
" \n",
" capitals[line[0]] = line[1] # 加入新元素,国家名:首都\n",
" return capitals # 返回创建的字典\n",
"\n",
"\n",
"def query():\n",
" \"\"\"接受用户输入的国家名,查询并返回该国家的首都;当国家名不存在时,输出'输入错误'; 当输入为空时,结束程序\"\"\"\n",
" while True: # 构建无限循环\n",
" country = input() # 输入要查询的国家名称\n",
" if country == '': # 当输入回车(输入为空)时\n",
" break # 终止循环,结束程序\n",
" else: # 输出首都,国家名不存在时,输出'输入错误'\n",
" print(capital.get(country, '输入错误'))\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
" file = 'images/ch7/7.7 国家与首都.csv'\n",
" capital = countries(file) # 调用函数读file文件获得字典\n",
" query() # 调用查询首都函数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 词频统计"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"统计“[7.8 宋词三百首.txt](images/ch7/7.8 宋词三百首.txt)”中出现次数最多的10个词和每个词出现的次数。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**解析**:\n",
"1. 词和词出现的次数正好可以用字典中的键值对表示,遍历“7.8 宋词三百首.txt”,以每个词为键,以该词出现的次数为值,构成一个词典。对词典进行降序排序,输出前10个元素。\n",
"2. 中文的文章与英文不同,中文中以句子为分隔的,各词之间无分隔,所以要先将一个句子切分成多个单词。分词这项工作可以利用一个第三方库jieba 库中lcut(txt) 函数来完成,参数txt 为一个中文字符串。在使用第三方库之前,要先安装这个库,方法是:\n",
"```python\n",
"pip install jieba\n",
"```\n",
"3. 先定义一个读文件的函数,将文件内容读取为一个字符串。返回这个字符串,内容读出来拼接为一个字符串。\n",
"\n",
"4. 定义一个词频分析函数,调用file_string(file) 函数获得读取文件得到的字符串,利用jieba 库中lcut(txt) 函数分词得到中文词的列表。遍历这个列表,以非单字词(长度大于1)为键构建字典,将词的计数作为键的值反复修改字典,该词出现一次,其值就增加1,最终得到全部词的词频字典。利用sorted()函数根据字典的值进行降序排序,返回排序的结果。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Dumping model to file cache /tmp/jieba.cache\n",
"Loading model cost 0.819 seconds.\n",
"Prefix dict has been built successfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"东风 38\n",
"几道 30\n",
"斜阳 26\n",
"黄昏 26\n",
"吴文英 25\n",
"何处 24\n",
"天涯 24\n",
"相思 23\n",
"风雨 22\n",
"明月 22\n"
]
}
],
"source": [
"import jieba # jieba是中文分词库,将中文句子切分成词。\n",
"\n",
"\n",
"def file_string(file):\n",
" \"\"\"接受文件名为参数,读取文件中的数据成一个字符串,返回这个字符串\"\"\"\n",
" txt = '' # 创建空字符串\n",
" with open(file, \"r\", encoding='utf-8') as data: # 创建文件对象\n",
" for line in data: # 遍历文件对象\n",
" txt = txt + line.strip() # 拼接成字符串\n",
" return txt # 返回字符串\n",
"\n",
"\n",
"def cut_txt(txt):\n",
" \"\"\"利用jieba对字符串txt进行切分,返回元素为中文词的列表。\"\"\"\n",
" word_list = jieba.lcut(txt) # 用lcut()将字符串切分成词的列表\n",
" return word_list\n",
"\n",
"\n",
"def text_analysis(word_list, txt):\n",
" \"\"\"接受一个字符串为参数,利用jieba对字符串进行分词,统计分析这个字符串中每个词出现的次数,\n",
" 以每个词为键,以该词出现的次数为值构建字典,按词出现的次数排序,返回降序排序的词频列表\"\"\"\n",
" counts = {} # 创建一个空字典,counts = dict()\n",
" for word in word_list: # 遍历切分好的词\n",
" if len(word) > 1: # 跳过一个字的词不统计\n",
" counts[word] = counts.get(word, 0) + 1 # 以当前词为键,值加1,初值0\n",
" items = sorted(counts.items(), key=lambda x: x[1], reverse=True) # 排序\n",
" return items # 返回降序排序的词频列表\n",
"\n",
"\n",
"def print_words(items, n):\n",
" for i in range(n): # 输出前10 个元素\n",
" word, count = items[i] # 将元组中值赋给word和count\n",
" print(\"{0:<4}{1:>4}\".format(word, count))\n",
" # 词语左对齐宽度4;数量右对齐宽度4\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
" filename = \"images/ch7/7.8 宋词三百首.txt\" # 文件名\n",
" text = file_string(filename) # 调用读文件函数,返回字符串类型\n",
" words_lst = cut_txt(text)\n",
" ls = text_analysis(words_lst, text) # 调用统计分析函数统计词频\n",
" print_words(ls, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}