2350f2b
joyvan 6 years ago
5 changed file(s) with 316 addition(s) and 113 deletion(s). Raw diff Collapse all Expand all
1010 "cell_type": "markdown",
1111 "metadata": {},
1212 "source": [
13 "决策树是一种通过**树形结构**进行分类的方法。在决策树中,树形结构中每个节点表示对分类目标在属性上的一个判断,每个分支代表基于该属性做出的一个判断,最后树形结构中每个叶子结点代表一种分类结果。"
13 "决策树是一种通过**树形结构**进行分类的方法,使用层层推理来实现最终的分类。决策树由下面几种元素构成:\n",
14 "\n",
15 "<img src=\"http://imgbed.momodel.cn//20200110170450.png\" width=500>\n"
16 ]
17 },
18 {
19 "cell_type": "markdown",
20 "metadata": {},
21 "source": [
22 "决策树的组成元素有哪些?"
23 ]
24 },
25 {
26 "cell_type": "markdown",
27 "metadata": {},
28 "source": [
29 "上面的说法过于抽象,下面来看一个实际的例子。构建一棵结构简单的决策树,用于预测贷款用户是否具有偿还贷款的能力。\n",
30 "\n",
31 "贷款用户主要具备三个属性:**是否拥有房产**,**是否结婚**,**平均月收入**。\n",
32 "\n",
33 "每一个内部节点都表示一个属性条件判断,叶子节点表示贷款用户是否具有偿还能力。\n",
34 "<img src=\"http://imgbed.momodel.cn//20200110171836.png\" width=500>\n"
35 ]
36 },
37 {
38 "cell_type": "markdown",
39 "metadata": {},
40 "source": [
41 "首先判断贷款用户是否拥有房产,如果用户拥有房产,则说明该用户具有偿还贷款的能力;否则需要判断该用户是否结婚,如果已经结婚则具有偿还贷款的能力;否则需要判断该用户的收入大小,如果该用户月收入小于 4K 元,则该用户不具有偿还贷款的能力,否则该用户是具有偿还能力的。"
42 ]
43 },
44 {
45 "cell_type": "markdown",
46 "metadata": {},
47 "source": [
48 "决策树的流程是什么?\n",
49 "\n",
50 "在有一个贷款用户A,其情况是月收入 3K、已经结婚、没有房产,那么他是否具有偿还贷款的能力呢? \n",
51 "\n",
52 "上图中我们为啥要用“是否拥有房产”作根节点呢?可不可以用“是否结婚”和“平均月收入”做根节点呢?"
1453 ]
1554 },
1655 {
4887 "\n",
4988 "根据上表,绘制如图所示的决策树:\n",
5089 "\n",
51 "<img src=\"http://imgbed.momodel.cn/决策树11.png\" width=500px>"
52 ]
53 },
54 {
55 "cell_type": "markdown",
56 "metadata": {},
57 "source": [
58 "第一层是天气状况,具有雨、多云和晴三种属性取值。\n",
90 "<img src=\"http://imgbed.momodel.cn//20200110172806.png\" width=500>"
91 ]
92 },
93 {
94 "cell_type": "markdown",
95 "metadata": {},
96 "source": [
97 "根节点是天气状况,具有雨、多云和晴三种属性取值。\n",
5998 "+ 多云: 样本子集是 { 3, 7, 12, 13 } ,仅有“前往游乐场游玩”一个类别,即肯定去游乐场。 \n",
6099 " \n",
61100 " \n",
81120 ]
82121 },
83122 {
84 "cell_type": "code",
85 "execution_count": null,
123 "cell_type": "markdown",
124 "metadata": {},
125 "source": [
126 "把数据导入 DataFrame 数据结构:"
127 ]
128 },
129 {
130 "cell_type": "code",
131 "execution_count": 1,
86132 "metadata": {},
87133 "outputs": [],
88134 "source": [
94140 "import math\n",
95141 "from math import log\n",
96142 "import warnings\n",
97 "warnings.filterwarnings(\"ignore\")"
98 ]
99 },
100 {
101 "cell_type": "code",
102 "execution_count": null,
103 "metadata": {},
104 "outputs": [],
143 "warnings.filterwarnings(\"ignore\")\n"
144 ]
145 },
146 {
147 "cell_type": "code",
148 "execution_count": 2,
149 "metadata": {},
150 "outputs": [
151 {
152 "data": {
153 "text/html": [
154 "<div>\n",
155 "<style scoped>\n",
156 " .dataframe tbody tr th:only-of-type {\n",
157 " vertical-align: middle;\n",
158 " }\n",
159 "\n",
160 " .dataframe tbody tr th {\n",
161 " vertical-align: top;\n",
162 " }\n",
163 "\n",
164 " .dataframe thead th {\n",
165 " text-align: right;\n",
166 " }\n",
167 "</style>\n",
168 "<table border=\"1\" class=\"dataframe\">\n",
169 " <thead>\n",
170 " <tr style=\"text-align: right;\">\n",
171 " <th></th>\n",
172 " <th>天气</th>\n",
173 " <th>温度</th>\n",
174 " <th>湿度</th>\n",
175 " <th>是否有风</th>\n",
176 " <th>是否前往游乐场</th>\n",
177 " </tr>\n",
178 " </thead>\n",
179 " <tbody>\n",
180 " <tr>\n",
181 " <th>0</th>\n",
182 " <td>晴</td>\n",
183 " <td>&gt;26</td>\n",
184 " <td>&gt;75</td>\n",
185 " <td>否</td>\n",
186 " <td>0</td>\n",
187 " </tr>\n",
188 " <tr>\n",
189 " <th>1</th>\n",
190 " <td>晴</td>\n",
191 " <td>&lt;=26</td>\n",
192 " <td>&gt;75</td>\n",
193 " <td>是</td>\n",
194 " <td>0</td>\n",
195 " </tr>\n",
196 " <tr>\n",
197 " <th>2</th>\n",
198 " <td>多云</td>\n",
199 " <td>&gt;26</td>\n",
200 " <td>&gt;75</td>\n",
201 " <td>否</td>\n",
202 " <td>1</td>\n",
203 " </tr>\n",
204 " <tr>\n",
205 " <th>3</th>\n",
206 " <td>雨</td>\n",
207 " <td>&lt;=26</td>\n",
208 " <td>&gt;75</td>\n",
209 " <td>否</td>\n",
210 " <td>1</td>\n",
211 " </tr>\n",
212 " <tr>\n",
213 " <th>4</th>\n",
214 " <td>雨</td>\n",
215 " <td>&lt;=26</td>\n",
216 " <td>&gt;75</td>\n",
217 " <td>否</td>\n",
218 " <td>1</td>\n",
219 " </tr>\n",
220 " <tr>\n",
221 " <th>5</th>\n",
222 " <td>雨</td>\n",
223 " <td>&lt;=26</td>\n",
224 " <td>&lt;=75</td>\n",
225 " <td>是</td>\n",
226 " <td>0</td>\n",
227 " </tr>\n",
228 " <tr>\n",
229 " <th>6</th>\n",
230 " <td>多云</td>\n",
231 " <td>&lt;=26</td>\n",
232 " <td>&lt;=75</td>\n",
233 " <td>是</td>\n",
234 " <td>1</td>\n",
235 " </tr>\n",
236 " <tr>\n",
237 " <th>7</th>\n",
238 " <td>晴</td>\n",
239 " <td>&lt;=26</td>\n",
240 " <td>&gt;75</td>\n",
241 " <td>否</td>\n",
242 " <td>0</td>\n",
243 " </tr>\n",
244 " <tr>\n",
245 " <th>8</th>\n",
246 " <td>晴</td>\n",
247 " <td>&lt;=26</td>\n",
248 " <td>&lt;=75</td>\n",
249 " <td>否</td>\n",
250 " <td>1</td>\n",
251 " </tr>\n",
252 " <tr>\n",
253 " <th>9</th>\n",
254 " <td>雨</td>\n",
255 " <td>&lt;=26</td>\n",
256 " <td>&gt;75</td>\n",
257 " <td>否</td>\n",
258 " <td>1</td>\n",
259 " </tr>\n",
260 " <tr>\n",
261 " <th>10</th>\n",
262 " <td>晴</td>\n",
263 " <td>&lt;=26</td>\n",
264 " <td>&lt;=75</td>\n",
265 " <td>是</td>\n",
266 " <td>1</td>\n",
267 " </tr>\n",
268 " <tr>\n",
269 " <th>11</th>\n",
270 " <td>多云</td>\n",
271 " <td>&lt;=26</td>\n",
272 " <td>&gt;75</td>\n",
273 " <td>是</td>\n",
274 " <td>1</td>\n",
275 " </tr>\n",
276 " <tr>\n",
277 " <th>12</th>\n",
278 " <td>多云</td>\n",
279 " <td>&gt;26</td>\n",
280 " <td>&lt;=75</td>\n",
281 " <td>否</td>\n",
282 " <td>1</td>\n",
283 " </tr>\n",
284 " <tr>\n",
285 " <th>13</th>\n",
286 " <td>雨</td>\n",
287 " <td>&lt;=26</td>\n",
288 " <td>&gt;75</td>\n",
289 " <td>是</td>\n",
290 " <td>0</td>\n",
291 " </tr>\n",
292 " </tbody>\n",
293 "</table>\n",
294 "</div>"
295 ],
296 "text/plain": [
297 " 天气 温度 湿度 是否有风 是否前往游乐场\n",
298 "0 晴 >26 >75 否 0\n",
299 "1 晴 <=26 >75 是 0\n",
300 "2 多云 >26 >75 否 1\n",
301 "3 雨 <=26 >75 否 1\n",
302 "4 雨 <=26 >75 否 1\n",
303 "5 雨 <=26 <=75 是 0\n",
304 "6 多云 <=26 <=75 是 1\n",
305 "7 晴 <=26 >75 否 0\n",
306 "8 晴 <=26 <=75 否 1\n",
307 "9 雨 <=26 >75 否 1\n",
308 "10 晴 <=26 <=75 是 1\n",
309 "11 多云 <=26 >75 是 1\n",
310 "12 多云 >26 <=75 否 1\n",
311 "13 雨 <=26 >75 是 0"
312 ]
313 },
314 "execution_count": 2,
315 "metadata": {},
316 "output_type": "execute_result"
317 }
318 ],
105319 "source": [
106320 "# 原始数据\n",
107321 "datasets = [\n",
147361 "source": [
148362 "## 2.2.2 构建决策树 \n",
149363 "\n",
150 "**信息增益**用来衡量样本集合复杂度(不确定性)所减少的程度。 \n",
151 "\n",
152 "**信息熵**用来度量信息量的大小。从信息论的角度来看,对信息的度量等于计算信息不确定性的多少。 "
364 "**信息增益**是什么?\n",
365 "\n",
366 "**信息熵**是什么?"
153367 ]
154368 },
155369 {
182396 " if ent == 0:\n",
183397 " ent = 0\n",
184398 " # 返回信息熵,精确到小数点后 4 位\n",
185 " return round(ent, 4)"
186 ]
187 },
188 {
189 "cell_type": "markdown",
190 "metadata": {},
191 "source": [
192 "\n",
193 "\n",
194 "现在用**熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场( 9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n",
399 " return round(ent, 4)\n"
400 ]
401 },
402 {
403 "cell_type": "markdown",
404 "metadata": {},
405 "source": [
406 "\n",
407 "\n",
408 "现在用**信息熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场 (9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n",
195409 "\n",
196410 "记 “游客来游乐场” 和 “游客不来游乐场” 的概率分别为 $p_1$ 和 $p_2$ ,显然 $p_1=\\frac{9}{14}$,$p_1=\\frac{5}{14}$,则这 14 个样本所蕴含的信息熵:\n",
197411 "\n",
202416 "cell_type": "markdown",
203417 "metadata": {},
204418 "source": [
205 "我们可以用下面这种方式对 dataframe 的数据按条件进行筛选。"
206 ]
207 },
208 {
209 "cell_type": "code",
210 "execution_count": null,
211 "metadata": {},
212 "outputs": [],
213 "source": [
214 "# 例如:按 是否前往游乐场==0 进行筛选\n",
215 "df[df['是否前往游乐场']=='0']"
419 "我们可以用下面这种方式对 DataFrame 的数据按条件进行筛选。"
420 ]
421 },
422 {
423 "cell_type": "code",
424 "execution_count": null,
425 "metadata": {},
426 "outputs": [],
427 "source": [
428 "# 例如:按 是否前往游乐场 == 0 进行筛选\n",
429 "df[df['是否前往游乐场']=='0']\n"
216430 ]
217431 },
218432 {
234448 "count_dict = {'前往':df[df['是否前往游乐场']=='1'].shape[0], '不前往':df[df['是否前往游乐场']=='1'].shape[1]}\n",
235449 "# 计算信息熵\n",
236450 "entropy = calc_entropy(total_num, count_dict)\n",
237 "entropy"
451 "entropy\n"
238452 ]
239453 },
240454 {
278492 "outputs": [],
279493 "source": [
280494 "# 筛选出 天气为晴并且去游乐场的样本数据\n",
281 "df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]"
495 "df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]\n"
282496 ]
283497 },
284498 {
289503 "source": [
290504 "# 天气为晴的总天数\n",
291505 "total_num_sun = df[df['天气']=='晴'].shape[0]\n",
506 "\n",
292507 "# 天气为晴时,去游乐场和不去游乐场的人数\n",
293 "count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0], \n",
294 " '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n",
508 "count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0],\n",
509 " '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n",
295510 "print(count_dict_sun)\n",
511 "\n",
296512 "# 计算天气-晴 的信息熵\n",
297513 "ent_sun = calc_entropy(total_num_sun, count_dict_sun)\n",
298514 "print('天气-晴 的信息熵为:%s' % ent_sun)\n"
306522 "source": [
307523 "# 天气为多云的总天数\n",
308524 "total_num_cloud = df[df['天气']=='多云'].shape[0]\n",
525 "\n",
309526 "# 天气为多云时,去游乐场和不去游乐场的人数\n",
310 "count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0], \n",
311 " '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n",
527 "count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0],\n",
528 " '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n",
312529 "print(count_dict_cloud)\n",
530 "\n",
313531 "# 计算天气-多云 的信息熵\n",
314532 "ent_cloud = calc_entropy(total_num_cloud, count_dict_cloud)\n",
315 "print('天气-多云 的信息熵为:%s' % ent_cloud)"
533 "print('天气-多云 的信息熵为:%s' % ent_cloud)\n"
316534 ]
317535 },
318536 {
323541 "source": [
324542 "# 天气为雨的总天数\n",
325543 "total_num_rain = df[df['天气']=='雨'].shape[0]\n",
544 "\n",
326545 "# 天气为雨时,去游乐场和不去游乐场的人数\n",
327 "count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0], \n",
328 " '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n",
546 "count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0],\n",
547 " '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n",
329548 "print(count_dict_rain)\n",
549 "\n",
330550 "# 计算天气-雨 的信息熵\n",
331551 "ent_rain = calc_entropy(total_num_rain, count_dict_rain)\n",
332 "print('天气-雨 的信息熵为:%s' % ent_rain)"
552 "print('天气-雨 的信息熵为:%s' % ent_rain)\n"
333553 ]
334554 },
335555 {
355575 "cell_type": "markdown",
356576 "metadata": {},
357577 "source": [
358 "假设有 $K$ 个信息,其组成了集合样本 $D$ ,记第 $k$ 个信息发生的概率为$P_k(1≤k≤K)$。 \n",
359 "这 $K$ 个信息的信息熵: \n",
360 "$$E(D)=-\\sum_{k=1}^{K}p_k log_{2} p_k$$\n",
361 "\n",
362 "需要指出:**所有 $p_k$ 累加起来的和为1**。"
363 ]
364 },
365 {
366 "cell_type": "markdown",
367 "metadata": {},
368 "source": [
369578 "使用上面的公式计算信息增益。"
370579 ]
371580 },
376585 "outputs": [],
377586 "source": [
378587 "# 信息增益\n",
379 "gain = entropy - (total_num_sun/total_num*ent_sun + \n",
380 " total_num_cloud/total_num*ent_cloud + \n",
588 "gain = entropy - (total_num_sun/total_num*ent_sun +\n",
589 " total_num_cloud/total_num*ent_cloud +\n",
381590 " total_num_rain/total_num*ent_rain)\n",
382 "gain"
591 "gain\n"
383592 ]
384593 },
385594 {
477686 "# 查看 label\n",
478687 "print(list(iris.target_names))\n",
479688 "# 查看 feature\n",
480 "print(iris.feature_names)"
689 "print(iris.feature_names)\n"
481690 ]
482691 },
483692 {
506715 "# 载入数据\n",
507716 "X, y = load_iris(return_X_y=True)\n",
508717 "# 切分训练集合测试集\n",
509 "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
718 "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n"
510719 ]
511720 },
512721 {
527736 "# 初始化模型,可以调整 max_depth 来观察模型的表现\n",
528737 "clf = tree.DecisionTreeClassifier(random_state=42, max_depth=2)\n",
529738 "# 训练模型\n",
530 "clf = clf.fit(X_train, y_train)"
739 "clf = clf.fit(X_train, y_train)\n"
531740 ]
532741 },
533742 {
547756 "feature_names = ['萼片长度','萼片宽度','花瓣长度','花瓣宽度']\n",
548757 "target_names = ['山鸢尾', '杂色鸢尾', '维吉尼亚鸢尾']\n",
549758 "# 可视化生成的决策树\n",
550 "dot_data = tree.export_graphviz(clf, out_file=None, \n",
551 " feature_names=feature_names, \n",
552 " class_names=target_names, \n",
553 " filled=True, rounded=True, \n",
554 " special_characters=True) \n",
555 "graph = graphviz.Source(dot_data) \n",
556 "graph "
759 "dot_data = tree.export_graphviz(clf, out_file=None,\n",
760 " feature_names=feature_names,\n",
761 " class_names=target_names,\n",
762 " filled=True, rounded=True,\n",
763 " special_characters=True)\n",
764 "graph = graphviz.Source(dot_data)\n",
765 "graph\n"
557766 ]
558767 },
559768 {
571780 "source": [
572781 "from sklearn.metrics import accuracy_score\n",
573782 "y_test_predict = clf.predict(X_test)\n",
574 "accuracy_score(y_test,y_test_predict)"
783 "accuracy_score(y_test,y_test_predict)\n"
575784 ]
576785 },
577786 {
609818 " # 读取每一行的内容\n",
610819 " for line in f.readlines():\n",
611820 " contents += line\n",
612 " return contents"
821 " return contents\n"
613822 ]
614823 },
615824 {
678887 " ent = -sum([(p / word_number) * log(p / word_number, 2) for p in\n",
679888 " word_counter.values()])\n",
680889 " print('信息熵为:%.2f' % ent)\n",
681 " return ent"
682 ]
683 },
684 {
685 "cell_type": "code",
686 "execution_count": null,
687 "metadata": {},
688 "outputs": [],
689 "source": [
690 "ent = cal_essay_entropy(ch_essay)"
691 ]
692 },
693 {
694 "cell_type": "code",
695 "execution_count": null,
696 "metadata": {},
697 "outputs": [],
698 "source": [
699 "ent = cal_essay_entropy(en_essay, split_by = ' ')"
890 " return ent\n"
891 ]
892 },
893 {
894 "cell_type": "code",
895 "execution_count": null,
896 "metadata": {},
897 "outputs": [],
898 "source": [
899 "ent = cal_essay_entropy(ch_essay)\n"
900 ]
901 },
902 {
903 "cell_type": "code",
904 "execution_count": null,
905 "metadata": {},
906 "outputs": [],
907 "source": [
908 "ent = cal_essay_entropy(en_essay, split_by = ' ')\n"
700909 ]
701910 },
702911 {
3232 "贷款用户主要具备三个属性:**是否拥有房产**,**是否结婚**,**平均月收入**。\n",
3333 "\n",
3434 "每一个内部节点都表示一个属性条件判断,叶子节点表示贷款用户是否具有偿还能力。\n",
35 "<img src=\"http://imgbed.momodel.cn/决策树实例2.png\"/>\n",
35 "<img src=\"http://imgbed.momodel.cn//20200110171836.png\" width=500>\n",
3636 "\n"
3737 ]
3838 },
8282 "\n",
8383 "根据上表,绘制如图所示的决策树:\n",
8484 "\n",
85 "<img src=\"http://imgbed.momodel.cn/决策树11.png\" width=500px>"
85 "<img src=\"http://imgbed.momodel.cn//20200110172806.png\" width=500>"
8686 ]
8787 },
8888 {
112112 "1. 选择一个属性值;\n",
113113 "2. 基于该属性对样本集进行划分;\n",
114114 "3. 重复步骤 1 和 2 直到最后所得划分结果中每个样本为同一类别。"
115 ]
116 },
117 {
118 "cell_type": "markdown",
119 "metadata": {},
120 "source": [
121 "首先我们读取数据"
115122 ]
116123 },
117124 {
559566 "\n",
560567 "同理可以计算温度高低、湿度大小、风力强弱三个气象特点的信息增益。 \n",
561568 "通常情况下,某个分支的信息增益越大,则该分支对样本集划分所获得的“纯度”越大,信息不确定性减少的程度越大。"
562 ]
563 },
564 {
565 "cell_type": "markdown",
566 "metadata": {},
567 "source": [
568 "假设有 $K$ 个信息,其组成了集合样本 $D$ ,记第 $k$ 个信息发生的概率为$P_k(1≤k≤K)$。 \n",
569 "这 $K$ 个信息的信息熵: \n",
570 "$$E(D)=-\\sum_{k=1}^{K}p_k log_{2} p_k$$\n",
571 "\n",
572 "需要指出:**所有 $p_k$ 累加起来的和为1**。"
573569 ]
574570 },
575571 {
5858 "import numpy as np\n",
5959 "import matplotlib.pyplot as plt\n",
6060 "%matplotlib inline\n",
61 "!mkdir -p ~/.keras/datasets\n",
62 "!cp ./mnist.npz ~/.keras/datasets/mnist.npz\n",
6361 "\n",
6462 "x = np.array([1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005])\n",
6563 "y = np.array([325.68, 331.15, 338.69, 345.90, 354.19, 360.88, 369.48, 379.67])\n",
data_sample.jpg less more
Binary diff not shown
model.jpg less more
Binary diff not shown