| 10 | 10 |
"cell_type": "markdown",
|
| 11 | 11 |
"metadata": {},
|
| 12 | 12 |
"source": [
|
| 13 | |
"决策树是一种通过**树形结构**进行分类的方法。在决策树中,树形结构中每个节点表示对分类目标在属性上的一个判断,每个分支代表基于该属性做出的一个判断,最后树形结构中每个叶子结点代表一种分类结果。"
|
|
13 |
"决策树是一种通过**树形结构**进行分类的方法,使用层层推理来实现最终的分类。决策树由下面几种元素构成:\n",
|
|
14 |
"\n",
|
|
15 |
"<img src=\"http://imgbed.momodel.cn//20200110170450.png\" width=500>\n"
|
|
16 |
]
|
|
17 |
},
|
|
18 |
{
|
|
19 |
"cell_type": "markdown",
|
|
20 |
"metadata": {},
|
|
21 |
"source": [
|
|
22 |
"决策树的组成元素有哪些?"
|
|
23 |
]
|
|
24 |
},
|
|
25 |
{
|
|
26 |
"cell_type": "markdown",
|
|
27 |
"metadata": {},
|
|
28 |
"source": [
|
|
29 |
"上面的说法过于抽象,下面来看一个实际的例子。构建一棵结构简单的决策树,用于预测贷款用户是否具有偿还贷款的能力。\n",
|
|
30 |
"\n",
|
|
31 |
"贷款用户主要具备三个属性:**是否拥有房产**,**是否结婚**,**平均月收入**。\n",
|
|
32 |
"\n",
|
|
33 |
"每一个内部节点都表示一个属性条件判断,叶子节点表示贷款用户是否具有偿还能力。\n",
|
|
34 |
"<img src=\"http://imgbed.momodel.cn//20200110171836.png\" width=500>\n"
|
|
35 |
]
|
|
36 |
},
|
|
37 |
{
|
|
38 |
"cell_type": "markdown",
|
|
39 |
"metadata": {},
|
|
40 |
"source": [
|
|
41 |
"首先判断贷款用户是否拥有房产,如果用户拥有房产,则说明该用户具有偿还贷款的能力;否则需要判断该用户是否结婚,如果已经结婚则具有偿还贷款的能力;否则需要判断该用户的收入大小,如果该用户月收入小于 4K 元,则该用户不具有偿还贷款的能力,否则该用户是具有偿还能力的。"
|
|
42 |
]
|
|
43 |
},
|
|
44 |
{
|
|
45 |
"cell_type": "markdown",
|
|
46 |
"metadata": {},
|
|
47 |
"source": [
|
|
48 |
"决策树的流程是什么?\n",
|
|
49 |
"\n",
|
|
50 |
"在有一个贷款用户A,其情况是月收入 3K、已经结婚、没有房产,那么他是否具有偿还贷款的能力呢? \n",
|
|
51 |
"\n",
|
|
52 |
"上图中我们为啥要用“是否拥有房产”作根节点呢?可不可以用“是否结婚”和“平均月收入”做根节点呢?"
|
| 14 | 53 |
]
|
| 15 | 54 |
},
|
| 16 | 55 |
{
|
|
| 48 | 87 |
"\n",
|
| 49 | 88 |
"根据上表,绘制如图所示的决策树:\n",
|
| 50 | 89 |
"\n",
|
| 51 | |
"<img src=\"http://imgbed.momodel.cn/决策树11.png\" width=500px>"
|
| 52 | |
]
|
| 53 | |
},
|
| 54 | |
{
|
| 55 | |
"cell_type": "markdown",
|
| 56 | |
"metadata": {},
|
| 57 | |
"source": [
|
| 58 | |
"第一层是天气状况,具有雨、多云和晴三种属性取值。\n",
|
|
90 |
"<img src=\"http://imgbed.momodel.cn//20200110172806.png\" width=500>"
|
|
91 |
]
|
|
92 |
},
|
|
93 |
{
|
|
94 |
"cell_type": "markdown",
|
|
95 |
"metadata": {},
|
|
96 |
"source": [
|
|
97 |
"根节点是天气状况,具有雨、多云和晴三种属性取值。\n",
|
| 59 | 98 |
"+ 多云: 样本子集是 { 3, 7, 12, 13 } ,仅有“前往游乐场游玩”一个类别,即肯定去游乐场。 \n",
|
| 60 | 99 |
" \n",
|
| 61 | 100 |
" \n",
|
|
| 81 | 120 |
]
|
| 82 | 121 |
},
|
| 83 | 122 |
{
|
| 84 | |
"cell_type": "code",
|
| 85 | |
"execution_count": null,
|
|
123 |
"cell_type": "markdown",
|
|
124 |
"metadata": {},
|
|
125 |
"source": [
|
|
126 |
"把数据导入 DataFrame 数据结构:"
|
|
127 |
]
|
|
128 |
},
|
|
129 |
{
|
|
130 |
"cell_type": "code",
|
|
131 |
"execution_count": 1,
|
| 86 | 132 |
"metadata": {},
|
| 87 | 133 |
"outputs": [],
|
| 88 | 134 |
"source": [
|
|
| 94 | 140 |
"import math\n",
|
| 95 | 141 |
"from math import log\n",
|
| 96 | 142 |
"import warnings\n",
|
| 97 | |
"warnings.filterwarnings(\"ignore\")"
|
| 98 | |
]
|
| 99 | |
},
|
| 100 | |
{
|
| 101 | |
"cell_type": "code",
|
| 102 | |
"execution_count": null,
|
| 103 | |
"metadata": {},
|
| 104 | |
"outputs": [],
|
|
143 |
"warnings.filterwarnings(\"ignore\")\n"
|
|
144 |
]
|
|
145 |
},
|
|
146 |
{
|
|
147 |
"cell_type": "code",
|
|
148 |
"execution_count": 2,
|
|
149 |
"metadata": {},
|
|
150 |
"outputs": [
|
|
151 |
{
|
|
152 |
"data": {
|
|
153 |
"text/html": [
|
|
154 |
"<div>\n",
|
|
155 |
"<style scoped>\n",
|
|
156 |
" .dataframe tbody tr th:only-of-type {\n",
|
|
157 |
" vertical-align: middle;\n",
|
|
158 |
" }\n",
|
|
159 |
"\n",
|
|
160 |
" .dataframe tbody tr th {\n",
|
|
161 |
" vertical-align: top;\n",
|
|
162 |
" }\n",
|
|
163 |
"\n",
|
|
164 |
" .dataframe thead th {\n",
|
|
165 |
" text-align: right;\n",
|
|
166 |
" }\n",
|
|
167 |
"</style>\n",
|
|
168 |
"<table border=\"1\" class=\"dataframe\">\n",
|
|
169 |
" <thead>\n",
|
|
170 |
" <tr style=\"text-align: right;\">\n",
|
|
171 |
" <th></th>\n",
|
|
172 |
" <th>天气</th>\n",
|
|
173 |
" <th>温度</th>\n",
|
|
174 |
" <th>湿度</th>\n",
|
|
175 |
" <th>是否有风</th>\n",
|
|
176 |
" <th>是否前往游乐场</th>\n",
|
|
177 |
" </tr>\n",
|
|
178 |
" </thead>\n",
|
|
179 |
" <tbody>\n",
|
|
180 |
" <tr>\n",
|
|
181 |
" <th>0</th>\n",
|
|
182 |
" <td>晴</td>\n",
|
|
183 |
" <td>>26</td>\n",
|
|
184 |
" <td>>75</td>\n",
|
|
185 |
" <td>否</td>\n",
|
|
186 |
" <td>0</td>\n",
|
|
187 |
" </tr>\n",
|
|
188 |
" <tr>\n",
|
|
189 |
" <th>1</th>\n",
|
|
190 |
" <td>晴</td>\n",
|
|
191 |
" <td><=26</td>\n",
|
|
192 |
" <td>>75</td>\n",
|
|
193 |
" <td>是</td>\n",
|
|
194 |
" <td>0</td>\n",
|
|
195 |
" </tr>\n",
|
|
196 |
" <tr>\n",
|
|
197 |
" <th>2</th>\n",
|
|
198 |
" <td>多云</td>\n",
|
|
199 |
" <td>>26</td>\n",
|
|
200 |
" <td>>75</td>\n",
|
|
201 |
" <td>否</td>\n",
|
|
202 |
" <td>1</td>\n",
|
|
203 |
" </tr>\n",
|
|
204 |
" <tr>\n",
|
|
205 |
" <th>3</th>\n",
|
|
206 |
" <td>雨</td>\n",
|
|
207 |
" <td><=26</td>\n",
|
|
208 |
" <td>>75</td>\n",
|
|
209 |
" <td>否</td>\n",
|
|
210 |
" <td>1</td>\n",
|
|
211 |
" </tr>\n",
|
|
212 |
" <tr>\n",
|
|
213 |
" <th>4</th>\n",
|
|
214 |
" <td>雨</td>\n",
|
|
215 |
" <td><=26</td>\n",
|
|
216 |
" <td>>75</td>\n",
|
|
217 |
" <td>否</td>\n",
|
|
218 |
" <td>1</td>\n",
|
|
219 |
" </tr>\n",
|
|
220 |
" <tr>\n",
|
|
221 |
" <th>5</th>\n",
|
|
222 |
" <td>雨</td>\n",
|
|
223 |
" <td><=26</td>\n",
|
|
224 |
" <td><=75</td>\n",
|
|
225 |
" <td>是</td>\n",
|
|
226 |
" <td>0</td>\n",
|
|
227 |
" </tr>\n",
|
|
228 |
" <tr>\n",
|
|
229 |
" <th>6</th>\n",
|
|
230 |
" <td>多云</td>\n",
|
|
231 |
" <td><=26</td>\n",
|
|
232 |
" <td><=75</td>\n",
|
|
233 |
" <td>是</td>\n",
|
|
234 |
" <td>1</td>\n",
|
|
235 |
" </tr>\n",
|
|
236 |
" <tr>\n",
|
|
237 |
" <th>7</th>\n",
|
|
238 |
" <td>晴</td>\n",
|
|
239 |
" <td><=26</td>\n",
|
|
240 |
" <td>>75</td>\n",
|
|
241 |
" <td>否</td>\n",
|
|
242 |
" <td>0</td>\n",
|
|
243 |
" </tr>\n",
|
|
244 |
" <tr>\n",
|
|
245 |
" <th>8</th>\n",
|
|
246 |
" <td>晴</td>\n",
|
|
247 |
" <td><=26</td>\n",
|
|
248 |
" <td><=75</td>\n",
|
|
249 |
" <td>否</td>\n",
|
|
250 |
" <td>1</td>\n",
|
|
251 |
" </tr>\n",
|
|
252 |
" <tr>\n",
|
|
253 |
" <th>9</th>\n",
|
|
254 |
" <td>雨</td>\n",
|
|
255 |
" <td><=26</td>\n",
|
|
256 |
" <td>>75</td>\n",
|
|
257 |
" <td>否</td>\n",
|
|
258 |
" <td>1</td>\n",
|
|
259 |
" </tr>\n",
|
|
260 |
" <tr>\n",
|
|
261 |
" <th>10</th>\n",
|
|
262 |
" <td>晴</td>\n",
|
|
263 |
" <td><=26</td>\n",
|
|
264 |
" <td><=75</td>\n",
|
|
265 |
" <td>是</td>\n",
|
|
266 |
" <td>1</td>\n",
|
|
267 |
" </tr>\n",
|
|
268 |
" <tr>\n",
|
|
269 |
" <th>11</th>\n",
|
|
270 |
" <td>多云</td>\n",
|
|
271 |
" <td><=26</td>\n",
|
|
272 |
" <td>>75</td>\n",
|
|
273 |
" <td>是</td>\n",
|
|
274 |
" <td>1</td>\n",
|
|
275 |
" </tr>\n",
|
|
276 |
" <tr>\n",
|
|
277 |
" <th>12</th>\n",
|
|
278 |
" <td>多云</td>\n",
|
|
279 |
" <td>>26</td>\n",
|
|
280 |
" <td><=75</td>\n",
|
|
281 |
" <td>否</td>\n",
|
|
282 |
" <td>1</td>\n",
|
|
283 |
" </tr>\n",
|
|
284 |
" <tr>\n",
|
|
285 |
" <th>13</th>\n",
|
|
286 |
" <td>雨</td>\n",
|
|
287 |
" <td><=26</td>\n",
|
|
288 |
" <td>>75</td>\n",
|
|
289 |
" <td>是</td>\n",
|
|
290 |
" <td>0</td>\n",
|
|
291 |
" </tr>\n",
|
|
292 |
" </tbody>\n",
|
|
293 |
"</table>\n",
|
|
294 |
"</div>"
|
|
295 |
],
|
|
296 |
"text/plain": [
|
|
297 |
" 天气 温度 湿度 是否有风 是否前往游乐场\n",
|
|
298 |
"0 晴 >26 >75 否 0\n",
|
|
299 |
"1 晴 <=26 >75 是 0\n",
|
|
300 |
"2 多云 >26 >75 否 1\n",
|
|
301 |
"3 雨 <=26 >75 否 1\n",
|
|
302 |
"4 雨 <=26 >75 否 1\n",
|
|
303 |
"5 雨 <=26 <=75 是 0\n",
|
|
304 |
"6 多云 <=26 <=75 是 1\n",
|
|
305 |
"7 晴 <=26 >75 否 0\n",
|
|
306 |
"8 晴 <=26 <=75 否 1\n",
|
|
307 |
"9 雨 <=26 >75 否 1\n",
|
|
308 |
"10 晴 <=26 <=75 是 1\n",
|
|
309 |
"11 多云 <=26 >75 是 1\n",
|
|
310 |
"12 多云 >26 <=75 否 1\n",
|
|
311 |
"13 雨 <=26 >75 是 0"
|
|
312 |
]
|
|
313 |
},
|
|
314 |
"execution_count": 2,
|
|
315 |
"metadata": {},
|
|
316 |
"output_type": "execute_result"
|
|
317 |
}
|
|
318 |
],
|
| 105 | 319 |
"source": [
|
| 106 | 320 |
"# 原始数据\n",
|
| 107 | 321 |
"datasets = [\n",
|
|
| 147 | 361 |
"source": [
|
| 148 | 362 |
"## 2.2.2 构建决策树 \n",
|
| 149 | 363 |
"\n",
|
| 150 | |
"**信息增益**用来衡量样本集合复杂度(不确定性)所减少的程度。 \n",
|
| 151 | |
"\n",
|
| 152 | |
"**信息熵**用来度量信息量的大小。从信息论的角度来看,对信息的度量等于计算信息不确定性的多少。 "
|
|
364 |
"**信息增益**是什么?\n",
|
|
365 |
"\n",
|
|
366 |
"**信息熵**是什么?"
|
| 153 | 367 |
]
|
| 154 | 368 |
},
|
| 155 | 369 |
{
|
|
| 182 | 396 |
" if ent == 0:\n",
|
| 183 | 397 |
" ent = 0\n",
|
| 184 | 398 |
" # 返回信息熵,精确到小数点后 4 位\n",
|
| 185 | |
" return round(ent, 4)"
|
| 186 | |
]
|
| 187 | |
},
|
| 188 | |
{
|
| 189 | |
"cell_type": "markdown",
|
| 190 | |
"metadata": {},
|
| 191 | |
"source": [
|
| 192 | |
"\n",
|
| 193 | |
"\n",
|
| 194 | |
"现在用**熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场( 9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n",
|
|
399 |
" return round(ent, 4)\n"
|
|
400 |
]
|
|
401 |
},
|
|
402 |
{
|
|
403 |
"cell_type": "markdown",
|
|
404 |
"metadata": {},
|
|
405 |
"source": [
|
|
406 |
"\n",
|
|
407 |
"\n",
|
|
408 |
"现在用**信息熵**来构建决策树。数据中 14 个样本分为 “游客来游乐场 (9 个样本)” 和 “游客不来游乐场( 5 个样本)” 两个类别,即 K = 2。\n",
|
| 195 | 409 |
"\n",
|
| 196 | 410 |
"记 “游客来游乐场” 和 “游客不来游乐场” 的概率分别为 $p_1$ 和 $p_2$ ,显然 $p_1=\\frac{9}{14}$,$p_1=\\frac{5}{14}$,则这 14 个样本所蕴含的信息熵:\n",
|
| 197 | 411 |
"\n",
|
|
| 202 | 416 |
"cell_type": "markdown",
|
| 203 | 417 |
"metadata": {},
|
| 204 | 418 |
"source": [
|
| 205 | |
"我们可以用下面这种方式对 dataframe 的数据按条件进行筛选。"
|
| 206 | |
]
|
| 207 | |
},
|
| 208 | |
{
|
| 209 | |
"cell_type": "code",
|
| 210 | |
"execution_count": null,
|
| 211 | |
"metadata": {},
|
| 212 | |
"outputs": [],
|
| 213 | |
"source": [
|
| 214 | |
"# 例如:按 是否前往游乐场==0 进行筛选\n",
|
| 215 | |
"df[df['是否前往游乐场']=='0']"
|
|
419 |
"我们可以用下面这种方式对 DataFrame 的数据按条件进行筛选。"
|
|
420 |
]
|
|
421 |
},
|
|
422 |
{
|
|
423 |
"cell_type": "code",
|
|
424 |
"execution_count": null,
|
|
425 |
"metadata": {},
|
|
426 |
"outputs": [],
|
|
427 |
"source": [
|
|
428 |
"# 例如:按 是否前往游乐场 == 0 进行筛选\n",
|
|
429 |
"df[df['是否前往游乐场']=='0']\n"
|
| 216 | 430 |
]
|
| 217 | 431 |
},
|
| 218 | 432 |
{
|
|
| 234 | 448 |
"count_dict = {'前往':df[df['是否前往游乐场']=='1'].shape[0], '不前往':df[df['是否前往游乐场']=='1'].shape[1]}\n",
|
| 235 | 449 |
"# 计算信息熵\n",
|
| 236 | 450 |
"entropy = calc_entropy(total_num, count_dict)\n",
|
| 237 | |
"entropy"
|
|
451 |
"entropy\n"
|
| 238 | 452 |
]
|
| 239 | 453 |
},
|
| 240 | 454 |
{
|
|
| 278 | 492 |
"outputs": [],
|
| 279 | 493 |
"source": [
|
| 280 | 494 |
"# 筛选出 天气为晴并且去游乐场的样本数据\n",
|
| 281 | |
"df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]"
|
|
495 |
"df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')]\n"
|
| 282 | 496 |
]
|
| 283 | 497 |
},
|
| 284 | 498 |
{
|
|
| 289 | 503 |
"source": [
|
| 290 | 504 |
"# 天气为晴的总天数\n",
|
| 291 | 505 |
"total_num_sun = df[df['天气']=='晴'].shape[0]\n",
|
|
506 |
"\n",
|
| 292 | 507 |
"# 天气为晴时,去游乐场和不去游乐场的人数\n",
|
| 293 | |
"count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0], \n",
|
| 294 | |
" '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
|
508 |
"count_dict_sun = {'前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='1')].shape[0],\n",
|
|
509 |
" '不前往':df[(df['天气']=='晴') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
| 295 | 510 |
"print(count_dict_sun)\n",
|
|
511 |
"\n",
|
| 296 | 512 |
"# 计算天气-晴 的信息熵\n",
|
| 297 | 513 |
"ent_sun = calc_entropy(total_num_sun, count_dict_sun)\n",
|
| 298 | 514 |
"print('天气-晴 的信息熵为:%s' % ent_sun)\n"
|
|
| 306 | 522 |
"source": [
|
| 307 | 523 |
"# 天气为多云的总天数\n",
|
| 308 | 524 |
"total_num_cloud = df[df['天气']=='多云'].shape[0]\n",
|
|
525 |
"\n",
|
| 309 | 526 |
"# 天气为多云时,去游乐场和不去游乐场的人数\n",
|
| 310 | |
"count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0], \n",
|
| 311 | |
" '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
|
527 |
"count_dict_cloud = {'前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='1')].shape[0],\n",
|
|
528 |
" '不前往':df[(df['天气']=='多云') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
| 312 | 529 |
"print(count_dict_cloud)\n",
|
|
530 |
"\n",
|
| 313 | 531 |
"# 计算天气-多云 的信息熵\n",
|
| 314 | 532 |
"ent_cloud = calc_entropy(total_num_cloud, count_dict_cloud)\n",
|
| 315 | |
"print('天气-多云 的信息熵为:%s' % ent_cloud)"
|
|
533 |
"print('天气-多云 的信息熵为:%s' % ent_cloud)\n"
|
| 316 | 534 |
]
|
| 317 | 535 |
},
|
| 318 | 536 |
{
|
|
| 323 | 541 |
"source": [
|
| 324 | 542 |
"# 天气为雨的总天数\n",
|
| 325 | 543 |
"total_num_rain = df[df['天气']=='雨'].shape[0]\n",
|
|
544 |
"\n",
|
| 326 | 545 |
"# 天气为雨时,去游乐场和不去游乐场的人数\n",
|
| 327 | |
"count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0], \n",
|
| 328 | |
" '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
|
546 |
"count_dict_rain = {'前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='1')].shape[0],\n",
|
|
547 |
" '不前往':df[(df['天气']=='雨') & (df['是否前往游乐场']=='0')].shape[0]}\n",
|
| 329 | 548 |
"print(count_dict_rain)\n",
|
|
549 |
"\n",
|
| 330 | 550 |
"# 计算天气-雨 的信息熵\n",
|
| 331 | 551 |
"ent_rain = calc_entropy(total_num_rain, count_dict_rain)\n",
|
| 332 | |
"print('天气-雨 的信息熵为:%s' % ent_rain)"
|
|
552 |
"print('天气-雨 的信息熵为:%s' % ent_rain)\n"
|
| 333 | 553 |
]
|
| 334 | 554 |
},
|
| 335 | 555 |
{
|
|
| 355 | 575 |
"cell_type": "markdown",
|
| 356 | 576 |
"metadata": {},
|
| 357 | 577 |
"source": [
|
| 358 | |
"假设有 $K$ 个信息,其组成了集合样本 $D$ ,记第 $k$ 个信息发生的概率为$P_k(1≤k≤K)$。 \n",
|
| 359 | |
"这 $K$ 个信息的信息熵: \n",
|
| 360 | |
"$$E(D)=-\\sum_{k=1}^{K}p_k log_{2} p_k$$\n",
|
| 361 | |
"\n",
|
| 362 | |
"需要指出:**所有 $p_k$ 累加起来的和为1**。"
|
| 363 | |
]
|
| 364 | |
},
|
| 365 | |
{
|
| 366 | |
"cell_type": "markdown",
|
| 367 | |
"metadata": {},
|
| 368 | |
"source": [
|
| 369 | 578 |
"使用上面的公式计算信息增益。"
|
| 370 | 579 |
]
|
| 371 | 580 |
},
|
|
| 376 | 585 |
"outputs": [],
|
| 377 | 586 |
"source": [
|
| 378 | 587 |
"# 信息增益\n",
|
| 379 | |
"gain = entropy - (total_num_sun/total_num*ent_sun + \n",
|
| 380 | |
" total_num_cloud/total_num*ent_cloud + \n",
|
|
588 |
"gain = entropy - (total_num_sun/total_num*ent_sun +\n",
|
|
589 |
" total_num_cloud/total_num*ent_cloud +\n",
|
| 381 | 590 |
" total_num_rain/total_num*ent_rain)\n",
|
| 382 | |
"gain"
|
|
591 |
"gain\n"
|
| 383 | 592 |
]
|
| 384 | 593 |
},
|
| 385 | 594 |
{
|
|
| 477 | 686 |
"# 查看 label\n",
|
| 478 | 687 |
"print(list(iris.target_names))\n",
|
| 479 | 688 |
"# 查看 feature\n",
|
| 480 | |
"print(iris.feature_names)"
|
|
689 |
"print(iris.feature_names)\n"
|
| 481 | 690 |
]
|
| 482 | 691 |
},
|
| 483 | 692 |
{
|
|
| 506 | 715 |
"# 载入数据\n",
|
| 507 | 716 |
"X, y = load_iris(return_X_y=True)\n",
|
| 508 | 717 |
"# 切分训练集合测试集\n",
|
| 509 | |
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
|
|
718 |
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n"
|
| 510 | 719 |
]
|
| 511 | 720 |
},
|
| 512 | 721 |
{
|
|
| 527 | 736 |
"# 初始化模型,可以调整 max_depth 来观察模型的表现\n",
|
| 528 | 737 |
"clf = tree.DecisionTreeClassifier(random_state=42, max_depth=2)\n",
|
| 529 | 738 |
"# 训练模型\n",
|
| 530 | |
"clf = clf.fit(X_train, y_train)"
|
|
739 |
"clf = clf.fit(X_train, y_train)\n"
|
| 531 | 740 |
]
|
| 532 | 741 |
},
|
| 533 | 742 |
{
|
|
| 547 | 756 |
"feature_names = ['萼片长度','萼片宽度','花瓣长度','花瓣宽度']\n",
|
| 548 | 757 |
"target_names = ['山鸢尾', '杂色鸢尾', '维吉尼亚鸢尾']\n",
|
| 549 | 758 |
"# 可视化生成的决策树\n",
|
| 550 | |
"dot_data = tree.export_graphviz(clf, out_file=None, \n",
|
| 551 | |
" feature_names=feature_names, \n",
|
| 552 | |
" class_names=target_names, \n",
|
| 553 | |
" filled=True, rounded=True, \n",
|
| 554 | |
" special_characters=True) \n",
|
| 555 | |
"graph = graphviz.Source(dot_data) \n",
|
| 556 | |
"graph "
|
|
759 |
"dot_data = tree.export_graphviz(clf, out_file=None,\n",
|
|
760 |
" feature_names=feature_names,\n",
|
|
761 |
" class_names=target_names,\n",
|
|
762 |
" filled=True, rounded=True,\n",
|
|
763 |
" special_characters=True)\n",
|
|
764 |
"graph = graphviz.Source(dot_data)\n",
|
|
765 |
"graph\n"
|
| 557 | 766 |
]
|
| 558 | 767 |
},
|
| 559 | 768 |
{
|
|
| 571 | 780 |
"source": [
|
| 572 | 781 |
"from sklearn.metrics import accuracy_score\n",
|
| 573 | 782 |
"y_test_predict = clf.predict(X_test)\n",
|
| 574 | |
"accuracy_score(y_test,y_test_predict)"
|
|
783 |
"accuracy_score(y_test,y_test_predict)\n"
|
| 575 | 784 |
]
|
| 576 | 785 |
},
|
| 577 | 786 |
{
|
|
| 609 | 818 |
" # 读取每一行的内容\n",
|
| 610 | 819 |
" for line in f.readlines():\n",
|
| 611 | 820 |
" contents += line\n",
|
| 612 | |
" return contents"
|
|
821 |
" return contents\n"
|
| 613 | 822 |
]
|
| 614 | 823 |
},
|
| 615 | 824 |
{
|
|
| 678 | 887 |
" ent = -sum([(p / word_number) * log(p / word_number, 2) for p in\n",
|
| 679 | 888 |
" word_counter.values()])\n",
|
| 680 | 889 |
" print('信息熵为:%.2f' % ent)\n",
|
| 681 | |
" return ent"
|
| 682 | |
]
|
| 683 | |
},
|
| 684 | |
{
|
| 685 | |
"cell_type": "code",
|
| 686 | |
"execution_count": null,
|
| 687 | |
"metadata": {},
|
| 688 | |
"outputs": [],
|
| 689 | |
"source": [
|
| 690 | |
"ent = cal_essay_entropy(ch_essay)"
|
| 691 | |
]
|
| 692 | |
},
|
| 693 | |
{
|
| 694 | |
"cell_type": "code",
|
| 695 | |
"execution_count": null,
|
| 696 | |
"metadata": {},
|
| 697 | |
"outputs": [],
|
| 698 | |
"source": [
|
| 699 | |
"ent = cal_essay_entropy(en_essay, split_by = ' ')"
|
|
890 |
" return ent\n"
|
|
891 |
]
|
|
892 |
},
|
|
893 |
{
|
|
894 |
"cell_type": "code",
|
|
895 |
"execution_count": null,
|
|
896 |
"metadata": {},
|
|
897 |
"outputs": [],
|
|
898 |
"source": [
|
|
899 |
"ent = cal_essay_entropy(ch_essay)\n"
|
|
900 |
]
|
|
901 |
},
|
|
902 |
{
|
|
903 |
"cell_type": "code",
|
|
904 |
"execution_count": null,
|
|
905 |
"metadata": {},
|
|
906 |
"outputs": [],
|
|
907 |
"source": [
|
|
908 |
"ent = cal_essay_entropy(en_essay, split_by = ' ')\n"
|
| 700 | 909 |
]
|
| 701 | 910 |
},
|
| 702 | 911 |
{
|