Principle and algorithm of decision tree

Decision tree is basically a summary of our previous experience. I have prepared a set of training equipment for you to play basketball. If we want to go out to play basketball, we usually judge according to the weather, temperature, humidity and windy conditions, and finally come to the conclusion: to play basketball? Still not going?

The picture above is a typical decision tree. When we make a decision tree, we will go through two stages: construction and pruning.

Construction is to generate a complete decision tree. To put it simply, the process of construction is the process of choosing what attributes are used as nodes, so there will be three kinds of nodes in the process of construction:

Root node: It is the top and first node of the tree. In the above picture, "weather" is the root node;

Internal nodes: those nodes in the middle of the tree, such as "temperature", "humidity" and "windy";

Leaf node: it is the bottom node of the tree and the result of decision.

Pruning is to slim down the decision tree and prevent over-fitting. It is divided into "front pruning" and "rear pruning".

Pre-pruning is pruning when constructing decision trees. This method is to evaluate the nodes in the construction process. If the accuracy of a node cannot be improved in the verification set, it is meaningless to divide the node. At this point, the current node will be regarded as a leaf node and will not be divided.

Post-pruning means pruning after the decision tree is generated. Usually, it starts from the leaf nodes of the decision tree and evaluates each node step by step. If there is little difference in classification accuracy between cutting this node subtree and keeping this node subtree, or cutting this node subtree can improve the accuracy of verification set, then this node subtree can be pruned.

1 under-fitting, 3 over-fitting, will lead to classification errors.

One of the reasons for over-fitting is the small sample size in the training set. If the decision tree selects too many attributes, the decision tree will be able to classify the samples in the training set "perfectly", but in this way, the characteristics of some data in the training set will be regarded as the characteristics of all data, but this characteristic is not necessarily the characteristics of all data, which makes this decision tree make mistakes in the real data classification, that is, the "generalization ability" of the model is poor.

P(i|t) represents the probability that node T is a classification I, where log2 is a logarithm with base 2. We are not introducing the formula here, but saying that there is a measure that can help us reflect the uncertainty of this information. The greater the uncertainty, the greater the information content and the higher the information entropy.

ID3 algorithm calculates information gain, that is, division can improve purity and reduce information entropy. Its calculation formula is the information entropy of the parent node minus the information entropy of all the child nodes.

In the formula, d is the parent node, Di is the child node, and A in Gain(D, a) is the attribute selection of D node.

Because ID3 tends to choose attributes with many values when calculating. To avoid this problem, C4.5 uses information gain rate to select attributes. Information gain rate = information gain/attribute entropy, and the specific calculation formula is omitted here.

When an attribute has many values, it is equivalent to being divided into many parts. Although the information gain becomes larger, for C4.5, the attribute entropy will also become larger, so the overall information gain rate is not large.

When ID3 constructs a decision tree, it is easy to produce over-fitting. In C4.5, pessimistic pruning (PEP) will be adopted after the decision tree is constructed, which can improve the generalization ability of the decision tree.

Pessimistic pruning is one of the post-pruning techniques. Recursively estimate the classification error rate of each internal node, compare the classification error rate of the node before and after pruning, and decide whether to prune. This pruning method no longer requires a separate test data set.

C4.5 can handle the case of continuous attributes and discretize them. For example, the "humidity" attribute of playing basketball is not divided according to "high and medium", but is calculated according to the humidity value, so any humidity value can be taken. How to choose this threshold? C4.5 Select the threshold corresponding to the partition with the highest information gain.

In view of the incomplete data set, C4.5 can also handle it.

Please use the following example to simulate the process of decision tree. Assuming that the data of a good apple is as follows, please use ID3 algorithm to give a decision tree of a good apple.

The information gain of "red" is: 1, and the information gain of "big" is: 0.

So choose "red" as the root node, "big" is useless, pruning.

Lecture on data analysis practice 45. 17 decision tree (I): Do you want to play basketball? Decision tree tells you

How to polish woodcarving

The method of roller skating to lose weight

Can eating more lettuce reduce weight?

How to use confinement to regulate physical fitness ppt

60-year-old Hitomi Kuroki and 5 1 year-old Summer Xu, what is the secret of their youth?

The price and effect of Shanlana stovepipe cream

Is it easy to lose weight after traveling? Will travel be fat or thin?

Babor ampoule evaluation Baba Baoan ampoule collection

Efficacy and function of amaranth natural diet food

What software do you use to find a summer job?