Reference:
Decision tree (classification tree, regression tree)
Decision tree: The diagram of this blog is really beautiful and easy to understand. Ha ha laugh
Detailed explanation of decision tree
Decision tree is a supervised learning algorithm, which is often used for classification and regression. This paper only discusses the classification problem.
Decision tree model is a tree structure used for classification and regression. A decision tree consists of nodes and directed edges. Usually, a decision tree contains a root node, several internal nodes and several leaf nodes. The decision-making process of the decision tree needs to start from the root node of the decision tree, compare the data to be measured with the characteristic nodes in the decision tree, and select the next comparison branch according to the comparison result until the leaf node is the final decision result.
In short, decision tree is a multi-classification model that uses tree model to make decisions.
In order to find the optimal segmentation feature, we need to know some information theory knowledge first:
Purity:
You can understand the construction process of decision tree as the process of finding pure division. Mathematically, we can express purity, and another way to explain purity is to minimize the difference of target variables.
The uncertainty of information.
In information theory, the probability of random discrete events is uncertain. To measure the uncertainty of this information, Shannon, the father of informatics, introduced the concept of information entropy.
The greater the uncertainty, the greater the information content and the higher the information entropy.
The greater the information entropy, the lower the purity. When all the samples in the set are mixed evenly, the information entropy is the largest and the purity is the lowest.
There are three classical indicators of "impurities", namely information gain (ID3 algorithm), information gain rate (C4.5 algorithm) and Gini index (Cart algorithm).
Information gain:
Information gain means that division can improve purity and reduce information entropy. Its calculation formula is the information entropy of the parent node minus the information entropy of all the child nodes.
Information gain rate
Information gain rate = information gain/attribute entropy
gini index
Gini index (Gini impurity): indicates the probability that randomly selected samples in the sample set are wrongly classified.
That is, Gini index (Gini impurity) = probability of sample being selected * probability of sample being misclassified.
The property of Gini coefficient is the same as information entropy: it measures the uncertainty of random variables;
The greater the g, the higher the uncertainty of the data;
The smaller the g, the lower the uncertainty of the data;
G = 0, all samples in the dataset belong to the same category.
Detailed reference: machine learning-Gini index
ID3 algorithm is based on Occam razor (you can get things done with less things): the smaller the decision tree, the better it is.
The core of ID3 algorithm is to select the features to be divided according to the information gain of each node of the decision tree, and then recursively construct the decision tree. The algorithm uses top-down greedy search to traverse the possible decision tree space.
Specific methods:
Limitations of ID3:
C4.5 is similar to ID3, but its major feature is that it overcomes the shortcoming of ID3' s emphasis on the number of features and introduces the information gain rate as the classification standard.
The implementation of C4.5 is improved on the basis of ID3;
The information gain rate tends to the features with lower expectation (the smaller the denominator, the larger the whole), so C4.5 does not directly divide the features with the largest gain rate, but adopts a heuristic method: first, find the features with higher information gain than the average value from the candidate divided features, and then select the features with the highest gain rate.
C4.5 limitations:
The decision tree generated by ID3 and C4.5 has a large branch and scale. The dichotomy of CART algorithm can simplify the scale of decision tree and improve the efficiency of generating decision tree.
Cart (Classification Regression Tree) is a classification regression tree algorithm, which can be used for both classification and regression. In this part, we mainly form its classification tree. Different from ID3 and C4.5, CART assumes that the decision tree is a binary tree, with internal node eigenvalues of "Yes" and "No", the left branch is a branch with a value of "Yes" and the right branch is a branch with a value of "No". Such a decision tree is equivalent to recursively dividing each feature into two parts and dividing the input space (that is, the feature space) into finite units.
CART's classification tree uses Gini index to select the optimal cut-off point of the optimal feature, and the specific process is as follows.
Pruning is to slim down the decision tree. The goal of this step is to get good results without too much judgment. The reason for this is to prevent "over-fitting".
Over-fitting: it means that the training result of the model is "too good", and it will be "rigid" in the actual application process, leading to classification errors.
Underfitting: refers to the unsatisfactory training results of the model.
Pruning method:
Reference: machine learning decision tree (1) -ID3, C4.5, CART (very detailed)
More models are constantly being updated. . . .