=== finding best splits in decision trees === Notation: We let TTTF stand for a population of four, three instances of which are classified as T and one as F, etc. -- Examples of Gini purity -- Population Gini purity ---------------------------------- TTT 0 TF 1 - (1/4+1/4) = 1/2 TTF 1 - (1/9+4/9) = 4/9 TTTF 1 - (1/16+9/16) = 3/8 TTTTF 1 - (1/25+16/25)= 8/25 --- Example of choosing best split (wrt. Gini purity) Let us have a population 1: T, 2: T, 3: F, 4: T which we can split in two ways: A: (1: T, 2: T) and (3: F, 4: T) B: (1: T, 2: T, 3: F) and (4: T) B might seem best, in that one component is totally pure, and the other (TTF) is "more pure" than the FT component of A. However, A has the highest information gain, as demonstrated by the below calculation. Intuitively, this is because the totally pure component of A (TT) is larger than the totally pure component of B. --------------------------------------------------------------------------- A B -------------------------------------------------- Purity 1/2 x 0 + 1/2 x 1/2 = 1/4 x 0 + 3/4 x 4/9 1/4 1/3 Information-gain 3/8 - 1/4 = 3/8 - 1/3 = 1/8 > 1/24 === ==== Information-content 1/2 log(2) + 1/2 log(2) = 1/4 log(4) + 3/4 log(4/3) = 1 2 - 3/4 log(3) = 0.811 Information-gain-ratio 0.125 > 0.051 ===== ===== --------------------------------------------------------------------------- -- motivating need to take information content into account Let us have a continuous population where N items are less than a given threshold W N items are greater than W 3/4 of the items less than W are classified as T 1/4 of the items less than W are classified as T We can split in two ways: A binary, along W B fine-grained, making all classes pure --------------------------------------------------------------------------- A B -------------------------------------------------- Purity 1/2 x 3/8 + 1/2 x 3/8 = 3/8 0 Information-gain 1/2 - 3/8 = 1/2 - 0 = 1/8 < 1/2 Information-content 1/2 log(2) + 1/2 log(2) = log(2n) 1 Information-gain-ratio 1/8 1/(2 log(2n)) === ============= ------------------------------------------------------------------- So for sufficiently large n, option A is to be preferred, as the extra purity in B is not worth the more fine-grained split.