GINI
Used by CART (Classification and Regression Trees). Measures how often a randomly chosen sample would be misclassified if we labeled it according to the distribution of classes in that node. A pure node (all samples in one class) has a Gini score of 0.The closer the score is to 0, the purer the node.
The formula is:
Where:
-
= number of classes
-
= proportion (frequency) of class among the data at that node
Interpretation:
-
If all samples at the node belong to one class → (pure node).
-
The higher the Gini value, the more mixed the classes are at that node.
Example:
Suppose a node has 10 samples: 7 positive and 3 negative.
This shows the node is somewhat impure, since both classes are present.
Toy dataset
We’ll predict Play (Yes/No) from three features: Outlook, Humidity, Wind.
ID | Outlook | Humidity | Wind | Play |
---|---|---|---|---|
1 | Sunny | High | Weak | No |
2 | Sunny | High | Strong | No |
3 | Overcast | High | Weak | Yes |
4 | Rain | High | Weak | Yes |
5 | Rain | Normal | Weak | Yes |
6 | Rain | Normal | Strong | No |
7 | Overcast | Normal | Strong | Yes |
8 | Sunny | Normal | Weak | Yes |
9 | Sunny | Normal | Strong | Yes |
10 | Overcast | High | Strong | Yes |
Totals: Yes = 7, No = 3.
Gini(node) =
Root Gini =
Step 1 — Evaluate first split (try every feature)
We compute the weighted Gini after splitting on each feature and pick the smallest.
Split by Outlook (Sunny / Overcast / Rain)
-
Sunny: {1,2,8,9} → Yes=2, No=2 → Gini =
-
Overcast: {3,7,10} → Yes=3, No=0 → Gini = 0
-
Rain: {4,5,6} → Yes=2, No=1 → Gini =
Weighted Gini =
Impurity reduction =
Split by Humidity (High / Normal)
-
High: {1,2,3,4,10} → Yes=3, No=2 → Gini =
-
Normal: {5,6,7,8,9} → Yes=4, No=1 → Gini =
Weighted Gini =
Reduction =
Split by Wind (Weak / Strong)
-
Weak: {1,3,4,5,8} → Yes=4, No=1 → Gini = 0.32
-
Strong: {2,6,7,9,10} → Yes=3, No=2 → Gini = 0.48
Weighted Gini =
Reduction =
Best first split: Outlook (lowest weighted Gini ≈ 0.333).
Current (partial) tree:
-
Root: Outlook
-
Overcast → leaf Yes (pure)
-
Sunny → mixed (needs splitting)
-
Rain → mixed (needs splitting)
-
Step 2 — Split the Rain branch
Rain subset: {4,5,6} → Yes=2, No=1 → Gini ≈ 0.444
Try Humidity within Rain:
-
High: {4} → Yes=1 → Gini = 0
-
Normal: {5,6} → Yes=1, No=1 → Gini = 0.5
Weighted =
Try Wind within Rain:
-
Weak: {4,5} → Yes=2 → Gini = 0
-
Strong: {6} → No=1 → Gini = 0
Weighted = 0
Best split for Rain: Wind (perfectly pure).
Rain branch becomes:
-
Rain → Wind
-
Weak → Yes
-
Strong → No
-
Step 3 — Split the Sunny branch
Sunny subset: {1,2,8,9} → Yes=2, No=2 → Gini = 0.5
Try Humidity within Sunny:
-
High: {1,2} → No=2 → Gini = 0
-
Normal: {8,9} → Yes=2 → Gini = 0
Weighted = 0
(Any other Sunny split won’t beat 0.)
Best split for Sunny: Humidity (perfectly pure).
Sunny branch becomes:
-
Sunny → Humidity
-
High → No
-
Normal → Yes
-
Final Tree |
Comments
Post a Comment