GINI INDEX

GINI

Used by CART (Classification and Regression Trees). Measures how often a randomly chosen sample would be misclassified if we labeled it according to the distribution of classes in that node. A pure node (all samples in one class) has a Gini score of 0.The closer the score is to 0, the purer the node.

The formula is:

Gini(node)=1c=1C(p(cnode))2Gini(node) = 1 - \sum_{c=1}^{C} \big( p(c \mid node) \big)^2

Where:

  • CC = number of classes

  • p(cnode)p(c \mid node) = proportion (frequency) of class cc among the data at that node

Interpretation:

  • If all samples at the node belong to one class → Gini=0Gini = 0 (pure node).

  • The higher the Gini value, the more mixed the classes are at that node.

Example:
Suppose a node has 10 samples: 7 positive and 3 negative.

p(positive)=0.7,p(negative)=0.3



Gini=1(0.72+0.32)=1(0.49+0.09)=0.42

This shows the node is somewhat impure, since both classes are present.


Toy dataset

We’ll predict Play (Yes/No) from three features: Outlook, Humidity, Wind.

IDOutlookHumidityWindPlay
1SunnyHighWeakNo
2SunnyHighStrongNo
3OvercastHighWeakYes
4RainHighWeakYes
5RainNormalWeakYes
6RainNormalStrongNo
7OvercastNormalStrongYes
8SunnyNormalWeakYes
9SunnyNormalStrongYes
10OvercastHighStrongYes

Totals: Yes = 7, No = 3.

Gini(node) = 1cp(cnode)21 - \sum_c p(c|node)^2

Root Gini = 1− (0.70.320.421 - (0.7^2 + 0.3^2) = 0.42


Step 1 — Evaluate first split (try every feature)

We compute the weighted Gini after splitting on each feature and pick the smallest.

Split by Outlook (Sunny / Overcast / Rain)

  • Sunny: {1,2,8,9} → Yes=2, No=2 → Gini = 1− (0.50.52

  • Overcast: {3,7,10} → Yes=3, No=0 → Gini = 0

  • Rain: {4,5,6} → Yes=2, No=1 → Gini = 1− ((2/32(1/324/0.444

Weighted Gini = (4/10)0.50 (3/10)(3/10)0.444 ≈ 0.20+0+0.133 0.333

Impurity reduction = 0.420.333 0.0870.42 − 0.333 = 0.087

Split by Humidity (High / Normal)

  • High: {1,2,3,4,10} → Yes=3, No=2 → Gini = 1(0.62+0.42)=0.481 − (0.6^2 + 0.4^2) = 0.48

  • Normal: {5,6,7,8,9} → Yes=4, No=1 → Gini = 1(0.82+0.22)=0.32

Weighted Gini = 0.50.48+0.50.32=0.400.5*0.48 + 0.5*0.32 = 0.40
Reduction = 0.420.40=0.020.42 − 0.40 = 0.02

Split by Wind (Weak / Strong)

  • Weak: {1,3,4,5,8} → Yes=4, No=1 → Gini = 0.32

  • Strong: {2,6,7,9,10} → Yes=3, No=2 → Gini = 0.48

Weighted Gini = 0.50.32+0.50.48=0.400.5*0.32 + 0.5*0.48 = 0.40
Reduction = 0.420.40=0.020.42 − 0.40 = 0.02

Best first split: Outlook (lowest weighted Gini ≈ 0.333).

Current (partial) tree:

  • Root: Outlook

    • Overcast → leaf Yes (pure)

    • Sunny → mixed (needs splitting)

    • Rain → mixed (needs splitting)



Step 2 — Split the Rain branch

Rain subset: {4,5,6} → Yes=2, No=1 → Gini ≈ 0.444

Try Humidity within Rain:

  • High: {4} → Yes=1 → Gini = 0

  • Normal: {5,6} → Yes=1, No=1 → Gini = 0.5
    Weighted = (1/3)0+(2/3)0.5=1/30.333(1/3)*0 + (2/3)*0.5 = 1/3 ≈ 0.333

Try Wind within Rain:

  • Weak: {4,5} → Yes=2 → Gini = 0

  • Strong: {6} → No=1 → Gini = 0
    Weighted = 0

Best split for Rain: Wind (perfectly pure).

Rain branch becomes:

  • Rain → Wind

    • Weak → Yes

    • Strong → No


Step 3 — Split the Sunny branch

Sunny subset: {1,2,8,9} → Yes=2, No=2 → Gini = 0.5

Try Humidity within Sunny:

  • High: {1,2} → No=2 → Gini = 0

  • Normal: {8,9} → Yes=2 → Gini = 0
    Weighted = 0

(Any other Sunny split won’t beat 0.)

Best split for Sunny: Humidity (perfectly pure).

Sunny branch becomes:

  • Sunny → Humidity

    • High → No

    • Normal → Yes

Final Tree



Comments