Decision Tree : Induction

Decision Tree - Classification

Tree is constructed in a greedy, top-down,recursive, divide-and-conquer manner
Build one node at a time, starting with the root
At the start: use all data samples to determine best attribute for the root
Based on that attribute, select data subset that goes to the children
For each child, repeat the process
Test attributes are selected based on a statistical measure (e.g., information gain)

All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
There are no samples left

Let Dnode be the set of training records that reach a node

If Dnode contains records that all belong the same class ynode, then node is a leaf node
labeled as ynode
If Dnode is an empty set, then node is a leaf node labeled by the default class, ydefault
If Dnode contains records that belong to more than one class, use an attribute test to split
the data into smaller subsets.
Recursively apply the procedure to each subset.

Imagine you’re at a node in the tree with a group of training examples, which we’ll call Dnode.

All records belong to one class → Stop
- If all the examples in Dnode are from the same class, then this node becomes a leaf node.
- The leaf is labeled with that class.
- Example: If all customers in this group “buy” a product, the leaf is labeled “Buy.”
No records left → Stop
- If Dnode is empty (no examples reach this node), the algorithm creates a leaf node with a default class.
- The default class is usually the majority class of the parent node.
Mixed records → Split further
- If Dnode contains a mix of different classes, the algorithm chooses an attribute test (a question like “Is Age > 30?” or “Is Weather = Sunny?”).
- This test splits the data into smaller subsets.
- Each subset becomes a new child node.
Repeat the process
- The algorithm is applied recursively to each child node, checking again: are all records the same class? Empty? Or mixed?
- This continues until the stopping condition is reached (all leaves are pure or no data is left).

Geometric Interpretation

Next up - splitting criteria