Random Forest

January 14, 2026

Random Forest

Random Forest is a machine learning algorithm used mainly for classification and regression tasks. At a high level, it’s an ensemble method that builds multiple decision trees and combines their outputs to make a more accurate and stable prediction.

Let’s break it down clearly:

1. Core Idea

A single decision tree can be very sensitive to the data — it might overfit, meaning it learns the noise in the training data.
Random Forest solves this by:
1. Creating many decision trees on random subsets of the data and features.
2. Aggregating their predictions:
  - For classification: majority vote (most trees agree on the class)
  - For regression: average of all tree predictions

Think of it like asking a committee of experts instead of relying on a single person.

2. How It Works

Bootstrapping (Random Sampling)
Each tree is trained on a random subset of the training data (with replacement).
Random Feature Selection
When splitting nodes in a tree, it randomly selects a subset of features instead of using all features. This increases diversity among trees.
Tree Building
Each tree grows fully (or until a stopping condition), making its own predictions.
Aggregation
- Classification: Pick the class predicted by most trees
- Regression: Take the average of all tree outputs

3. Advantages

Handles large datasets well
Reduces overfitting compared to single decision trees
Works with both numerical and categorical data
Provides feature importance, which helps understand which variables matter most

4. Disadvantages

Can be slower to train and predict with very large forests
Less interpretable than a single decision tree (harder to visualize)
How to handle imbalance in Random Forest
1. Class weighting / cost-sensitive learning
  - Assign higher weight to minority class when building trees.
  - In scikit-learn: class_weight='balanced'.
2. Resampling
  - Oversample the minority class (e.g., SMOTE)
  - Undersample the majority class
  - Can be combined with Random Forest to balance the data seen by each tree.

Comments