There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Unlock Accurate Predictions by Harnessing the Power of Many Trees.
We know Decision Trees can classify data by asking questions. But sometimes, a single tree can be too sensitive to the specific training data it saw – it might overfit. What if we could build a whole forest of slightly different decision trees and let them vote on the final classification? That's the core idea behind Random Forest Classification!
Random Forest is a highly effective and widely used ensemble learning method. It leverages the power of multiple decision trees to create a model that is typically more accurate, robust, and less prone to overfitting than a single decision tree.
Main Technical Concept: Random Forest is a supervised ensemble learning algorithm that builds multiple decision trees during training. For classification, it outputs the class selected by the majority of the individual trees (majority voting).
The magic of Random Forest comes from introducing randomness in two key ways to ensure the trees in the forest are diverse (i.e., different from each other):
Image Credit: Hardik Vasa on Wikimedia Commons, CC BY-SA 4.0
Once the forest of diverse trees is built:
By averaging out the predictions (through voting) of many diverse, potentially slightly overfit trees, the Random Forest ensemble typically achieves lower variance and better generalization than any single tree could alone.
Let's use the `Social_Network_Ads.csv` dataset example, predicting 'Purchased' based on 'Age' and 'EstimatedSalary'.
n_estimators
: Number of trees in the forest (e.g., 100 is a good starting point).criterion
: Splitting criterion ('gini' or 'entropy', 'gini' is default).random_state
: For reproducibility.# 1. Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
# 2. Load dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
# Select relevant features and target
X = dataset[['Age', 'EstimatedSalary']].values
Y = dataset['Purchased'].values
# (Optional but good practice: Check for missing values)
# print(pd.DataFrame(X).isnull().sum())
# 3. Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
# 4. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 5. Initialize and train Random Forest Classifier
# Using 100 trees and entropy criterion for this example
classifier = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)
classifier.fit(X_train_scaled, y_train)
# 6. Predict results on the test set
y_pred = classifier.predict(X_test_scaled)
# 7. Evaluate the Model
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)
# Accuracy Score
acc = accuracy_score(y_test, y_pred)
print(f'\nAccuracy Score: {acc:.4f}') # Example result might be ~0.93
# Classification Report
report = classification_report(y_test, y_pred)
print('\nClassification Report:\n', report)
# Visualize Confusion Matrix (Optional)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens')
plt.title('Confusion Matrix - Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
# plt.show()
n_estimators
: Number of trees (more is often better up to a point, watch computation time).max_depth
: Maximum depth of each tree (controls complexity, prevents overfitting).min_samples_split
: Minimum samples needed to split an internal node.min_samples_leaf
: Minimum samples required in a leaf node.max_features
: Number/fraction of features considered at each split.Interview Question
Question 1: What are the two main sources of randomness that make the trees in a Random Forest different from each other?
1. Bootstrap Sampling (Bagging): Each tree is trained on a random sample of the original data drawn with replacement.
2. Random Feature Subsetting: At each node split, only a random subset of the available features is considered for finding the best split.
Question 2: How does a Random Forest classifier make its final prediction for a new data point?
The new data point is passed down every tree in the forest. Each tree makes an individual class prediction (casts a vote). The Random Forest's final prediction is the class that receives the majority of the votes from all the trees.
Interview Question
Question 3: Why is Random Forest generally less prone to overfitting compared to a single Decision Tree?
Due to the combination of bagging and random feature selection, the individual trees in the forest are de-correlated (they learn different aspects and make different errors). While individual trees might overfit their specific data/feature subset, averaging their predictions (through majority voting) cancels out much of this noise and variance, leading to a more robust model that generalizes better.
Question 4: What does the `n_estimators` parameter control in `RandomForestClassifier`?
It controls the number of decision trees that are built in the forest ensemble.
Interview Question
Question 5: Is feature scaling (like Standardization) mandatory for Random Forests? Why or why not?
It is not strictly mandatory because Decision Trees (and therefore Random Forests) make splits based on threshold values for individual features, which are not sensitive to the overall scale (e.g., splitting at Age > 30 works the same whether age is in years or months). However, it's often still considered good practice as it doesn't hurt performance, can sometimes offer minor benefits in certain implementations or edge cases, and ensures consistency if you compare RF with other models that *do* require scaling.
Question 6: Besides accuracy, what other useful information can you often extract from a trained Random Forest model?
You can extract feature importances. The model can estimate how much each feature contributed, on average, to reducing impurity (like Gini or Entropy) across all the splits in all the trees. This helps identify the most influential predictors.