Predicting Numbers with Decision Trees: Regression Trees

Imagine playing a game of "Guess the Number." You ask questions like "Is it bigger than 50?" or "Is it even?" to narrow down the possibilities. Decision Trees in machine learning work similarly!

They are flowchart-like structures that ask questions about your data's features to guide you to a final prediction. Decision trees can predict categories (like "Will this customer buy?" - Classification Tree) or continuous numbers (like "How many hours will someone play tennis?" - Regression Tree). Today, we'll focus on understanding Regression Trees.

What is a Decision Tree, Anyway?

Splitting Data Based on Questions

A decision tree learns by splitting the dataset into smaller and smaller subsets. At each step, it asks a question about one of the input features (e.g., "Is the weather Outlook Sunny?"). Based on the answer, the data goes down a specific branch.

The main goal when building a tree is to make the resulting groups (at the end of the branches) as "pure" or homogeneous as possible regarding the value we want to predict.

For Regression Trees, "homogeneous" means the numerical target values within a group are very close to each other (low variation).
For Classification Trees, it means most items in a group belong to the same category.

Know Your Tree Parts

Root Node (Start Here: Asks first question) | |---- IF [Condition 1 True] ---> Decision Node (Asks next question) | | | |---- IF [Condition 1.1 True] ---> Leaf Node (Final Prediction 1) | | | |---- IF [Condition 1.1 False] --> Leaf Node (Final Prediction 2) | |---- IF [Condition 1 False] --> Leaf Node (Final Prediction 3) Subtree starts here ---> *Decision Node + its branches & leaves*

Basic Tree Terminology

Root Node: The top-level node where the first split happens.
Decision Node (Internal Node): A node that asks a question and splits the data further.
Leaf Node (Terminal Node): An end node that doesn't split anymore. It provides the final prediction for data reaching it.
Subtree: A section of the tree starting from a decision node.

How Regression Trees Choose the Best Split

Goal: Reduce the Spread (Standard Deviation)

How does the tree decide *which* question to ask at each step? For regression trees, a common method is to choose the split that results in the biggest reduction in the spread or variation of the target variable (the number we're trying to predict).

We often measure this spread using the Standard Deviation (SD). A low SD means the numbers in a group are very similar; a high SD means they are spread out. The tree wants to create groups (leaves) with the lowest possible SD.

Standard Deviation (s) ≈ Average distance of data points from their mean

(Formula: √[ Σ(value - mean)² / count ])

The method used is called Standard Deviation Reduction (SDR).

The SDR Calculation Steps

At any decision node, the algorithm considers all possible splits across all features:

Measure Current Spread: Calculate the Standard Deviation (SD_parent) of the target variable for all data points currently in this node.
Test Potential Splits: For every feature (e.g., 'Outlook', 'Temperature'):
- Consider splitting based on its values (e.g., Outlook=Sunny vs. Outlook=Overcast vs. Outlook=Rainy).
- For each potential split, calculate the SD of the target variable within each resulting child group (e.g., SD_sunny, SD_overcast, SD_rainy).
- Calculate the Weighted Average SD for this split. This reflects the overall spread *after* the split.
  Weighted_SD = (Fraction_in_Child1 * SD_Child1) + (Fraction_in_Child2 * SD_Child2) + ...
Calculate SDR: For each potential split, find the reduction:

SDR = SD_Parent - Weighted_Average_SD_Children
Choose Best Split: Select the feature and split value that yields the Maximum SDR. This is the split that makes the resulting groups most homogeneous (lowest combined spread).
Repeat: Apply this entire process recursively to the new child nodes until a stopping condition is met.

When to Stop Splitting?

The tree stops growing branches (creating leaf nodes) when:

The Standard Deviation in a node is already very low (the data is homogeneous). Often checked using the Coefficient of Variation (CV) = `(SD / Mean) * 100%`. If CV is below a set limit (e.g., 10%), stop.
The node contains too few data points to split further reliably (e.g., less than 5 samples).
A pre-set maximum tree depth is reached.

The Prediction in a Leaf Node

Once a data point reaches a leaf node, what's the prediction? For a regression tree, it's simple: the prediction is the average (mean) of the target variable for all the *training* data points that ended up in that leaf.

Example: Predicting Tennis Hours Played

Let's revisit the tennis example: predicting 'Hours Played' based on 'Outlook', 'Temperature', 'Humidity', 'Windy'.

Suppose at the Root Node (all 14 data points):

Mean Hours = 39.8
SD (Parent) = 9.32

Now, let's test splitting by 'Outlook':

Split Groups:
- Outlook = Sunny (5 points), suppose SD = 10.87
- Outlook = Overcast (4 points), suppose SD = 0 (perfectly consistent hours!)
- Outlook = Rainy (5 points), suppose SD = 7.78
Weighted Average SD (Children):
= (5/14 * 10.87) + (4/14 * 0) + (5/14 * 7.78)
≈ 3.88 + 0 + 2.78 = 6.66
*(Using the calculation from the previous response for consistency, acknowledging the source markdown might imply slightly different child SDs to get 7.66)*
Let's assume Weighted SD = 7.66 (as per original markdown's implication for SDR)
SDR (Outlook):
= SD_Parent - Weighted_SD_Children
= 9.32 - 7.66 = 1.66

If we calculate SDR for splitting by Temperature, Humidity, and Windy, and find that 1.66 is the highest SDR, then 'Outlook' is chosen as the first split at the Root Node.

Continuing the Tree

The 'Overcast' branch has SD = 0. It immediately becomes a Leaf Node. The prediction for any 'Overcast' weather would be the average hours played in that group (e.g., 45 hours).
For the 'Sunny' branch, we take only those 5 data points and repeat the SDR process, testing splits on Temperature, Humidity, and Windy to find the next best split for *that specific subgroup*.
We do the same for the 'Rainy' branch.

[Outlook?] (SD=9.32) SDR -> Choose Outlook | |-- [Outlook=Sunny] (SD=10.87, 5 points) -> [Windy?] (Check SDR for Temp, Hum, Wind) -> ...Leaf (Avg Sunny-Windy) | |-- [Outlook=Overcast] (SD=0, 4 points) ----> LEAF: Predict Avg_Overcast_Hours (e.g., 45) [STOP] | |-- [Outlook=Rainy] (SD=7.78, 5 points) ----> [Temp?] (Check SDR for Temp, Hum, Wind) -> ...Leaf (Avg Rainy-Temp)

Example Tree Growth using SDR

Key Terms Recap

Term	Definition
Decision Node	Where data splits based on a feature's condition.
Root Node	The very first split/decision node at the top.
Leaf Node	End node with the final prediction (average value for regression).
Subtree	A branch and its subsequent nodes/leaves.
Standard Deviation (SD)	Measures the spread or variation of numerical data.
Coefficient of Variation (CV)	Relative spread (SD / Mean). Used as a stopping criterion.
Standard Deviation Reduction (SDR)	The decrease in SD achieved by a split. Used to choose the best split.

Common Misunderstandings

Regression vs. Classification Prediction: Don't forget regression trees predict an average number at the leaves, while classification trees predict a category label.
Best Split ≠ Most Categories: A feature isn't chosen just because it has many values. It's chosen if splitting on it reduces the target variable's variance the most (highest SDR).
Instability: Single decision trees can change significantly with small data changes. Ensemble methods like Random Forests (built from many trees) are often more robust.

Quick Practice Checks

Problem	Solution Approach	Key Concept
Calculate SD for: [20, 25, 22, 28, 25].	Find mean, find squared differences from mean, average them, take square root.	Calculating SD.
Parent node (30 points, SD=12). Split A (10 points, SD=5). Split B (20 points, SD=8). Calculate weighted child SD.	Weighted SD = (10/30 * 5) + (20/30 * 8) ≈ 1.67 + 5.33 = 7.0	Weighted average SD calculation.
SDR for Feature A split = 4.2. SDR for Feature B split = 3.8. Which feature is chosen?	Feature A.	Select split with Maximum SDR.

Summary: Regression Trees (Part 1)

Regression Trees predict continuous numbers.
They work by recursively partitioning data based on input features.
Splits aim to create groups with low variance (low Standard Deviation) in the target variable.
The best split is chosen using Standard Deviation Reduction (SDR).
Splitting stops based on criteria like low variance (e.g., low CV), minimum sample size, or max depth.
Leaf nodes predict the average of the target variable for samples in that leaf.

Key Formulas/Concepts:

SD (s) ≈ Measure of Spread

SDR = SD_Before - Weighted_Avg_SD_After

Test Your Knowledge & Interview Prep

Interview Question

Question 1: What is the main difference between a Regression Tree and a Classification Tree in terms of their purpose and output?

Show Answer

A Regression Tree is used to predict a continuous numerical value (like price, temperature, hours). Its leaf nodes typically output the average of the target values in that leaf. A Classification Tree is used to predict a discrete category or class label (like 'Yes'/'No', 'Spam'/'Not Spam'). Its leaf nodes typically output the most common class in that leaf.

Question 2: Explain the goal of using Standard Deviation Reduction (SDR) when deciding where to split a node in a Regression Tree.

Show Answer

The goal of SDR is to find the split (based on a feature and value) that makes the resulting child nodes as homogeneous as possible in terms of the target variable. Homogeneous means the values are very similar, which corresponds to a low Standard Deviation. SDR quantifies how much the standard deviation decreases after a split compared to before. By choosing the split with the highest SDR, the algorithm picks the split that best separates the data into groups with less internal variation, leading towards more precise predictions at the leaves.

Interview Question

Question 3: If a leaf node in a Regression Tree is reached, how is the final prediction for a new data point falling into that leaf determined?

Show Answer

The final prediction is typically the average (mean) of the target variable values for all the *training* data points that ended up in that specific leaf node during the tree's construction.

Question 4: Why is it generally necessary to have stopping criteria when building a decision tree? What could happen if you didn't stop splitting?

Show Answer

Stopping criteria (like minimum samples per leaf, maximum depth, minimum SD reduction) are necessary to prevent overfitting. If the tree splits indefinitely until each leaf contains only one data point, it would perfectly memorize the training data (including noise) but would likely perform very poorly on new, unseen data because it hasn't learned the general underlying pattern.

Interview Question

Question 5: What do the Root Node and Decision Nodes represent in the overall structure and decision-making process of the tree?

Show Answer

The Root Node is the starting point, representing the entire dataset and the first question (split) asked based on the feature providing the highest SDR initially. Decision Nodes are subsequent points in the tree where further questions are asked about features to progressively partition the data down specific branches based on the answers, guiding a data point towards a final leaf node prediction.

Regression Trees Explained Simply

Predicting Numbers with Decision Trees: Regression Trees

What is a Decision Tree, Anyway?

Splitting Data Based on Questions

Know Your Tree Parts

How Regression Trees Choose the Best Split

Goal: Reduce the Spread (Standard Deviation)

The SDR Calculation Steps

When to Stop Splitting?

The Prediction in a Leaf Node

Example: Predicting Tennis Hours Played

Continuing the Tree

Key Terms Recap

Common Misunderstandings

Quick Practice Checks

Summary: Regression Trees (Part 1)

Test Your Knowledge & Interview Prep

You may also be interested in

🚀 Just Released