There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Understand how to evaluate your regression models accurately.
When we build regression models to predict values, we need a way to measure how well they actually fit the data. Two of the most common metrics for this are R-squared (R²) and Adjusted R-squared. They sound similar, but they tell slightly different stories, especially when dealing with multiple input features!
Understanding the difference is crucial for correctly evaluating your models and avoiding common pitfalls like thinking a complex model is great when it's actually just overfitting. Let's break them down.
R-squared, also known as the Coefficient of Determination, tells you the proportion (or percentage) of the variance in your dependent variable (Y, the thing you're predicting) that can be explained by the independent variable(s) (X, your inputs) included in the model.
Think of it like this: Your Y values naturally vary. How much of that variation does your model capture based on the X values? R² gives you that percentage.
R² essentially compares the errors made by your regression model to the errors you'd make if you simply guessed the average value of Y for every prediction.
R² = 1 - [ Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - y)² ]
yáµ¢
= Actual value
Å·áµ¢
= Predicted value by the model
y
= Mean (average) of the actual y values
The term Σ(yᵢ - ŷᵢ)²
is the sum of squared residuals (model errors).
The term Σ(yᵢ - y)²
represents the total variance in Y.
Watch out! R² has a major drawback: it almost always increases (or stays the same) whenever you add *any* new independent variable to the model, even if that variable is completely useless and has no real relationship with the dependent variable!
Why? Because adding *any* variable gives the model slightly more flexibility to fit the training data, even if it's just fitting noise. This makes R² potentially misleading when comparing models with different numbers of predictors.
To overcome the limitation of regular R², we use Adjusted R-squared. It modifies the R² value to account for the number of independent variables (predictors) included in the model relative to the number of data points.
Adjusted R² introduces a penalty for adding predictors that don't significantly improve the model's explanatory power.
R²
= The regular R-squared value
n
= Number of data points (observations)
k
= Number of independent variables (predictors) in the model
The term (n - 1) / (n - k - 1)
acts as the penalty factor. As 'k' increases (more predictors), this ratio increases, making the subtraction term larger and thus reducing the Adjusted R² unless the improvement in R² is substantial enough to offset the penalty.
Feature | R-Squared (R²) | Adjusted R-Squared |
---|---|---|
Definition | Proportion of variance in Y explained by X(s). | Proportion of variance explained, adjusted for the number of predictors (k) and sample size (n). |
Range | Typically 0 to 1. | Can be less than 0 (though usually 0 to ≤ R²). |
Effect of Adding Predictors | Always increases or stays the same. | Increases only if the added predictor improves the model significantly; can decrease if predictor is useless. |
Main Use Case | Measures overall goodness-of-fit for a *single* model. | Comparing models with different numbers of predictors; assessing usefulness of added predictors. |
Overfitting Indication | Can be misleadingly high in overfit models with many predictors. | Helps detect overfitting (if Adjusted R² is much lower than R², or decreases when adding predictors). |
Scikit-learn's r2_score
calculates R². You typically calculate Adjusted R² manually using the R² score.
from sklearn.metrics import r2_score
import numpy as np
# --- Assume y_test (actual values) and y_pred (predicted values) exist ---
# Example placeholder values
# y_test = np.array([10, 20, 30, 40, 50])
# y_pred = np.array([11, 18, 32, 38, 49])
# Assume X_test used for prediction had k features
# k = 3 # Example: number of predictors used
# n = len(y_test) # Number of samples
# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2:.4f}")
# Calculate Adjusted R-squared
if n - k - 1 != 0: # Prevent division by zero
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
print(f"Adjusted R-squared: {adj_r2:.4f}")
else:
print("Adjusted R-squared: Cannot calculate (n-k-1 is zero)")
Interview Question
Question 1: What does R-squared (Coefficient of Determination) actually measure?
R-squared measures the proportion (or percentage) of the total variance in the dependent variable (Y) that is explained or accounted for by the independent variable(s) (X) included in the regression model. It indicates how well the model's predictions fit the actual data points compared to simply predicting the mean of Y.
Question 2: Why was Adjusted R-squared developed? What problem with R-squared does it address?
Adjusted R-squared was developed to address the limitation of R-squared, which is that R² tends to increase (or stay the same) every time a new predictor is added to the model, regardless of whether that predictor is actually useful. Adjusted R² penalizes the model for the number of predictors included, providing a more accurate measure of goodness-of-fit when comparing models with different numbers of features.
Interview Question
Question 3: You are comparing two models. Model A has 5 predictors and an R² of 0.85 / Adjusted R² of 0.83. Model B has 10 predictors and an R² of 0.87 / Adjusted R² of 0.80. Which model might you prefer and why?
You might prefer Model A. Although Model B has a slightly higher R², its Adjusted R² is lower than Model A's. This suggests that the additional 5 predictors in Model B did not add enough explanatory power to justify the increased complexity; they might be irrelevant or causing slight overfitting. Model A provides nearly the same explanatory power (high R²) with fewer predictors, making it potentially more parsimonious and robust (as indicated by the higher Adjusted R²).
Question 4: Is it possible for Adjusted R-squared to be negative?
Yes, Adjusted R-squared can be negative. This typically happens when the model fits the data very poorly (the regular R² is close to zero or even slightly negative, which can occur if the model fits worse than just predicting the mean) and the penalty for the number of predictors is large enough to push the adjusted value below zero. A negative Adjusted R² strongly indicates a very poor model fit.
Interview Question
Question 5: Can you rely solely on R-squared or Adjusted R-squared to determine if a regression model is "good"? Why or why not?
No, you cannot rely solely on R² or Adjusted R². While they measure goodness-of-fit, they don't tell the whole story. A model could have a high R² but violate key regression assumptions (like linearity or homoscedasticity), making its coefficients unreliable. It also doesn't indicate if individual predictors are statistically significant or if the predictions are accurate enough for the specific business context (MAE/RMSE might be more relevant for that). Always check assumptions, residual plots, and consider other metrics alongside R²/Adjusted R².