There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Learn how to remove less useful features step-by-step using P-values and Adjusted R².
When building a Multiple Linear Regression model, we often start by including many potential input features (independent variables). But are all of them truly useful? Sometimes, adding more features doesn't actually improve the model and can even make it worse (overfitting) or harder to understand. How do we find the best, simplest set of features?
Backward Elimination is a popular technique to help us with this. It's a stepwise regression method that starts with *all* potential features and systematically removes the *least useful* ones one by one, until only significant features remain.
Main Technical Concept: Backward elimination is a feature selection technique used primarily with Multiple Linear Regression. It starts with a full model (all predictors) and iteratively removes the least statistically significant predictor (usually based on its p-value) until all remaining predictors meet a chosen significance level.
Think of it like tidying up your toolbox: you start with everything, then remove the tools you never actually use, leaving only the essential ones that get the job done effectively.
Backward Elimination follows a clear, iterative process:
SL = 0.05
(or 5%). This means we want features where we are 95% confident their relationship with the target isn't just due to random chance.
< SL
) suggests the predictor is statistically significant (it likely has a real effect on the target).> SL
) suggests the predictor is not statistically significant (we can't be confident its effect isn't just random chance).You repeat steps 3-5, removing one variable at a time (the one with the highest insignificant p-value), until all variables left in the model are statistically significant according to your chosen threshold.
It's important to understand the roles these two metrics play:
Rule: Remove predictor with max(P-value)
IF P-value > Significance Level (e.g., 0.05)
Monitor: Check Adjusted R²
after each removal. Ensure it doesn't drop drastically.
While Scikit-learn's `LinearRegression` doesn't directly provide p-values, the statsmodels
library is excellent for this kind of statistical modeling and feature selection.
Let's assume we're using the `50_Startups.csv` dataset (with columns: R&D Spend, Administration, Marketing Spend, State, Profit).
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Load dataset
df = pd.read_csv('50_Startups.csv')
X = df.iloc[:, :-1].values # Features (R&D, Admin, Marketing, State)
y = df.iloc[:, -1].values # Target (Profit)
# One-hot encode the 'State' column (index 3), drop one dummy to avoid trap
ct = ColumnTransformer(
transformers=[('encoder', OneHotEncoder(drop='first'), [3])],
remainder='passthrough' # Keep other columns
)
X = np.array(ct.fit_transform(X), dtype=float) # Ensure float type
# Add a column of ones for the intercept (b0) - statsmodels requires this explicitly
X = np.append(arr=np.ones((X.shape[0], 1)).astype(float), values=X, axis=1)
# Now X columns might be: [Constant, DummyState1, DummyState2, R&D, Admin, Marketing]
# We need to keep track of which column corresponds to which feature!
print("Shape of X after preprocessing:", X.shape)
import statsmodels.api as sm
# Significance Level
SL = 0.05
def backward_elimination(x_data, y_data, significance_level):
num_vars = x_data.shape[1] # Number of columns (features + constant)
# Start with all predictors
current_x = x_data.copy()
for i in range(num_vars): # Iterate potentially num_vars times
regressor_OLS = sm.OLS(endog=y_data, exog=current_x).fit()
p_values = regressor_OLS.pvalues
max_p_value = p_values.max()
print(f"\nIteration {i+1}: Max p-value = {max_p_value:.4f}")
print(f"Current Adj. R-squared: {regressor_OLS.rsquared_adj:.4f}")
if max_p_value > significance_level:
# Find the index of the feature with the highest p-value
max_p_index = p_values.argmax()
print(f"Removing feature at index {max_p_index} (p={max_p_value:.4f})")
# Remove the corresponding column from current_x
current_x = np.delete(current_x, max_p_index, axis=1)
else:
print("All remaining features are significant. Stopping.")
break # Exit loop if all remaining p-values <= SL
print("\nFinal Model Summary:")
final_regressor_OLS = sm.OLS(endog=y_data, exog=current_x).fit()
print(final_regressor_OLS.summary())
return current_x # Return the matrix with only optimal features
# Apply the function (using the full dataset X, y for demonstration - normally use X_train, y_train)
print("Starting Backward Elimination...")
X_optimal_features = backward_elimination(X, y, SL)
print("\nShape of X with optimal features:", X_optimal_features.shape)
# You would then retrain your final LinearRegression model (from sklearn) using ONLY these optimal features
The output of `regressor_OLS.summary()` within the loop (or at the end) is key. It shows each feature, its coefficient, standard error, t-statistic, and importantly, the P>|t| column (this is the p-value). You watch this column to decide which feature to remove next.
Issue | Potential Cause & Solution | Prevention / Best Practice |
---|---|---|
A feature known to be important is eliminated. | Significance level (SL) might be too strict (too low); high correlation with another kept variable might mask its individual significance. Solution: Re-evaluate SL, check for multicollinearity (VIF scores), consider domain knowledge. |
Don't rely solely on automated methods; combine with domain expertise. Check VIFs before starting. |
Adjusted R² drops significantly after removing a variable (even if p-value > SL). | The removed variable, while not meeting the strict p-value cutoff, still contributed meaningfully to explaining variance relative to model complexity. Solution: Consider keeping the variable if the drop in Adj. R² is substantial and the p-value wasn't excessively high. Re-assess SL. |
Monitor Adjusted R² alongside p-values. Balance statistical significance with practical model performance. |
Final model performs poorly on test data despite good stats on training data. | The selection process might have overfit to the training data's specific characteristics; the chosen SL might not generalize well. Solution: Perform backward elimination within a cross-validation loop for more robust feature selection. Validate final model on a hold-out test set. |
Use cross-validation. Don't make final decisions based only on training set metrics. |
statsmodels
library in Python is very useful for this as it provides detailed OLS summaries including p-values.Interview Question
Question 1: What is the main goal of using Backward Elimination in Multiple Linear Regression?
The main goal is to simplify the model by removing independent variables that are not statistically significant predictors of the dependent variable. This aims to create a more parsimonious (simpler), interpretable model that potentially generalizes better by excluding irrelevant features.
Question 2: What metric is typically used at each step of Backward Elimination to decide which variable to remove?
The P-value associated with each predictor's coefficient is typically used. The predictor with the highest p-value *above* the chosen significance level (e.g., 0.05) is removed.
Interview Question
Question 3: Describe the iterative process of Backward Elimination.
1. Start with a model including all potential predictors.
2. Set a Significance Level (SL, e.g., 0.05).
3. Fit the model and find the predictor with the highest p-value.
4. If that highest p-value > SL, remove that predictor and go back to step 3 (refit the model).
5. If all remaining predictors have p-values ≤ SL, stop. The current set of predictors is the final selection.
Question 4: What role does Adjusted R-squared play during Backward Elimination?
Adjusted R-squared is primarily used as a monitoring metric. While the p-value dictates removal, checking the Adjusted R² after each step helps ensure that removing the statistically insignificant variable doesn't drastically harm the model's overall explanatory power relative to its complexity. A stable or slightly increasing Adjusted R² is generally a good sign during elimination.
Interview Question
Question 5: Why is a library like `statsmodels` often preferred over `scikit-learn`'s basic `LinearRegression` when performing Backward Elimination?
`statsmodels` (specifically its `OLS` - Ordinary Least Squares - model) provides a detailed statistical summary output after fitting, which directly includes the coefficients, standard errors, t-statistics, and crucially, the p-values for each predictor. Scikit-learn's `LinearRegression` focuses more on prediction and doesn't readily provide these detailed statistical significance measures needed for the standard Backward Elimination process.