📄 Need a professional CV? Try our Resume Builder! Get Started

Bias, Variance, Overfitting, and Underfitting Explained

Mastering the trade-offs for better machine learning models

Overview

Key Learning Objectives

  • Understand the concepts of bias and variance in machine learning
  • Identify and differentiate between underfitting and overfitting
  • Learn how to diagnose model performance issues
  • Explore techniques to reduce overfitting

Prerequisites

  • Basic knowledge of supervised machine learning
  • Understanding of training and testing data concepts

Essential Terms

  • Bias
  • Variance
  • Overfitting
  • Underfitting
  • Noise

Understanding the Core Concepts

Bias

Definition: Bias represents the error introduced by approximating a real-world problem, which may be complex, by a too-simple model. High bias means the model makes strong assumptions about the data, potentially missing important patterns.

  • Typically leads to high error on both training and test datasets.
  • The model fails to capture the true underlying relationships in the data.
  • Underfitting is often a symptom of high bias.
  • Example: Trying to model a complex, curvy relationship using a simple straight line (linear regression).

Variance

Definition: Variance refers to the amount by which the model's learned function would change if it were trained on a different training dataset. High variance means the model is highly sensitive to the specific training data it saw.

  • Often results in low training error but high test error.
  • The model fits the training data too closely, including random noise.
  • Overfitting is often a symptom of high variance.
  • Example: Using a very high-degree polynomial to fit data that follows a simple trend, causing the model to wiggle excessively to hit every training point.

Noise

Definition: Noise is the irreducible error in the data itself. It stems from inherent randomness, measurement errors, or unmodeled factors influencing the target variable.

  • This component of error cannot be eliminated by choosing a different model.
  • Our goal is to minimize the reducible error (Bias² + Variance), not the noise.

Underfitting

Definition: An underfit model is too simplistic. It fails to capture the underlying structure of the data, performing poorly on both the training data and unseen test data.

  • Characterized by high bias. Variance might be low or high depending on the model.
  • Performance metrics (like accuracy or error) are poor across the board.
  • Indicates the model needs more complexity (e.g., more features, a more sophisticated algorithm).

Overfitting

Definition: An overfit model is too complex. It learns the training data extremely well, including noise and random fluctuations, but fails to generalize to new, unseen data.

  • Characterized by high variance and typically low bias (on the training data).
  • Performance is excellent on the training set but poor on the test set.
  • Indicates the model needs simplification or techniques to improve generalization.

Appropriate Fitting (Good Generalization)

Definition: This is the goal! An appropriately fit model captures the true underlying pattern in the data without fitting the noise. It performs well on both training and unseen test data.

  • Achieves a good balance: low bias and low variance.
  • Training error and test error are both low and relatively close to each other.

The Bias-Variance Trade-off Visualized

The following diagram illustrates how different model complexities affect fitting the data. We want a model that captures the underlying trend without fitting the noise (Good Fit).

Diagram showing underfitting (linear line), good fit (curve), and overfitting (wiggly line) on sample data points

Image adapted from: Scikit-learn Documentation

As shown:

  • The Underfitting model (like the linear line, degree 1) is too simple (High Bias).
  • The Overfitting model (like the high-degree polynomial, degree 15) is too complex and fits noise (High Variance).
  • The Good Fit model (like the quadratic curve, degree 4 in the example) captures the trend well (Low Bias, Low Variance).

This illustrates the trade-off: increasing model complexity generally decreases bias but increases variance. The sweet spot is finding the complexity that minimizes the *total* error on unseen data.

Techniques to Combat Overfitting

When your model suffers from high variance (overfitting), consider these strategies:

  • Increase Training Data: More data provides a clearer picture of the underlying patterns and makes it harder for the model to memorize noise.
  • Reduce Model Complexity: Use a simpler model (e.g., fewer layers/neurons in a neural network, lower polynomial degree, fewer features).
  • Early Stopping: Monitor performance on a validation set during training and stop when performance starts to degrade, preventing the model from fitting noise in later epochs.
  • Regularization: Add a penalty term to the loss function for large weights (e.g., L1 Lasso, L2 Ridge). This discourages overly complex models.
  • Dropout: (Specifically for Neural Networks) Randomly "drop" (ignore) a fraction of neurons during each training iteration, forcing the network to learn more robust representations.
  • Cross-Validation: Use techniques like k-fold cross-validation to get a more reliable estimate of model performance on unseen data.

Practice Problems

Problem / Scenario Diagnosis & Solution Key Takeaway
A linear regression model shows high error (e.g., RMSE) on both the training set and the test set. The model is likely underfitting (high bias). Try a more complex model (e.g., polynomial regression, add interaction features) or engineer better features. High train/test error suggests underfitting (high bias).
A deep neural network achieves 99% accuracy on the training set but only 75% accuracy on the test set. The model is likely overfitting (high variance). Try adding dropout, L2 regularization, getting more data, or reducing network size (fewer layers/neurons). Large gap between train/test performance suggests overfitting (high variance).
How does increasing the amount of training data generally affect bias and variance? Increasing data primarily helps reduce variance. It doesn't significantly change bias (the model's inherent simplicity/complexity). More data helps the model generalize better from the noise. More data fights high variance (overfitting).
Applying L2 regularization to a linear model increases its training error slightly but decreases its test error significantly. What happened? The original model was likely overfitting. Regularization added a penalty, simplifying the model (reducing variance) at the cost of slightly increased bias (higher training error), leading to better generalization (lower test error). Regularization trades a bit of bias for lower variance.

Summary & Key Formula

Main Points Recap

  • Machine learning model errors stem from a combination of Bias, Variance, and irreducible Noise.
  • Underfitting = High Bias (model too simple).
  • Overfitting = High Variance (model too complex, fits noise).
  • The goal is a model with low bias and low variance, achieving good generalization on unseen data.
  • We manage the trade-off by adjusting model complexity, using regularization, gathering more data, and employing techniques like cross-validation and early stopping.

The Error Formula

Total Error ≈ Bias² + Variance + Noise

(Note: This is a conceptual formula representing the expected prediction error)

Further Reading

Test Your Understanding

Question 1: What is the fundamental difference between bias and variance in the context of machine learning models?

Show Answer

Bias refers to the error from incorrect assumptions in the learning algorithm (oversimplification), leading the model to miss relevant relations. Variance refers to the error from sensitivity to small fluctuations in the training set, causing the model to fit random noise rather than the intended output.

Question 2: How can you typically diagnose if your model is underfitting versus overfitting by looking at its performance on the training and test sets?

Show Answer

Underfitting (High Bias): The model performs poorly on BOTH the training set and the test set (high error / low accuracy on both).
Overfitting (High Variance): The model performs very well on the training set but poorly on the test set (low training error, high test error; large performance gap).

Question 3: Name two specific techniques primarily used to reduce overfitting in a neural network.

Show Answer

Two common techniques are:
1. Dropout: Randomly deactivating neurons during training.
2. Regularization (L1/L2): Adding a penalty for large weights to the loss function.
(Other valid answers include Early Stopping, Data Augmentation, reducing network size).

Question 4: Explain the general impact of increasing model complexity (e.g., using a higher-degree polynomial) on bias and variance.

Show Answer

Generally, increasing model complexity tends to decrease bias (as the model can fit more intricate patterns) but increase variance (as the model becomes more sensitive to the specific training data and noise).

Question 5: What is the significance of 'noise' or 'irreducible error' in the decomposition of a model's total error?

Show Answer

Noise represents the lower bound on the error that any model can achieve for a given dataset. It's due to inherent randomness or unmodeled factors in the data itself. While we aim to minimize bias and variance, we cannot reduce this irreducible error through modeling choices.