There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Learn the theory and intuition behind this essential dimensionality reduction technique.
Modern datasets can be huge, not just in the number of rows (samples), but also in the number of columns (features or dimensions). Trying to analyze or build models with hundreds or thousands of features can be incredibly challenging due to the "Curse of Dimensionality". How can we make sense of such complex data?
Principal Component Analysis (PCA) is a fundamental and widely used dimensionality reduction technique. It helps us simplify complex datasets by transforming a large set of features into a smaller set of new features, called principal components, while retaining most of the original information (variance).
Main Technical Concept: PCA is an unsupervised feature extraction technique that finds a new coordinate system for the data. The axes of this new system (the principal components) are chosen such that the first axis captures the maximum variance in the data, the second captures the maximum remaining variance while being orthogonal (uncorrelated) to the first, and so on. By keeping only the first few principal components, we reduce dimensionality while preserving most of the data's variability.
Before diving into *how* PCA works, let's quickly recap why we need dimensionality reduction:
Imagine your data points plotted in space (even if it's a space with hundreds of dimensions!). PCA tries to find a new set of axes (directions) for this space with special properties:
These new axes (PC1, PC2, PC3, ...) are linear combinations of the original features. The key idea is that the first few principal components often capture the vast majority of the original data's variability.
Image Credit: Uploadalt on Wikimedia Commons, CC0
By discarding the later principal components (which capture very little variance), we can reduce the number of dimensions while retaining most of the important information contained in the data's spread.
The underlying math involves linear algebra (eigenvectors and eigenvalues), but let's focus on the conceptual steps:
z = (x - μ) / σ
(where μ is mean, σ is standard deviation).Finds vectors v
(eigenvectors) and scalars λ
(eigenvalues) such that multiplying the Covariance Matrix (Σ
) by an eigenvector is the same as scaling the eigenvector by its eigenvalue:
Eigenvectors point in directions of variance.
Eigenvalues tell you how *much* variance is in that direction.
After calculating all principal components, how do you decide how many (k) to retain for your reduced dataset?
explained_variance_ratio_
attribute.Image Source: Data Science Plus
The choice often involves a trade-off between dimensionality reduction and information loss (loss of variance).
Scikit-learn makes PCA implementation straightforward.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# --- Assume X is your original feature matrix (n_samples, n_features) ---
# df = pd.read_csv(...)
# X = df.drop('target_column', axis=1).values
# 1. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Apply PCA
# Specify the number of components (k) or the variance ratio to keep
# Option A: Specify number of components (e.g., keep 2)
pca_k = PCA(n_components=2)
X_pca_k = pca_k.fit_transform(X_scaled)
# Option B: Specify variance ratio (e.g., keep 95% of variance)
pca_var = PCA(n_components=0.95)
X_pca_var = pca_var.fit_transform(X_scaled)
# 3. Analyze Explained Variance
print("Explained variance ratio (per component) for pca_k:", pca_k.explained_variance_ratio_)
print("Total variance explained by pca_k:", np.sum(pca_k.explained_variance_ratio_))
print(f"\nNumber of components chosen by pca_var (for 95% variance): {pca_var.n_components_}")
print("Explained variance ratio (per component) for pca_var:", pca_var.explained_variance_ratio_)
print("Total variance explained by pca_var:", np.sum(pca_var.explained_variance_ratio_))
# 4. Use the transformed data (e.g., X_pca_k or X_pca_var) for further modeling or visualization
print("\nShape of original data:", X_scaled.shape)
print("Shape after PCA (k=2):", X_pca_k.shape)
print("Shape after PCA (95% variance):", X_pca_var.shape)
# (Optional: Scree plot)
# pca_full = PCA().fit(X_scaled)
# plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
# plt.xlabel('Number of Components')
# plt.ylabel('Cumulative Explained Variance')
# plt.grid(True)
# plt.show()
Interview Question
Question 1: What is the main goal of Principal Component Analysis (PCA)?
The main goal of PCA is to reduce the dimensionality (number of features) of a dataset while retaining as much of the original data's variability (information) as possible. It achieves this by transforming the data onto a new set of uncorrelated axes called principal components, ordered by the amount of variance they capture.
Question 2: Why is standardizing the data a crucial first step before applying PCA?
PCA finds components based on maximizing variance. If features have vastly different scales (e.g., meters vs. kilometers, or age vs. income), the feature(s) with the largest scale will dominate the variance calculation and thus heavily influence the principal components. Standardizing brings all features to a common scale (usually zero mean and unit variance), ensuring that PCA finds directions based on the underlying patterns rather than arbitrary feature scales.
Interview Question
Question 3: What do the Eigenvalues and Eigenvectors represent in the context of PCA applied to a covariance matrix?
Eigenvectors represent the directions of the new axes (the principal components) in the feature space. They indicate the orientation along which the data varies.
Eigenvalues represent the magnitude of the variance captured along the direction of the corresponding eigenvector. A larger eigenvalue means more data variance lies along that principal component's direction.
Question 4: How do you typically decide how many principal components (k) to keep after performing PCA?
Common methods include:
1. Explained Variance Threshold: Choose the minimum number of components (k) required to retain a certain percentage (e.g., 90%, 95%, 99%) of the total variance. This is checked using the cumulative sum of the `explained_variance_ratio_`.
2. Scree Plot: Plot the explained variance of each component (eigenvalues) in descending order and look for an "elbow" point where the variance explained by subsequent components drops off sharply. Keep the components before the elbow.
Interview Question
Question 5: What is a major disadvantage of using PCA in terms of model interpretation?
A major disadvantage is the loss of interpretability. The principal components are mathematical combinations of *all* original features. While PC1 might capture "overall size" or PC2 might capture "shape difference", they generally don't correspond directly to single, real-world input variables, making it harder to explain *why* the model makes a certain prediction based on the original features.
Question 6: Is PCA a feature selection or a feature extraction technique? Explain why.
PCA is a feature extraction technique. It doesn't select a subset of the original features; instead, it transforms the original features into a new, smaller set of artificial features (the principal components) that are linear combinations of the old ones.