There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Understand how data groups emerge by merging or splitting, visualized with dendrograms.
Imagine organizing items not just into separate boxes (like K-Means does), but creating a structure showing how groups merge into larger groups, like a family tree or an organizational chart. This is the idea behind Hierarchical Clustering, another powerful unsupervised learning technique for finding patterns and groups in data.
Unlike K-Means where you *must* specify the number of clusters beforehand, Hierarchical Clustering builds a hierarchy of clusters, allowing you to explore different levels of grouping afterward. Today, we'll focus on the most common approach: Agglomerative Hierarchical Clustering.
Main Technical Concept: Hierarchical Clustering creates a nested sequence of clusters. Agglomerative (bottom-up) clustering starts with each data point as its own cluster and iteratively merges the two closest clusters until only one cluster (containing all data) remains. The process creates a tree structure called a dendrogram.
There are two main strategies for hierarchical clustering:
We will primarily discuss the Agglomerative method as it's more widely used and implemented.
Here’s how the bottom-up merging works:
This merging process naturally creates a hierarchy, which we can visualize.
To decide which clusters to merge, we need two crucial components:
First, we need a way to measure the distance or dissimilarity between individual data points. Common choices for numerical data include:
Just like K-Means, Hierarchical Clustering using these metrics is sensitive to feature scales. Feature Scaling (Standardization/Normalization) is generally recommended before clustering!
Once we have multiple points in a cluster, how do we measure the distance between two clusters? This is defined by the Linkage Criterion:
The choice of linkage method can significantly impact the final structure of the clusters!
The entire merging process of agglomerative clustering can be beautifully visualized using a tree-like diagram called a Dendrogram.
Dendrograms are extremely useful because they show the entire hierarchy of merges, allowing us to *choose* the number of clusters *after* the algorithm has run!
Unlike K-Means, you don't *have* to specify K beforehand. You can decide by looking at the dendrogram:
Essentially, you are "cutting" the tree at a height that represents a significant jump in distance required to merge clusters, suggesting that the groups below that cut are relatively distinct.
We typically use `scipy` to generate the dendrogram and `scikit-learn` to perform the actual clustering once K is chosen.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage # Important imports!
from sklearn.preprocessing import StandardScaler # Scaling is good practice
# Load dataset
dataset = pd.read_csv('Mall_Customers.csv')
# Select features: Annual Income (idx 3) and Spending Score (idx 4)
X = dataset.iloc[:, [3, 4]].values
# Optional but recommended: Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Generate the linkage matrix using Ward's method
# 'ward' minimizes variance within clusters, often gives nice results
linked = linkage(X_scaled, method='ward')
# Plot the dendrogram
plt.figure(figsize=(15, 7)) # Make it wider
dendrogram(linked,
orientation='top',
# labels=dataset['CustomerID'].values, # Optional: Label leaves if needed
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plt.xlabel('Customers (or cluster indices)')
plt.ylabel('Distance (Ward)')
# plt.show() # Uncomment to display plot
print("Dendrogram plot generated. Examine it to choose optimal K.")
Examine the plotted dendrogram. Find the longest vertical distance without crossing a horizontal line. Count how many vertical lines your imaginary horizontal cut crosses – this is your likely optimal K (in the Mall Customers example, it's often visually determined to be 5).
from sklearn.cluster import AgglomerativeClustering
# Choose the number of clusters based on dendrogram analysis
optimal_k = 5
# Initialize and fit the Agglomerative Clustering model
# affinity='euclidean' is the distance metric used within clusters
# linkage='ward' is the criterion used to merge clusters
hc_model = AgglomerativeClustering(n_clusters=optimal_k, affinity='euclidean', linkage='ward')
# Fit the model and predict cluster labels for each data point
y_hc_labels = hc_model.fit_predict(X_scaled) # Use scaled data
print(f"\nCluster labels for first 10 customers (K={optimal_k}):\n", y_hc_labels[:10])
# Visualize the clusters (using original or scaled data for plot axes)
plt.figure(figsize=(10, 7))
colors = ['#ef4444', '#3b82f6', '#16a34a', '#eab308', '#a855f7'] # Example colors
labels = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']
for i in range(optimal_k):
plt.scatter(X_scaled[y_hc_labels == i, 0], X_scaled[y_hc_labels == i, 1],
s=100, c=colors[i], label=labels[i], alpha=0.7)
plt.title('Clusters of Customers (Hierarchical Clustering)')
plt.xlabel('Annual Income (Scaled)')
plt.ylabel('Spending Score (Scaled)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
# plt.show() # Uncomment to display plot
print("\nCluster visualization plot generated.")
Interview Question
Question 1: What is the main difference between K-Means clustering and Agglomerative Hierarchical Clustering?
The main difference lies in how clusters are formed and the need to specify K:
- K-Means: Requires you to specify the number of clusters (K) *beforehand*. It partitions data into exactly K clusters by iteratively assigning points to the nearest centroid and updating centroids.
- Agglomerative Hierarchical Clustering: Does *not* require K upfront. It builds a hierarchy by starting with individual points and merging the closest clusters step-by-step until only one remains. You choose K *afterwards* by looking at the resulting dendrogram.
Question 2: What is a dendrogram, and how is it used in hierarchical clustering?
A dendrogram is a tree diagram that visualizes the output of hierarchical clustering. It shows the sequence in which clusters were merged (in agglomerative clustering). The vertical axis typically represents the distance or dissimilarity at which merges occurred. It's used to understand the cluster structure and, importantly, to help decide on the optimal number of clusters by identifying a suitable height at which to "cut" the tree.
Interview Question
Question 3: Explain the concept of "Linkage Criterion" in hierarchical clustering and name two common methods.
The Linkage Criterion defines how the distance between two *clusters* (each potentially containing multiple points) is calculated, which determines which pair of clusters gets merged next. Common methods include:
1. Ward's Linkage: Merges the pair of clusters that results in the minimum increase in the total within-cluster variance. Often produces compact, spherical clusters.
2. Complete Linkage (Maximum): Defines cluster distance as the distance between the *farthest* points in the two clusters.
3. Average Linkage: Defines cluster distance as the *average* distance between all pairs of points, one from each cluster.
4. Single Linkage (Minimum): Defines cluster distance as the distance between the *closest* points in the two clusters.
Question 4: How do you typically determine the optimal number of clusters (K) after performing agglomerative hierarchical clustering?
The most common way is by examining the dendrogram. You look for the largest "jump" in distance (longest vertical line that doesn't cross a horizontal merge line). You then draw a horizontal line across the dendrogram at a height that cuts this longest vertical segment. The number of vertical lines intersected by your horizontal cut suggests the optimal number of clusters (K).
Interview Question
Question 5: What is a major disadvantage of hierarchical clustering compared to K-Means, especially for large datasets?
A major disadvantage is its computational complexity and memory requirement. Standard agglomerative clustering often requires calculating and storing a distance matrix between all pairs of points or clusters, which can be very slow (O(n²) or O(n³)) and memory-intensive for large datasets (n samples), making it less scalable than K-Means (which is often closer to O(n*k*t)).
Question 6: Is feature scaling important for hierarchical clustering? Why or why not?
Yes, it generally is important, especially if the algorithm uses distance metrics like Euclidean distance (which most common linkage methods rely on). If features are on different scales, the feature with the larger range will dominate the distance calculations, potentially leading to poor or misleading cluster structures. Scaling ensures all features contribute more equally based on their variance.