Case Study
K-Means from Scratch (Wikipedia, TF-IDF)
2025
- Python
- NumPy
- Clustering
- Machine Learning
Implemented k-means in NumPy and applied it to TF-IDF vectors of ~5.9k Wikipedia biographies to study clustering behavior under different inits and K values.
Problem & Motivation:
Cluster unlabeled Wikipedia biography articles and analyze how initialization, number of clusters K, and multiple runs affect cluster quality and stability.
Data & Approach:
- Vectorized the core k-means steps (assign_clusters, revise_centroids, heterogeneity, main loop) over sparse TF-IDF features with L2 normalization.
- Compared vanilla random initialization vs k-means++ and ran multiple seeds per setting, using heterogeneity to quantify local minima.
- Swept over different K values (e.g., 2, 10, 25), plotting K vs heterogeneity and using an elbow-style heuristic to judge tradeoffs.
- Interpreted clusters by inspecting nearest documents and top TF-IDF terms per centroid to reveal topics (politicians, athletes, artists, academics, etc.).
Results:
- k-means++ plus multi-seed restarts consistently achieved lower heterogeneity than naive random init, indicating better local optima.
- Larger K values broke broad clusters (e.g., generic ‘researchers’ or ‘politicians’) into more coherent subgroups like footballers, golfers, orchestra conductors, and visual artists.
- Cluster inspections showed reasonably pure topic groups for moderate K, with nearest-neighbor bios and top words matching intuitive themes.
Limitations:
Uses standard k-means, so assumes roughly spherical clusters in Euclidean space and requires K to be chosen; results are also tied to TF-IDF bag-of-words (no semantic embeddings).