Home Page

Case Study

K-Means from Scratch (Wikipedia, TF-IDF)

2025
  • Python
  • NumPy
  • Clustering
  • Machine Learning

Implemented k-means in NumPy and applied it to TF-IDF vectors of ~5.9k Wikipedia biographies to study clustering behavior under different inits and K values.

Problem & Motivation:

Cluster unlabeled Wikipedia biography articles and analyze how initialization, number of clusters K, and multiple runs affect cluster quality and stability.

Data & Approach:

  • Vectorized the core k-means steps (assign_clusters, revise_centroids, heterogeneity, main loop) over sparse TF-IDF features with L2 normalization.
  • Compared vanilla random initialization vs k-means++ and ran multiple seeds per setting, using heterogeneity to quantify local minima.
  • Swept over different K values (e.g., 2, 10, 25), plotting K vs heterogeneity and using an elbow-style heuristic to judge tradeoffs.
  • Interpreted clusters by inspecting nearest documents and top TF-IDF terms per centroid to reveal topics (politicians, athletes, artists, academics, etc.).

Results:

  • k-means++ plus multi-seed restarts consistently achieved lower heterogeneity than naive random init, indicating better local optima.
  • Larger K values broke broad clusters (e.g., generic ‘researchers’ or ‘politicians’) into more coherent subgroups like footballers, golfers, orchestra conductors, and visual artists.
  • Cluster inspections showed reasonably pure topic groups for moderate K, with nearest-neighbor bios and top words matching intuitive themes.

Limitations:

Uses standard k-means, so assumes roughly spherical clusters in Euclidean space and requires K to be chosen; results are also tied to TF-IDF bag-of-words (no semantic embeddings).