K-Means from Scratch (Wikipedia, TF-IDF) • Varun Panuganti

Problem & Motivation:

Cluster unlabeled Wikipedia biography articles and analyze how initialization, number of clusters K, and multiple runs affect cluster quality and stability.

Data & Approach:

Vectorized the core k-means steps (assign_clusters, revise_centroids, heterogeneity, main loop) over sparse TF-IDF features with L2 normalization.
Compared vanilla random initialization vs k-means++ and ran multiple seeds per setting, using heterogeneity to quantify local minima.
Swept over different K values (e.g., 2, 10, 25), plotting K vs heterogeneity and using an elbow-style heuristic to judge tradeoffs.
Interpreted clusters by inspecting nearest documents and top TF-IDF terms per centroid to reveal topics (politicians, athletes, artists, academics, etc.).

Results:

k-means++ plus multi-seed restarts consistently achieved lower heterogeneity than naive random init, indicating better local optima.
Larger K values broke broad clusters (e.g., generic ‘researchers’ or ‘politicians’) into more coherent subgroups like footballers, golfers, orchestra conductors, and visual artists.
Cluster inspections showed reasonably pure topic groups for moderate K, with nearest-neighbor bios and top words matching intuitive themes.

Limitations:

Uses standard k-means, so assumes roughly spherical clusters in Euclidean space and requires K to be chosen; results are also tied to TF-IDF bag-of-words (no semantic embeddings).