Case Study
Twitter Topic Modeling (NMF)
2025
- Python
- NMF/Topic Modeling
- Machine Learning
Modeled ~119k April 30, 2020 COVID-era tweets with TF-IDF + NMF to discover latent topics, inspect top words per topic, and analyze tweet–topic weights and outliers.
Problem & Motivation:
Discover unsupervised themes in short COVID-related tweets and understand how strongly each tweet loads onto those topics, including detection of unusual outlier groups.
Data & Approach:
- Used scikit-learn TF-IDF (max_df=0.95) over pre-cleaned tweets (English only, lowercased, no URLs/punct/stopwords/common COVID tokens).
- Fit NMF with k=5 (init='nndsvd') to get tweet-topic loadings and word-topic weights; wrote helpers to rank top words per topic.
- Assigned each tweet to its dominant topic via argmax on the projection matrix, then counted topic popularity across the corpus.
- Refit NMF with k=3 for 3D visualization of tweets in topic space and flagged outliers with high Topic-2 weights, inspecting their raw text via .unique().
Results:
- 5-topic model produced interpretable themes (e.g., case/death counts, health/support, US/China/politics, apps to slow spread), with one topic clearly dominating tweet assignments.
- 3-topic model merged related themes but preserved a similar structure and revealed a tight outlier cluster dominated by app/self-reporting/symptom-tracking style tweets.
- Outlier analysis showed those tweets share rare but highly weighted terms, explaining why NMF isolates them as a distinct, high-Topic-2 group.
Limitations:
Bag-of-words TF-IDF over a single day of tweets; topics and outliers are sensitive to k, initialization, and short, sparse tweet text, so themes may drift across reruns or different time windows.