Home Page

Case Study

Twitter Topic Modeling (NMF)

2025
  • Python
  • NMF/Topic Modeling
  • Machine Learning

Modeled ~119k April 30, 2020 COVID-era tweets with TF-IDF + NMF to discover latent topics, inspect top words per topic, and analyze tweet–topic weights and outliers.

Problem & Motivation:

Discover unsupervised themes in short COVID-related tweets and understand how strongly each tweet loads onto those topics, including detection of unusual outlier groups.

Data & Approach:

  • Used scikit-learn TF-IDF (max_df=0.95) over pre-cleaned tweets (English only, lowercased, no URLs/punct/stopwords/common COVID tokens).
  • Fit NMF with k=5 (init='nndsvd') to get tweet-topic loadings and word-topic weights; wrote helpers to rank top words per topic.
  • Assigned each tweet to its dominant topic via argmax on the projection matrix, then counted topic popularity across the corpus.
  • Refit NMF with k=3 for 3D visualization of tweets in topic space and flagged outliers with high Topic-2 weights, inspecting their raw text via .unique().

Results:

  • 5-topic model produced interpretable themes (e.g., case/death counts, health/support, US/China/politics, apps to slow spread), with one topic clearly dominating tweet assignments.
  • 3-topic model merged related themes but preserved a similar structure and revealed a tight outlier cluster dominated by app/self-reporting/symptom-tracking style tweets.
  • Outlier analysis showed those tweets share rare but highly weighted terms, explaining why NMF isolates them as a distinct, high-Topic-2 group.

Limitations:

Bag-of-words TF-IDF over a single day of tweets; topics and outliers are sensitive to k, initialization, and short, sparse tweet text, so themes may drift across reruns or different time windows.