Twitter Topic Modeling (NMF) • Varun Panuganti

Problem & Motivation:

Discover unsupervised themes in short COVID-related tweets and understand how strongly each tweet loads onto those topics, including detection of unusual outlier groups.

Data & Approach:

Used scikit-learn TF-IDF (max_df=0.95) over pre-cleaned tweets (English only, lowercased, no URLs/punct/stopwords/common COVID tokens).
Fit NMF with k=5 (init='nndsvd') to get tweet-topic loadings and word-topic weights; wrote helpers to rank top words per topic.
Assigned each tweet to its dominant topic via argmax on the projection matrix, then counted topic popularity across the corpus.
Refit NMF with k=3 for 3D visualization of tweets in topic space and flagged outliers with high Topic-2 weights, inspecting their raw text via .unique().

Results:

5-topic model produced interpretable themes (e.g., case/death counts, health/support, US/China/politics, apps to slow spread), with one topic clearly dominating tweet assignments.
3-topic model merged related themes but preserved a similar structure and revealed a tight outlier cluster dominated by app/self-reporting/symptom-tracking style tweets.
Outlier analysis showed those tweets share rare but highly weighted terms, explaining why NMF isolates them as a distinct, high-Topic-2 group.

Limitations:

Bag-of-words TF-IDF over a single day of tweets; topics and outliers are sensitive to k, initialization, and short, sparse tweet text, so themes may drift across reruns or different time windows.