Case Study
Sentiment Analysis (Amazon Reviews)
2025
- Python
- pandas
- scikit-learn
- Logistic Regression
Used product review data from Amazon.com, turned reviews into word-count features, and trained logistic regression models to predict whether a review is positive or negative.
Problem & Motivation:
Given food product reviews from Amazon, predict whether the sentiment of each review is positive (+1) or negative (-1) using the review text.
Data & Approach:
- Loaded the food_products.csv data into a pandas DataFrame and created a sentiment column from the rating values (+1 or -1).
- Removed punctuation from the review text and used scikit-learn's CountVectorizer to build word-count features.
- Split the data into training, validation, and test sets with train_test_split.
- Trained a majority class classifier as a baseline and then fit a logistic regression sentiment model in scikit-learn.
- Computed validation accuracy with accuracy_score, built a confusion matrix, and inspected the most positive and most negative words using the model coefficients.
- Trained additional logistic regression models with different L2 regularization strengths and recorded their train and validation accuracies in a table.
Results:
- The logistic regression sentiment model achieved higher validation accuracy than the majority class classifier.
- Words such as 'great' and 'best' had large positive coefficients, while words such as 'not' and 'bland' had large negative coefficients, matching how we expect people to write positive and negative reviews.
Limitations:
Uses only word-count features and logistic regression; does not include more advanced text processing or models beyond what was required in the assignment.