Case Study
U.S. Traffic, Pollution & Accidents — Data Analysis
2025
- Python
- pandas
- scikit-learn
- GeoPandas
- Data Cleaning
- Regression
- Classification
Merged 3 nationwide datasets (33M congestion rows, 46-col accidents, 22-col pollution) into a unified state-day panel; explored how congestion relates to accidents, seasonality, and emissions using statistical models & geospatial plots.
Problem & Motivation:
Understand how congestion, weather, and pollution interact—e.g., what conditions create high accident risk, how congestion varies by month/state, and whether high-traffic areas contribute more to emissions.
Data & Approach:
- Cleaned & merged accidents, pollution, and congestion datasets by grouping to (date, state) and aggregating severity, delays, emissions, and accident counts.
- Built EDA: scatterplots, weather-wise accident curves, month-level trends, state-level geospatial maps for congestion and pollutants.
- Trained models: linear regression for accidents, logistic regression for congestion buckets, and DecisionTree/RandomForest/GradientBoosting for pollutant prediction.
Results:
- Accidents peak at medium congestion under clear weather; delay metrics are right-skewed and weakly correlated.
- Congestion stable across states except low-population regions (e.g., ND, ME); peaks in late spring/summer.
- SO₂ and NO₂ show the strongest (but still weak) pollution–congestion alignment; GradientBoosting achieved lowest RMSE for all pollutants.
Limitations:
State-level aggregation hides local patterns; weather bucketing simplifies rich categories; correlations can’t confirm causation.