Home Page

Case Study

U.S. Traffic, Pollution & Accidents — Data Analysis

2025
  • Python
  • pandas
  • scikit-learn
  • GeoPandas
  • Data Cleaning
  • Regression
  • Classification

Merged 3 nationwide datasets (33M congestion rows, 46-col accidents, 22-col pollution) into a unified state-day panel; explored how congestion relates to accidents, seasonality, and emissions using statistical models & geospatial plots.

Problem & Motivation:

Understand how congestion, weather, and pollution interact—e.g., what conditions create high accident risk, how congestion varies by month/state, and whether high-traffic areas contribute more to emissions.

Data & Approach:

  • Cleaned & merged accidents, pollution, and congestion datasets by grouping to (date, state) and aggregating severity, delays, emissions, and accident counts.
  • Built EDA: scatterplots, weather-wise accident curves, month-level trends, state-level geospatial maps for congestion and pollutants.
  • Trained models: linear regression for accidents, logistic regression for congestion buckets, and DecisionTree/RandomForest/GradientBoosting for pollutant prediction.

Results:

  • Accidents peak at medium congestion under clear weather; delay metrics are right-skewed and weakly correlated.
  • Congestion stable across states except low-population regions (e.g., ND, ME); peaks in late spring/summer.
  • SO₂ and NO₂ show the strongest (but still weak) pollution–congestion alignment; GradientBoosting achieved lowest RMSE for all pollutants.

Limitations:

State-level aggregation hides local patterns; weather bucketing simplifies rich categories; correlations can’t confirm causation.