Case Study
LLM Uncertainty Quantification
2025
- Python
- pandas
- Machine Learning
- Git/GitHub
Tool that runs multiple LLMs on the same dataset and reports confidence, calibration (ECE), and an aggregated ensemble prediction. Includes CSV uploads and a simple React + Node/Python workflow.
Problem & Motivation:
Single-model LLM outputs can be unstable, and teams often need a clearer read on confidence before trusting model decisions.
Data & Approach:
- Built a UI for uploading CSV prompts and entering a Hugging Face token.
- Backend (Node + Python) runs multiple models on inputted dataset, collects their confidence scores, and computes ECE.
- Combined the model outputs into a simple confidence-weighted ensemble and exported results as JSON/CSV.
Results:
- More stable predictions compared to using any one model alone.
- ECE highlighted the confidence levels of certain models.
- Easy to plug in and remove models without changing code because of the UI flow.
Limitations:
Slow when many models are selected; model outputs can still be correlated since they’re trained on similar data.