Case Study

LLM Uncertainty Quantification

2025

Python
pandas
Machine Learning
Git/GitHub

Tool that runs multiple LLMs on the same dataset and reports confidence, calibration (ECE), and an aggregated ensemble prediction. Includes CSV uploads and a simple React + Node/Python workflow.

Problem & Motivation:

Single-model LLM outputs can be unstable, and teams often need a clearer read on confidence before trusting model decisions.

Data & Approach:

Built a UI for uploading CSV prompts and entering a Hugging Face token.
Backend (Node + Python) runs multiple models on inputted dataset, collects their confidence scores, and computes ECE.
Combined the model outputs into a simple confidence-weighted ensemble and exported results as JSON/CSV.

Results:

More stable predictions compared to using any one model alone.
ECE highlighted the confidence levels of certain models.
Easy to plug in and remove models without changing code because of the UI flow.

Limitations:

Slow when many models are selected; model outputs can still be correlated since they’re trained on similar data.