LLM Inference
Ongoing research — 2026

BOLT: Budget-Optimal LLM Inference via Quantization, Adaptive Exits, and Test-Time Verification.
The Pareto Frontier of Open Inference
Open-weight LLM deployment is often a zero-sum game between memory, latency, and accuracy. While quantization and early-exit methods are usually studied in isolation, BOLT treats them as a unified, joint optimization problem. By co-tuning INT4 precision with adaptive layer-skipping, we recover the "reasoning tax" imposed by compression through lightweight test-time verification.
The 3-Knob Stack: Efficiency without Collapse
Our research investigates the interplay between three distinct inference "knobs" across Qwen2.5-7B and 14B architectures:
- Adaptive Exits: A confidence-based controller (logit margin/entropy) that terminates computation once a hidden state stabilizes.
- Test-Time Verification: A budget-constrained reranking pipeline using self-consistency and LLM-as-judge to validate INT4 outputs.
- KV Compression: Dynamic token-dropping policies that preserve global attention mass for long-context stability.
Methodology & COLM-Next Benchmarking
Designed for reproducibility on a single-GPU (A100) budget, the project generates a multi-modal failure taxonomy across math (GSM8K), code (HumanEval), and long-context QA. We map the Accuracy vs. Effective Compute curve, identifying the specific regimes where verification compensates for quantization noise and where early-exit heuristics break complex reasoning chains.
// Execution Plan:
Prototype: Colab (bitsandbytes NF4)
Production: HPC Job Arrays (A100)
Metrics: Pareto Frontiers + Calibration + Failure Breakdown