2025-07-16 –, South Hall 2A
Deploying machine learning models for real-world tasks is expensive—especially for inference. Unlike training, inference isn’t a one-and-done deal; it’s a recurring cost that grows with every prediction you make. And naturally, if you want accurate results, you’re probably calling up the biggest, most powerful model in your arsenal. But here’s the problem: these large models are resource-hungry, and most inputs don’t even need their full power. In fact, research shows that small, efficient models can handle a good chunk of your tasks just fine.
So why bring a Ferrari to pick up groceries? Adaptive inference offers a smarter solution: instead of using one oversized model for everything, it dynamically selects which model to use based on task difficulty. For simple inputs, you call smaller, cheaper models. For harder tasks, you escalate to the big guns. The result? High accuracy without blowing your compute budget.
This talk will cover:
- Why using one large model for all tasks is overkill (and expensive).
- How adaptive inference works and practical strategies for task routing.
- Challenges in estimating task difficulty and balancing latency with accuracy.
- Real-world examples of cost savings, from edge-to-cloud setups to large language model APIs.
To make this tangible, I’ll share how Agreement-Based Cascading (ABC) uses ensemble agreement for routing decisions. By letting models decide when they’re needed, ABC saves costs in edge-to-cloud deployments, reduces GPU rental and API bills, and outperforms state-of-the-art methods—all while staying intuitive and efficient.
Whether you’re an ML engineer deploying models, a researcher curious about efficient inference, or just someone who loves learning how to save money while staying performant, this talk has something for you.
Intermediate
I am a second-year PhD student at Carnegie Mellon University, specializing in efficiency challenges in machine learning. My research focuses on optimizing ML inference through model routing, architecture design, and execution efficiency. I have developed and evaluated methods such as Agreement-Based Cascading (ABC), which routes tasks to smaller models when possible, reducing costs for edge-to-cloud and LLM API setups, and Bonsai, an inference-time pruning framework that creates smaller, faster models for constrained hardware.
I’ve presented my work at conferences, published research papers on ML efficiency, and contributed to open-source repositories on inference optimization.