2026-07-17 –, Chamber Hall A (S3A)
The hardships of building End-to-End Voice Assistants in the Wild
Robot Holmes is back in the mist-choked streets of MLington, but he isn’t working solo.
Meet Zintia, an intern from the Voice Assistant district. She’s helpful, hyper-efficient, and incredibly annoying, providing Holmes with data before he can lift a finger. But Zintia has a secret. The longer she’s on the case, the more of her "dark side" emerges. She’s not just hearing the truth; she’s deciding which parts Holmes is allowed to hear.
This is a story-driven, practical session for anyone tired of "Hello World" chatbots. We will move past the hype to look at what it actually take to make End-to-End Voice Assistants work in the real world.
Our Investigation Includes:
- The Gear: How to use E2E speech models like gpt-realtime and integrate them into a production voice interface using FreeSWITCH and Pipecat.
- The Interrogation: Navigating the hardships of instruction-following, ensuring underlying LLMs stay on path through defined states and agentic flow.
- The Double-Cross: Identifying and mitigating "hidden agendas" - the hallucinations and safety guardrails that can make a voice assistant turn on its user.
Expect live demos, hard-won production lessons, a detective noir story and a blueprint for building voice agents that are fast, fluid, and (mostly) law-abiding.
Hey,
I'm Johannes, a Data Scientist who loves to tell educative stories about Machine Learning methods and AI. Preferably I'm doing this in Open Source communities.
I've been working with Computer Vision for more than 10 years, ranging from designing my own Haar-Cascade face detection, over research on autonomous cars and helping people configure their photobooks automatically, all the way to undestanding the needs of smalle and medium sized enterprises, to create tailored solutions for them.