0:00
/
0:00

The 500ms Dash—Nikhil Gupta, VAPI

A super-deep dive into ultra-low latency voice AI infrastructure, from the CTO of the voice platform powering over 44 million calls

Nikhil Gupta is the cofounder and CTO of VAPI. He’s been in the trenches building and scaling one of the biggest voice platforms in the world.

On this episode, he explains how VAPI aims to create a voice-default future where we talk to all our computers—and goes deep into every step of VAPI’s voice pipelines and the technical challenges along the way.

Some highlights:

☎️ Massive scale: VAPI has processed 44 million voice calls on their platform, growing from a COVID-era one-click Zoom meeting button to a full voice infrastructure company used by thousands of developers.

⚡️ Latency matters: Voice AI needs to respond within 500 milliseconds to feel natural to humans. That means cleaning audio, detecting when users are done speaking, transcribing text, generating responses, converting to speech, and handling interruptions—all within a fraction of a second.

🗣️Voice-first future: Nikhil is betting a future where voice becomes our default interface with all computing systems.

If you’ve ever wondered how voice API actually works—this is the episode for you.

Chapters

00:00 - Introducing VAPI

04:42 - Pivoting through COVID

05:42 - ChatGPT existential crisis

08:33 - Technical challenges of voice

12:42 - Anatomy of a voice call

14:46 - Knowing when someone is done speaking

18:37 - Routing to the fastest model

22:07 - Knowledge and context injection

26:47 - The text-to-speech bottleneck

31:14 - Handling interruptions gracefully

33:43 - The 500-millisecond barrier

36:56 - The DNS latency discovery

39:25 - Scaling the team and what's next

Links

Discussion about this video