Nikhil Gupta is the cofounder and CTO of VAPI. He’s been in the trenches building and scaling one of the biggest voice platforms in the world.
On this episode, he explains how VAPI aims to create a voice-default future where we talk to all our computers—and goes deep into every step of VAPI’s voice pipelines and the technical challenges along the way.
Some highlights:
☎️ Massive scale: VAPI has processed 44 million voice calls on their platform, growing from a COVID-era one-click Zoom meeting button to a full voice infrastructure company used by thousands of developers.
⚡️ Latency matters: Voice AI needs to respond within 500 milliseconds to feel natural to humans. That means cleaning audio, detecting when users are done speaking, transcribing text, generating responses, converting to speech, and handling interruptions—all within a fraction of a second.
🗣️Voice-first future: Nikhil is betting a future where voice becomes our default interface with all computing systems.
If you’ve ever wondered how voice API actually works—this is the episode for you.
Chapters
00:00 - Introducing VAPI
04:42 - Pivoting through COVID
05:42 - ChatGPT existential crisis
08:33 - Technical challenges of voice
12:42 - Anatomy of a voice call
14:46 - Knowing when someone is done speaking
18:37 - Routing to the fastest model
22:07 - Knowledge and context injection
26:47 - The text-to-speech bottleneck
31:14 - Handling interruptions gracefully
33:43 - The 500-millisecond barrier
36:56 - The DNS latency discovery
39:25 - Scaling the team and what's next
Links
Stratechery by Ben Thompson - Recommended reading from Nikhil
Share this post