Hi everyone,
One year ago, we launched on this platform with a vision to transform voice AI. Today, Vapi has grown to 100,000+ developers and recently raised our $20M Series A.
Voice AI has reached its tipping point. We're seeing hundreds of startups, agencies, and developers building innovative voice solutions for enterprises and SMBs on Vapi's platform.
AMA about building better voice products, promising use cases, voice models, and what's coming in 2025. I'll be around 1PM PT to answer your questions!
@guilledhorno Awesome! Biggest thing we hear from customers is the need for better agent reliability. So we're investing a ton in Workflows, a way to design step-by-step conversation flows that stay on the rails :)
Why is Siri so bad and what's involved in building a better alternative? What's hard about that problem?
@rajiv_ayyangar Thanks for the question! The current Siri stack is pre-generative.
To handle two-way multi-turn interaction, you need to hold context across multiple turns. And this means more complex and indeterminate inputs. Generative models are needed to handle the increased complexity.
We'd expect apple to come out with something in the next year, but there is a lot of risk that comes with indeterministic models, so they're taking their time.
@rajiv_ayyangar Oh man. Siri. I sometimes have to use the "Bene Gesserit" voice from dune to get siri to turn out the lights. 😂
I've followed you and @nikhilro since you built @Superpowered , which I really admired for its craft and attention to UX. What prompted the pivot to AI voice infra?
@rajiv_ayyangar Haha the old days. To be honest, we burnt out. It was 3-4 years on Superpowered.
We grew it to profitability, it was a sustainable business. But, we weren't growing fast, in general it's hard to build a unicorn as a B2C productivity tool, you have to go to enterprise. Either we could have:
Picked a vertical and go deep with note-taking (ex. Healthcare scribes), then go hard on enterprise.
Became another all-in-one team productivity platform, then go hard on enterprise
In either case, we didn't have the will to go down either of these paths, so we decided to pivot.
You've got lots of developers building on Vapi. What are a few of the products built on Vapi you're most excited about, and what do you think makes them special? (I'm guessing you work really closely with these teams so you get a privileged view of their iteration speed and style).
@rajiv_ayyangar We see everything from customer service to AI girlfriends haha. Generally I'm most excited by the ones that access to unlock voice to someone who couldn't access voice technology before.
Ex. We have customers serving tradespeople to help them accept more inbound appointments after hours, others helping patients get their test results faster, etc. etc.
Without these last-mile builders, this tech couldn't get out into the world.
Kinda niche UX question: As I'm using agents and apps in general, I find myself going to audio input / dictation more and more. I really wish there were more audio-in-text-out style interfaces, because it seems like the most efficient, highest-bandwidth interface for semantic info. Have you seen this anywhere?
For example, in ChatGPT advanced voice, it frustrates me to no end that I have to wait for a voice reply, rather than just reading the output.
@rajiv_ayyangar I do agree voice-in text-out will be the highest-bandwidth interface, and we are starting to see this with Apple Intelligence, but it's not quite real time yet.
I have seen some applications in Drive-thru voice AI, like https://www.of.one/ where the user says what they want and the order form changes in real-time visually. The order itself is a lot of context to hold in a person's memory (and annoying to have it all confirmed back to the user), so it makes sense here.
Other than that, I have not seen enough innovation here. But would expect developments in Apple Intelligence to change this norm and drive the wave.
P.S. Whispr Flow!! https://wisprflow.ai/ Less of an interface, but great dictation experience.
Do you have the same "hallucination" issues in AI voice as with text? Does that make it hard for large customers to adopt AI voice for deterministic use cases? If so, what 'breakthrough' is needed to overcome this?
Should we be concerned about "voice-likeness" and how models are storing our voice data for training? I'm thinking about voice actors and how they can secure (or enhance) their livelihood as models evolve and become more indistinguishable from real humans.
When using an agent with Twilio, the quality of the results often gets worse. How can we improve this? For example, in the Hungarian language, the performance degrades significantly.
I was playing around with 11labs voice models yesterday. And as with OpenAI's voice models. I find my self having to record multiple takes of different text segments. Since every take is slightly different. Some takes have clear "voice flukes". So in order to have a clean output. I then have to cherry pick which iterations works best. Then stitch them together in one session. I mainly use this for product presentation videos. So having a clean session without the "AI artifact tell tales" is the key. So people don't notice it's AI. I have to do the same with OpenAI's voice models, they also have "artifacts" in the output. I actually prefer OpenAIs voice models over 11labs at least for product videos. My question is. Is there going to be a "cherry picker algorithm" out soon? its kinda laborious to make nice clean outputs. How are you folks at vapi thinking about this?