
At Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued.
At Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued.
Hi everyone!
Sharing Sesame's Conversational Speech Model (CSM), and this is a big step beyond typical text-to-speech. The goal is to achieve what Sesame calls "voice presence": making spoken interactions feel real, understood, and valued.
A PH version of this model System Card is :)
😃 Emotional Context: It tries to understand and respond to the emotion in the conversation.
⏱️ Conversational Dynamics: It aims for natural timing, pauses, and intonation.
🧠 Contextual Awareness: It adapts its tone and style to the situation.
👤 Consistent Personality: It maintains coherence.
👂 Multimodal: It understands both text and audio input.
🗣️ End-to-End: It generates speech directly, in a single stage, for greater efficiency.
🔓 Open Source: Models will be released under Apache 2.0 License.
They've built a custom evaluation suite to measure these conversational aspects, because traditional metrics (like Word Error Rate) don't really capture how natural the speech sounds.
The model itself is based on the Llama architecture, but with a clever split-transformer design.
You can try a demo to experience the conversational voice (It's magical, believe me)
Hunting credits to @sentry_co 🙌
@sentry_co @zaczuo That's amazing this will be released as Apache 2!
Incredible. That was genuinely the most realistic, engaging conversation I have had with AI so far! Pretty damn breath taking.
This is a pretty stunning release. There were moments where I had to remember that I was speaking with AI, everything seemed so perfectly nuanced. Then there were moments of absolute derailment with intonation where I needed no reminding. Overall, super exciting evolution here. I can't wait to see what's next from Sesame.