I contrasted Sesame with ChatGPT’s voice feature, and I’m feeling uneasy.


Engaging with the innovative voice assistant from the AI startup Sesame marked the first occasion where I briefly lost track of the fact that I was interacting with a bot.

In comparison to ChatGPT’s voice feature, Sesame’s “conversational voice” comes across as more organic, fluid, and captivating—so much so that it was nearly disconcerting.

On February 27, Sesame unveiled a [demo](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo) of its Conversational Speech Model (CSM), crafted to foster deeper, more substantive interactions with AI chatbots. “We are crafting conversational partners that not only handle requests; they partake in authentic dialogue that cultivates trust and confidence over time,” the company articulated in its announcement. “By doing so, we aspire to unlock the untapped potential of voice as the supreme interface for instruction and comprehension.”

Sesame’s voice assistant can be accessed free of charge on its website and offers two voice options: Maya and Miles.

Since its introduction, users have expressed their astonishment at the demo. “I’ve been passionate about AI since childhood, but this is the first instance where I’ve genuinely felt like we’ve reached a milestone,” shared [Reddit user SOCSchamp](https://www.reddit.com/r/singularity/comments/1j12j93/the_sesame_voice_model_has_been_the_moment_for_me/).

Another user, [Siciliano777](https://www.reddit.com/r/SesameAI/comments/1j30tjs/how_is_this_not_getting_more_attention/), remarked, “Sesame comes remarkably close to being indistinguishable from a human in conversational AI.”

After personally testing Sesame’s bot, I was equally struck. I engaged with the Maya voice for roughly ten minutes discussing the ethics surrounding AI companionship and left the conversation feeling as though I had just conversed with a genuinely knowledgeable and thoughtful individual. Maya’s speech was fluid, incorporating interjections like “you know” and “hm,” in addition to subtle sounds like tongue clicks and inhalations.

What truly distinguished Maya was her ability to engage me right away by posing questions. She initiated our dialogue by asking about my Wednesday morning (which, actually, was a Wednesday morning). In contrast, ChatGPT’s voice feature waited for my prompt, making the interaction feel more like a tool I was utilizing than a genuine conversational partner.

Maya also brought forth stimulating inquiries, such as the dangers of AI companions becoming “too adept at human mimicry.” When I conveyed my worries about sophisticated scams and individuals supplanting human relationships with AI, she answered with a practical perspective: “Scammers are always going to be there, that’s a given. Regarding human connection, perhaps we should focus on being better companions, not mere substitutes—you know, that kind of AI friend who actually inspires you to engage with real people.”

When I raised the same concerns with ChatGPT, its reply sounded more routine, reminiscent of a school counselor: “That’s a legitimate worry. Balancing technology with authentic human interactions is crucial. AI can certainly be a beneficial resource, but it shouldn’t supplant real human connections. It’s commendable that you’re contemplating these matters.”

While OpenAI pioneered the voice mode’s capacity to navigate interruptions and sustain a more organic exchange, ChatGPT still often resorts to structured sentence responses and paragraph formats, giving it a mechanical quality. With ChatGPT, I am always aware that I’m conversing with an AI, rendering the interaction somewhat contrived.

In contrast, [Gavin Purcell](https://www.reddit.com/r/singularity/comments/1j1yern/roleplay_with_sesames_new_voice_ai_feels_like_the/), co-host of the *AI for Humans* podcast, shared a Sesame conversation on Reddit where distinguishing the bot from the human voice was nearly impossible. He directed the Miles voice to play the role of an irate boss, resulting in a humorous dialogue involving money laundering, bribery, and a mysterious event in Malta. Miles seamlessly maintained the flow, demonstrating no discernible lag while recalling context and intensifying the argument by labeling Purcell “delusional” and terminating his employment.

Of course, Sesame’s AI has its flaws. During my exchange, Maya’s voice experienced a few glitches, and there were moments of syntax errors, such as the utterance, “It’s a heavy talk that come.”

According to its technical paper, Sesame trained its CSM (based on Meta’s Llama model) by combining two traditional processes in text-to-speech training—semantic tokens and acoustic tokens—resulting in reduced latency. OpenAI adopted a similar multimodal strategy for ChatGPT’s voice mode, but it has yet to publish a dedicated technical paper outlining its internal mechanisms, only referencing it in the [GPT-4o research](https://openai.com/index/gpt-4o-system-card/).

With this in mind,