×
Sesame’s CTO reveals how they’re building real-time voice AI that talks like humans
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Andreessen Horowitz’s latest episode of AI + a16z features Sesame’s CTO Ankit Kumar delving into the technical foundations of their voice technology with a16z partner Anjney Midha. This conversation offers a rare glimpse into the engineering complexities behind real-time conversational AI, exploring how voice interfaces might fundamentally change human-computer interaction as the technology continues to evolve from research labs into everyday applications.

The big picture: Sesame’s voice technology represents a significant advancement in AI-powered conversational interfaces, with the company taking the unusual step of open-sourcing key components of their underlying models.

  • Kumar and Midha explore the technical challenges involved in creating voice AI that can maintain natural conversation flow while balancing personality expression with computational efficiency.
  • The discussion highlights how multimodal AI systems must integrate speech recognition, natural language processing, and speech synthesis in real-time to create convincing voice interactions.

Key technical challenges: Developing real-time voice AI requires overcoming several complex engineering hurdles that balance performance with computational constraints.

  • Full-duplex conversation modeling, which allows the AI to both listen and speak simultaneously like humans do, represents a particularly difficult problem that Sesame has addressed in their technology.
  • The team has implemented specific computational optimizations to achieve the low-latency interactions necessary for natural-feeling conversations without requiring excessive processing power.

Why open-sourcing matters: Sesame’s decision to release key components of their model architecture reflects a strategic approach to advancing voice AI technology within the broader ecosystem.

  • Open-sourcing creates opportunities for community contributions while potentially accelerating adoption of their underlying technical approach.
  • The move suggests Sesame believes their competitive advantage lies in implementation and product experience rather than solely in proprietary model architecture.

In plain English: Sesame is building AI that can talk with people naturally in real-time, and they’re sharing some of their technical blueprints with the broader developer community rather than keeping everything proprietary.

Technical deep dives: The conversation explores advanced concepts in speech AI that explain how modern voice interfaces are evolving beyond simple command-response patterns.

  • Kumar breaks down how multimodal AI systems must integrate different types of intelligence – processing audio input, understanding language context, and generating natural-sounding speech – all while maintaining conversation flow.
  • The discussion addresses scaling laws in speech synthesis, examining how larger models affect voice quality and expressiveness compared to more optimized smaller models.

Where voice interfaces are heading: The conversation positions natural language as potentially the most intuitive user interface, capable of redefining how humans interact with technology.

  • Voice AI’s evolution toward more contextual understanding and human-like conversational abilities could make technology more accessible to people regardless of technical literacy.
  • The discussion suggests voice interfaces may eventually become the primary way people interact with digital systems, supplementing or replacing screen-based interfaces in many contexts.
Building the Next Generation of Conversational AI

Recent News

AI boosts SkinCeuticals sales with Appier’s marketing tech

Data-driven AI marketing tools helped L'Oréal achieve a 152% increase in ad spending returns and 48% revenue growth for SkinCeuticals' online store.

Two-way street: AI etiquette emerges as machines learn from human manners

Users increasingly rely on social niceties with AI assistants, reflecting our tendency to humanize technology despite knowing it lacks consciousness.

AI-driven FOMO stalls purchase decisions for smartphone consumers

Current AI smartphone features provide limited practical value for many users, especially retirees and those outside tech-focused professions, leaving consumers uncertain whether to upgrade functioning older devices.