Mira Murati's Thinking Machines Lab has released its first AI model, positioning it as a direct challenge to OpenAI's GPT-4o Realtime and Google's Gemini Live. The startup argues that existing voice AI systems fundamentally misunderstand how humans interact with language.
The core difference lies in processing speed and architecture. Thinking Machines' model ingests audio, video, and text simultaneously in 200-millisecond chunks rather than waiting for complete utterances before responding. This parallel processing creates what the team calls true interactivity. OpenAI's voice mode, despite its real-time marketing, still operates on a question-and-answer framework that forces users into discrete conversational turns. Murati's model abandons this constraint entirely.
The timing matters. Voice interfaces represent the next major shift in how users access AI. OpenAI's GPT-4o Realtime impressed early users with latency under one second, but Thinking Machines identifies a deeper problem. Even fast responses don't capture natural conversation. Humans interrupt, overlap speech, and respond to incomplete thoughts. They don't wait for questions to end before beginning answers.
Thinking Machines' approach mirrors how humans actually listen and think. By processing audio streams in 200-millisecond blocks, the system can generate output while the user still speaks. This eliminates the artificial pause between input and response that currently defines voice AI interactions.
The model processes three modalities simultaneously, not sequentially. This parallel architecture differs fundamentally from systems that transcribe audio, process text, then generate speech. Integration happens at the model level from the start.
Competition in voice AI intensifies. Google and OpenAI both control integrated ecosystems that give them distribution advantages. Thinking Machines enters as a pure model play, betting that interactivity quality alone drives adoption. The startup must convince developers and