Current voice AI systems rely on cascaded models (STT → LLM → TTS) or end-to-end speech-to-speech models, each with trade-offs in latency, control, and expressiveness. What do you see as the biggest technical bottleneck in moving towards a truly real-time, low-latency, emotionally adaptive speech model? Is it model architecture, dataset limitations, compute constraints, or something else...