
Launched on December 19th, 2024
How do you validate an AI agent that could reply in unpredictable ways?
My team and I have released Agentic Flow Testing—an open-source framework where one AI agent autonomously tests another through natural language conversations.
It simulates real-world interactions to stress-test behaviors, find edge cases, and ensure reliability.
AI-driven testing: Automate complex dialogue scenarios, from nuanced queries to adversarial inputs.
CI/CD integration: Run tests directly in your pipeline to catch issues before deployment.
Scalable coverage: Reduce manual effort while uncovering gaps traditional methods miss.
How to contribute — star our GitHub repo to support development and stay updated:
http://github.com/langwatch/scenariohttps://github.com/langwatch/scenario
Every day I speak with AI teams building with LLM-powered applications and something is changing.
I see a new role is quietly forming:
The AI Quality lead as the quality owner.
Not always in title, but increasingly in function.
Why? Because quality in AI products is no longer optional. We see Product managers, data scientist fulfilling this role and are stepping up to define what “good” looks like, which evals to run, and how to act on the results.
We see this is the biggest challenge:
Teams know they need evaluations, but which ones? How often? And how do you make them actionable?
That’s the gap we are filling at @LangWatch : We guide AI teams to define their own quality standards, implement the right evaluations, and turn a vague goal into a repeatable, structured process.
I think we’ll soon see the rise of the AI Quality PM—or maybe even a dedicated AI Quality Lead.
What do you think? Will this become its own function?
I’d love to hear your take or guide you through what evals you actually need before going to production and in production.