Hey Product Hunt 👋
We’ve been building Handit for the last 9 months — it started as a tool to help us debug AI agents in production… and quickly became much more.
Handit is the open-source engine for self-improving AI agents.
It evaluates every decision your agent makes, generates better prompts or model calls, A/B tests the fix, and helps you ship what actually works — automatically.
We built it because AI agents are powerful, but they break in real life.
They hallucinate, drift, and degrade silently — Unlike observability tools that just flag issues, Handit actually fixes them and shows you the full impact.
It’s:
✅ Open source
✅ Fully traceable (every input/output, decision, tool call)
✅ Stack-agnostic (LangChain, RAGs, custom pipelines — it works anywhere)
We’re already seeing teams use it to:
- Catch hallucinations before users complain
- Auto-tune prompts using real evaluations aligned to their business KPIs
- Monitor + improve agent workflows without extra infra nor extra time from their engineering teams.
We’d love to know:
How are you managing your agents in production today?
What would you want to auto-optimize or evaluate better?
💬 Drop any questions, ideas, or feedback below — we’re reading everything.
🙏 And if Handit resonates, we’d love your support!
Thanks for being here. Excited to keep building with you.
— Jose & the Handit team
Great approach to a real challenge in production AI. Making agents more reliable and helping them improve over time is key. Moving from just observing to actually fixing issues feels like a smart and needed step.
A couple of questions came to mind:
1. You mention it auto-generates prompts and datasets to improve performance. Could you explain how that process works and how it learns from the agent’s decisions?
2. Since it’s stack agnostic, like LangChain and RAGs, what’s the typical learning curve or effort to integrate it into an existing agent already in production?
Wishing you the best with the launch. And, Happy hacking!
We evaluate a subset of all the decisions your agents make using the actual production logs—looking at accuracy, cost, latency, or even business KPIs, depending on what you want to evaluate (you can configure your own evaluators).
When drift or underperformance is detected, Handit automatically proposes fixes: this could be a better prompt, a rerouted model, or a more relevant few-shot dataset. These suggestions are based on patterns observed in failing vs. successful outputs.
Then HandIt A/B tests them to see what really improves performance by running the same evaluators on the same subset of production data. Once a fix wins, it's ready to ship with one click. You stay in control; the system handles the grunt work.
We're working on a self improving memory, where all this edge cases and errors are stored and can be retrieved as context for your even better prompts to never fail again in the cases it has failed.
What’s the integration effort like? To be honest? Pretty low. We’re stack-agnostic—LangChain, custom RAGs, just plain python, js, http calls, whatever—as long as you can send JSON logs (we have libraries and http calls for that), you’re in. Most teams get it running in under an hour with our integrations team in a call. From there, you’ll start seeing evaluations and improvements almost immediately.
It really helps clarify how you handle the auto-generation of improvements, especially the use of real production data and A/B testing to validate solutions. It sounds solid and results-driven.
Also great to hear that integration is pretty low-effort and stack-agnostic. That really lowers the barrier for teams already running agents in production.
@jramr7 Curious—when Handit decides to generate a new prompt, how does that process begin? Does it rely on past performance logs, or do you use any evaluation rules or metrics?
@reid_crooks Hey great question! it relies on past performance logs + the evaluations that those logs had + an enriched insight generation of what went wrong and what could potentially fix it. Then we do an A/B testing system that validates that this problems are solved agains the same evaluators and production data. And finally you just get a better prompt with a comparison of the actual metrics calculated with your evaluators. See how it looks like:
(And now we're implementing something that's still being tested, called self improving memory to store all of this errors as context for your prompts to not only fix your prompts but give them context of their past errors so they never make them again!)
@jramr7 the self improving memory sounds super interesting. I’d love to chat more about the technical side of things. My twitter is @ReidCrooks9460 if you’re interested.
@sakshamverma_08 Ha! Never been asked this before, but it's pretty boring, we were building the mvp a year ago, thinking about a name while we were using it in our own AI teams, and at some point we started asking people on our teams to hand their AI to us to connect them to the dashboard, so we started just asking to "Hand it", then we said, well, you can just handit to us, and we'll solve all the issues you're having with your AI. And it sticked with us. :)
@jramr7 Can you support my organization on the day of our launch as well. And give feedbacks after launch, I'm sure as a 'Maker' you can understand how much valuable it is.
@cristhian_camilo_gomez_neira@lakshya_singh Thank you! we really think self improving AI it's gonna be the next big leap in AI, that's why we open sourced it, to get everyone in the next wave!
@fedjabosnic Thanks for the feedback! the tracing is great, but you have to see the prompts getting better on their own, that's some next level thing!!!
Handit.ai
Hi @jramr7
Great approach to a real challenge in production AI. Making agents more reliable and helping them improve over time is key. Moving from just observing to actually fixing issues feels like a smart and needed step.
A couple of questions came to mind:
1. You mention it auto-generates prompts and datasets to improve performance. Could you explain how that process works and how it learns from the agent’s decisions?
2. Since it’s stack agnostic, like LangChain and RAGs, what’s the typical learning curve or effort to integrate it into an existing agent already in production?
Wishing you the best with the launch. And, Happy hacking!
Handit.ai
@pastuxso Hey! love the questions,
How does it auto-generate prompts and datasets?
We evaluate a subset of all the decisions your agents make using the actual production logs—looking at accuracy, cost, latency, or even business KPIs, depending on what you want to evaluate (you can configure your own evaluators).
When drift or underperformance is detected, Handit automatically proposes fixes: this could be a better prompt, a rerouted model, or a more relevant few-shot dataset. These suggestions are based on patterns observed in failing vs. successful outputs.
Then HandIt A/B tests them to see what really improves performance by running the same evaluators on the same subset of production data. Once a fix wins, it's ready to ship with one click. You stay in control; the system handles the grunt work.
We're working on a self improving memory, where all this edge cases and errors are stored and can be retrieved as context for your even better prompts to never fail again in the cases it has failed.
What’s the integration effort like?
To be honest? Pretty low. We’re stack-agnostic—LangChain, custom RAGs, just plain python, js, http calls, whatever—as long as you can send JSON logs (we have libraries and http calls for that), you’re in. Most teams get it running in under an hour with our integrations team in a call. From there, you’ll start seeing evaluations and improvements almost immediately.
Happy to show it in action if you’re curious! you can get in a call with our integrations team here: https://calendly.com/cristhian-handit/30min
—José
Thanks so much for the detailed response.
It really helps clarify how you handle the auto-generation of improvements, especially the use of real production data and A/B testing to validate solutions. It sounds solid and results-driven.
Also great to hear that integration is pretty low-effort and stack-agnostic. That really lowers the barrier for teams already running agents in production.
Thanks again for sharing these insights.
Handit.ai
@pastuxso No problem at all, hope to see you soon using handit to make your AI improve itself automatically!!
Handit.ai
@reid_crooks Hey great question! it relies on past performance logs + the evaluations that those logs had + an enriched insight generation of what went wrong and what could potentially fix it. Then we do an A/B testing system that validates that this problems are solved agains the same evaluators and production data. And finally you just get a better prompt with a comparison of the actual metrics calculated with your evaluators. See how it looks like:
(And now we're implementing something that's still being tested, called self improving memory to store all of this errors as context for your prompts to not only fix your prompts but give them context of their past errors so they never make them again!)
Handit.ai
@reid_crooks sure, I'll be in X then!
Handit.ai
@sakshamverma_08 Ha! Never been asked this before, but it's pretty boring, we were building the mvp a year ago, thinking about a name while we were using it in our own AI teams, and at some point we started asking people on our teams to hand their AI to us to connect them to the dashboard, so we started just asking to "Hand it", then we said, well, you can just handit to us, and we'll solve all the issues you're having with your AI. And it sticked with us. :)
@jramr7 Ohh... So that's how it came to be. You know you should mention that on your site, i've not checked it yet. But it'll be interesting
@jramr7 Can you support my organization on the day of our launch as well. And give feedbacks after launch, I'm sure as a 'Maker' you can understand how much valuable it is.
Jupitrr AI
This is something new! Love the direction! Congrats on the launch @cristhian_camilo_gomez_neira @jramr7
Handit.ai
@cristhian_camilo_gomez_neira @lakshya_singh Thank you! we really think self improving AI it's gonna be the next big leap in AI, that's why we open sourced it, to get everyone in the next wave!
Congrats on the launch! Love the flow visualization, pretty cool to see something like that!
Handit.ai
@fedjabosnic Thanks for the feedback! the tracing is great, but you have to see the prompts getting better on their own, that's some next level thing!!!