Simplify the testing process for LLMs, chatbots, and other apps powered by AI. BenchLLM is a free open-source tool that allows you to test hundreds of prompts and responses on the fly. Automate evaluations and benchmark models to build better and safer AI.
Hello Product Hunt!
We built BenchLLM to offer a more versatile open-source benchmarking tool for AI applications. It lets you measure the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs.
BenchLLM is actively used at V7 for improving our LLM applications and is now Open Sourced under MIT License to share with the wider community.
You can use it to:
- Test the responses of your LLM across any number of prompts.
- Implement continuous integration for chains like LangChain, agents like AutoGPT, or LLM models like Llama or GPT-4.
- Eliminate flaky chains and create confidence in your code.
- Spot inaccurate responses and hallucinations in your application at every version.
Key Features:
- Automated tests and evaluations on any number of prompts and predictions via LLMs.
- Multiple evaluation methods: semantic similarity checks, string matching, manual review.
- Caching LLM responses to accelerate the testing and evaluation process.
- Comprehensive API and CLI for executing test suites and faster development iterations.
Here's a preview of a common use case in LLM testing and how popular models compare:
https://www.loom.com/share/173c1...
Visit our GitHub repo to access examples, templates, and docs. Or join our Discord for feedback or to contribute to the project!
V7 Go
CloudFunnels AI