
LLMs change fast — GPT-4 updates silently, models vanish, and prompts break. PromptPerf helps you stay ahead by testing a prompt across GPT-4o, GPT-4, and GPT-3.5, comparing outputs to your expected result using similarity scoring. ✅ 3 test cases per run, unlimited runs ✅ CSV export ✅ Built-in scoring More models and batch runs coming soon. One feature per 100 users. Built solo. Feedback welcome 🙏 promptperf.dev
PromptPerf
Thank you Ajay for the review. The next phase is to add multiple model support so prompts can be tested against these and compared. Following this would be the ability to auto run the prompts against multiple models and temperatures with 3,5,10 runs to ensure the same prompt at the same temp on the same model provides consistent results (giving you the accuracy/consistent score on your prompt)