Still manually tweaking AI prompts & praying for the best?
I've completed my proof-of-concept for the AI evaluation tool - successfully testing prompts across configs (Models x Temps x Runs). https://promptperf.dev/
Hi, I am building a AI Model Prompt Optimising tool that will run through multiple AI models across various temperature settings for X number of runs.
This saves time doing multiple tests, example user wants to test 3 questions/prompts for an expected answer across 4 models and the test would include 3 temperature options, each test to be done atleast 10 times to ensure consistency.
3x4x3x10 = 360 API calls + Reporting and Analysis of each result. Now imagine doing this multiple times for multiple prompts every-time, this can be 100 calls, or 1000 calls based on your requirements.
How It Works:
1. Define Your Prompt
Input the prompt you want to optimize—whether it's a question, instruction, or complex task for the AI.
2. Set Expected Output
Specify what an ideal response looks like—this becomes the benchmark against which all AI outputs are compared.
3. Select Models
Choose which AI models you want to evaluate from our supported options like GPT-4o, Claude, Gemini, and more.
4. Configure Settings
Set the temperature values and number of runs for each configuration to test various creativity vs. precision balances.
5. Run Tests
PromptPerf automatically executes all tests across your specified models, temperatures, and run counts.
6. Analyze Results
Review comprehensive performance data showing which configuration delivers the best match to your expected output.
I am looking for feedback and early testers who would be interested, I have done a proof of concept and about 1-2 weeks away from early access launch/MVP
https://promptperf.dev/
Replies