Features
Measure & Evaluate
Know exactly how your agents perform. Test systematically. Improve continuously.
Evaluation System
How evaluations work
A complete system for testing, measuring, and improving agent quality.
Test Datasets
Create sets of inputs and expected outputs. Run your agent against them to measure accuracy.
Experiments
Compare different prompts, models, or settings side-by-side. See which performs best.
AI Judges
Use Claude to evaluate agent responses. Score for accuracy, helpfulness, tone, and safety.
Track Over Time
See how changes affect performance. Set alerts when quality drops below thresholds.
Evaluation Dashboard
See performance at a glance
Track pass rates, score distributions, and trends over time. Identify issues before they impact users.
Evaluation Dashboard
Support Agent • v2.4
Recent Test Runs
Metrics
What you measure
Pass Rate
% of test cases that meet your criteria
Average Score
Weighted evaluation score across all tests
Per-Evaluator Breakdown
See scores by each evaluation criterion
Duration
How long each test run takes
What AI judges evaluate
Configure your judge to score responses on the criteria that matter to your use case. Add custom criteria for domain-specific evaluation.
- Configurable evaluation criteria
- Detailed scoring explanations
- Custom criteria for your domain
Evaluation Workflow
How to evaluate your agents
Create test dataset
Build a set of representative inputs and expected behaviors for your agent.
Run experiment
Your agent processes each test case and generates responses automatically.
AI judge scores
Claude evaluates each response against your configured criteria.
Review and improve
Identify failures, improve your agent, and re-run to verify improvements.
Ready to measure your agents?
Schedule a demo with our team to see how Zentrr helps you build agents you can trust.