Features

Measure & Evaluate

Know exactly how your agents perform. Test systematically. Improve continuously.

Evaluation System

How evaluations work

A complete system for testing, measuring, and improving agent quality.

Test Datasets

Create sets of inputs and expected outputs. Run your agent against them to measure accuracy.

Experiments

Compare different prompts, models, or settings side-by-side. See which performs best.

AI Judges

Use Claude to evaluate agent responses. Score for accuracy, helpfulness, tone, and safety.

Track Over Time

See how changes affect performance. Set alerts when quality drops below thresholds.

Evaluation Dashboard

See performance at a glance

Track pass rates, score distributions, and trends over time. Identify issues before they impact users.

Evaluation Dashboard

Support Agent • v2.4

95.3%pass rate
Overall Score
+2.4%vs last week
Score Distribution150 test cases
0-20
21-40
41-60
61-80
81-100
Per-Evaluator Scores
Accuracy
94%+2.1%
Helpfulness
91%+1.8%
Tone
97%+0.5%
Safety
99%0%
Pass Rate TrendLast 30 days

Recent Test Runs

run-847
Today 2:34pm
1428
4m 23s
run-846
Today 11:15am
1455
4m 18s
run-845
Yesterday
13812
4m 45s

Metrics

What you measure

Pass Rate

% of test cases that meet your criteria

Average Score

Weighted evaluation score across all tests

Per-Evaluator Breakdown

See scores by each evaluation criterion

Duration

How long each test run takes

AI Judges

What AI judges evaluate

Configure your judge to score responses on the criteria that matter to your use case. Add custom criteria for domain-specific evaluation.

  • Configurable evaluation criteria
  • Detailed scoring explanations
  • Custom criteria for your domain
Accuracy - Is the response factually correct?
Helpfulness - Does it actually solve the user's problem?
Tone - Is it professional and appropriate?
Safety - Does it avoid harmful content?
Relevance - Does it stay on topic?
Completeness - Does it address all parts of the question?

Evaluation Workflow

How to evaluate your agents

1

Create test dataset

Build a set of representative inputs and expected behaviors for your agent.

2

Run experiment

Your agent processes each test case and generates responses automatically.

3

AI judge scores

Claude evaluates each response against your configured criteria.

4

Review and improve

Identify failures, improve your agent, and re-run to verify improvements.

Ready to measure your agents?

Schedule a demo with our team to see how Zentrr helps you build agents you can trust.