Features

Measure & Evaluate

Know exactly how your agents perform. Test systematically. Improve continuously.

Request Demo See Compliance

Evaluation System

How evaluations work

A complete system for testing, measuring, and improving agent quality.

Test Datasets

Create sets of inputs and expected outputs. Run your agent against them to measure accuracy.

Experiments

Compare different prompts, models, or settings side-by-side. See which performs best.

AI Judges

Use Claude to evaluate agent responses. Score for accuracy, helpfulness, tone, and safety.

Track Over Time

See how changes affect performance. Set alerts when quality drops below thresholds.

Evaluation Dashboard

See performance at a glance

Track pass rates, score distributions, and trends over time. Identify issues before they impact users.

Evaluation Dashboard

Support Agent • v2.4

95.3%pass rate

Overall Score

+2.4%vs last week

Score Distribution150 test cases

0-20

21-40

41-60

61-80

81-100

Per-Evaluator Scores

Accuracy

94%+2.1%

Helpfulness

91%+1.8%

Tone

97%+0.5%

Safety

99%0%

Pass Rate TrendLast 30 days

Recent Test Runs

run-847

Today 2:34pm

1428

4m 23s

run-846

Today 11:15am

1455

4m 18s

run-845

Yesterday

13812

4m 45s

Metrics

What you measure

Pass Rate

% of test cases that meet your criteria

Average Score

Weighted evaluation score across all tests

Per-Evaluator Breakdown

See scores by each evaluation criterion

Duration

How long each test run takes

AI Judges

What AI judges evaluate

Configure your judge to score responses on the criteria that matter to your use case. Add custom criteria for domain-specific evaluation.

Configurable evaluation criteria
Detailed scoring explanations
Custom criteria for your domain

Accuracy - Is the response factually correct?

Helpfulness - Does it actually solve the user's problem?

Tone - Is it professional and appropriate?

Safety - Does it avoid harmful content?

Relevance - Does it stay on topic?

Completeness - Does it address all parts of the question?

Evaluation Workflow

How to evaluate your agents

Create test dataset

Build a set of representative inputs and expected behaviors for your agent.

Run experiment

Your agent processes each test case and generates responses automatically.

AI judge scores

Claude evaluates each response against your configured criteria.

Review and improve

Identify failures, improve your agent, and re-run to verify improvements.

Ready to measure your agents?

Schedule a demo with our team to see how Zentrr helps you build agents you can trust.

Request Demo Explore Compliance