Skip to main content

Overview

A/B testing on Cekura lets you compare two (or more) runs of your agent against the same set of evaluators — so you can see whether a prompt change, model swap, or configuration tweak actually improves behavior before shipping it. The flow is simple:
  1. Run the same set of evaluators against Version A of your agent.
  2. Make your change (update the prompt, switch the model, adjust tools, etc.).
  3. Run the same evaluators against Version B.
  4. Select both runs on the Results page and compare metric-by-metric.
Always run both versions against the same evaluators with the same test profile. Otherwise you’re measuring noise from different test inputs rather than the change itself.

Prerequisites

Before running an A/B test, make sure you have:
  • An agent configured on Cekura — see the Testing Overview for the basic testing flow
  • A stable set of evaluators that pass reliably for your baseline — see the Suggested Testing Approach
  • A clear hypothesis about what you expect the change to improve (pass rate, latency, adherence to a specific metric, etc.)

Running an A/B Test

1

Run Version A (baseline)

Select your evaluators and run them against your current agent configuration. Wait for the run to complete and record the run name or ID.
2

Apply your change

Update the agent’s prompt, model, voice, tool configuration, or any other parameter you want to test. Change one variable at a time — otherwise you won’t know which change caused the difference.
3

Run Version B

Run the same evaluators against the updated agent. Use the same test profile, personalities, and frequency so the only difference between the two runs is your change.
4

Select both runs and compare

On the Results page, select the checkboxes next to both runs and click Compare. Cekura shows a side-by-side breakdown of every metric and evaluator across the two runs.Look for:
  • Overall pass rate — did B improve or regress?
  • Per-metric deltas — which specific metrics moved, and by how much?
  • Individual evaluator diffs — open any evaluator to see the transcripts side-by-side and read why one passed and the other failed.

Best Practices

Change One Variable at a Time

If you change the prompt and swap the model in the same run, you can’t attribute any improvement (or regression) to either change. Test changes in isolation.

Use Frequency > 1 to De-Noise

Voice agents are non-deterministic. A single run can pass or fail by chance. Set frequency to 3–5 so each evaluator runs multiple times — the aggregate result is far more reliable than any single call. See Load Testing for how frequency works.

Keep the Test Set Fixed

Don’t add or remove evaluators between A and B. If you want to expand coverage, do that first, re-establish the baseline, and then run the A/B test.

Compare More Than Pass Rate

Two runs can have the same pass rate while differing dramatically on latency, talk ratio, or a specific workflow metric. The compare view surfaces all of these — don’t stop at the headline number.