Skip to main content

Overview

Building a robust testing suite for your AI voice agent doesn’t have to be overwhelming. We’ve found that many successful users follow a simple, iterative approach that yields high-quality test cases in just one cycle. A workflow that allows many of our users to derive significant value:
1

Generate 10 test cases

Start by creating 10 diverse test cases that cover different scenarios your agent might encounter.What to include:
  • Common user requests (happy path scenarios)
  • Edge cases (unusual but valid requests)
  • Error conditions (invalid inputs, missing information)
  • Different user personalities and communication styles
Use Cekura’s AI-powered scenario generation to quickly create varied test cases based on your agent’s purpose.
2

Run them

Execute all 10 test cases against your agent.During the run:
  • Let each conversation complete naturally
This gives you a baseline understanding of how your agent performs across different scenarios.
3

Review the failed calls

Analyze the conversations where your agent didn’t meet the expected outcome.What to look for:
  • Why did the conversation fail?
  • Did the agent misunderstand the request?
  • Was information missing or incorrect?
  • Did the agent handle edge cases poorly?
  • Were there technical issues (latency, interruptions)?
If a call is marked as failure but you believe it should be successful, check these two things:
  1. Is the expected outcome prompt correct and clear?
    • If not: Edit the expected outcome prompt directly from inside the run
    • Re-evaluate the call until it passes
    • Hit Save to update the evaluator with the corrected expected outcome
  2. Did the testing agent follow the instructions provided?

Why This Works

After just one iteration of this exercise, you will have 10 very good test cases you can always rely on. Here’s what makes this approach effective:

1. Real-World Validation

Your test cases are validated against actual agent behavior, not theoretical scenarios. You know exactly how your agent responds.

2. Failure-Driven Refinement

Failed calls help you:
  • Refine your agent’s prompts and logic
  • Identify missing features or capabilities
  • Improve error handling
  • Adjust expected outcomes to be more realistic

3. Regression Testing Foundation

Once refined, these 10 test cases become your regression test suite. Run them after every agent update to ensure you haven’t broken existing functionality.

4. Iterative Improvement

Each cycle of this workflow compounds your testing quality:
  • Cycle 1: Establish baseline, fix obvious issues
  • Cycle 2: Handle edge cases better
  • Cycle 3: Optimize performance and user experience

Expanding Your Test Suite

After your initial 10 test cases are solid, you can expand strategically:

Add Personality Variations

Test the same scenarios with different personalities (patient, frustrated, background noise)

Cover More Scenarios

Generate additional test cases for less common but important use cases

Test Profile Variations

Use different test profiles to validate identity verification flows

Stress Testing

Add load testing to ensure your agent performs under high traffic

Best Practices

Start Simple

Don’t try to cover every possible scenario on day one. Start with 10 good test cases and build from there.

Be Specific with Expected Outcomes

Vague expected outcomes make it hard to evaluate success. Instead of “Agent handles the request well,” use “Agent cancels the appointment and provides confirmation number.”

Use Realistic Instructions

Your evaluator instructions should mimic how real users would interact with your agent. Avoid overly scripted or robotic instructions.

Review Passed Calls Too

Don’t only focus on failures. Review successful calls to understand what your agent does well and ensure the success wasn’t accidental.

Maintain Your Test Suite

As your agent evolves, update your test cases and expected outcomes to reflect new capabilities and requirements.

Example: Building Your First 10 Test Cases

Let’s say you’re testing a restaurant reservation AI agent. Here’s a balanced set of 10 test cases:
#Scenario TypeDescription
1Happy PathMake a reservation for 2 people tonight at 7 PM
2Happy PathMake a reservation for 4 people next Friday at 6:30 PM
3Date Clarification”I want to book a table for Saturday” (this Saturday or next?)
4Time UnavailableRequest a time slot that’s fully booked
5ModificationChange an existing reservation time
6CancellationCancel an existing reservation
7Information RequestAsk about menu options or special dietary accommodations
8Large PartyRequest reservation for 10+ people
9Interrupted UserUser with background noise and interruptions
10Non-Native SpeakerUser with slower pace and accent
This mix covers:
  • 40% standard scenarios (1, 2, 5, 6)
  • 30% clarification and error handling (3, 4, 7)
  • 20% edge cases (8)
  • 10% challenging conditions (9, 10)

Measuring Success

After running your workflow, you should aim for:
  • 70-80% pass rate on first run (realistic baseline)
  • 90-95% pass rate after refining based on failures
  • 95%+ pass rate as your long-term regression suite
Don’t aim for 100%: Real-world conversations are unpredictable. Some variability is normal and healthy. Focus on consistency in core functionality.

Next Steps

Once you have your reliable 10 test cases:
  1. Schedule Regular Runs: Set up cron jobs to run your tests automatically
  2. Monitor Metrics: Track performance over time using metrics
  3. Iterate on Failures: Continuously refine your agent based on test results
  4. Expand Coverage: Gradually add more test cases for comprehensive coverage