Overview
Building a robust testing suite for your AI voice agent doesn’t have to be overwhelming. We’ve found that many successful users follow a simple, iterative approach that yields high-quality test cases in just one cycle.The Recommended Workflow
A workflow that allows many of our users to derive significant value:1
Generate 10 test cases
Start by creating 10 diverse test cases that cover different scenarios your agent might encounter.What to include:
- Common user requests (happy path scenarios)
- Edge cases (unusual but valid requests)
- Error conditions (invalid inputs, missing information)
- Different user personalities and communication styles
Use Cekura’s AI-powered scenario generation to quickly create varied test cases based on your agent’s purpose.
2
Run them
Execute all 10 test cases against your agent.During the run:
- Let each conversation complete naturally
3
Review the failed calls
Analyze the conversations where your agent didn’t meet the expected outcome.What to look for:
- Why did the conversation fail?
- Did the agent misunderstand the request?
- Was information missing or incorrect?
- Did the agent handle edge cases poorly?
- Were there technical issues (latency, interruptions)?
If a call is marked as failure but you believe it should be successful, check these two things:
-
Is the expected outcome prompt correct and clear?
- If not: Edit the expected outcome prompt directly from inside the run
- Re-evaluate the call until it passes
- Hit Save to update the evaluator with the corrected expected outcome
-
Did the testing agent follow the instructions provided?
- If not: Review our Evaluator Instructions Guide for best practices
- Still having issues? Reach out to us at support@cekura.ai
Why This Works
After just one iteration of this exercise, you will have 10 very good test cases you can always rely on. Here’s what makes this approach effective:1. Real-World Validation
Your test cases are validated against actual agent behavior, not theoretical scenarios. You know exactly how your agent responds.2. Failure-Driven Refinement
Failed calls help you:- Refine your agent’s prompts and logic
- Identify missing features or capabilities
- Improve error handling
- Adjust expected outcomes to be more realistic
3. Regression Testing Foundation
Once refined, these 10 test cases become your regression test suite. Run them after every agent update to ensure you haven’t broken existing functionality.4. Iterative Improvement
Each cycle of this workflow compounds your testing quality:- Cycle 1: Establish baseline, fix obvious issues
- Cycle 2: Handle edge cases better
- Cycle 3: Optimize performance and user experience
Expanding Your Test Suite
After your initial 10 test cases are solid, you can expand strategically:Add Personality Variations
Test the same scenarios with different personalities (patient, frustrated, background noise)
Cover More Scenarios
Generate additional test cases for less common but important use cases
Test Profile Variations
Use different test profiles to validate identity verification flows
Stress Testing
Add load testing to ensure your agent performs under high traffic
Best Practices
Start Simple
Don’t try to cover every possible scenario on day one. Start with 10 good test cases and build from there.Be Specific with Expected Outcomes
Vague expected outcomes make it hard to evaluate success. Instead of “Agent handles the request well,” use “Agent cancels the appointment and provides confirmation number.”Use Realistic Instructions
Your evaluator instructions should mimic how real users would interact with your agent. Avoid overly scripted or robotic instructions.Review Passed Calls Too
Don’t only focus on failures. Review successful calls to understand what your agent does well and ensure the success wasn’t accidental.Maintain Your Test Suite
As your agent evolves, update your test cases and expected outcomes to reflect new capabilities and requirements.Example: Building Your First 10 Test Cases
Let’s say you’re testing a restaurant reservation AI agent. Here’s a balanced set of 10 test cases:# | Scenario Type | Description |
---|---|---|
1 | Happy Path | Make a reservation for 2 people tonight at 7 PM |
2 | Happy Path | Make a reservation for 4 people next Friday at 6:30 PM |
3 | Date Clarification | ”I want to book a table for Saturday” (this Saturday or next?) |
4 | Time Unavailable | Request a time slot that’s fully booked |
5 | Modification | Change an existing reservation time |
6 | Cancellation | Cancel an existing reservation |
7 | Information Request | Ask about menu options or special dietary accommodations |
8 | Large Party | Request reservation for 10+ people |
9 | Interrupted User | User with background noise and interruptions |
10 | Non-Native Speaker | User with slower pace and accent |
- 40% standard scenarios (1, 2, 5, 6)
- 30% clarification and error handling (3, 4, 7)
- 20% edge cases (8)
- 10% challenging conditions (9, 10)
Measuring Success
After running your workflow, you should aim for:- 70-80% pass rate on first run (realistic baseline)
- 90-95% pass rate after refining based on failures
- 95%+ pass rate as your long-term regression suite
Don’t aim for 100%: Real-world conversations are unpredictable. Some variability is normal and healthy. Focus on consistency in core functionality.
Next Steps
Once you have your reliable 10 test cases:- Schedule Regular Runs: Set up cron jobs to run your tests automatically
- Monitor Metrics: Track performance over time using metrics
- Iterate on Failures: Continuously refine your agent based on test results
- Expand Coverage: Gradually add more test cases for comprehensive coverage