Use this file to discover all available pages before exploring further.
These are standard metrics applicable across domains defined by Cekura. They are organized into four categories based on what aspect of your voice agent they evaluate.
These metrics evaluate whether your agent provides correct and consistent information.
Expected Outcome
Result: Pass / Review Required / Failed | Cost: 0 creditsEvaluates whether the Main Agent achieved the goal specified in the evaluator’s expected outcome prompt. An LLM analyzes the conversation to determine if the intended outcome was reached.Requirements: Set expected_outcome_prompt in your evaluator configuration describing what success looks like.Interpretation:
Failed: Main Agent did not achieve the expected outcome
Hallucination
Result: True/False | Cost: 0.6 credits per callDetects when the Main Agent provides information that contradicts or isn’t supported by the uploaded Knowledge Base files. A Knowledge Base is a collection of files uploaded to the agent containing reference information for fact-checking. An LLM compares Main Agent responses against the Knowledge Base content.Requirements: Upload Knowledge Base files to your agent containing the source of truth.Interpretation:
True: No hallucinations detected (Main Agent stayed factual)
False: Main Agent provided unsupported or contradictory information
Relevancy
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent’s responses were relevant and appropriate to the conversation context. An LLM analyzes each response to determine if it addressed what the Testing Agent was asking.Interpretation:
True: Responses were relevant and on-topic
False: Main Agent gave off-topic or inappropriate responses
Response Consistency
Result: True/False | Cost: 0.2 credits per callDetects inconsistencies in the Main Agent’s responses during a call. An LLM checks for two specific issues:
Testing Agent provides information (e.g., their name) and the Main Agent repeats it back incorrectly
Main Agent makes contradictory statements (e.g., says one thing early in the call, then contradicts it later)
Interpretation:
True: Main Agent maintained consistent information throughout
False: Inconsistencies or contradictions detected
Tool Call Success
Result: True/False | Cost: 0 creditsChecks whether any tool calls made by the Main Agent resulted in errors.Interpretation:
True: All tool calls succeeded
False: One or more tool calls returned an error
Transcription Accuracy
Result: Score (0-100) | Cost: 0 credits (Runs) / 1.0 credits per minute (Call Logs)Evaluates speech-to-text accuracy differently depending on the context:For Call Logs: Uses two separate state-of-the-art transcription models to generate ground truth transcriptions. Compares these against the candidate transcript to find inconsistencies, with the score based on the number of errors found.For Runs/Simulations: Compares the provider transcript (what the Main Agent’s TTS actually said) against the Cekura ground truth transcript (what the Main Agent should have said). This makes it the recommended metric for detecting TTS content fidelity errors — for example, if the agent’s TTS renders “2:30pm” as “two thirteen”, or drops a digit from a number, the WER metric will catch it. Errors in names, nouns, and numbers count as 1.0. Verb errors count as 0.5. Other words are ignored. Scoring: 5 = 0 errors, 4 = 1-3 errors, 3 = 4-6 errors, 2 = 7-12 errors, 1 = 13+ errors.The explanation also includes the standard Word Error Rate (WER) percentage.Interpretation: Higher scores indicate better transcription accuracy. Use this metric when you need to verify that the agent’s TTS is rendering content correctly (correct words, correct numbers), not just that speech is audible.
Voicemail Detection (Beta)
Result: True/False | Cost: 0.2 credits per callDetects whether the call reached a voicemail system instead of a live person. An LLM analyzes the transcript for voicemail indicators (greeting messages, beeps, recording prompts).Interpretation:
These metrics evaluate the flow and dynamics of the conversation.
AI Interrupting User
Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Main Agent started speaking while the Testing Agent was still talking. An interruption is when one speaker begins talking while the other is still speaking. Uses Voice Activity Detection (VAD) to find the timestamps for each turn of each speaker and precisely detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. Frequent interruptions indicate the Main Agent isn’t properly waiting for the Testing Agent to finish speaking.
Stop Time After User Interruption (ms)
Result: Numeric (milliseconds) | Cost: 0 creditsMeasures how long it takes for the Main Agent to stop speaking after the Testing Agent interrupts. Uses VAD to find the timestamps for each turn and detect when the Testing Agent starts speaking over the Main Agent.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. A responsive Main Agent should stop quickly (under 500ms) when interrupted.
User Interrupting AI
Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Testing Agent started speaking while the Main Agent was still talking. Uses VAD to find the timestamps for each turn and detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: High counts may indicate Testing Agent frustration, Main Agent speaking too long, or poor turn-taking.
Latency (in ms)
Result: Numeric (average milliseconds) | Cost: 0 creditsMeasures the response time between the Testing Agent finishing speaking and the Main Agent starting its response. Latency is calculated for every Main Agent turn throughout the call. With stereo audio, VAD is used to find precise timestamps for each speaker turn.We compute and display percentile statistics: P25, P50, P75, P90, P95, and P99.Interpretation: Lower is better. Latency under 2000ms is generally good.
Unnecessary Repetition Count
Result: Count (number of repetitions) | Cost: 0.2 credits per callIdentifies instances where the Main Agent unnecessarily repeated information it had already provided. An LLM analyzes the conversation for redundant statements.Interpretation: Lower is better. Repetition wastes time and can frustrate the Testing Agent.
Verbosity
Result: Score (1-5) | Cost: 0.2 credits per callEvaluates how appropriately concise the Main Agent is across the conversation. An LLM classifies each Main Agent turn as Concise, Padded (filler, restated questions, repeated information, disclaimer spam), or Overloaded (more information, options, or follow-up questions than the user asked for), then aggregates to a 1-5 score. Necessary detail the user explicitly asked for is not penalised.Interpretation:
5: Consistently concise and well-calibrated to user intent
4: Mostly concise; a few overlong turns
3: Mixed; several turns are too long
2: Verbosity is the dominant issue across most turns
1: Agent is bloated throughout
Detect Silence in Conversation
Result: True/False | Cost: 0 creditsDetects prolonged silence periods where both the Main Agent and Testing Agent are silent, which may indicate technical issues or agent problems.Requirements: Audio recording.Configuration: Set silence_duration in the metric configuration (default: 10 seconds).Interpretation:
True: No problematic silence detected
False: Extended mutual silence exceeding the threshold was detected
Infrastructure Issues
Result: True/False | Cost: 0 creditsDetects when the Main Agent fails to respond within the configured timeout after the Testing Agent finishes their turn, indicating potential infrastructure or connectivity problems.Requirements: Audio recording.Configuration: Set infra_issues_timeout in the metric configuration (default: 10 seconds).Interpretation:
True: No infrastructure issues detected
False: Main Agent failed to respond within the timeout after the Testing Agent finished speaking
Appropriate Termination by Main Agent
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent ended the call appropriately. An LLM analyzes whether the Main Agent wrapped up the conversation properly before ending.Interpretation:
True: Call was ended appropriately by the Main Agent
False: Main Agent ended call abruptly or inappropriately
Appropriate Termination by Testing Agent
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Testing Agent ended the call early, which may indicate poor experience or unresolved issues. An LLM analyzes the conversation to determine if the call ended prematurely.Interpretation:
These metrics evaluate the Testing Agent’s experience and satisfaction with the conversation.
CSAT
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates overall customer satisfaction based on two dimensions:1. Customer Sentiment - Evaluates the Testing Agent’s (customer’s) tone throughout the call:
Positive: Clear expressions of gratitude like “Thank you so much for your help” = 5 points
Negative: Explicit frustration, harsh language, complaints = 1 point
2. Cooperation - Evaluates whether the Main Agent helped the Testing Agent:
Fully cooperative / No issues = 5 points
Somewhat uncooperative = 3 points
Refused to help / Obstructed = 1 point
The final score is the average of both dimensions, scaled to 0-100.Interpretation: Higher is better. Scores above 70 indicate good customer satisfaction.
Dropoff Node
Result: Enum (one of your configured stages) | Cost: 0.2 credits per callIdentifies at which stage of the conversation the call dropped or ended. An LLM maps the conversation endpoint to one of your predefined stages.Requirements: Configure dropoff_nodes on your agent with the conversation stages you want to track (e.g., “greeting”, “information_gathering”, “resolution”, “closing”).Interpretation: Helps identify where in your conversation flow Testing Agents are dropping off, enabling targeted improvements.
Sentiment
Result: Enum (positive / neutral / negative) | Cost: 0.2 credits per callDetermines the Testing Agent’s overall sentiment toward the Main Agent based on the conversation transcript. An LLM analyzes the Testing Agent’s language, tone, and responses.Classification criteria:
Positive: Only when the Testing Agent is clearly very grateful with phrases like “Thank you so much for your help”, “I really appreciate this”, “You’ve been so helpful”
Negative: Explicit frustration, harsh language, complaints, or aggressive tone
Neutral: Simple “thanks”, cooperative tone, matter-of-fact responses, or when sentiment is unclear
Note: Neutral is the default value when sentiment is uncertain.Interpretation:
Positive: Testing Agent seemed very satisfied or grateful
Neutral: Testing Agent showed no strong emotion
Negative: Testing Agent seemed frustrated or dissatisfied
Topic of Call
Result: Enum (one of your configured topics) | Cost: 0.2 credits per callCategorizes the call into one of your predefined topics. An LLM analyzes the conversation to determine the primary subject matter.Requirements: Configure topic_nodes on your agent with the topics you want to track (e.g., “billing”, “technical_support”, “sales”, “general_inquiry”).Interpretation: Helps understand call volume distribution across different topics for resource planning and analysis.
These metrics evaluate the audio and speech characteristics of the Main Agent.
Average Pitch (in Hz)
Result: Numeric (Hertz) | Cost: 0 creditsMeasures the average pitch frequency of the Main Agent’s voice during the call using pitch extraction algorithms.Requirements: Audio recording.
Gibberish Detection (Beta)
Result: True/False | Cost: 0.2 credits per minuteDetects nonsensical or garbled speech from the Main Agent. A multimodal LLM analyzes the Main Agent’s audio to identify unintelligible segments, nonsense sounds, or garbled speech.Requirements: Stereo audio recording with separate channels.Interpretation:
True: Speech was clear and intelligible
False: Gibberish or garbled speech detected
Letterwise Pronunciation
Result: True/False | Cost: 0.2 credits per callChecks whether certain words were spelled out letter-by-letter correctly in the audio (e.g., spelling out “J-O-H-N” for a name). A multimodal LLM analyzes the Main Agent’s audio to verify spelling.Requirements: Audio recording. Configure spelling_word_types on your agent specifying which types of words should be spelled out (e.g., “name”, “email”, “confirmation_code”).Interpretation:
True: Every instance of every word of the selected category was correctly spelled out in the audio
False: Spelling errors detected or words not spelled out when required
Pronunciation Check (Beta)
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates pronunciation accuracy for specific words you define.Requirements: Audio recording. Configure pronunciation_words on your agent as a list of word-phoneme pairs (e.g., [["Cekura", "suh-KYUR-uh"]]).Interpretation: Higher scores indicate better pronunciation accuracy. Useful for brand names or technical terms.
Speaking Rate (Beta)
Result: True/False | Cost: 0.2 credits per callDetects abrupt or unnatural changes in the Main Agent’s speaking rate during the call using an ML model.Requirements: Audio recording. Currently supports English only.Interpretation:
True: Speaking rate was consistent and natural
False: Unnatural speaking rate changes detected
Talk Ratio
Result: Numeric (0.0 to 1.0) | Cost: 0 creditsCalculates the ratio of Main Agent speaking time to total call duration. Uses VAD to find the timestamps for each turn of each speaker for accurate speaker separation.Requirements: Stereo audio recording with separate channels.Interpretation: A ratio around 0.4-0.6 is typical. Very high ratios may indicate the Main Agent is dominating the conversation; very low ratios may indicate the Main Agent isn’t being helpful enough.
Voice Change Detection (Beta)
Result: True/False | Cost: 0.2 credits per callDetects unexpected speaker changes during the Main Agent’s speaking turns using an ML model for voice analysis.Requirements: Audio recording.Interpretation:
True: Consistent speaker throughout Main Agent turns
False: Unexpected voice change detected (may indicate system issues)
Voice Tone + Clarity
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates the overall voice quality of the Main Agent’s audio using an ML model. Specifically analyzes clarity (how clear and understandable the voice is) and jitter (variations in audio timing that can affect quality) on the Main Agent’s audio channel.Requirements: Audio recording.Example of low voice clarity:
Interpretation: Higher is better. Scores above 70 indicate good voice quality. Low scores may indicate audio issues, background noise, or voice synthesis problems.
Words Per Minute (WPM)
Result: Numeric (words per minute) | Cost: 0 creditsCalculates the Main Agent’s speaking speed based on transcript word count and speaking duration.Requirements: Audio recording.