Pre-Defined Metrics

These are standard metrics applicable across domains defined by Cekura. They are organized into four categories based on what aspect of your voice agent they evaluate.

Accuracy Metrics

These metrics evaluate whether your agent provides correct and consistent information.

Expected Outcome

Result: Pass / Review Required / Failed | Cost: 0 creditsEvaluates whether the Main Agent achieved the goal specified in the evaluator’s expected outcome prompt. An LLM analyzes the conversation to determine if the intended outcome was reached.Requirements: Set expected_outcome_prompt in your evaluator configuration describing what success looks like.Interpretation:

Pass: Main Agent achieved the expected outcome
Review Required: Outcome unclear, manual review recommended
Failed: Main Agent did not achieve the expected outcome

Hallucination

Result: True/False | Cost: 0.6 credits per callDetects when the Main Agent provides information that contradicts or isn’t supported by the uploaded Knowledge Base files. A Knowledge Base is a collection of files uploaded to the agent containing reference information for fact-checking. An LLM compares Main Agent responses against the Knowledge Base content.Requirements: Upload Knowledge Base files to your agent containing the source of truth.Interpretation:

True: No hallucinations detected (Main Agent stayed factual)
False: Main Agent provided unsupported or contradictory information

Relevancy

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent’s responses were relevant and appropriate to the conversation context. An LLM analyzes each response to determine if it addressed what the Testing Agent was asking.Interpretation:

True: Responses were relevant and on-topic
False: Main Agent gave off-topic or inappropriate responses

Response Consistency

Result: True/False | Cost: 0.2 credits per callDetects inconsistencies in the Main Agent’s responses during a call. An LLM checks for two specific issues:

Testing Agent provides information (e.g., their name) and the Main Agent repeats it back incorrectly
Main Agent makes contradictory statements (e.g., says one thing early in the call, then contradicts it later)

Interpretation:

True: Main Agent maintained consistent information throughout
False: Inconsistencies or contradictions detected

Tool Call Success

Result: True/False | Cost: 0 creditsChecks whether any tool calls made by the Main Agent resulted in errors.Interpretation:

True: All tool calls succeeded
False: One or more tool calls returned an error

Transcription Accuracy

Result: Score (0-100) | Cost: 0 credits (Runs) / 1.0 credits per minute (Call Logs)Evaluates speech-to-text accuracy differently depending on the context:For Call Logs: Uses two separate state-of-the-art transcription models to generate ground truth transcriptions. Compares these against the candidate transcript to find inconsistencies, with the score based on the number of errors found. No text normalisation is applied before scoring — the raw transcript is sent directly to the LLM judge so it can assess exactly what was transcribed.For Runs/Simulations: Measures how accurately the provider’s speech-to-text transcribed the testing agent (the simulated caller). It compares the provider transcript against Cekura’s ground-truth transcript of what the testing agent actually said — Cekura generates the testing agent’s speech, so it knows the exact words spoken. Only the testing agent’s turns are scored: Cekura has no independent ground truth for the main agent’s turns (both transcripts there are just speech-to-text of the same audio), so scoring them would be circular. Use this to catch cases where the provider’s transcription mishears the caller — a wrong word, a dropped digit in a number, or a name transcribed incorrectly.What is Word Error Rate (WER)? WER is the standard speech-recognition accuracy measure — the fraction of spoken words the speech-to-text got wrong:

WER = (substitutions + deletions + insertions) / words in reference

A WER of 0% means a perfect transcription; lower is better. The explanation always reports the WER percentage for the run.How the score is calculated. The score is not the raw WER. Each transcription error is weighted by how much the mistranscribed word matters, and the score comes from the total weighted error count:

Word type	Weight per error
Names, common nouns, numbers	1.0
Negation flips (e.g. “can” vs “can’t”, a dropped “no”/“not”)	1.0
Verbs	0.5
Everything else (articles, pronouns, fillers, prepositions…)	0 toward the weighted count — treated as a minor variation (see below)

The same distinct significant word mistranscribed multiple times is only counted once. This POS-weighted approach works across common languages including Spanish. The weighted error count maps to a base score:

Base score	Weighted errors
100%	0
80%	1–3
60%	4–6
40%	7–12
20%	13+

Minor variations. Every remaining difference — articles, pronouns, fillers, prepositions and other function words — is a minor variation. These don’t count toward the weighted error count, but each one applies a small absolute 0.1% penalty to the score. So a transcript with no significant errors and, say, 3 minor variations scores 99.7% instead of a flat 100%, and a call scored 60% on significant errors with 6 minor variations lands at 59.4%. Significant errors still dominate; the minor penalty only lets you tell a flawless transcript apart from one with a few small differences by looking at the score alone. This applies to both Runs and Call Logs.Rendering-only differences are not errors. Differences that only change how a value is rendered are treated as equivalent and don’t count at all — spelled-out vs digit numbers, and spelled-out vs abbreviated units (e.g. “five kilowatts” vs “5 kW”, “two hundred thirty volt” vs “230 V”). Only a genuine change in the value or the unit (“20 watts” vs “20 volts”) is a real error. Backchannel acknowledgements (“uh-huh” vs “aha”) are treated as minor variations.Reading the explanation. Errors are grouped into Major Variations (drive the base score) and Minor Variations (each applies the 0.1% penalty), so you can see exactly which mistranscriptions drove the score. Both sections always appear — None means there were no errors of that kind. A summary at the bottom shows the Metric Score and the WER percentage.Interpretation: Higher scores indicate better transcription accuracy. Use this metric when you need to verify that the agent’s TTS is rendering content correctly (correct words, correct numbers), not just that speech is audible.

Voicemail Detection (Beta)

Result: True/False | Cost: 0.2 credits per callDetects whether the call reached a voicemail system instead of a live person. An LLM analyzes the transcript for voicemail indicators (greeting messages, beeps, recording prompts).Interpretation:

True: Call reached voicemail
False: Call connected to a live person

Conversation Quality Metrics

These metrics evaluate the flow and dynamics of the conversation.

AI Interrupting User

Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Main Agent started speaking while the Testing Agent was still talking. An interruption is when one speaker begins talking while the other is still speaking. Uses Voice Activity Detection (VAD) to find the timestamps for each turn of each speaker and precisely detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. Frequent interruptions indicate the Main Agent isn’t properly waiting for the Testing Agent to finish speaking.

Stop Time After User Interruption (ms)

Result: Numeric (milliseconds) | Cost: 0 creditsMeasures how long it takes for the Main Agent to stop speaking after the Testing Agent interrupts. Uses VAD to find the timestamps for each turn and detect when the Testing Agent starts speaking over the Main Agent.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. A responsive Main Agent should stop quickly (under 500ms) when interrupted.

User Interrupting AI

Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Testing Agent started speaking while the Main Agent was still talking. Uses VAD to find the timestamps for each turn and detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: High counts may indicate Testing Agent frustration, Main Agent speaking too long, or poor turn-taking.

Interruption Score

Result: Score (0-5, shown as a percentage on the Results page) | Cost: 0 creditsNormalizes how often the Main Agent interrupts the Testing Agent against the number of turns: 5 * (1 - interruptions / turns), clamped to 0-5. Uses VAD on stereo audio where available, otherwise transcript analysis. The Results page shows it as a percentage of the 5-point maximum (e.g. 5.0 → 100%).Interpretation: Higher is better. 5 (100%) means no interruptions; lower scores indicate the Main Agent increasingly talks over the Testing Agent.

Latency (in ms)

Result: Numeric (average milliseconds) | Cost: 0 creditsMeasures the response time between the Testing Agent finishing speaking and the Main Agent starting its response. Latency is calculated for every Main Agent turn throughout the call. With stereo audio, VAD is used to find precise timestamps for each speaker turn.We compute and display percentile statistics: P25, P50, P75, P90, P95, and P99.Interpretation: Lower is better. Latency under 2000ms is generally good.

Unnecessary Repetition Count

Result: Count (number of repetitions) | Cost: 0.2 credits per callIdentifies instances where the Main Agent unnecessarily repeated information it had already provided. An LLM analyzes the conversation for redundant statements.Interpretation: Lower is better. Repetition wastes time and can frustrate the Testing Agent.

Unnecessary Repetition Score

Result: Score (0-5) | Cost: 0.2 credits per callThe 0-5 score companion to Unnecessary Repetition Count, normalizing repetitions against the number of turns: 5 * (1 - repetitions / turns), clamped to 0-5. An LLM flags when the Main Agent re-confirms the same, unchanged information two or more times.Interpretation: Higher is better. 5 means no unnecessary repetition; lower scores mean the agent keeps re-confirming the same information.

Verbosity

Result: Score (1-5) | Cost: 0.2 credits per callEvaluates how appropriately concise the Main Agent is across the conversation. An LLM classifies each Main Agent turn as Concise, Padded (filler, restated questions, repeated information, disclaimer spam), or Overloaded (more information, options, or follow-up questions than the user asked for), then aggregates to a 1-5 score. Necessary detail the user explicitly asked for is not penalised.Interpretation:

5: Consistently concise and well-calibrated to user intent
4: Mostly concise; a few overlong turns
3: Mixed; several turns are too long
2: Verbosity is the dominant issue across most turns
1: Agent is bloated throughout

Detect Silence in Conversation

Result: True/False | Cost: 0 creditsDetects prolonged silence periods where both the Main Agent and Testing Agent are silent, which may indicate technical issues or agent problems.Requirements: Audio recording.Configuration: Set silence_duration in the metric configuration (default: 10 seconds).Interpretation:

True: No problematic silence detected
False: Extended mutual silence exceeding the threshold was detected

Infrastructure Issues

Result: True/False | Cost: 0 creditsDetects when the Main Agent fails to respond within the configured timeout after the Testing Agent finishes their turn, indicating potential infrastructure or connectivity problems.Requirements: Audio recording.Configuration: Set infra_issues_timeout in the metric configuration (default: 10 seconds).Interpretation:

True: No infrastructure issues detected
False: Main Agent failed to respond within the timeout after the Testing Agent finished speaking

Appropriate Termination by Main Agent

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent ended the call appropriately. An LLM analyzes whether the Main Agent wrapped up the conversation properly before ending.Interpretation:

True: Call was ended appropriately by the Main Agent
False: Main Agent ended call abruptly or inappropriately

Appropriate Termination by Testing Agent

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Testing Agent ended the call early, which may indicate poor experience or unresolved issues. An LLM analyzes the conversation to determine if the call ended prematurely.Interpretation:

True: Call ended at a natural conclusion point
False: Testing Agent ended call early, suggesting dissatisfaction

Customer Experience Metrics

These metrics evaluate the Testing Agent’s experience and satisfaction with the conversation.

CSAT

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates overall customer satisfaction based on two dimensions:1. Customer Sentiment - Evaluates the Testing Agent’s (customer’s) tone throughout the call:

Positive: Clear expressions of gratitude like “Thank you so much for your help” = 5 points
Neutral: Simple thanks, cooperative tone, matter-of-fact responses = 5 points
Negative: Explicit frustration, harsh language, complaints = 1 point

2. Cooperation - Evaluates whether the Main Agent helped the Testing Agent:

Fully cooperative / No issues = 5 points
Somewhat uncooperative = 3 points
Refused to help / Obstructed = 1 point

The final score is the average of both dimensions, scaled to 0-100.Interpretation: Higher is better. Scores above 70 indicate good customer satisfaction.

Dropoff Node

Result: Enum (one of your configured stages) | Cost: 0.2 credits per callIdentifies at which stage of the conversation the call dropped or ended. An LLM maps the conversation endpoint to one of your predefined stages.Requirements: Configure dropoff_nodes on your agent with the conversation stages you want to track (e.g., “greeting”, “information_gathering”, “resolution”, “closing”).Interpretation: Helps identify where in your conversation flow Testing Agents are dropping off, enabling targeted improvements.

Sentiment

Result: Enum (positive / neutral / negative) | Cost: 0.2 credits per callDetermines the Testing Agent’s overall sentiment toward the Main Agent based on the conversation transcript. An LLM analyzes the Testing Agent’s language, tone, and responses.Classification criteria:

Positive: Only when the Testing Agent is clearly very grateful with phrases like “Thank you so much for your help”, “I really appreciate this”, “You’ve been so helpful”
Negative: Explicit frustration, harsh language, complaints, or aggressive tone
Neutral: Simple “thanks”, cooperative tone, matter-of-fact responses, or when sentiment is unclear

Note: Neutral is the default value when sentiment is uncertain.Interpretation:

Positive: Testing Agent seemed very satisfied or grateful
Neutral: Testing Agent showed no strong emotion
Negative: Testing Agent seemed frustrated or dissatisfied

Topic of Call

Result: Enum (one of your configured topics) | Cost: 0.2 credits per callCategorizes the call into one of your predefined topics. An LLM analyzes the conversation to determine the primary subject matter.Requirements: Configure topic_nodes on your agent with the topics you want to track (e.g., “billing”, “technical_support”, “sales”, “general_inquiry”).Interpretation: Helps understand call volume distribution across different topics for resource planning and analysis.

Speech Quality Metrics

These metrics evaluate the audio and speech characteristics of the Main Agent.

Average Pitch (in Hz)

Result: Numeric (Hertz) | Cost: 0 creditsMeasures the average pitch frequency of the Main Agent’s voice during the call using pitch extraction algorithms.Requirements: Audio recording.

Gibberish Detection (Beta)

Result: True/False | Cost: 0.2 credits per minuteDetects nonsensical or garbled speech from the Main Agent. A multimodal LLM analyzes the Main Agent’s audio to identify unintelligible segments, nonsense sounds, or garbled speech.Requirements: Stereo audio recording with separate channels.Interpretation:

True: Speech was clear and intelligible
False: Gibberish or garbled speech detected

Letterwise Pronunciation

Result: True/False | Cost: 0.2 credits per callChecks whether certain words were spelled out letter-by-letter correctly in the audio (e.g., spelling out “J-O-H-N” for a name). A multimodal LLM analyzes the Main Agent’s audio to verify spelling.Requirements: Audio recording. Configure spelling_word_types on your agent specifying which types of words should be spelled out (e.g., “name”, “email”, “confirmation_code”).Interpretation:

True: Every instance of every word of the selected category was correctly spelled out in the audio
False: Spelling errors detected or words not spelled out when required

Pronunciation Check (Beta)

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates pronunciation accuracy for specific words you define.Requirements: Audio recording. Configure pronunciation_words on your agent as a list of word-phoneme pairs (e.g., [["Cekura", "suh-KYUR-uh"]]).Interpretation: Higher scores indicate better pronunciation accuracy. Useful for brand names or technical terms.

Speaking Rate (Beta)

Result: True/False | Cost: 0.2 credits per callDetects abrupt or unnatural changes in the Main Agent’s speaking rate during the call using an ML model.Requirements: Audio recording. Currently supports English only.Interpretation:

True: Speaking rate was consistent and natural
False: Unnatural speaking rate changes detected

Talk Ratio

Result: Numeric (0.0 to 1.0) | Cost: 0 creditsCalculates the ratio of Main Agent speaking time to total call duration. Uses VAD to find the timestamps for each turn of each speaker for accurate speaker separation.Requirements: Stereo audio recording with separate channels.Interpretation: A ratio around 0.4-0.6 is typical. Very high ratios may indicate the Main Agent is dominating the conversation; very low ratios may indicate the Main Agent isn’t being helpful enough.

Voice Change Detection (Beta)

Result: True/False | Cost: 0.2 credits per callDetects unexpected speaker changes during the Main Agent’s speaking turns using an ML model for voice analysis.Requirements: Audio recording.Interpretation:

True: Consistent speaker throughout Main Agent turns
False: Unexpected voice change detected (may indicate system issues)

Voice Tone + Clarity

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates the overall voice quality of the Main Agent’s audio using an ML model. Specifically analyzes clarity (how clear and understandable the voice is) and jitter (variations in audio timing that can affect quality) on the Main Agent’s audio channel.Requirements: Audio recording.Example of low voice clarity:

Interpretation: Higher is better. Scores above 70 indicate good voice quality. Low scores may indicate audio issues, background noise, or voice synthesis problems.

Words Per Minute (WPM)

Result: Numeric (words per minute) | Cost: 0 creditsCalculates the Main Agent’s speaking speed based on transcript word count and speaking duration.Requirements: Audio recording.

Get Started

Key Concepts

Guides

Integrations

Advanced

Pre-Defined Metrics

Accuracy Metrics

Conversation Quality Metrics

Customer Experience Metrics

Speech Quality Metrics

​Accuracy Metrics

​Conversation Quality Metrics

​Customer Experience Metrics

​Speech Quality Metrics

Accuracy Metrics

Conversation Quality Metrics

Customer Experience Metrics

Speech Quality Metrics