Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cekura.ai/llms.txt

Use this file to discover all available pages before exploring further.

LLM Judge Metrics allow you to evaluate your AI voice agent calls using natural language descriptions. Instead of writing code, you simply describe what constitutes success in plain English, and the system automatically evaluates each call against your criteria. This makes it easy to create custom evaluations without programming knowledge.
For the canonical reference of {{...}} variables you can use in your metric prompts — and which are available in Simulation vs. Observability — see Metric Variables.

What You Can Evaluate

LLM Judge Metrics are ideal for evaluating qualitative aspects of conversations that require understanding context and nuance:
  • Workflow Compliance: Check if agents followed specific steps or procedures
  • Communication Quality: Assess tone, clarity, professionalism, or empathy
  • Information Accuracy: Verify agents provided correct information or asked required questions
  • Customer Handling: Evaluate objection handling, de-escalation, or problem resolution
  • Policy Adherence: Ensure agents stayed within company policies and guidelines
  • Call Outcomes: Determine if desired outcomes were achieved (bookings, resolutions, etc.)

Benefits

  • No Coding Required: Write evaluations in natural language, no programming skills needed
  • Flexible & Adaptable: Easily modify criteria by updating your metric description
  • Context-Aware: Understands conversational context, not just keyword matching
  • Dynamic Variables: Use call-specific data (customer info, metadata) in your evaluations

Creating LLM Judge Metrics

Navigate to the Metrics section and select Create Metric.
  1. Name & Type: Give your metric a descriptive name (e.g., Correct End Call by Main Agent).
  2. Description (The Prompt): Write a natural language description of what constitutes success. This is what the LLM Judge will use to evaluate calls.
Use context variables to make the metric dynamic. For example, use {{metadata.instructions}} to reference specific scenario steps the agent was supposed to follow.You will see a list of context variables in the dashboard when creating a metric. See Metric Variables for a complete list.
Example Description:
Check if the Main Agent ended the call only after all steps in
{{metadata.instructions}} were completed by the Testing Agent.

Set Triggers

Define when the metric should run under the Evaluation Trigger section.
  • Always: Runs on every call (default).
  • Custom: Use logic to run metrics only in specific scenarios. You can write a trigger prompt in natural language or use Python code to define when the metric should run (e.g., return True only if the agent is attempting to book an appointment).

Testing Your Metrics

Before saving, validate your logic immediately within the builder.
1

Click Test Metric

Navigate to the test section within the metric builder.
2

Select Call IDs

Select a few past Call IDs from the list to test against.
3

Run the Test

Run the test to see if the metric passes/fails as expected on historical data.
4

Create Metric

If satisfied with the results, click Create Metric to save.

Audio Evaluation

When calling LLM Judge metrics from Python code, you can set audio=True to have the judge analyze the actual voice recording instead of (or in addition to) the transcript text. This is useful for evaluating speech delivery, pacing, tone, and other audio properties that the transcript alone cannot capture.
response = evaluate_llm_judge_metric(
    data,
    api_key,
    description="Did the agent speak clearly and at an appropriate pace?",
    eval_type="binary_qualitative",
    audio=True,
    audio_start_time=5.2,   # optional: clip start in seconds
    audio_end_time=12.8,    # optional: clip end in seconds
)
See Python Metric — Audio-Based Analysis for the full pattern.