LLM Judge Metric

LLM Judge Metrics allow you to evaluate your AI voice agent calls using natural language descriptions. Instead of writing code, you simply describe what constitutes success in plain English, and the system automatically evaluates each call against your criteria. This makes it easy to create custom evaluations without programming knowledge.

What You Can Evaluate

LLM Judge Metrics are ideal for evaluating qualitative aspects of conversations that require understanding context and nuance:

Workflow Compliance: Check if agents followed specific steps or procedures
Communication Quality: Assess tone, clarity, professionalism, or empathy
Information Accuracy: Verify agents provided correct information or asked required questions
Customer Handling: Evaluate objection handling, de-escalation, or problem resolution
Policy Adherence: Ensure agents stayed within company policies and guidelines
Call Outcomes: Determine if desired outcomes were achieved (bookings, resolutions, etc.)

Benefits

No Coding Required: Write evaluations in natural language, no programming skills needed
Flexible & Adaptable: Easily modify criteria by updating your metric description
Context-Aware: Understands conversational context, not just keyword matching
Dynamic Variables: Use call-specific data (customer info, metadata) in your evaluations

Creating LLM Judge Metrics

Navigate to the Metrics section and select Create Metric.

Name & Type: Give your metric a descriptive name (e.g., Correct End Call by Main Agent).
Description (The Prompt): Write a natural language description of what constitutes success. This is what the LLM Judge will use to evaluate calls.

Use context variables to make the metric dynamic. For example, use {{metadata.instructions}} to reference specific scenario steps the agent was supposed to follow.You will see a list of context variables in the dashboard when creating a metric. See LLM Judge Available Variables for a complete list.

Example Description:

Check if the Main Agent ended the call only after all steps in
{{metadata.instructions}} were completed by the Testing Agent.

Set Triggers

Define when the metric should run under the Evaluation Trigger section.

Always: Runs on every call (default).
Custom: Use logic to run metrics only in specific scenarios. You can write a trigger prompt in natural language or use Python code to define when the metric should run (e.g., return True only if the agent is attempting to book an appointment).

Testing Your Metrics

Before saving, validate your logic immediately within the builder.

Click Test Metric

Navigate to the test section within the metric builder.

Select Call IDs

Select a few past Call IDs from the list to test against.

Run the Test

Run the test to see if the metric passes/fails as expected on historical data.

Create Metric

If satisfied with the results, click Create Metric to save.

LLM Judge Available Variables - Variables you can use in metric descriptions
Creating Good Metrics - Complete guide for building high-fidelity metrics
Python Metric - Write custom evaluation logic in Python
Pre-defined Metrics - Generic metrics provided by Cekura

Get Started

Key Concepts

Guides

Integrations

Advanced

LLM Judge Metric

What You Can Evaluate

Benefits

Creating LLM Judge Metrics

Set Triggers

Testing Your Metrics

Get Started

Key Concepts

Guides

Integrations

Advanced

​What You Can Evaluate

​Benefits

​Creating LLM Judge Metrics

​Set Triggers

​Testing Your Metrics

​Related Documentation

What You Can Evaluate

Benefits

Creating LLM Judge Metrics

Set Triggers

Testing Your Metrics

Related Documentation