Skip to main content
LLM Judge Metrics allow you to evaluate your AI voice agent calls using natural language descriptions. Instead of writing code, you simply describe what constitutes success in plain English, and the system automatically evaluates each call against your criteria. This makes it easy to create custom evaluations without programming knowledge.

What You Can Evaluate

LLM Judge Metrics are ideal for evaluating qualitative aspects of conversations that require understanding context and nuance:
  • Workflow Compliance: Check if agents followed specific steps or procedures
  • Communication Quality: Assess tone, clarity, professionalism, or empathy
  • Information Accuracy: Verify agents provided correct information or asked required questions
  • Customer Handling: Evaluate objection handling, de-escalation, or problem resolution
  • Policy Adherence: Ensure agents stayed within company policies and guidelines
  • Call Outcomes: Determine if desired outcomes were achieved (bookings, resolutions, etc.)

Benefits

  • No Coding Required: Write evaluations in natural language, no programming skills needed
  • Flexible & Adaptable: Easily modify criteria by updating your metric description
  • Context-Aware: Understands conversational context, not just keyword matching
  • Dynamic Variables: Use call-specific data (customer info, metadata) in your evaluations

Creating LLM Judge Metrics

Navigate to the Metrics section and select Create Metric.
  1. Name & Type: Give your metric a descriptive name (e.g., Correct End Call by Main Agent).
  2. Description (The Prompt): Write a natural language description of what constitutes success. This is what the LLM Judge will use to evaluate calls.
Use context variables to make the metric dynamic. For example, use {{metadata.instructions}} to reference specific scenario steps the agent was supposed to follow.You will see a list of context variables in the dashboard when creating a metric. See LLM Judge Available Variables for a complete list.
Example Description:
Check if the Main Agent ended the call only after all steps in
{{metadata.instructions}} were completed by the Testing Agent.

Set Triggers

Define when the metric should run under the Evaluation Trigger section.
  • Always: Runs on every call (default).
  • Custom: Use logic to run metrics only in specific scenarios. You can write a trigger prompt in natural language or use Python code to define when the metric should run (e.g., return True only if the agent is attempting to book an appointment).

Testing Your Metrics

Before saving, validate your logic immediately within the builder.
1

Click Test Metric

Navigate to the test section within the metric builder.
2

Select Call IDs

Select a few past Call IDs from the list to test against.
3

Run the Test

Run the test to see if the metric passes/fails as expected on historical data.
4

Create Metric

If satisfied with the results, click Create Metric to save.