Create Metric
Create a quality metric for evaluating agent conversations.
Authorizations
API Key Authentication. It should be included in the header of each request.
Body
Name of the metric
Description of what this metric evaluates
Whether this metric evaluates audio content
The evaluation prompt used for this metric
ID of the project this metric belongs to.
External identifier for the assistant
Type of metric (llm_judge recommended; basic and custom_prompt are deprecated)
basic- Basic (Deprecated in favor of LLM Judge)custom_prompt- Custom Prompt ( Deprecated in favor of LLM Judge)custom_code- Custom Codellm_judge- LLM Judge
basic, custom_prompt, custom_code, llm_judge Output shape of the evaluation score
binary_workflow_adherence- Binary Workflow Adherencebinary_qualitative- Binary Qualitativecontinuous_qualitative- Continuous Qualitativenumeric- Numericenum- Enum
binary_workflow_adherence, binary_qualitative, continuous_qualitative, numeric, enum Order in which to display this metric in the UI
Per-metric configuration.
Some predefined metrics carry per-agent configuration, keyed by agent id: configuration[<key>] = {"<agent_id>": <value>}. On read, every agent attached to the metric is returned; on write, only the agent ids you include are updated.
Call-level keys (not per-agent):
- Detect Silence —
silence_duration(int, seconds, default 10): minimum mutual silence before flagging a failure. - Infrastructure Issues —
infra_issues_timeout(int, seconds, default 10): max seconds before the Main Agent must respond after the Testing Agent finishes speaking.
Per-agent keys (each value is {"<agent_id>": <value>}):
- Dropoff Node —
nodes:{"<agent_id>": [{name, description}]}— conversation stages used to classify where a call dropped off. - Topic of Call —
nodes:{"<agent_id>": [{name, description}]}— topic categories used to classify each call. - Pronunciation Check —
pronunciation_words:{"<agent_id>": [[word, phonemes], ...]}— how specific words should be pronounced (e.g. {"42": [["Cekura", "suh-KYUR-uh"]]}). - Letterwise Pronunciation —
spelling_word_types:{"<agent_id>": ["name", "email", ...]}— word categories the Main Agent must spell out letter-by-letter. - Hallucination —
hallucination_kb_files:{"<agent_id>": [file_id, ...]}— KnowledgeBaseFile IDs used as the source of truth for fact-checking.
List of agent IDs to enable this project-level metric for. Only applicable when project is set.
Possible values for enum-type metrics (list of strings, e.g. ["resolved", "escalated", "abandoned"])
When enabled, this metric is automatically assigned to new agents created in the project.
Enable this metric for simulations.
Example: true or false
Enable this metric for observability.
Example: true or false
Enable sampling for this metric using project-level sample rate
When to run this metric.
-
always— evaluate every call (default) -
automatic— system decides based on call content -
custom— only evaluate whenevaluation_trigger_promptcondition is met -
always- Always -
automatic- Automatic -
custom- Custom
always, automatic, custom LLM prompt that decides whether to evaluate this call. Only used when evaluation_trigger=custom and trigger_type=llm_judge.
Example: "Did the agent offer a refund?"
How to evaluate the trigger condition. Only relevant when evaluation_trigger=custom.
-
llm_judge— useevaluation_trigger_prompt(default) -
custom_code— useevaluation_trigger_custom_code -
llm_judge- LLM Judge -
custom_code- Custom Code
llm_judge, custom_code Python code to evaluate the trigger condition. Only used when evaluation_trigger=custom and trigger_type=custom_code.
Python code that implements the metric evaluation. Required when type=custom_code. Must define a function evaluate(transcript, ...) -> bool | float | str.
Response
Name of the metric
Description of what this metric evaluates
Whether this metric evaluates audio content
The evaluation prompt used for this metric
ID of the project this metric belongs to.
External identifier for the assistant
Type of metric (llm_judge recommended; basic and custom_prompt are deprecated)
basic- Basic (Deprecated in favor of LLM Judge)custom_prompt- Custom Prompt ( Deprecated in favor of LLM Judge)custom_code- Custom Codellm_judge- LLM Judge
basic, custom_prompt, custom_code, llm_judge Output shape of the evaluation score
binary_workflow_adherence- Binary Workflow Adherencebinary_qualitative- Binary Qualitativecontinuous_qualitative- Continuous Qualitativenumeric- Numericenum- Enum
binary_workflow_adherence, binary_qualitative, continuous_qualitative, numeric, enum Order in which to display this metric in the UI
Per-metric configuration.
Some predefined metrics carry per-agent configuration, keyed by agent id: configuration[<key>] = {"<agent_id>": <value>}. On read, every agent attached to the metric is returned; on write, only the agent ids you include are updated.
Call-level keys (not per-agent):
- Detect Silence —
silence_duration(int, seconds, default 10): minimum mutual silence before flagging a failure. - Infrastructure Issues —
infra_issues_timeout(int, seconds, default 10): max seconds before the Main Agent must respond after the Testing Agent finishes speaking.
Per-agent keys (each value is {"<agent_id>": <value>}):
- Dropoff Node —
nodes:{"<agent_id>": [{name, description}]}— conversation stages used to classify where a call dropped off. - Topic of Call —
nodes:{"<agent_id>": [{name, description}]}— topic categories used to classify each call. - Pronunciation Check —
pronunciation_words:{"<agent_id>": [[word, phonemes], ...]}— how specific words should be pronounced (e.g. {"42": [["Cekura", "suh-KYUR-uh"]]}). - Letterwise Pronunciation —
spelling_word_types:{"<agent_id>": ["name", "email", ...]}— word categories the Main Agent must spell out letter-by-letter. - Hallucination —
hallucination_kb_files:{"<agent_id>": [file_id, ...]}— KnowledgeBaseFile IDs used as the source of truth for fact-checking.
List of agent IDs to enable this project-level metric for. Only applicable when project is set.
Possible values for enum-type metrics (list of strings, e.g. ["resolved", "escalated", "abandoned"])
When enabled, this metric is automatically assigned to new agents created in the project.
Enable this metric for simulations.
Example: true or false
Enable this metric for observability.
Example: true or false
Enable sampling for this metric using project-level sample rate
When to run this metric.
-
always— evaluate every call (default) -
automatic— system decides based on call content -
custom— only evaluate whenevaluation_trigger_promptcondition is met -
always- Always -
automatic- Automatic -
custom- Custom
always, automatic, custom LLM prompt that decides whether to evaluate this call. Only used when evaluation_trigger=custom and trigger_type=llm_judge.
Example: "Did the agent offer a refund?"
How to evaluate the trigger condition. Only relevant when evaluation_trigger=custom.
-
llm_judge— useevaluation_trigger_prompt(default) -
custom_code— useevaluation_trigger_custom_code -
llm_judge- LLM Judge -
custom_code- Custom Code
llm_judge, custom_code Python code to evaluate the trigger condition. Only used when evaluation_trigger=custom and trigger_type=custom_code.
Python code that implements the metric evaluation. Required when type=custom_code. Must define a function evaluate(transcript, ...) -> bool | float | str.