Python Code Metrics allow you to write custom evaluation logic in Python to evaluate your AI agent’s performance. This gives you complete control over the evaluation process and enables complex analysis that goes beyond simple prompt-based metrics.
Custom code metrics are executed in a secure Python environment with access to call data including transcripts, metadata, and dynamic variables. Your code must set specific output variables to provide the evaluation result and explanation.
Here’s a simple example that checks if the agent mentioned a specific product:
Copy
Ask AI
# Check if the agent mentioned "Premium Plan" in the conversationtranscript = data["transcript"].lower()if "premium plan" in transcript: _result = True _explanation = "Agent successfully mentioned the Premium Plan during the conversation"else: _result = False _explanation = "Agent did not mention the Premium Plan in the conversation"
You can access the results of other metrics that were evaluated for the same call directly by metric name using data[<metric_name>]. This allows you to create complex evaluations that depend on the outcomes of other metrics.
Example usage:
Copy
Ask AI
# Access metric results directly by namecustomer_satisfaction = data["Customer Satisfaction"]response_time = data["Response Time"]product_knowledge = data["Product Knowledge"]# Each metric result contains the evaluation outcomeif customer_satisfaction and response_time < 60: _result = "Excellent" _explanation = "Customer was satisfied and response time was fast"
Here’s a more complex example that analyzes sentiment and response time:
Copy
Ask AI
import refrom datetime import datetime# Get transcript datatranscript = data["transcript"]call_duration = data["call_duration"]# Analyze agent responsesagent_responses = []lines = transcript.split('\n')for line in lines: if line.strip().startswith('Agent:'): response = line.replace('Agent:', '').strip() agent_responses.append(response)# Calculate average response lengthif agent_responses: avg_response_length = sum(len(response) for response in agent_responses) / len(agent_responses) # Check if responses are detailed enough (more than 50 characters average) if avg_response_length > 50: _result = True _explanation = f"Agent provided detailed responses with average length of {avg_response_length:.1f} characters" else: _result = False _explanation = f"Agent responses were too brief with average length of {avg_response_length:.1f} characters"else: _result = False _explanation = "No agent responses found in transcript"
Here’s an example that combines multiple metric results to create a comprehensive evaluation:
Copy
Ask AI
# Access metric results directly by nametry: satisfaction = data["Customer Satisfaction"] response_time = data["Response Time"] # Create a composite score based on multiple metrics if satisfaction and response_time < 60: _result = "Excellent" _explanation = f"Customer was satisfied ({satisfaction}) and response time was fast ({response_time}s)" elif satisfaction and response_time < 120: _result = "Good" _explanation = f"Customer was satisfied ({satisfaction}) but response time was moderate ({response_time}s)" else: _result = "Needs Improvement" _explanation = f"Either customer satisfaction ({satisfaction}) or response time ({response_time}s) needs improvement"except KeyError: _result = "Incomplete" _explanation = "Required metrics (Customer Satisfaction, Response Time) not found in results"