Setup steps and authentication are in the Overview. This page covers triggering runs and reading their output.
Pick a mode
The right command depends on the transport:| Mode | When to use |
|---|---|
text | Fast functional check — no audio, scenarios run as text exchanges |
voice | Outbound voice call via your provider (Vapi/Retell/etc.) |
chirp | Cekura’s hosted voice runner |
livekit_v2 / pipecat_v2 | Self-hosted LiveKit or Pipecat agents |
sip | SIP-based provider (e.g. Plivo, Twilio SIP) |
websocket | Custom WebSocket protocol |
vapi_webrtc, retell_webrtc, elevenlabs | Provider WebRTC paths |
Trigger a run
- CLI
- SDK
result_id.Poll status
- CLI
- SDK
Inspect a run
- CLI
- SDK
Live operations on in-progress runs
- CLI
- SDK
Vote on a metric result
Capture thumbs up/down feedback on a specific metric evaluation for a run, optionally attach the expected value and free-text feedback. The run is marked as reviewed; the metric evaluation is updated. Feeds the labs / metric-review workflow.- CLI
- SDK
--expected-value is parsed as JSON when possible (5, true, "foo"), so numeric / boolean metrics get the right type.Expected outcome
If a scenario has anexpected_outcome, you can (re)evaluate whether the run met it, then thumbs-vote on the outcome verdict the same way you vote on metric results.
- SDK
Flag a critical-scenario verdict as wrong
- SDK
Improve the agent’s prompt from run failures
Iterate on the agent’s prompt using the failure pattern across recent runs.- SDK
Re-evaluate without re-running
If you change a metric prompt and want to score existing runs against the new definition:- CLI
- SDK
Promote a run into a test set
Take a passing run and freeze it as a regression dataset:- CLI
- SDK
See also
Evaluators
The scenarios that runs execute.
Metrics
Define how runs are scored.
Calls
Production calls — same scoring engine, different input.
API Reference
Full field reference for runs, results, and run-mode payloads.