Skip to main content
Setup steps and authentication are in the Overview. This page covers triggering runs and reading their output.
A run is one execution of a scenario against an agent. A result is the parent batch (one trigger → many runs, one per scenario × personality combination). The CLI and SDK expose both.

Pick a mode

The right command depends on the transport:
ModeWhen to use
textFast functional check — no audio, scenarios run as text exchanges
voiceOutbound voice call via your provider (Vapi/Retell/etc.)
chirpCekura’s hosted voice runner
livekit_v2 / pipecat_v2Self-hosted LiveKit or Pipecat agents
sipSIP-based provider (e.g. Plivo, Twilio SIP)
websocketCustom WebSocket protocol
vapi_webrtc, retell_webrtc, elevenlabsProvider WebRTC paths

Trigger a run

# Text mode (fastest for iteration)
cekura scenarios run-text \
  --agent-id 123 \
  --scenario-ids 1,2,3

# Voice via Cekura's hosted runner
cekura scenarios run-chirp \
  --agent-id 123 \
  --scenario-ids 1,2,3

# Self-hosted LiveKit
cekura scenarios run-livekit-v2 --from-file run.json
Each command returns a JSON envelope with a top-level result_id.

Poll status

cekura runs list --result-id <result-id> --format json
Loop in a shell script:
while true; do
  STATUSES=$(cekura runs list --result-id "$RESULT_ID" --format json | jq -r '.[].status' | sort -u)
  [[ "$STATUSES" =~ pending|running ]] || break
  sleep 5
done

Inspect a run

cekura runs get 5544
cekura runs list --agent-id 123

Live operations on in-progress runs

# Get a live listen URL while a voice run is in progress
cekura runs listen-url 5544

# End an in-progress call
cekura runs end-call 5544

Vote on a metric result

Capture thumbs up/down feedback on a specific metric evaluation for a run, optionally attach the expected value and free-text feedback. The run is marked as reviewed; the metric evaluation is updated. Feeds the labs / metric-review workflow.
cekura runs mark-metric-vote 5544 \
  --metric-id 55 \
  --thumbs-down \
  --expected-value 2 \
  --feedback "Agent missed the reschedule step"
--expected-value is parsed as JSON when possible (5, true, "foo"), so numeric / boolean metrics get the right type.

Expected outcome

If a scenario has an expected_outcome, you can (re)evaluate whether the run met it, then thumbs-vote on the outcome verdict the same way you vote on metric results.
# Re-score expected outcome for this run
client.runs.run_expected_outcome(run_id=5544)

# Thumbs vote on the outcome verdict
client.runs.mark_expected_outcome_vote(
    run_id=5544,
    thumbs_up=True,
    feedback="Agent did escalate to a human",
)

Flag a critical-scenario verdict as wrong

client.runs.mark_critical_scenario_wrong(run_id=5544, scenario_id=42)
client.runs.unmark_critical_scenario_wrong(run_id=5544, scenario_id=42)

Improve the agent’s prompt from run failures

Iterate on the agent’s prompt using the failure pattern across recent runs.
# Background job — returns a progress_id you can poll
job = client.runs.improve_prompt_bg(
    run_ids=[5544, 5545, 5546],
    agent=123,
    prompt="<current agent system prompt>",
)
client.runs.improve_prompt_progress(progress_id=job["progress_id"])

# One-shot synchronous variant (smaller batches)
client.runs.improve_prompt(run_ids=[5544], agent=123, prompt="<current>")

# Just the categorized failure issues, without rewriting the prompt
client.runs.improve_prompt_issues(run_ids=[5544, 5545], agent=123)

Re-evaluate without re-running

If you change a metric prompt and want to score existing runs against the new definition:
cekura run rerun 987 --metric-ids 55,56

Promote a run into a test set

Take a passing run and freeze it as a regression dataset:
cekura test-sets create-from-run --run-id 5544 --name "regression-2024-04"

See also

Evaluators

The scenarios that runs execute.

Metrics

Define how runs are scored.

Calls

Production calls — same scoring engine, different input.

API Reference

Full field reference for runs, results, and run-mode payloads.