Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cekura.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Metric Lab Auto Improve button rewrites a metric’s prompt based on the feedback annotations on its test sets. This guide shows how to run that same optimiser on a recurring schedule using a Claude Code routine and the Cekura MCP — so newly-annotated test sets feed back into your metric prompts automatically.
Optimiser input is human feedback, not raw test sets. The optimiser reads the annotations and notes you’ve left on MetricReview rows. If no new feedback has been added since the last run, re-running will not change the prompt. Treat this as a follow-on to your existing labelling cadence.

Prerequisites

1

Connect Claude Code to the Cekura MCP

Follow the MCP overview for the one-line claude mcp add setup. OAuth is recommended.
2

Have at least one metric with labelled test sets

The optimiser needs calls you’ve Added to Lab and Annotated through the Metric Lab workflow — annotations and feedback are what the optimiser learns from.
3

Note the metric IDs you want the routine to operate on

Open each metric in the dashboard and copy the metric ID from the URL (/metrics/<id>). That’s the only ID you need — the optimiser uses every test set already in the metric’s Lab automatically.

The routine prompt

Paste this into Claude Code, filling in the metric IDs. It chains the MCP tools the optimiser needs end-to-end.
You are auto-optimising Cekura metric prompts based on accumulated test-set
feedback. Run this for each metric_id below.

Metrics to optimise: [12345, 12346]

For each metric_id:

1. Call `metric_reviews_process_feedbacks` with just that metric_id. The
   optimiser will use every test set already in the metric's Lab. Capture
   the returned `progress_id`.

2. Poll `metric_reviews_process_feedbacks_progress` with that progress_id
   roughly every 30s until status is "success" or "error". If "error", stop
   and report the error for that metric — continue to the next one.

3. On success, read these fields from the progress output:
   - `output.improved_metric_description` and `output.improved_evaluation_trigger`
   - `output.meta_harness.optimized_code` and `output.meta_harness.type`
   - `output.meta_harness.score`, `num_correct`, `num_total` (the score after
     optimisation against the labelled set)

   Call `metrics_retrieve` for the same metric_id to read the *current*
   `description`, `evaluation_trigger`, `type`, and `custom_code`. Produce a
   unified diff between current and proposed for each of those fields. Also
   report the post-optimisation score (e.g. `7/7`) and whether the metric
   type changed (e.g. `basic` → `custom_code`).

4. Do NOT call `metrics_partial_update` yet. Print the diff and the score
   and stop for this metric.

After processing every metric, summarise: which had proposed changes, which
were unchanged, which errored. Do not apply any changes.
Step 4 deliberately stops short of saving. The optimiser sometimes proposes large rewrites — keep a human in the loop for the first few runs before letting the routine call metrics_partial_update directly.

Schedule it

Once the prompt produces clean diffs you’re comfortable with, schedule it with Claude Code’s /schedule slash command:
/schedule weekly on Monday at 09:00

<paste the routine prompt above>
Claude Code will fire the routine on that cron and surface the diffs in your inbox / Claude Code session each time it runs. See Claude Code Routines for the full slash-command reference. A typical cadence:
  • Weekly if your team labels reviews regularly (recommended).
  • Daily only if you have an active labelling workflow producing dozens of new annotations per day.
  • Monthly if labelling is bursty (e.g. quarterly audits).

Auto-apply (advanced)

Once you trust the routine, swap step 4 for:
4. Call `metrics_partial_update` with the metric_id and every changed
   field from step 3 — typically some combination of `description`,
   `evaluation_trigger`, `type`, and `custom_code`. The optimiser may
   convert a `basic` metric into a `custom_code` metric, so pass both
   `type` and `custom_code` when they differ from current.

   Then call `metrics_run_reviews_create` for the same metric to re-score
   every linked test set against the new prompt. Report the before/after
   score delta from the run.
This applies the new prompt and immediately verifies it didn’t regress the labelled set. If the score delta is negative for any metric, revert manually from the Metric Lab UI.
metrics_partial_update is destructive in effect. It overwrites the live prompt your production agent evaluates against. Always run the read-only version of the routine for a week or two before enabling auto-apply.

How it maps to the Metric Lab UI

Routine stepEquivalent in the Metric Lab UI
metric_reviews_process_feedbacksClicking Auto Improve
metric_reviews_process_feedbacks_progressThe progress panel polling that task
Reviewing the diff before savingView Changes → diff view
metrics_partial_updateSave
metrics_run_reviews_createRun (re-score the test set)

Troubleshooting

The routine reports “no changes proposed” every run. No new feedback has been added since the last run. The optimiser is deterministic on a fixed set of annotations. The improved_metric_description looks identical to current, but the score went up. The optimiser sometimes leaves the description untouched and instead converts the metric into a custom_code wrapper around an enhanced prompt. Check output.meta_harness.optimized_code and output.meta_harness.type — that’s where the real change lives. Step 3 above already diffs these fields. Progress polling times out. The optimiser can take several minutes for metrics with many test sets. Increase the polling interval or raise your routine’s timeout — do not retry mid-flight, that will start a second concurrent optimisation. I want to optimise against a specific subset of test sets, not the whole Lab. Pass test_set_ids explicitly in step 1 instead of omitting it. The optimiser will use exactly those IDs.