The Metric Lab Auto Improve button rewrites a metric’s prompt based on the feedback annotations on its test sets. This guide shows how to run that same optimiser on a recurring schedule using a Claude Code routine and the Cekura MCP — so newly-annotated test sets feed back into your metric prompts automatically.Documentation Index
Fetch the complete documentation index at: https://docs.cekura.ai/llms.txt
Use this file to discover all available pages before exploring further.
Optimiser input is human feedback, not raw test sets. The optimiser reads the annotations and notes you’ve left on
MetricReview rows. If no new feedback has been added since the last run, re-running will not change the prompt. Treat this as a follow-on to your existing labelling cadence.Prerequisites
Connect Claude Code to the Cekura MCP
Follow the MCP overview for the one-line
claude mcp add setup. OAuth is recommended.Have at least one metric with labelled test sets
The optimiser needs calls you’ve Added to Lab and Annotated through the Metric Lab workflow — annotations and feedback are what the optimiser learns from.
The routine prompt
Paste this into Claude Code, filling in the metric IDs. It chains the MCP tools the optimiser needs end-to-end.Schedule it
Once the prompt produces clean diffs you’re comfortable with, schedule it with Claude Code’s/schedule slash command:
- Weekly if your team labels reviews regularly (recommended).
- Daily only if you have an active labelling workflow producing dozens of new annotations per day.
- Monthly if labelling is bursty (e.g. quarterly audits).
Auto-apply (advanced)
Once you trust the routine, swap step 4 for:metrics_partial_update is destructive in effect. It overwrites the live prompt your production agent evaluates against. Always run the read-only version of the routine for a week or two before enabling auto-apply.How it maps to the Metric Lab UI
| Routine step | Equivalent in the Metric Lab UI |
|---|---|
metric_reviews_process_feedbacks | Clicking Auto Improve |
metric_reviews_process_feedbacks_progress | The progress panel polling that task |
| Reviewing the diff before saving | View Changes → diff view |
metrics_partial_update | Save |
metrics_run_reviews_create | Run (re-score the test set) |
Troubleshooting
The routine reports “no changes proposed” every run. No new feedback has been added since the last run. The optimiser is deterministic on a fixed set of annotations. Theimproved_metric_description looks identical to current, but the score went up. The optimiser sometimes leaves the description untouched and instead converts the metric into a custom_code wrapper around an enhanced prompt. Check output.meta_harness.optimized_code and output.meta_harness.type — that’s where the real change lives. Step 3 above already diffs these fields.
Progress polling times out. The optimiser can take several minutes for metrics with many test sets. Increase the polling interval or raise your routine’s timeout — do not retry mid-flight, that will start a second concurrent optimisation.
I want to optimise against a specific subset of test sets, not the whole Lab. Pass test_set_ids explicitly in step 1 instead of omitting it. The optimiser will use exactly those IDs.