Skip to main content

Overview

If your agent serves users in more than one language — or handles code-switching (e.g. Spanglish, Hinglish) — you can’t rely on English-only tests to tell you whether it actually works. Multilingual testing on Cekura uses language-specific personalities to drive evaluators in the target language and surface issues that only show up outside English, like:
  • ASR (speech-to-text) mis-transcriptions on non-English audio
  • TTS (text-to-speech) pronouncing names or numbers incorrectly in the target language
  • The agent silently falling back to English mid-conversation
  • Workflow metrics passing in one language but failing in another

How It Works

Multilingual testing is driven by two primitives:
  1. Personalities — each personality has a configured language (and optionally an accent, pace, or code-switching pattern). The evaluator speaks to your agent using that personality, in that language.
  2. Per-language evaluator runs — the same evaluator can be run against multiple personalities to test whether your agent handles the same workflow identically across languages.
See the Personality docs for the full list of configuration options, and Creating Custom Personalities for how to add a language or accent that isn’t in the default set.

Running a Multilingual Test

1

Confirm your agent supports the target languages

Check your agent’s STT, LLM, and TTS configuration for each language you want to test. A common failure mode is that the LLM prompt is multilingual but the STT is locked to en-US — the agent will never “hear” the non-English turns correctly.
2

Pick or create a personality per language

Use a built-in personality (e.g. Spanglish) or create a custom personality for the specific language, accent, or code-switching pattern you want to test.Common patterns:
  • Pure target language — caller speaks only Spanish for the whole call
  • Code-switching — caller mixes English and the target language within sentences
  • Accented English — caller speaks English with a non-native accent (tests ASR robustness)
3

Attach the personality to your evaluators

You have two options — pick whichever fits your workflow:
  • Duplicate the evaluator per language — take an evaluator that already passes in English, duplicate it, and assign the language-specific personality to each copy. Good when you want the per-language variants saved as distinct evaluators you can re-run independently.
  • Override personality at runtime — keep a single evaluator and override the personality when you run it. On the evaluators page, select the evaluators you want to run and use the Override Personality option to pick one or more personalities for this run. Cekura executes each selected evaluator against every selected personality. Good when you want to sweep the same workflow across many languages without creating N copies.
Either way, keep the scenario instructions and expected outcomes identical so you’re measuring the same workflow across languages.
4

Run and compare per language

Run each language’s evaluator set and compare pass rates side-by-side. Use the A/B Testing compare view to diff two language runs of the same evaluator.

What to Watch For

SymptomLikely Cause
Expected outcome fails, transcript looks garbledSTT not configured for the target language
Agent responds in English even though caller is speaking SpanishLLM prompt isn’t instructing the agent to mirror the caller’s language
Names, dates, or numbers sound wrong on playbackTTS voice or language setting mismatched
Works in pure Spanish but fails on code-switchingAgent is language-locking on first turn; prompt needs to handle mid-call switches

Best Practices

Hold the Workflow Constant, Vary the Language

When isolating language-related issues, keep scenario instructions and expected outcomes identical across language variants. If you change the workflow and the language at the same time, you won’t know which one caused a regression.

Test the Seams, Not Just the Middle

Most multilingual bugs live at transitions: the opening turn (before language is established), the handoff to a tool call, and the farewell. Make sure your evaluators exercise all three.

Include Code-Switching Explicitly

Many real users mix languages. If your product is used in markets where this is common, add at least one code-switching personality per language pair — don’t assume pure-language tests cover it.

Measure Per Language, Not Just Overall

A 90% pass rate that’s 99% in English and 70% in Spanish is hiding a real problem. Break down results per personality (and therefore per language) rather than relying on the aggregate number.