Multilingual Testing

Overview

If your agent serves users in more than one language — or handles code-switching (e.g. Spanglish, Hinglish) — you can’t rely on English-only tests to tell you whether it actually works. Multilingual testing on Cekura uses language-specific personalities to drive evaluators in the target language and surface issues that only show up outside English, like:

ASR (speech-to-text) mis-transcriptions on non-English audio
TTS (text-to-speech) pronouncing names or numbers incorrectly in the target language
The agent silently falling back to English mid-conversation
Workflow metrics passing in one language but failing in another

How It Works

Multilingual testing is driven by two primitives:

Personalities — each personality has a configured language (and optionally a pace). The evaluator speaks to your agent using that personality, in that language.
Per-language evaluator runs — the same evaluator can be run against multiple personalities to test whether your agent handles the same workflow identically across languages.

See the Personality docs for the full list of configuration options, and Creating Custom Personalities for how to add a new language to the default set.

Running a Multilingual Test

Confirm your agent supports the target languages

Check your agent’s STT, LLM, and TTS configuration for each language you want to test. A common failure mode is that the LLM prompt is multilingual but the STT is locked to en-US — the agent will never “hear” the non-English turns correctly.

Pick or create a personality per language

Use a built-in personality (e.g. Spanglish) or create a custom personality for the specific language you want to test.Common patterns:

Pure target language — caller speaks only Spanish for the whole call
Code-switching — caller mixes English and the target language within sentences

Attach the personality to your evaluators

You have two options — pick whichever fits your workflow:

Duplicate the evaluator per language — take an evaluator that already passes in English, duplicate it, and assign the language-specific personality to each copy. Good when you want the per-language variants saved as distinct evaluators you can re-run independently.
Override personality at runtime — keep a single evaluator and override the personality when you run it. On the evaluators page, select the evaluators you want to run and use the Override Personality option to pick one or more personalities for this run. Cekura executes each selected evaluator against every selected personality. Good when you want to sweep the same workflow across many languages without creating N copies.

Either way, keep the scenario instructions and expected outcomes identical so you’re measuring the same workflow across languages.

Run and compare per language

Run each language’s evaluator set and compare pass rates side-by-side. Use the A/B Testing compare view to diff two language runs of the same evaluator.

What to Watch For

Symptom	Likely Cause
Expected outcome fails, transcript looks garbled	STT not configured for the target language
Agent responds in English even though caller is speaking Spanish	LLM prompt isn’t instructing the agent to mirror the caller’s language
Names, dates, or numbers sound wrong on playback	TTS voice or language setting mismatched
Works in pure Spanish but fails on code-switching	Agent is language-locking on first turn; prompt needs to handle mid-call switches

Best Practices

Hold the Workflow Constant, Vary the Language

When isolating language-related issues, keep scenario instructions and expected outcomes identical across language variants. If you change the workflow and the language at the same time, you won’t know which one caused a regression.

Test the Seams, Not Just the Middle

Most multilingual bugs live at transitions: the opening turn (before language is established), the handoff to a tool call, and the farewell. Make sure your evaluators exercise all three.

Include Code-Switching Explicitly

Many real users mix languages. If your product is used in markets where this is common, add at least one code-switching personality per language pair — don’t assume pure-language tests cover it.

Measure Per Language, Not Just Overall

A 90% pass rate that’s 99% in English and 70% in Spanish is hiding a real problem. Break down results per personality (and therefore per language) rather than relying on the aggregate number.

Personality — How personalities control language and behavior
Creating Custom Personalities — Add a new language
A/B Testing — Compare per-language runs side-by-side
Accent Testing — Testing within the same language across regional or non-native accent variants
Testing Overview — Basic testing flow

Get Started

Key Concepts

Guides

Integrations

Advanced

Multilingual Testing

Overview

How It Works

Running a Multilingual Test

What to Watch For

Best Practices

Hold the Workflow Constant, Vary the Language

Test the Seams, Not Just the Middle

Include Code-Switching Explicitly

Measure Per Language, Not Just Overall

​Overview

​How It Works

​Running a Multilingual Test

​What to Watch For

​Best Practices

​Hold the Workflow Constant, Vary the Language

​Test the Seams, Not Just the Middle

​Include Code-Switching Explicitly

​Measure Per Language, Not Just Overall

​Related Resources

Overview

How It Works

Running a Multilingual Test

What to Watch For

Best Practices

Hold the Workflow Constant, Vary the Language

Test the Seams, Not Just the Middle

Include Code-Switching Explicitly

Measure Per Language, Not Just Overall

Related Resources