Product5 min read

Dialect is not a locale code.

Setting language: 'ar' tells your voice agent which script to write, not what a caller in Amman actually sounds like, or how weird it feels when the reply comes back in newsreader Arabic. What it takes to build a voice agent that speaks the way its callers do, and why the same problem exists in English.

VVorel EngineeringEngineeringLast updated

There is a moment familiar to anyone who has deployed a voice agent in the Arabic-speaking world. The pipeline works. The transcription is accurate, the reasoning is sound, the answer is correct. The agent speaks, and the caller's tone shifts, because the answer came back in Modern Standard Arabic: the register of news broadcasts and government forms, a register in which no one on either end of that phone call has ever ordered a coffee.

Nothing is wrong, in the sense that a linter would find. Everything is wrong, in the sense that matters on a phone call. The business sounds like it outsourced its front desk to a television anchor. And the cause is usually a single line of configuration somewhere: language: 'ar'.

What a locale code actually tells you

A locale code answers the questions software traditionally asked: which script, which direction, which decimal separator, which translation file. It answers none of the questions a voice conversation asks. What greeting does a clinic receptionist in Amman actually use? Is the caller's 'Thursday' rendered with the word a Jordanian would say or the one a textbook would print? When the price is 45 dinars, does the agent say it the way it is written or the way anyone would actually say it across a counter?

Arabic makes the gap unmissable because the distance between the written standard and any spoken dialect is enormous: different function words, different negation, different everyday vocabulary. Levantine, Gulf, Egyptian, and Maghrebi callers all type the same script and none of them speak it. But the gap is not an Arabic quirk. An English voice agent that says 'your appointment has been scheduled for the fourteenth of July at fourteen fifteen' has made the same mistake in a smaller font. Register is spoken language's whole personality, and locale codes carry none of it.

The written standard is the register of news broadcasts and government forms. No one on either end of the call has ever ordered a coffee in it.

Register is a product decision

The uncomfortable implication is that 'what should the agent sound like' cannot be answered by an engineer at integration time. It is a decision about the business: a hospital wants warmer and more careful; a car dealership wants quicker and more colloquial; both are in the same city, both are 'ar'. So the deployment has to carry a register choice the way it carries a brand voice: which dialect family, how colloquial within it, which forms of address, what level of formality shift when delivering bad news.

Mechanically, that choice has to reach every layer that produces speech. The response instructions carry dialect-specific phrasing guidance, not just 'reply in Arabic.' The examples the model sees are written in the target register, because models imitate what they are shown far more faithfully than what they are told. And the rendering rules (numbers, times, prices, phone numbers read back digit by digit) are per-register, because '2:15' has a spoken form, not a universal one.

The voice has to match the words

There is a failure mode subtler than wrong-register text: right-register text in the wrong mouth. TTS voices carry an accent, and a Gulf-accented voice reading fluent Levantine text lands in an uncanny valley that is worse than either mismatch alone; callers cannot name what is off, but they hear that something is. The register decision and the voice selection are one decision. A deployment picks a voice and a register as a pair, validated together, and changing one re-opens the other.

This pairing also constrains the input side. Recognition tuned for the written standard degrades on dialect speech precisely on the words that mark the register: the everyday function words, the local names, the colloquial yes and no. If you are measuring transcription quality on standard-Arabic test sets and deploying into dialect phone calls, your accuracy number is measuring the wrong language. We wrote separately about when to trust the transcript; the short version here is that dialect is one more reason the answer is 'less than the benchmark says.'

Test with ears, not with review

None of this can be validated by translation review, because the failure is not inaccuracy, it is wrongness of register, and a bilingual reviewer looking at text will pass sentences that a native listener would wince at over the phone. The only test that works is listening: native speakers of the target dialect, on real calls or realistic recordings, asked not 'is this correct?' but 'who does this sound like?' The answer you are looking for is a person, ideally a specific kind of person: the good receptionist. The answer that sends you back to work is an institution: the news, the bank machine, the embassy.

Questions for a vendor operating in dialect-diverse markets

If your callers speak a dialect (and if your callers speak Arabic, they do), the evaluation questions are concrete. Which dialect will the agent speak, as opposed to which language? Can two businesses in the same country sound different? How are prices and times rendered, in writing and out loud? Was the voice validated against the register, or picked from a dropdown? And who, with native ears, signed off? A vendor who has actually shipped in these markets will have opinions on every one of these, usually strong ones, usually with a story attached. A vendor who answers 'we support Arabic' is telling you the story will be yours.

Read the full guide

AI voice agent for business

How Vorel picks up the phone, reads your CRM, books the appointment, and writes the audit row before the caller hangs up.

Read the guide

The next call doesn’t have to go to voicemail.

Book a thirty-minute demo. We point Vorel at one of your real numbers on the same call.