Engineering5 min read

When the model hedges, force the tool call.

Ask a voice agent 'do you have Thursday at 2?' and a surprising fraction of the time the model answers 'we should have availability then' without checking the calendar. That is a hallucination with good manners. Why prompting doesn't fix it, and the mechanical guardrail that does.

VVorel EngineeringEngineeringLast updated

Here is a failure mode you will not find on any vendor's demo call, because demos are rehearsed and this one only shows up in the wild. A caller asks a direct factual question: 'do you have anything Thursday at 2?' The agent has a perfectly good availability tool. The model does not call it. Instead it answers, warmly and fluently: 'Thursday at 2 should work, would you like me to book that?'

Nobody looked at a calendar. 'Should work' is not a lookup; it is a guess wearing the costume of one. If the slot is free, nobody ever notices. If it is not, the agent either fails at booking time in front of the caller, or worse, hands a human operator a commitment that the business cannot keep.

Why the model hedges

This is not a defect in any one model; we have observed it across every model family we have run in production. It is a byproduct of how assistants are trained. Human feedback rewards answers that are helpful, confident, and frictionless. Between 'let me check' (a pause, a tool call, work) and 'that should be fine' (instant, agreeable, usually true), the training gradient leans toward the second, because most of the time the hedge is not punished. Calendars are mostly open. Stock is mostly in. The model has learned what the world is usually like, and it answers from the prior.

The problem is that a business phone line is exactly the place where the prior is not good enough. The caller is not asking what calendars are usually like. They are asking about this calendar, this Thursday. An answer that is 85% likely to be right is, on a phone call that ends in a commitment, a 15% defect rate on the core transaction.

'Should work' is not a lookup. It is a guess wearing the costume of one, and it is usually right, which is exactly why it survives testing and fails in production.

Why you can't prompt your way out

The first thing every team tries is instructions: 'NEVER state availability without calling check_availability first.' It helps. The hedge rate drops. It does not reach zero, and it will not, for a structural reason: a prompt instruction is one signal competing with the model's entire trained disposition, across every phrasing a caller can produce. You are negotiating with a probability distribution. You can shift it; you cannot cap it.

Worse, instruction pressure has side effects. Stack enough ALWAYS and NEVER clauses and the model starts calling tools on turns that did not need them, apologizing for checks nobody asked for, and burning your prompt budget on defensive boilerplate that also has to be processed on every single turn. Reliability by exhortation is expensive and it still leaks.

The mechanical fix

The fix that holds is to take the decision away from the model on exactly the turns where the decision is not a judgment call. The pipeline classifies the caller's turn before the main model pass. When the turn is a check-worthy factual question (availability, price, inventory, an account fact), the model is invoked with the tool choice constrained: the only legal output for this pass is a call to the relevant tool. The model still fills in the arguments, still handles the ambiguity of 'sometime Thursday afternoon-ish,' still does everything it is genuinely good at. The one thing it can no longer do is skip the lookup, because the API contract, not the prompt, forbids it.

This is the same shape of solution as an append-only transcript or a database constraint: move the invariant out of the layer that behaves probabilistically and into a layer that behaves mechanically. Prompts are for judgment. Constraints are for rules. Confusing the two is the root of most agent reliability failures we see.

There is a real tradeoff to engineer around: the classifier has false positives, and a forced tool call on a turn that did not need one costs you a tool-turn's worth of latency (we have a separate post on exactly how expensive that is). In practice the check-worthy intents are a small, well-defined set with high classifier precision, and the asymmetry of costs settles the argument. An unnecessary lookup costs the caller one second. A phantom commitment costs the business a customer.

Catch what still slips through

No intent list is exhaustive, so behind the forced check sits a second mechanism: every turn the agent speaks is graded after the fact against the evidence the tools actually returned during that turn. An answer that asserts a price, an hour, or an open slot with no supporting tool result in the turn's evidence gets flagged, counted, and surfaced for review. The forced tool call prevents the hedge on the turns you anticipated; the grader measures the ones you did not. The rate from the grader tells you where to extend the intent list. The loop closes.

The vendor question

If you are evaluating agents that will make commitments on your behalf, ask one question: 'What, mechanically, prevents your agent from answering an availability question without checking?' Listen for the layer the answer lives in. 'Our prompts instruct it to always check' means the answer is a probability. 'The model cannot produce a non-tool answer on that class of turn, and here is our grader's slip-through rate for last month' means someone has been burned by the hedge and built the fence. You want the vendor with the fence.

Read the full guide

AI voice agent for business

How Vorel picks up the phone, reads your CRM, books the appointment, and writes the audit row before the caller hangs up.

Read the guide
Keep reading
EngineeringMay 24, 202610 min read

Grading every voice turn. The production loop that catches hallucinations before they ship.

A voice agent can hallucinate on one turn out of fifty and still feel pretty good in aggregate. The problem is that one turn quoting a wrong price, inventing a hold time, or claiming a feature that does not exist is the turn that gets escalated, screenshotted, and tweeted. We grade every agent turn in production against a small list of factual constraints, and route any unsupported claim straight into the operator queue. Here is the architecture, the prompt, and why a per-turn grader beats a weekly eval.

Read the post
EngineeringJuly 1, 20265 min read

The tool-call tax: every lookup costs a second trip through the model.

A voice turn that calls a tool runs the LLM twice: once to decide what to look up, once to read the result and answer. That second trip, not the lookup itself, is why tool turns feel twice as slow as plain ones. Where the tax comes from and the five levers that actually reduce it.

Read the post
EngineeringMay 24, 20269 min read

Designing the audit row: what every voice agent should write back to your CRM.

The audit row is the single most important artifact a voice agent produces. It is not the transcript and it is not the reasoning trace. It is a structured record, written into the operator's system of record, that answers 'what happened on that call?' without anyone having to listen to it. Here is the field model we ship and the design choices behind it.

Read the post

The next call doesn’t have to go to voicemail.

Book a thirty-minute demo. We point Vorel at one of your real numbers on the same call.