When the model hedges, force the tool call.
Ask a voice agent 'do you have Thursday at 2?' and a surprising fraction of the time the model answers 'we should have availability then' without checking the calendar. That is a hallucination with good manners. Why prompting doesn't fix it, and the mechanical guardrail that does.
Here is a failure mode you will not find on any vendor's demo call, because demos are rehearsed and this one only shows up in the wild. A caller asks a direct factual question: 'do you have anything Thursday at 2?' The agent has a perfectly good availability tool. The model does not call it. Instead it answers, warmly and fluently: 'Thursday at 2 should work, would you like me to book that?'
Nobody looked at a calendar. 'Should work' is not a lookup; it is a guess wearing the costume of one. If the slot is free, nobody ever notices. If it is not, the agent either fails at booking time in front of the caller, or worse, hands a human operator a commitment that the business cannot keep.
Why the model hedges
This is not a defect in any one model; we have observed it across every model family we have run in production. It is a byproduct of how assistants are trained. Human feedback rewards answers that are helpful, confident, and frictionless. Between 'let me check' (a pause, a tool call, work) and 'that should be fine' (instant, agreeable, usually true), the training gradient leans toward the second, because most of the time the hedge is not punished. Calendars are mostly open. Stock is mostly in. The model has learned what the world is usually like, and it answers from the prior.
The problem is that a business phone line is exactly the place where the prior is not good enough. The caller is not asking what calendars are usually like. They are asking about this calendar, this Thursday. An answer that is 85% likely to be right is, on a phone call that ends in a commitment, a 15% defect rate on the core transaction.
'Should work' is not a lookup. It is a guess wearing the costume of one, and it is usually right, which is exactly why it survives testing and fails in production.
Why you can't prompt your way out
The first thing every team tries is instructions: 'NEVER state availability without calling check_availability first.' It helps. The hedge rate drops. It does not reach zero, and it will not, for a structural reason: a prompt instruction is one signal competing with the model's entire trained disposition, across every phrasing a caller can produce. You are negotiating with a probability distribution. You can shift it; you cannot cap it.
Worse, instruction pressure has side effects. Stack enough ALWAYS and NEVER clauses and the model starts calling tools on turns that did not need them, apologizing for checks nobody asked for, and burning your prompt budget on defensive boilerplate that also has to be processed on every single turn. Reliability by exhortation is expensive and it still leaks.
The mechanical fix
The fix that holds is to take the decision away from the model on exactly the turns where the decision is not a judgment call. The pipeline classifies the caller's turn before the main model pass. When the turn is a check-worthy factual question (availability, price, inventory, an account fact), the model is invoked with the tool choice constrained: the only legal output for this pass is a call to the relevant tool. The model still fills in the arguments, still handles the ambiguity of 'sometime Thursday afternoon-ish,' still does everything it is genuinely good at. The one thing it can no longer do is skip the lookup, because the API contract, not the prompt, forbids it.
This is the same shape of solution as an append-only transcript or a database constraint: move the invariant out of the layer that behaves probabilistically and into a layer that behaves mechanically. Prompts are for judgment. Constraints are for rules. Confusing the two is the root of most agent reliability failures we see.
There is a real tradeoff to engineer around: the classifier has false positives, and a forced tool call on a turn that did not need one costs you a tool-turn's worth of latency (we have a separate post on exactly how expensive that is). In practice the check-worthy intents are a small, well-defined set with high classifier precision, and the asymmetry of costs settles the argument. An unnecessary lookup costs the caller one second. A phantom commitment costs the business a customer.
Catch what still slips through
No intent list is exhaustive, so behind the forced check sits a second mechanism: every turn the agent speaks is graded after the fact against the evidence the tools actually returned during that turn. An answer that asserts a price, an hour, or an open slot with no supporting tool result in the turn's evidence gets flagged, counted, and surfaced for review. The forced tool call prevents the hedge on the turns you anticipated; the grader measures the ones you did not. The rate from the grader tells you where to extend the intent list. The loop closes.
The vendor question
If you are evaluating agents that will make commitments on your behalf, ask one question: 'What, mechanically, prevents your agent from answering an availability question without checking?' Listen for the layer the answer lives in. 'Our prompts instruct it to always check' means the answer is a probability. 'The model cannot produce a non-tool answer on that class of turn, and here is our grader's slip-through rate for last month' means someone has been burned by the hedge and built the fence. You want the vendor with the fence.
AI voice agent for business
How Vorel picks up the phone, reads your CRM, books the appointment, and writes the audit row before the caller hangs up.
Read the guide
