The tool-call tax: every lookup costs a second trip through the model.
A voice turn that calls a tool runs the LLM twice: once to decide what to look up, once to read the result and answer. That second trip, not the lookup itself, is why tool turns feel twice as slow as plain ones. Where the tax comes from and the five levers that actually reduce it.
In an earlier post we made the case that voice latency is the LLM and everything else is a rounding error. This post is about the worst case of that rule: the turn where the agent has to look something up. Ask a production voice agent 'do you have anything Thursday afternoon?' and time it against 'what are your hours?' The tool turn is reliably about twice as slow, and the lookup itself, a single indexed query, accounts for almost none of the difference.
Anatomy of a tool turn
plain turn:
context assembly ~10 ms
LLM pass (answer) 600-1200 ms
TTS first chunk 120-240 ms
tool turn:
context assembly ~10 ms
LLM pass 1 (decide+call) 500-900 ms ← full prefill
tool execution 80-400 ms
LLM pass 2 (read+answer) 600-1200 ms ← full prefill AGAIN
TTS first chunk 120-240 msThe two model passes are sequential by construction: the second pass cannot start until the tool result exists, and the tool cannot run until the first pass names it and fills in the arguments. Each pass pays the full cost of ingesting the conversation so far: the system prompt, the business context, the transcript, the tool definitions. On a long call that prefill is thousands of tokens, and you are paying to process it twice within a single caller-perceived pause.
That is the tool-call tax. It is not the database, it is not the network, and it is not the size of your tool catalog per se. It is the second sequential trip through the model, with a full re-read of context, sitting inside one silence.
The levers, in order of impact
Cache the prefix. The system prompt, business context, and tool definitions do not change between pass one and pass two; only the tail of the conversation does. Every serious inference provider now supports prompt caching, and structuring your prompt so the static content sits first and the volatile content sits last means the second pass prefill hits cache and costs a fraction of the first. This is the single largest lever and it is also the least glamorous: it is prompt layout discipline, not a new model. We wrote up the cost side of this separately; the latency side is, if anything, the better half of the deal.
Parallelize independent tools. If the model wants the customer record and the schedule, those lookups do not depend on each other, and firing them sequentially buys you nothing but latency. The subtle part is that this is not purely a framework setting: the model has to emit both calls in one pass rather than asking, reading, and asking again. That is shaped by how the tools are described and how the agent is instructed. Get it right and a two-tool turn collapses from two taxes to one.
Speak while the second pass runs. The first pass tells you everything you need to keep the caller oriented: which tool is being called, with which arguments. That is enough to say 'let me check Thursday for you' out loud, honestly, while pass two is still prefilling. Done right, the caller never experiences the tax at all; the holding line covers exactly the window where work is happening. Done wrong, it degenerates into the reflexive filler problem, which we have a whole separate post about. The difference is that here the filler is generated from the actual tool call, so it cannot fire when nothing is happening.
Skip the second pass when the result speaks for itself. A successful booking does not need a frontier model to phrase 'you're booked for Thursday at 2:15.' When the tool result is structured and the response shape is known, a template renders it in microseconds and the caller cannot tell the difference. The second pass is for results that need judgment: ambiguity, partial failures, tradeoffs. Spending it on 'confirmation successful' is pure tax.
Speculate on the obvious lookup. On some turns you know, before the model does, what it will need. A caller who has just been read two appointment slots and says 'the second one' is going to trigger a booking; a caller who opens with an account question is going to need their record. Kicking off the likely lookup while the model is still thinking converts the tool execution window to zero when the guess is right, and wastes a cheap read when it is wrong. This is the fussiest lever on the list: it needs guardrails so that only side-effect-free reads are ever speculated. Speculatively writing to a calendar is how you book phantom appointments.
The tax is the second sequential trip through the model with a full re-read of context. Every lever that matters either makes that trip cheaper, hides it behind honest speech, or deletes it.
What doesn't move it
Faster hardware helps both passes equally, which means it does not change the ratio: tool turns stay twice as slow as plain turns, and the caller feels ratios, not milliseconds. Shaving transport overhead is worth single-digit milliseconds against a tax measured in seconds. And cutting your tool catalog from twenty tools to eight shortens the prefill a little but leaves the second trip fully intact; it is an optimization of the tax form, not the tax.
The question to ask a vendor
Ask for turn latency split by turn type: plain turns and tool turns, p50 and p95 for each. A single blended latency number hides the tax, because the easy majority of turns drags the average down while the turns that matter most to the caller (the ones where the agent is actually doing something with their data) sit out in the tail. A vendor who cannot produce the split has not instrumented their own pipeline deeply enough to have paid the tax down. A vendor who can will usually be happy to talk about it, because the work above is real engineering and everyone who has done it knows it.
AI voice agent for business
How Vorel picks up the phone, reads your CRM, books the appointment, and writes the audit row before the caller hangs up.
Read the guide
