EngineeringJuly 1, 20265 min read

The tool-call tax: every lookup costs a second trip through the model.

A voice turn that calls a tool runs the LLM twice: once to decide what to look up, once to read the result and answer. That second trip, not the lookup itself, is why tool turns feel twice as slow as plain ones. Where the tax comes from and the five levers that actually reduce it.

VVorel EngineeringEngineeringLast updated July 5, 2026

TL;DR

When a voice agent answers a question that needs a lookup, the model runs twice: a first pass that decides to call the tool, and a second pass that reads the result and composes the answer. The two passes are sequential and each one re-processes the conversation context, so a tool turn costs roughly double a plain turn even when the lookup itself takes 100 ms. The levers, in order of impact: cache the static prompt prefix so the second pass prefill is nearly free; fire independent tools in parallel instead of one per round-trip; let the agent speak a grounded holding line while the second pass runs; render simple tool results through a template instead of a second model pass; and speculatively start obvious lookups before the model asks for them. What does not help: faster GPUs alone, protocol golf, or shrinking the tool count without shrinking the round-trips.

In an earlier post we made the case that voice latency is the LLM and everything else is a rounding error. This post is about the worst case of that rule: the turn where the agent has to look something up. Ask a production voice agent 'do you have anything Thursday afternoon?' and time it against 'what are your hours?' The tool turn is reliably about twice as slow, and the lookup itself, a single indexed query, accounts for almost none of the difference.

Anatomy of a tool turn

plain turn:
  context assembly            ~10 ms
  LLM pass (answer)        600-1200 ms
  TTS first chunk           120-240 ms

tool turn:
  context assembly            ~10 ms
  LLM pass 1 (decide+call) 500-900 ms   ← full prefill
  tool execution             80-400 ms
  LLM pass 2 (read+answer) 600-1200 ms  ← full prefill AGAIN
  TTS first chunk           120-240 ms

The two model passes are sequential by construction: the second pass cannot start until the tool result exists, and the tool cannot run until the first pass names it and fills in the arguments. Each pass pays the full cost of ingesting the conversation so far: the system prompt, the business context, the transcript, the tool definitions. On a long call that prefill is thousands of tokens, and you are paying to process it twice within a single caller-perceived pause.

That is the tool-call tax. It is not the database, it is not the network, and it is not the size of your tool catalog per se. It is the second sequential trip through the model, with a full re-read of context, sitting inside one silence.

The levers, in order of impact

Cache the prefix. The system prompt, business context, and tool definitions do not change between pass one and pass two; only the tail of the conversation does. Every serious inference provider now supports prompt caching, and structuring your prompt so the static content sits first and the volatile content sits last means the second pass prefill hits cache and costs a fraction of the first. This is the single largest lever and it is also the least glamorous: it is prompt layout discipline, not a new model. We wrote up the cost side of this separately; the latency side is, if anything, the better half of the deal.

Parallelize independent tools. If the model wants the customer record and the schedule, those lookups do not depend on each other, and firing them sequentially buys you nothing but latency. The subtle part is that this is not purely a framework setting: the model has to emit both calls in one pass rather than asking, reading, and asking again. That is shaped by how the tools are described and how the agent is instructed. Get it right and a two-tool turn collapses from two taxes to one.

Speak while the second pass runs. The first pass tells you everything you need to keep the caller oriented: which tool is being called, with which arguments. That is enough to say 'let me check Thursday for you' out loud, honestly, while pass two is still prefilling. Done right, the caller never experiences the tax at all; the holding line covers exactly the window where work is happening. Done wrong, it degenerates into the reflexive filler problem, which we have a whole separate post about. The difference is that here the filler is generated from the actual tool call, so it cannot fire when nothing is happening.

Skip the second pass when the result speaks for itself. A successful booking does not need a frontier model to phrase 'you're booked for Thursday at 2:15.' When the tool result is structured and the response shape is known, a template renders it in microseconds and the caller cannot tell the difference. The second pass is for results that need judgment: ambiguity, partial failures, tradeoffs. Spending it on 'confirmation successful' is pure tax.

Speculate on the obvious lookup. On some turns you know, before the model does, what it will need. A caller who has just been read two appointment slots and says 'the second one' is going to trigger a booking; a caller who opens with an account question is going to need their record. Kicking off the likely lookup while the model is still thinking converts the tool execution window to zero when the guess is right, and wastes a cheap read when it is wrong. This is the fussiest lever on the list: it needs guardrails so that only side-effect-free reads are ever speculated. Speculatively writing to a calendar is how you book phantom appointments.

The tax is the second sequential trip through the model with a full re-read of context. Every lever that matters either makes that trip cheaper, hides it behind honest speech, or deletes it.

What doesn't move it

Faster hardware helps both passes equally, which means it does not change the ratio: tool turns stay twice as slow as plain turns, and the caller feels ratios, not milliseconds. Shaving transport overhead is worth single-digit milliseconds against a tax measured in seconds. And cutting your tool catalog from twenty tools to eight shortens the prefill a little but leaves the second trip fully intact; it is an optimization of the tax form, not the tax.

The question to ask a vendor

Ask for turn latency split by turn type: plain turns and tool turns, p50 and p95 for each. A single blended latency number hides the tax, because the easy majority of turns drags the average down while the turns that matter most to the caller (the ones where the agent is actually doing something with their data) sit out in the tail. A vendor who cannot produce the split has not instrumented their own pipeline deeply enough to have paid the tax down. A vendor who can will usually be happy to talk about it, because the work above is real engineering and everyone who has done it knows it.

Read the full guide