Engineering5 min read

How do you test an agent that writes to your CRM?

You can't rehearse with an agent whose every conversation creates contacts, books slots, and posts notes into a live system of record. Mock the model and you've tested nothing; clone the CRM and you've tested a different system. The dry-run architecture that lets you exercise the real pipeline with the side effects intercepted at the last possible layer.

VVorel EngineeringEngineeringLast updated

Here is an awkward question for any team deploying a CX agent wired into a real system of record: when you want to know whether last week's prompt change made the agent worse, what do you actually do? You cannot just call it fifty times and try fifty bookings, because each of those calls writes contacts, appointments, and notes into the CRM your staff works out of. Testing the agent pollutes the thing the agent exists to keep clean.

The two workarounds that fail

The first instinct is to mock the expensive part: stub the model, replay canned responses, assert the pipeline glues them together. This produces green checkmarks and no information. Every behavior you actually worry about (does the agent call the right tool, does it hedge instead of checking, does it write the note to the right contact) lives inside the part you just mocked out. You have tested the plumbing and shipped the judgment untested.

The second instinct is a staging CRM: a sandbox account, seeded with fake data, and the agent pointed at it for test calls. Better, but now you are testing a different system. The sandbox has 40 tidy contacts; production has 12,000 messy ones, including three people with the same name and a customer whose phone number is attached to their ex-spouse's account. The failure modes that hurt in production are properties of production data, and a seeded sandbox contains none of them. Staging tells you the agent works in a world that does not exist.

Mock the model and you test the plumbing while shipping the judgment untested. Point at a staging CRM and you test a world that does not exist.

Intercept at the last layer

The design that resolves this is dry-run mode, and the entire trick is where you put the switch. Everything runs for real: the real system prompt with the real business context, the real model, the real tool-selection logic, and, critically, real reads against the production CRM. The agent looks up the actual customer, checks the actual calendar, quotes the actual price list. The interception happens at exactly one layer: the integration client that performs writes. When a write call arrives in dry-run, the client does not send it. It records the full payload (endpoint, object, fields) as a 'would have written' entry, and returns a realistic synthetic success so the conversation proceeds exactly as it would have.

Placement is everything. Intercept any higher (say, at the tool layer) and you fork the code path: the test exercises a branch that production never runs, and the two drift apart until dry-run passes mean nothing. The write interceptor has to be the same client production uses, at the last hop before the external call, with dry-run as a per-conversation context rather than a second code path. The honest way to say it: dry-run should differ from production by one conditional at one layer, and every line above that conditional should be identical.

The output of a dry-run conversation is a transcript plus a write ledger: three would-have-written entries, each with the exact payload that would have hit the CRM. That ledger is the test result. You can assert against it ('this conversation should produce one contact update and one booking, and the booking should be Thursday 2:15'), diff it across prompt versions, and show it to a nervous buyer as proof of exactly what the agent would do to their data before you flip anything live.

Simulated callers and replay grading

Dry-run makes individual test conversations safe. Making them systematic takes two more pieces. The first is simulated callers: a second model plays the customer, driving the agent through scripted scenarios (the rescheduler, the vague caller, the one who changes their mind twice, the one who asks for a human). Each scenario runs against the real pipeline in dry-run and asserts on both the transcript and the write ledger. This is a regression suite for behavior: when a prompt change breaks the way the agent handles a mid-call correction, a scenario goes red before a caller ever hears the regression.

The second piece is replay grading: taking transcripts of real production conversations and re-running graders (or re-running the agent itself, in dry-run, against the same inputs) to evaluate a change against traffic you actually received rather than traffic you imagined. Simulated scenarios cover the cases you thought of. Replays cover the cases your callers thought of. You need both, because the second list is always stranger.

Live is a gate, not a test

None of this eliminates supervised live testing; it repositions it. Live calls with a human watching are the final gate, run after dry-run scenarios pass, scoped to a handful of calls whose writes are expected and immediately reviewed. What the dry-run architecture eliminates is the indefensible middle where live production calls are the first place a change meets reality, and the CRM quietly accumulates the debris of every experiment.

The buyer version of this post

If a vendor wants write access to your system of record, their testing story is part of your security review, and it takes three questions to assess. Can they run a full conversation against your real data with writes intercepted, and show you the would-have-written ledger? Is dry-run the same code path as production, or a parallel one that drifts? And when they change a prompt, what re-runs automatically before the change reaches your phone line? A vendor with real answers has built the machinery in this post. A vendor who answers 'we test carefully in our sandbox' is telling you your production CRM is the regression suite.

Read the full guide

Operator-led AI

The architectural choice to keep your team as the protagonist and the AI as their amplifier. The category lens Vorel was built around.

Read the guide
Keep reading
EngineeringMay 24, 202610 min read

Grading every voice turn. The production loop that catches hallucinations before they ship.

A voice agent can hallucinate on one turn out of fifty and still feel pretty good in aggregate. The problem is that one turn quoting a wrong price, inventing a hold time, or claiming a feature that does not exist is the turn that gets escalated, screenshotted, and tweeted. We grade every agent turn in production against a small list of factual constraints, and route any unsupported claim straight into the operator queue. Here is the architecture, the prompt, and why a per-turn grader beats a weekly eval.

Read the post
EngineeringMay 24, 20269 min read

Designing the audit row: what every voice agent should write back to your CRM.

The audit row is the single most important artifact a voice agent produces. It is not the transcript and it is not the reasoning trace. It is a structured record, written into the operator's system of record, that answers 'what happened on that call?' without anyone having to listen to it. Here is the field model we ship and the design choices behind it.

Read the post
EngineeringMay 15, 20264 min read

CRM-as-source-of-truth is not a feature. It's a tax bill.

Every CX-AI vendor's data layer is a vendor lock-in. The architectural choice to keep your customer record in your CRM, and have the agent write into it natively, costs the vendor real engineering and saves you real money. Here is why it matters more than buyers realize.

Read the post

The next call doesn’t have to go to voicemail.

Book a thirty-minute demo. We point Vorel at one of your real numbers on the same call.