EngineeringJuly 3, 20265 min read

How do you test an agent that writes to your CRM?

You can't rehearse with an agent whose every conversation creates contacts, books slots, and posts notes into a live system of record. Mock the model and you've tested nothing; clone the CRM and you've tested a different system. The dry-run architecture that lets you exercise the real pipeline with the side effects intercepted at the last possible layer.

VVorel EngineeringEngineeringLast updated July 5, 2026

TL;DR

An agent that reads and writes a live CRM cannot be tested the way you test a chatbot, because every test conversation leaves real rows behind. The two obvious workarounds both fail: mocking the model tests none of the behavior that matters, and pointing the agent at a staging CRM tests a system with different data, different edge cases, and different latencies than the one that will take real calls. The architecture that works is dry-run interception: run the entire real pipeline (real prompts, real model, real tool-selection, real CRM reads) and intercept only the writes, at the integration boundary, recording each one as a 'would have written' entry with full payload. Layer simulated callers on top for regression, replay grading on transcripts for continuous evaluation, and keep supervised live calls as the final gate rather than the first test.

Here is an awkward question for any team deploying a CX agent wired into a real system of record: when you want to know whether last week's prompt change made the agent worse, what do you actually do? You cannot just call it fifty times and try fifty bookings, because each of those calls writes contacts, appointments, and notes into the CRM your staff works out of. Testing the agent pollutes the thing the agent exists to keep clean.

The two workarounds that fail

The first instinct is to mock the expensive part: stub the model, replay canned responses, assert the pipeline glues them together. This produces green checkmarks and no information. Every behavior you actually worry about (does the agent call the right tool, does it hedge instead of checking, does it write the note to the right contact) lives inside the part you just mocked out. You have tested the plumbing and shipped the judgment untested.

The second instinct is a staging CRM: a sandbox account, seeded with fake data, and the agent pointed at it for test calls. Better, but now you are testing a different system. The sandbox has 40 tidy contacts; production has 12,000 messy ones, including three people with the same name and a customer whose phone number is attached to their ex-spouse's account. The failure modes that hurt in production are properties of production data, and a seeded sandbox contains none of them. Staging tells you the agent works in a world that does not exist.

Mock the model and you test the plumbing while shipping the judgment untested. Point at a staging CRM and you test a world that does not exist.

Intercept at the last layer

The design that resolves this is dry-run mode, and the entire trick is where you put the switch. Everything runs for real: the real system prompt with the real business context, the real model, the real tool-selection logic, and, critically, real reads against the production CRM. The agent looks up the actual customer, checks the actual calendar, quotes the actual price list. The interception happens at exactly one layer: the integration client that performs writes. When a write call arrives in dry-run, the client does not send it. It records the full payload (endpoint, object, fields) as a 'would have written' entry, and returns a realistic synthetic success so the conversation proceeds exactly as it would have.

Placement is everything. Intercept any higher (say, at the tool layer) and you fork the code path: the test exercises a branch that production never runs, and the two drift apart until dry-run passes mean nothing. The write interceptor has to be the same client production uses, at the last hop before the external call, with dry-run as a per-conversation context rather than a second code path. The honest way to say it: dry-run should differ from production by one conditional at one layer, and every line above that conditional should be identical.

The output of a dry-run conversation is a transcript plus a write ledger: three would-have-written entries, each with the exact payload that would have hit the CRM. That ledger is the test result. You can assert against it ('this conversation should produce one contact update and one booking, and the booking should be Thursday 2:15'), diff it across prompt versions, and show it to a nervous buyer as proof of exactly what the agent would do to their data before you flip anything live.

Simulated callers and replay grading

Dry-run makes individual test conversations safe. Making them systematic takes two more pieces. The first is simulated callers: a second model plays the customer, driving the agent through scripted scenarios (the rescheduler, the vague caller, the one who changes their mind twice, the one who asks for a human). Each scenario runs against the real pipeline in dry-run and asserts on both the transcript and the write ledger. This is a regression suite for behavior: when a prompt change breaks the way the agent handles a mid-call correction, a scenario goes red before a caller ever hears the regression.

The second piece is replay grading: taking transcripts of real production conversations and re-running graders (or re-running the agent itself, in dry-run, against the same inputs) to evaluate a change against traffic you actually received rather than traffic you imagined. Simulated scenarios cover the cases you thought of. Replays cover the cases your callers thought of. You need both, because the second list is always stranger.

Live is a gate, not a test

None of this eliminates supervised live testing; it repositions it. Live calls with a human watching are the final gate, run after dry-run scenarios pass, scoped to a handful of calls whose writes are expected and immediately reviewed. What the dry-run architecture eliminates is the indefensible middle where live production calls are the first place a change meets reality, and the CRM quietly accumulates the debris of every experiment.

The buyer version of this post

If a vendor wants write access to your system of record, their testing story is part of your security review, and it takes three questions to assess. Can they run a full conversation against your real data with writes intercepted, and show you the would-have-written ledger? Is dry-run the same code path as production, or a parallel one that drifts? And when they change a prompt, what re-runs automatically before the change reaches your phone line? A vendor with real answers has built the machinery in this post. A vendor who answers 'we test carefully in our sandbox' is telling you your production CRM is the regression suite.

Read the full guide