← Insights

An Operator's Field Guide to Buying AI Agents

By Keith Sherman9 min read

If you operate a real business and you're being pitched AI agents right now, you are inside a marketplace that doesn't know how to grade itself. Every vendor demos a happy path. Every vendor has "logs." Every vendor will tell you their agent is "production-ready," and not one of them will define what that means. You are expected to evaluate a category of software for which the buying criteria do not yet exist.

This essay is a small attempt to fix that.

It is written for the operator — the SMB owner, the ops lead, the procurement person who has been asked to make a decision about an AI agent and who suspects, correctly, that the questions they were taught to ask about SaaS do not quite apply here. AI agents are not SaaS. They take actions on your behalf. They speak to your customers. They write things that bear your name. When they fail, they do not throw a "500 Internal Server Error" — they leave a voicemail you did not authorize, send an email you did not write, or quietly do nothing while your phone rings.

You need a different rubric.

What follows is five questions to ask any AI agent vendor before you sign a contract, along with what a real answer sounds like, and what evasive answers tell you. None of the five require you to be technical. They do require you to be willing to make a vendor uncomfortable, which — given the stakes — is the lowest bar in this entire transaction.

1. Show me the scope declaration, not the system prompt.

Every AI agent vendor will, when asked how their agent "knows what to do," walk you through their system prompt. This is the wrong artifact.

A system prompt is a paragraph of natural-language instructions written to a language model. It is fragile, ambiguous, and effectively unauditable — the same words mean different things to different model versions, and the same model produces different behavior across temperature settings and time. A system prompt is to an AI agent what a sticky note is to a corporate policy. It might contain the right intent. It is not the document you build a business on.

What you want instead is a scope declaration: a structured, machine-readable description of what this agent is authorized to do, what data it can read, what actions it can take, who it can talk to on your behalf, and what it is forbidden from doing. The scope declaration is not the prompt the model sees. It is the contract the runtime enforces around the model.

The difference matters because scope declarations are testable. You can read one. You can ask "is this agent allowed to send email to non-existing customers?" and get a yes or no answer, not a probabilistic shrug. You can compare two vendors' scope declarations side by side and notice that one of them quietly allows the agent to update CRM records and the other doesn't.

When you ask the vendor for the scope declaration and they hand you the system prompt, you have learned something important: they have not yet separated the two concepts in their own product. That's not a deal-breaker on day one of the category — almost no one has — but it tells you exactly where they are on the maturity curve.

A good answer: "Here's our scope declaration for your account. It's a JSON file. These are the actions the agent can take, these are the data sources it can read, these are the recipients it can send to, and here are the hard prohibitions. The model never sees this directly — the runtime enforces it."

A bad answer: "Our prompt is proprietary."

2. Walk me through one real production failure and how it was caught.

This is the single most diagnostic question in the rubric. Ask it directly. Watch what happens.

Every AI agent in production has failed at something. Voice agents have misheard names. Email agents have addressed customers by their last vendor's first name because of stale context. Scheduling agents have double-booked because they pulled an outdated calendar. This is not embarrassing — it is the actual physics of running language models against the real world. A vendor who has not yet experienced a production failure has not yet been in production.

The diagnostic value of this question is in how the failure was caught. There are three answer shapes, and each one tells you something different.

Answer shape A: "The customer called us." This is the answer of a vendor who has no production telemetry. Their failure detection is the customer noticing first. If this is the answer, you will be their telemetry. Pass.

Answer shape B: "Our monitoring flagged it." Better. Now ask the follow-up: what specifically did the monitoring flag? If the vendor describes generic metrics — uptime, latency, error rates — they are running SaaS monitoring on what is not SaaS. None of those metrics catch the failure modes that matter for agents (the agent did a thing, but it did the wrong thing, confidently, with no error code).

Answer shape C: "The agent's escalation trigger fired and routed the situation to a human before it propagated." This is the answer of a vendor who has actually thought about agents as a distinct category. They have a layer that asks, in real time, "is this agent currently in a situation it shouldn't be handling autonomously?" — and when the answer is yes, the agent stops and hands off. This is the only architecture that scales.

You are not looking for "we have never failed." You are looking for "here is the specific failure, here is the specific mechanism that caught it, and here is what we changed afterward." A vendor who can give you that story has lived in production. A vendor who can't, hasn't.

3. What does the completion record look like, and can I export it?

Every AI agent action should produce a completion record: a structured, signed artifact that says this agent did this thing, at this time, in this context, with this input, producing this output, with these tools, observed by this runtime. It is the receipt of what happened.

When you ask a vendor for an example completion record, you are testing two things at once. First, do they generate one at all? Many vendors do not — their "logs" are unstructured strings written to a file the customer never sees. Second, can you export it? If the completion records live only inside the vendor's dashboard and disappear when you churn, you do not own your own operational history. You are renting a memory of your business from a third party.

The reason this matters is not regulatory paranoia. It is downstream operations. A year from now, a customer is going to call you and ask why an AI agent said something to them that they have an issue with. Maybe it quoted a price you don't honor. Maybe it confirmed an appointment that never made it onto the calendar. Whatever the dispute, you need to be able to say exactly what happened, when, and on what basis. If your vendor cannot produce that record, the dispute is automatically your problem.

A good answer to this question includes the structure of the completion record (what fields it contains), the signing or attestation mechanism (so you know the record hasn't been altered after the fact), and a clear export path (CSV, API, or downloadable archive). A vendor who can't articulate the first two does not yet have completion records. They have logs. The difference is whether the record is structured enough to mean something six months from now.

4. What triggers an escalation to a human, and who controls those triggers?

Every responsibly-deployed AI agent has an escalation path. The question is not whether one exists. The question is who decides when it fires, and the answer matters more than most operators realize.

There are three failure modes here.

Failure mode A: Escalation is implicit. The agent decides for itself when to escalate. This is fragile because the same model that's failing is the model deciding it's failing. You don't want the agent grading its own confidence.

Failure mode B: Escalation is hard-coded by the vendor. The vendor defines the triggers and the customer can't change them. This works until your operation has a quirk the vendor's defaults don't anticipate. You are a roofer with a single client who insists on a particular communication cadence; the vendor's general-case escalation triggers do not know about that client. Now what?

Failure mode C: Escalation is configurable by the operator. You — the business — define the triggers, in your terms, and the runtime enforces them. "If the caller mentions the word 'lawsuit,' stop and route to me." "If the proposal amount exceeds $50,000, require my approval before sending." "If the customer is on the do-not-engage list, transfer immediately."

Mode C is the only mode that survives contact with reality. Operators always know things about their business that the vendor does not. The escalation system needs to be configurable by the people who actually run the business, not pre-baked by people who don't.

Ask the vendor: "Can I add a new escalation trigger without writing code or filing a support ticket?" If the answer is no, you are buying their model of your business instead of bringing your own.

5. Can I take my deployment data with me if I switch vendors?

This is the question vendors least want to answer, which is exactly why it matters.

"Deployment data" includes: your scope declaration, your escalation configuration, your completion records, your agent's interaction history with your customers, and any business-specific context the agent has accumulated (entity recognition data, customer preferences, vocabulary specific to your operation). It is the operational memory of your business as it runs through this vendor.

The question is not whether the vendor will delete your data when you leave — most reputable ones will. The question is whether you can take it with you. The two are very different. Deletion protects you. Portability empowers you. A vendor who deletes your data but won't export it has built a moat out of your operations.

The buyer-side risk here is straightforward. If you cannot move your deployment data, switching vendors means starting over. That cost — measured in weeks of re-onboarding, lost institutional memory, and customer-facing rough edges — is the lock-in. The longer you run with a vendor, the higher the lock-in gets, and the worse your negotiating position becomes when contract renewal comes around.

A good answer: "Here's the export format. Here's the API. Here's the schema documentation. Yes, you own your deployment data; we are operating it on your behalf."

A bad answer, in any phrasing: "Our system isn't really designed for that."

A few words about demos

Every vendor in this category will offer you a demo. The demo is not the product. The demo is the happy path showcase — the curated, rehearsed, optimized version of what the agent does on its best day. The questions above exist because you cannot evaluate an AI agent by watching its highlight reel.

When you do see a demo, three small disciplines help.

First, ask the vendor to demo a failure path on purpose. Have them show you what happens when the agent doesn't know the answer, when the user goes off-script, when the context is incomplete. If they can only demo success, they have only built success.

Second, ask whether you can run the demo with your own data. A canned demo against the vendor's pre-loaded examples tells you almost nothing. A live demo against your phone numbers, your customer list, your vocabulary — that tells you whether the agent generalizes.

Third, note what the demo doesn't show you. Does it show the completion record? The escalation panel? The audit trail? The scope declaration? If the answer is no, those things may not exist. Demos optimize for what looks good on a sales call, which is the inverse of what looks good in production.

What you are actually evaluating

The five questions above all point at the same underlying thing: does this vendor treat agents as a distinct category of software with its own accountability requirements, or are they shipping a chatbot in a trench coat?

The category is new enough that the answer for most vendors is the latter. That is not a reason for despair. It is a reason for discipline. You are buying very early in a market that has not yet settled on what "good" looks like, and your willingness to ask sharp questions now will shape both your own outcomes and — at the margin — the standards the market eventually converges on.

The Agent Deployment Standard exists because we believe these questions should have shared, public answers — that operators should not each have to re-derive the buying criteria from first principles. ADS does not yet answer all of them, and some of what it specifies will change as more operators run more agents in more real conditions. But the project's bet is that an open standard, built with input from the people who actually run these systems, will compound into a rubric the whole market can use.

In the meantime: ask the five questions. Watch the answers. Trust the vendors who treat them as legitimate. Walk away from the ones who treat them as adversarial.

The hardest part of this market right now is that you are early. The easiest part is that being early gives you leverage. Use it.

---

The Agent Deployment Standard is published by SAIL Institute under an open license. Read the specification at github.com/keithesherman-stack/ADS.

This essay introduces the thinking behind the Agent Deployment Standard, published by SAIL Institute. ADS v0.1 is available on GitHub for review and contribution.