Nick Gaudio, Salesforce Expert of 8 Years

Nick GaudioSweep Staff , May 20, 2026

When to Trust Salesforce MCP (And When Not To)

TL;DR

L2 is where Salesforce MCP is at home — multi-element metadata retrieval actually works most of the time. Calling validation rules, fetching fields, querying flows, deploying components from an IDE: this is genuinely productive territory.
The trouble is when L2 looks like L3 from the inside. Teams ask whole-org questions, get a confident (and possibly wrong) answer back, and don't realize their MCP setup was doing L2 work the whole time.
A more honest question is "does my MCP setup actually cover the question I'm asking, or is it guessing?" (Most teams don't actually know.)

****

Let me start with the part that gets totally undersold in most AI tooling conversations these days: Salesforce MCP works!

It works when a dev in Cursor asks "what's the validation rule on this object" and gets back the rule, accurately… It works when Claude in an IDE deploys a Lightning component without alt-tabbing to Workbench…. It works when an agent fetches a flow's metadata, walks through the structure, and spots an obvious bug. It has real utility.

This is L2 — multi-element metadata retrieval. The model knows what to ask for, the MCP exposes the right tool, the API returns the right data, the answer comes back. Genuinely useful work. Probably the single biggest productivity gain Salesforce developers have had in the last five years.

The problem starts when you start to notice what happens at the seams.

What L2 actually does well

Concretely, in any of the five flavors of Salesforce MCP server, the things you can trust:

Targeted metadata retrieval. "Show me the fields on Account." "What's the picklist for Stage?" "Pull the XML for this flow." All of these are direct API calls returning specific objects. The MCP fetches, the model reads, the answer is reliable.
Record-level operations. "Update this opportunity's stage." "Create a contact tied to this account." "Bulk-update these records." Anything that's a CRUD operation on records, with the right permissions, works.
Single-component code analysis. Drop an Apex class into context, ask "where's the bug." Same with a flow XML. The model reads the file and reasons over it directly — that's L1 territory technically, but MCP makes it easier to pull the file.
Deploys and tests from the IDE. Cursor or Claude Code calling the DX MCP can deploy components and run Apex tests. The action is well-bounded, the API returns clear success/failure, the loop closes.
Narrow SOQL. "Show me Closed Won opportunities from EMEA last quarter with no related case." A SOQL query the model can write correctly against objects it can describe.

This is real productivity. None of it is in dispute.

Where L2 looks like L3 from the inside

The trouble starts when someone asks an MCP-connected AI a question that sounds like L2 but is actually L3.

"What's the validation rule on Opportunity Stage?" is L2. There's a specific object to query, and the answer is one or two records.

"What updates Opportunity Stage automatically?" sounds similar. It isn't. The answer touches every workflow field update, every approval process, every flow, every Apex trigger, every managed package automation that writes to that field. There is no API for that question — we wrote a whole piece about why here.

But here's the thing: from inside the chat window, the second question looks the same as the first. You type it. The AI calls some tools. The AI returns an answer. The answer sounds confident.

The model didn't say "I can answer the first one but not the second." It guessed. It called the tools it could call, retrieved what it could retrieve, and gave you back a list. The list looks plausible. The list is also incomplete, and you have no way to tell how incomplete without checking by hand.

That's the L2/L3 trap. Same interface, same workflow, dramatically different reliability.

Why teams overestimate their MCP coverage

Most teams I talk to don't fully know what's in their MCP setup. They installed it. It answers some questions. They started trusting it. They never sat down and audited which questions it actually covers.

A few patterns I see in the wild:

"We use the Salesforce DX MCP." OK, but the DX MCP has dozens of tools across orgs, metadata, data, users, and code-analyzer toolsets. Which toolsets did you enable? Which scopes? When was the last time you checked what's exposed vs. what's not?

"We built our own MCP." Cool. Who built it? What does it cover? Does it parse flow XML or just retrieve it? Does it walk Apex SymbolTable references or just return class definitions? Does it touch managed package metadata at all? Does the team that uses it daily know any of this?

"We use the hosted MCP from Salesforce." That covers 11+ endpoints currently in pilot — sobject-all, sobject-reads, invocable-actions, flows, data-360, prompt-builder, tableau-next. Useful set. Doesn't include dependency analysis. Doesn't include the Tooling API queries that would let you walk references. Doesn't include the Metadata API at the level needed for full component retrieval.

In every case, the set of questions the MCP can answer reliably is smaller than the set of questions the team actually asks. The gap is where confident-wrong answers come from.

The honest test

If you want to know which of your AI questions are getting reliable answers, the test is straightforward but uncomfortable.

Pick a question your team asked an MCP-connected AI recently — ideally one whose answer drove a decision. Now go validate it by hand. Open Setup, walk every component the AI mentioned, check that they actually do what the AI said. Then walk every other component you can think of that might also be relevant. The ones the AI didn't mention.

Two questions to ask yourself at the end:

Did the AI list everything?
Were the things it listed accurate?

If both answers are yes, you were in L2 territory and your MCP setup served you. If either answer is no, you were asking an L3 question with L2 tools, and you walked into the trap.

Do this five times, with five different questions, and you'll have a real sense of where the line is for your specific setup. Most teams haven't done it once.

What L2 is actually best at

The fix isn't "stop using MCP." MCP is great at what it does. The fix is to use L2 tools for L2 questions and stop expecting them to scale up.

A useful default: anything that touches one specific component, one record, one file — L2 territory, trust the answer (and verify on anything material). Anything that touches the relationships between components, the dependencies, the whole-org behavior — L3 territory, and your MCP is guessing.

Sweep's MCP server exists for this distinction. The Sweep MCP doesn't replace your DX MCP or your hosted MCP — it sits alongside them, exposing the metadata graph as a tool. When the AI asks "what references this field," Sweep returns the full answer from the graph. When the AI asks "show me this record," your existing MCP returns it. Each tool serves its layer.

For L2 work that's expanding into L3 — and most enterprise Salesforce work eventually is — that's the architecture that actually closes the loop.

The bottom line

MCP works. Use it. Trust it for L2 work. Verify the results on anything that drives a decision.

For L3 and L4 questions — the ones about the whole org and about change over time — assume your MCP-only setup is guessing until you can prove otherwise. The fix isn't a better prompt. It's a graph between the AI and the org.

If you've never sat down to map which of your questions are L2 versus L3, that's the first piece of work. Start with the framework. The rest follows from there.