- If you build an agentic content QA pipeline with Claude and MCP using a single session, the writer grades its own work. Scores cluster near the top of the rubric and rarely disagree with the draft they just wrote.
- The fix is a two-agent topology: the QA agent runs in a fresh Claude session with only the finished draft, a rubric MCP, and a source-of-truth MCP loaded. No writing prompt. No chain-of-thought. In practice, editor agreement rises sharply on the same drafts and rubric once context is isolated.
- A rubric MCP returns deterministic pass/fail per criterion (e.g.,
check_cpl_within_vertical_benchmark). It is not an LLM-as-judge prompt. The difference is auditability. - Cost runs to pennies per article at Claude Sonnet 4.6 API pricing in 2026 ($3/MTok input, $15/MTok output). A rounding error against one hallucinated CPL benchmark reaching a client.
- Cap rewrite attempts at 2. Anything legal-adjacent, novel-benchmark, or client-specific commitment escalates to a human editor, not another loop.
Questions this article answers:
- Why does my Claude QA agent score every draft 9 out of 10?
- What is a rubric MCP and how is it different from an LLM-as-judge prompt?
- What goes into a source-of-truth MCP for marketing content?
- How do I prevent infinite rewrite loops in an agentic QA pipeline?
- How do I measure whether my content QA pipeline actually works?
- What does the QA agent need to NOT see?
If Your QA Agent Scores Every Draft 9/10, It’s Grading Its Own Homework
If you build an agentic content QA pipeline with Claude and MCP and every draft comes back at 9/10, your pipeline is broken. Not your prompt. Your architecture. The same Claude session that wrote the draft is grading it, and the writing context already primed the evaluator with every justification it needs to approve itself.
We ran into this on vertical lead-gen articles where a hallucinated CPL benchmark (cost per lead, the price an advertiser pays for one form-fill or call) gets a client fired. The single-agent QA pass rubber-stamped drafts that human editors flagged for invented numbers and fabricated platform behavior. Same prompt. Same rubric.
The fix was not a better prompt. It was a two-agent topology with two MCP servers (Model Context Protocol, Anthropic’s open standard for giving Claude tool access), plus a state machine that routes to approve, rewrite, or escalate.
This article walks the build. Audience: marketing engineers running production content pipelines in 2026 who already wired Claude into drafting and now need a QA layer that won’t embarrass them.
Why a Single Claude Session Scores Its Own Drafts 9/10
A Claude session asked to evaluate its own writing has the writer’s reasoning sitting in its context window. The evaluator can read every justification for every choice before it scores anything. It cannot disagree with itself. The chain-of-thought, the brief, the style guide, the article concept doc, all of it is already in the conversation. When you ask “is this good?” at the end, you are asking a writer to grade work they just defended.
This is self-evaluation bias, and it is the load-bearing failure mode most “Claude + MCP for content” guides skip past. They treat QA as a prompt-engineering problem (“add a stricter rubric to your evaluator prompt”). It is not. No prompt fixes a poisoned context.
Teams running single-agent QA loops typically see scores cluster near the top of the rubric, with the agent agreeing with itself on almost every draft. Run those same drafts through a fresh-context evaluator and the scores spread out, the failures get specific, and editor agreement climbs. The architecture is the only thing that changed.
The takeaway: “LLM-as-judge” prompts inherit the same problem when they share context with the writer. The fix is structural.
The Two-Agent Topology: Writer and QA Agent in Isolated Contexts
The two-agent topology runs the writer and the evaluator in separate Claude sessions with no shared context. The QA agent sees only the finished text plus the tools it needs to verify claims. That is the entire architectural insight. Everything else is plumbing.
Here is the flow in plain prose:
- Writer agent. One Claude session. Full brief, client style guide, content-generation tools loaded. Produces a draft as a clean artifact (just the markdown, no chain-of-thought attached).
- Draft artifact handoff. The draft is saved to a file, an object store, or a message queue. The writer’s session ends.
- QA agent. A fresh Claude session. Loaded with the finished draft text, the rubric MCP, and the source-of-truth MCP. Nothing else. No writing prompt. No article concept. No notes from the writer.
- State machine. Reads the QA agent’s per-criterion results and routes the draft to approve, rewrite, or escalate.
- Rewrite agent (if triggered). Another fresh session, given only the flagged sections and the failed criteria. Returns a patched draft, which goes back to step 3.
Implementing context isolation in Claude Code vs. the API
In Claude Code, the cleanest implementation is sub-agents. Each sub-agent gets its own context and tool list. Configure the QA sub-agent with access to only the rubric and source-of-truth MCP servers. Do not give it the writer’s tools or system prompt.
On the API, use separate conversation IDs and explicitly build the QA agent’s message history from scratch. The system prompt for the QA agent should never mention the article’s purpose, the target keyword, or any reasoning from the writer. It should describe the QA agent’s job and the rubric, and nothing more.
What the QA agent must NOT see
- The writer’s system prompt
- The article brief or outline
- The target keyword or SEO instructions
- Any chain-of-thought, scratchpad, or planning notes from the writer
- The client style guide (the writer applied it; the QA agent checks the result, not the process)
The cost objection comes up here. A second Claude session feels expensive. At Claude Sonnet 4.6 API pricing in 2026 ($3 per million input tokens, $15 per million output tokens), a full two-agent QA pass on a 1,500-word article lands in the pennies-per-article range depending on prompt size and tool call volume. One hallucinated benchmark in a published client article costs more than running this pipeline for a year.

Build a Rubric MCP That Returns Pass/Fail Per Criterion
A rubric MCP exposes deterministic tool calls. Calls like check_claim_has_source(claim_id), check_cpl_within_vertical_benchmark(value, vertical), and check_platform_capability(platform, feature), each returning a boolean pass/fail plus a reason string. No composite scores. No “rate this 1 to 10.” The QA agent walks the rubric tool by tool, and every judgment is grounded in a tool call rather than a vibe.
This is the architectural line most guides miss. They conflate two different things:
| LLM-as-judge prompt | Rubric MCP |
|---|---|
| “Rate this draft 1 to 10 on accuracy, clarity, tone.” | check_claim_has_source(claim_id) returns {pass: false, reason: "no source linked"} |
| Composite score, no audit trail | Per-criterion result, fully logged |
| Non-deterministic across runs | Deterministic for the same input |
| Subjective | Auditable |
The rubric MCP forces structure. The agent cannot wave its hands. Every claim either passes a tool call or it doesn’t.
Example rubric for vertical lead-gen content
A working rubric for vertical lead-gen articles looks something like this:
check_claim_has_source: every standalone statistic must link to an external source.check_cpl_within_vertical_benchmark: any CPL number cited falls within a known range for the vertical.check_platform_behavior_verified: claims about platform features (Meta Lead Ads field support, Google Ads bidding behavior) match current platform docs.check_case_study_numbers_traceable: any named client outcome traces back to an approved internal source.check_client_commitment_flagged_for_legal: any sentence promising a specific result for a client is flagged for human review.
Each tool returns pass/fail plus a reason. The QA agent calls each one against the relevant section of the draft. The state machine reads the results.
The Source-of-Truth MCP Is the One That Saves Client Relationships
The highest-ROI MCP to wire first is the source-of-truth MCP, not the rubric. Point it at your client’s actual CRM exports, vertical pricing sheets, internal benchmark tables, and approved case-study numbers. Tone errors get caught by editors. Fabricated numbers get clients fired.
The QA failures we see in production are not grammar or tone. They are:
- Invented CPL ranges (“insurance CPLs run $18 to $45” when the client’s actual range is $32 to $78).
- Fabricated platform behavior (claiming Meta Lead Ads supports a field that does not exist, per Meta’s Lead Ads documentation).
- Made-up case-study numbers (a 40% lift that nobody can trace to a real account).
A tone evaluator catches none of these. A source-of-truth MCP catches all three.
Which data sources actually go into it
The source-of-truth MCP wraps the data the model should be fact-checking against. For a vertical lead-gen agency, that usually means:
- Client CRM exports (current CPL, conversion rate, close rate by vertical).
- Internal benchmark tables (refreshed quarterly).
- Platform feature matrices (what each ad platform actually supports today).
- Approved case study database (numbers cleared by legal and the client).
- Pricing sheets and rate cards.
Expose each as a tool call. lookup_cpl_benchmark(vertical, market_size) returns the current range. lookup_case_study_number(client_id, metric) returns the approved figure or null. The QA agent calls the tool for every verifiable number in the draft and compares.
Handling staleness
The source-of-truth MCP is only as good as the data behind it. Stale data is a failure mode. If the CPL benchmark table hasn’t been refreshed since Q1 and you’re publishing in Q4, the QA agent will confidently approve numbers that are now wrong.
Two defenses:
- Tag every data source with a
last_updatedtimestamp. The MCP returns it on every call. The QA agent flags any source older than a threshold (we use 90 days for CPL data) and escalates rather than approves. - When the MCP returns
no_data, that is an escalation, not a pass. Silence is not consent. If the agent can’t verify a claim, a human has to.
The State Machine: Approve, Rewrite, or Escalate
QA is a routing decision, not a binary score. The state machine reads per-criterion results from the rubric MCP and sends the draft to one of three destinations: auto-approve to the CMS, rewrite, or human escalation. Treating QA as a single number throws away the routing information that makes the pipeline useful.
The routing logic we run:
- All criteria pass: approve and push to the CMS.
- 1 to 3 criteria fail, all in rewriteable categories (tone, missing citation, weak claim phrasing): invoke the rewrite agent, then re-run QA.
- Any criterion fails in a non-rewriteable category (legal claim, novel benchmark, client-specific commitment, fabricated case study): escalate to a human editor with the full rubric trace.
- Rewrite attempts exceed the cap: escalate regardless of category.
Rewrite loop ceiling: 2 is the right cap
Without a cap, the rewrite agent can fail the same criterion three or four times in a row, rephrasing without fixing. We cap at 2 rewrite attempts before forced escalation. If the model couldn’t fix it in two tries, the problem is structural (missing source data, ambiguous criterion, claim that shouldn’t be in the draft at all), and a human needs to look.
Give the rewrite agent only the failed criterion and the relevant draft section. Not the full QA reasoning. Not the writer’s original prompt. Fresh context, narrow task.
Where human review still has to live
Three categories should always escalate:
- Legal-adjacent claims: anything about compliance, regulatory thresholds, or platform policy.
- Novel benchmarks: numbers the source-of-truth MCP doesn’t have a record of.
- Client-specific commitments: promises about outcomes for a named client.
The pipeline is not trying to replace editors. It’s trying to make sure the only drafts that reach editors are the ones that actually need editorial judgment, not the ones with three invented CPL numbers that any tool call would have caught.
Measuring Whether the Pipeline Works
Inter-rater agreement against a human editor is the only honest metric for whether your QA pipeline works. Agreements between QA agent and human editor divided by total reviewed drafts. Run it weekly on a sample. Anything else (“average rubric score,” “approval rate”) measures the pipeline’s confidence in itself, which is exactly the bias you started with.
The formula:
“` Inter-rater agreement = (drafts where QA agent and human editor reached the same approve/rewrite/escalate decision) ÷ (total drafts in sample) “`
The pattern most teams see, holding rubric and drafts constant and changing only architecture: single-agent topologies agree with editors on roughly half of contested drafts. Two-agent topologies with context isolation push that agreement materially higher, into the range where the pipeline is doing useful triage instead of laundering self-confidence. The single biggest gain comes from context isolation. Adding the source-of-truth MCP closes the remaining gap on numeric claims.
Failure modes once the pipeline is live
Watch for three patterns in production:
- Rubric overfitting. Agreement climbs above 95% and the escalation rate drops near zero. The rubric is now rubber-stamping a different way: too narrow to catch new failure types. Add criteria.
- Source-of-truth staleness. The agent confidently approves a CPL claim that was true last quarter. Audit data sources monthly and tighten the
last_updatedthreshold. - Infinite rewrite loops. A criterion keeps failing across multiple rewrites. The cap of 2 exists for this. Force escalation and look at why the rewrite agent couldn’t fix it.
Log every tool call the QA agent makes: criterion, input, result, reason string, MCP version, timestamp. For agency work, this is your audit trail. When a client’s compliance team asks how a claim got cleared, you point at the trace. Without logs, you have no answer.
If you’re also weighing how AI is reshaping the buy side of marketing, our breakdown of the Meta Ads MCP for Claude operator setup walks the same isolation principles applied to ad-account access.
Frequently Asked Questions
Why does my Claude QA agent score every draft 9 out of 10?
Because the QA agent is running in the same context as the writer, so it has access to every justification the writer made before scoring anything. This is self-evaluation bias. The fix is structural: run the QA agent in a fresh Claude session with no access to the writing prompt, brief, or chain-of-thought. Same model, fresh context, scores spread out and editor agreement rises.
What is a rubric MCP and how is it different from an LLM-as-judge prompt?
A rubric MCP is a Model Context Protocol server that exposes deterministic tool calls returning pass/fail per criterion, while an LLM-as-judge prompt asks one model to assign a subjective score. The MCP version is auditable, deterministic across runs, and forces the agent to ground each judgment in a tool call. The prompt version inherits the same biases as the writer and produces non-reproducible scores.
What goes into a source-of-truth MCP for marketing content?
A source-of-truth MCP wraps the data the QA agent fact-checks against: client CRM exports, vertical benchmark tables, platform feature matrices, approved case studies, and pricing sheets. Each is exposed as a tool call the agent can hit for every verifiable claim. Tag every record with a last_updated timestamp so the agent can escalate when data is stale instead of approving on outdated benchmarks.
How do I prevent infinite rewrite loops in an agentic QA pipeline?
Cap rewrite attempts at 2 and force escalation to a human editor after the second failed rewrite. If the model can’t fix a flagged criterion in two tries with a fresh-context rewrite agent, the problem is structural: missing source data, ambiguous criterion, or a claim that shouldn’t be in the draft. A human, not another loop, resolves it.
How do I measure whether my content QA pipeline actually works?
Use inter-rater agreement against a human editor: agreements between the QA agent and editor divided by total drafts in the sample. Run it weekly on a fresh sample. Average rubric scores and approval rates measure the pipeline’s confidence in itself, which is exactly the bias you’re trying to fix. Editor agreement is the only ground truth.
What does the QA agent need to NOT see?
The QA agent must not see the writer’s system prompt, the article brief or outline, the target keyword, any chain-of-thought from the writer, or the client style guide. It sees the finished draft text and the tools it needs (rubric MCP, source-of-truth MCP). Context isolation is the single highest-leverage change in the entire architecture, and the one that moves editor agreement most when teams measure before and after.
Talk to Elevarus About Your Agentic Content QA Pipeline
If you’re building this in-house and want a second set of eyes on the topology, rubric design, or source-of-truth MCP scope, or if you want help designing and operating one for your agency’s vertical content, book a free consultation. Not a sales pitch, a working conversation about where the architecture is load-bearing and where it can be lighter. The cost of getting this wrong is measured in client relationships, not Claude tokens.