category: tutorial

The Test Is the Product: How to Build Self-Refining SEO Pages

// the agent is not the magic. the evaluator is.

Jun 30, 2026 โ€ข 9 min read
๐Ÿ” ๐Ÿงช ๐Ÿ“ˆ

> generate โ†’ test โ†’ refine โ†’ repeat

Everyone wants a self-improving AI system. Most people accidentally build a one-shot generator with a cron job taped to it. The difference is not the model. The difference is the test.

A real self-refining loop is brutally simple: generate โ†’ test โ†’ refine โ†’ repeat until threshold. If the test is cheap, fast, and correlated with the actual goal, you can improve the output without a human babysitting every attempt. If the test is vague, slow, or as hard as writing the thing, you do not have a loop. You have vibes in a trench coat.

The rule:

A self-improving system only exists when verification is easier than generation. If judging the output costs as much as creating it, the loop is dead before the agent starts.

Let's use an SEO page as the example

Say you run a product page that targets a search query like "AI invoice automation for agencies". You want an agent to keep improving that page: better title, cleaner intro, stronger examples, better internal links, tighter schema, more useful FAQs, fewer content gaps.

The naive version is: "AI, make this page rank better." That's not a loop. That's a prayer. Ranking takes weeks, search results move for reasons outside your page, and the model can easily optimize for the wrong thing. A loop needs a faster signal than "did Google like it eventually?"

The workable version is: define a page candidate as the unit of work, run it through a battery of cheap tests, feed the failure details back into the agent, and stop when the page crosses a threshold. Real ranking data still matters, but it is not the inner loop. It is the outer scoreboard.

Phase 0: define the unit of work

Before you build anything, answer two boring questions.

  • Output: one publishable SEO landing page for one query and one intent.
  • Good enough: the page satisfies the search intent, covers the required entities and questions, passes technical SEO checks, has a clear conversion path, and does not make unsupported claims.

That second sentence matters. If "good" just means "feels premium" or "sounds like our brand," the machine has nothing stable to climb. You can still use AI, but the human is the test layer.

The testing layer is the gate

Most people start by designing the agent. Wrong order. Start by designing evaluate(page) โ†’ score + failures. If you cannot write that function, you cannot build the loop.

type EvalResult = {
  score: number; // 0-100
  pass: boolean;
  failures: Array<{
    test: string;
    severity: "low" | "medium" | "high";
    message: string;
    fixHint: string;
  }>;
};

async function evaluatePage(pageHtml: string, targetQuery: string): Promise<EvalResult> {
  return weightedScore([
    checkTitleAndMeta(pageHtml, targetQuery),
    checkIntentCoverage(pageHtml, targetQuery),
    checkEntityCoverage(pageHtml, targetQuery),
    checkInternalLinks(pageHtml),
    checkSchemaMarkup(pageHtml),
    checkClaimsHaveEvidence(pageHtml),
    checkReadability(pageHtml),
    checkConversionPath(pageHtml),
  ]);
}

Notice the shape: not just a number. The evaluator returns what failed and how to fix it. A score by itself is almost useless. A failed test with a concrete hint is fuel for the next attempt.

What can be tested cheaply?

For an SEO page, the inner loop should avoid pretending it can predict Google. Instead, test the things that are cheap and usually correlated with quality.

Test Signal Why it works
Intent coverage Does the page answer the jobs implied by the query? Useful pages usually satisfy the searcher's actual task.
Entity coverage Are the expected concepts, tools, objections, and examples present? Missing entities often reveal thin content.
Technical SEO Title, meta, canonical, schema, headings, links, image alt text. Cheap deterministic checks catch dumb mistakes.
Evidence Are claims backed by sources, examples, or product facts? Prevents the agent from inventing authority.
Conversion path Is there a clear next step for the right visitor? Traffic without action is just server cost.

None of these perfectly equal "rank number one." But they are fast, inspectable, and directionally useful. That is the bar for an inner-loop metric.

The verifier must be cheaper than the writer

If your agent spends two minutes generating a page and your evaluator spends ten minutes doing a deep market analysis, the economics are upside down. The test should be mostly deterministic code, retrieval, schema validation, lightweight LLM judging, and maybe a small SERP snapshot.

const score =
  0.20 * titleMetaScore +
  0.25 * intentCoverageScore +
  0.20 * entityCoverageScore +
  0.10 * technicalSeoScore +
  0.15 * evidenceScore +
  0.10 * conversionScore;

const pass = score >= 85 && failures.every(f => f.severity !== "high");

The exact weights are less important than the fact that the rules exist. Once the rules exist, the agent can fight them. That sounds bad, and sometimes it is. But it is also what makes iteration possible.

Goodhart will try to eat your loop

The moment a metric becomes the target, the agent will learn to hack it. If your evaluator rewards keyword usage, the agent will keyword-stuff. If it rewards word count, it will bloat the page. If it rewards internal links, it will turn every paragraph into blue spaghetti.

Hardening move:

Every positive metric needs an opposing guardrail. Reward entity coverage, but penalize repetition. Reward completeness, but penalize fluff. Reward conversion, but penalize intrusive CTAs that break the reading experience.

A good SEO loop has anti-gaming tests baked in: duplication checks, readability floors, claim verification, keyword stuffing penalties, and a final human review before publish if the page affects the brand or makes strong claims.

The three-layer architecture

I like thinking about these systems in three layers: testing, agent, and UI. In that order. The UI is last because a beautiful dashboard wrapped around a bad metric is just a slot machine with charts.

1. Testing layer

This is the product. It takes a candidate page and returns a score, a pass/fail verdict, and a list of actionable failures.

  • Rules engine for deterministic checks: headings, meta tags, links, schema, broken URLs.
  • Retrieval layer for source material: product docs, customer notes, competitor SERP pages, internal case studies.
  • LLM judge for the fuzzy but bounded questions: intent match, clarity, unsupported claims, objection handling.
  • Analytics adapter for slow outer-loop truth: impressions, clicks, conversions, scroll depth, demo requests.

2. AI agent

The agent should not receive "score: 72" and guess what happened. It should receive the exact failing tests and rewrite against them.

Attempt 2 failed:
- HIGH: Missing pricing objection section.
  Fix: Add a short section explaining when automation pays back and when it doesn't.
- MEDIUM: Title uses target keyword, but meta description does not include the buyer persona.
  Fix: Mention agencies or service businesses in the meta description.
- MEDIUM: Two claims about ROI have no evidence.
  Fix: Replace with sourced examples or soften the language.

That is actionable. The next generation step can be narrow: "rewrite only the objection section and meta description; do not touch sections that passed." This keeps the loop from oscillating and destroying its own good work.

3. User interface

The UI is where the human watches the loop think. Show the candidate page, the score, the diff from the previous attempt, and the reason each score moved. Most importantly, show which test drove the refinement. If you cannot explain why attempt four beat attempt three, you cannot trust the system.

The human should be able to approve, reject, edit the rubric, pin a section, add a source, or override a verdict. The loop should keep an audit trail: every candidate, every score, every failure, every prompt, every diff. This is how you debug the system after it ships a weird page at 2am.

Inner loop vs. outer loop

Do not confuse fast proxy metrics with reality. The inner loop can improve page quality before publish. The outer loop still needs real user data after publish.

  • Inner loop: seconds to minutes. Checks quality, coverage, technical SEO, evidence, conversion clarity.
  • Outer loop: days to weeks. Watches impressions, rankings, click-through rate, signups, pipeline, revenue.

The mistake is letting slow reality become the only test. If you wait two weeks for every iteration, you do not have a self-refining loop. You have a marketing calendar. Use slow truth to recalibrate the rubric, not to run every step of the loop.

A minimal loop you can actually build

for (let i = 0; i < maxIterations; i++) {
  const candidate = await generatePage({
    query,
    productFacts,
    previousPage,
    failuresFromLastRun,
  });

  const result = await evaluatePage(candidate.html, query);
  saveAttempt({ candidate, result });

  if (result.pass) {
    return await requestHumanApproval(candidate, result);
  }

  previousPage = candidate;
  failuresFromLastRun = result.failures;
}

return escalateToHuman({ reason: "No convergence", attempts });

That is the whole machine. The hard part is not the loop. The hard part is making evaluatePage honest enough that passing it means something.

When this is not loop-able

Some problems should not be sold as self-improving loops. "Make our brand feel more premium" is probably human-in-the-loop. "Find a new positioning strategy" is probably human-in-the-loop. "Write a viral essay" is mostly not a loop unless you have a real distribution test and can afford the delay.

But "produce a page that passes technical SEO, covers known intent, includes required facts, avoids unsupported claims, and presents a clear CTA" is loop-able. It has tests. The tests are cheaper than the page. They give feedback immediately. That's enough to start.

The punchline

Self-improving AI is not magic. It is evaluation engineering with a generator attached. The agent is the flashy part, but the evaluator is the leverage. Build the test first, make the feedback actionable, show the loop's work, and only then let it repeat.

If the page gets better every cycle because the system can prove what improved, you have a loop. If it just rewrites itself until someone says "looks good," you have content roulette.

Mann Jadwani

Mann Jadwani

GenAI Gremlin. I build things that shouldn't work, but somehow do. Currently breaking prod at 3am.