A Python schema auditor for SEO - feed it URLs, get back a presence matrix

A recurring question on any technical SEO audit: which of these URLs have which schema types live?

The big SEO platforms will answer it for you - for a price. Most of them gate schema auditing behind their top tier, somewhere between $200/month and $500/month per seat. For occasional audits this is fine. For weekly audits across a stable of clients, it adds up to thousands of dollars a year for a feature whose math fits in 60 lines of Python.

So I wrote the 60 lines.

The script

Given a CSV of URLs and a list of schema types you care about (Product, AggregateRating, Person, Organization, BreadcrumbList, FAQPage, Article, etc.), the script:

Spins up a headless Firefox via Selenium.
Visits each URL and waits for the body to load.
Pulls every <script type="application/ld+json"> block from the rendered HTML.
JSON-parses each block and reads the @type field.
For each URL × schema-type cell, writes "Yes" or "No" in an output CSV.

That's it. The result is a presence matrix - one row per URL, one column per schema type - that you can sort, filter, and hand to a developer as a punch list.

Why headless Firefox and not just `requests.get()`

The most common shortcut on a script like this is to use requests to fetch the page and BeautifulSoup to parse the HTML. It's faster, simpler, and avoids the WebDriver setup tax.

It also misses the schema on any site that injects JSON-LD via JavaScript - which, in 2024, is most modern sites. WordPress sites with Yoast or Rank Math are usually fine with requests because the JSON-LD lands in the static HTML. But headless React/Vue/Next.js sites typically render schema client-side after the initial HTML loads. A requests fetch would return zero schemas where Selenium returns five.

The cost of running headless Firefox is about 2-3 seconds per URL versus 200ms for a raw HTTP fetch. For a 500-URL audit that's a 20-minute job instead of a 2-minute job - a difference that disappears the moment you're crawling a site where requests would have lied about half the schema.

The `@type` quirk

JSON-LD's @type field is theoretically a string. In practice it's sometimes a string, sometimes an array (when a single block declares multiple types), and sometimes nested inside a @graph array of multiple objects.

The clean version of the parser handles all three:

for script in json_ld_scripts:
    data = json.loads(script.string)
    check_data = [data] if isinstance(data, dict) else data
    for item in check_data:
        item_type = item.get('@type')
        if item_type in interested_types:
            found_schemas[item_type] = "Yes"

This catches single-object blocks and array-of-objects blocks but doesn't recurse into @graph arrays. A future version should walk the @graph because some CMSes (Yoast in particular) declare everything inside one @graph rather than as individual blocks. That's the next iteration.

What an audit looks like in practice

For a 200-URL audit on a typical e-commerce site:

8 minutes of crawl time (with headless Firefox, sequential)
Output CSV with one row per URL and one Yes/No column per schema type of interest
Quick pivot in Excel to count which schemas appear on which page types

The actionable output is usually a list like "180 of 200 product pages have Product schema but only 12 have AggregateRating, and no FAQ pages have FAQPage schema at all." That's the punch list. That's what the audit produces.

What I would change

A few next-iteration improvements:

Parallelize the crawl. Sequential at 2-3 seconds per URL gets slow above 500 URLs. Selenium can run 4-6 instances in parallel on a modern laptop without much memory pressure. Future version should split the URL list across worker processes.

Recurse into @graph. Some sites declare every schema type inside a single @graph array. Right now the script misses those because it only looks at top-level @type. A short recursive helper would catch them.

Add property-level checks. "Has FAQPage" is the first question. "Has FAQPage with at least 3 valid Question entries that have non-empty acceptedAnswer.text" is the second. The current script answers the first; the next iteration should answer the second.

Export schema content, not just presence. Sometimes the question isn't "is FAQ schema present" but "what are the FAQ questions saying." Adding an optional flag that dumps the parsed JSON content per page would make this a content-auditing tool too, not just a presence checker.

But for the question I actually need answered most weeks - "what's missing and where?" - the current 60-line version pays its rent. Free for anyone to copy.

Source: github.com/schandler7171/portfolio-example-scripts/tree/main/SchemaChecker

The script

Why headless Firefox and not just requests.get()

The @type quirk

What an audit looks like in practice

What I would change

Why headless Firefox and not just `requests.get()`

The `@type` quirk