All posts

Screaming Frog at scale - a Python wrapper for crawls the GUI can't hold

A Python wrapper that batches a sitemap into Screaming Frog CLI runs, combines the exports into one spreadsheet, and produces a client-ready summary. Built because the GUI choked on a 40k-URL site and the deadline was the next morning.

The setup: a client audit. Big site. The sitemap turned out to have over 180,000 URLs across a dozen subsections, nested sitemap indexes, the whole party. The plan was the standard SEO audit move - open Screaming Frog, paste in the sitemap, hit start, walk away.

Somewhere around 40,000 to 50,000 URLs in - which is roughly where the Screaming Frog GUI tends to start struggling in my experience - it began choking. Memory pressure. Then a crash. Then another with a slightly different configuration. By the third try I gave up trying to crawl the whole thing at once and started thinking about what Screaming Frog is actually good at and what it isn't.

What it's good at: doing the crawl. Hitting URLs, extracting metadata, reading status codes, building the internal link graph. There is genuinely nothing else on the market that does this part better.

What it isn't good at: holding tens of thousands of URLs of state in memory while the user interface tries to stay responsive. That is a UI problem, not a crawling problem.

The CLI doesn't have that problem.

The script in 100 words

Read a sitemap (recursively, because big sites have sitemap indexes pointing at sub-sitemaps). Save every URL to a master list. Split the list into batches of 1,000. Run the Screaming Frog CLI on each batch, exporting internal:all to CSV. Rename and stash each batch's CSV. Combine all the CSVs into one Excel workbook. Generate a summary text file with subdomain counts and notes on what was found. Run time: about as long as the GUI would have taken if it hadn't crashed, plus or minus, but without the babysitting.

The shape

Four phases run in order, all from one python crawl.py invocation:

Phase 0 - download robots.txt and llms.txt. Two text files saved to the output folder. The robots.txt is obvious. The llms.txt is the newer convention for telling AI crawlers what they can and can't read - some sites publish it, most don't. Downloading both gives the audit deliverable a "what does this site say about itself to crawlers" baseline that the client may not even know exists.

Phase 1 - parse the sitemap. The function recursively walks sitemap indexes. If the URL you give it returns a sitemapindex element, it queues each nested sitemap. If it returns a urlset, it harvests every <loc>. This is the difference between getting 8,000 URLs from a site's primary sitemap and getting the 180,000 URLs the site actually has across all its content types. On a big enough site, you may not even know how many URLs you're going to find until the recursion finishes.

Phase 2 - batch the crawl. URLs get split into chunks of 1,000 and each chunk goes into the Screaming Frog CLI with --headless --crawl-list <file> --export-tabs internal:all. The CLI runs, exports the CSV, the script renames it to batch1_internal_all.csv and moves on. Each batch is independent - if batch 17 fails, batches 1 through 16 are already on disk and only 17 needs a retry.

Phase 3 - combine. pandas.concat over the batch CSVs, write the result as full_combined_output.xlsx. One spreadsheet, every URL, every metric. Goes straight to the client deliverable folder.

Phase 4 - summarize. A short text file with subdomain counts, robots.txt status, llms.txt status, and the sitemap source. It is the "tldr at the top of the audit" that the client opens before they touch the spreadsheet.

Why batching matters more than I expected

The reason I batched was memory pressure. The reason I'm glad I batched is resumability.

Crawls of big sites are slow. Many minutes to many hours. They get interrupted - your laptop sleeps, the network blips, Screaming Frog decides to chew through your CPU and trip thermal throttling. In a single monolithic crawl, an interruption means starting over.

In a batched crawl, an interruption means restarting the current batch. One failed batch retrying its 1,000 URLs while batches 1-180 sit on disk already exported is a far better failure mode than the whole job starting over. At 180,000 URLs that distinction is the difference between "I'll re-run that one chunk" and "I'll see you Monday."

Re-runnability shaped the file naming too. Every batch file is timestamped and prefixed. Every output CSV is named batchN_internal_all.csv. The combined Excel is regenerable from the batch CSVs without re-crawling. If a client comes back two months later and says "can you re-run the analysis with a different filter," I open the batch CSVs and the new analysis takes minutes, not hours.

The bit about llms.txt that nobody asked for

llms.txt is interesting because the convention is still gelling. The idea: a site publishes a small text file at /llms.txt telling AI models what content they should and shouldn't read, similar in spirit to robots.txt but aimed at LLM crawlers rather than search engines.

Most sites don't have one yet. The ones that do tend to be either AI-forward companies signalling that they're thinking about this, or content publishers who have explicit positions about AI training. Either way, whether a site publishes an llms.txt and what it says is interesting context for an audit.

Downloading it costs one HTTP request and adds two lines to the summary file. It's the kind of low-effort, high-signal data point that's worth catching even when it's not what the client asked for.

The "do not crash silently" principle

Every external call in this script is wrapped in try/except. Sitemap fetches that fail print the failure and continue with whatever URLs were already collected. Robots.txt or llms.txt failures print and move on without aborting the crawl. A batch that fails the SF CLI is logged and the rest of the batches run.

The principle: an audit script should produce as much output as possible on a partial failure, not zero output on any failure. A client gets a summary text file plus an Excel with 179 of 180 batches even if batch 47 mysteriously failed - and the summary tells them which one didn't make it. That's a recoverable situation. An empty folder and a stack trace is not.

What I would change

A few things on the next iteration:

Parallelize the batches. Right now they run sequentially because the script is dumb. Screaming Frog can handle a few concurrent invocations on a multi-core machine. Running batches 2 and 3 in parallel while batch 1 finishes would cut total run time roughly in half on most hardware.

Move config out of the source file. The domain, sitemap URL, batch size, and output folder are hardcoded variables at the top of the script. For my own use that's fine. For sharing, a config.yaml or argparse-driven CLI would make this drop-in usable without editing the source.

Diff against a previous crawl. I have the full Excel of every URL on every audit I've ever done. Comparing this month's crawl to last month's would surface pages that disappeared, status codes that changed, and metadata that drifted. The hooks are there - the diff layer isn't built yet.

Push the summary to email or Slack. Right now I run it locally and open the folder. For recurring audits, it should land in the same channel as the rest of the marketing reporting - one less place to remember to look.

But for what it is - a one-file Python script that turns a sitemap URL into a complete crawl deliverable without requiring me to babysit the Screaming Frog GUI for an afternoon - it's earned its keep across multiple audits. Free to fork, free to ship.

Source: github.com/schandler7171/octo-python