All posts

Headless browsers for SEO audits - because BeautifulSoup wasn't enough

Why I stopped trying to scrape client sites with requests + BS4 and started driving a real browser instead. A workflow that hit roughly 99% extraction success across hundreds of audits.

A few years back I had a stack of client site audits piling up and a Python script that was supposed to be doing most of the heavy lifting. It was doing about 60% of the work. The other 40% was me hand-copying content out of pages the script came back empty on, while quietly cursing whoever decided "modern web" should mean "your scraper sees an empty <div id='root'> and nothing else."

The fix was the same fix everyone eventually arrives at: stop pretending to be a browser, be a browser. Selenium driving headless Firefox, then BeautifulSoup for the parsing. That shift took my client extraction rate from "sometimes" to about 99%, and it stopped getting me blocked at the door by aggressive bot detection.

This post is about why, and the workflow I settled on.

What BeautifulSoup actually does

BS4 is a parser. It takes a string of HTML and gives you a tree to walk. It is excellent at that job. It does not fetch HTML, it does not execute JavaScript, it does not maintain cookies, it does not negotiate TLS, it does not present a user-agent. People say "BS4 doesn't work on this site" when what they mean is "the HTML I fetched with requests.get(url) doesn't have the content I want, and there's nothing BS4 can do about that."

There are two reasons the HTML you fetch can come up empty:

The content isn't in the HTML. Modern marketing sites are JS-heavy. React, Vue, Next.js, headless WordPress setups, infinite scroll, lazy-loaded sections. The server sends a shell. The browser hydrates the shell into a real page by executing JavaScript. requests and urllib don't execute JavaScript. They give you the shell.

The server doesn't want you. Cloudflare's bot walls. Akamai. PerimeterX. Even basic WAF rules at the origin. They look at your request and ask: does this look like a real browser? python-requests/2.31.0 says no. Even with a spoofed user-agent string, the request headers don't match what real Chrome sends. The TLS handshake fingerprint (JA3) doesn't match. There's no Sec-Fetch-* headers. No cookies from a prior session. No JavaScript challenge response. You can paper over some of this with libraries that mimic browser fingerprints, but it's an arms race and you'll lose it eventually.

The fix for both problems is the same.

The shift: drive an actual browser

Selenium, Playwright, Puppeteer - pick your driver. I went with Selenium and headless Firefox because at the time it was slightly more robust against Chromium-specific fingerprint checks (some bot walls specifically target headless Chrome). The principle is identical no matter which you choose: you're not fetching HTML, you're opening the page.

The browser executes JS. The DOM hydrates. The TLS handshake looks real because it is real. The user-agent says "Firefox 120" because that's what's running. Cookies persist across requests in the same session. The bot wall sees a browser and lets it through.

Then - and this is the part people forget - you hand the rendered DOM to BeautifulSoup for parsing. BS4 isn't being replaced. It's getting better input.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

options = Options()
options.add_argument("-headless")
driver = webdriver.Firefox(options=options)

driver.get(url)
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.TAG_NAME, 'body'))
)
soup = BeautifulSoup(driver.page_source, 'html.parser')

That's the whole transformation. Five lines of requests becomes five lines of Selenium-plus-BS4 and the success rate goes from "sometimes" to "almost always."

The audit workflow it enabled

For a typical client site audit, the input is a CSV of URLs - usually exported from Search Console, the client's sitemap, or a Screaming Frog crawl. The output the client cares about is content they can read, comment on, and act on.

The pipeline:

  1. Headless render each URL. Wait for <body> to be present (or a more specific selector if the site lazy-loads its content into a known container).
  2. Strip the chrome. Remove <nav>, <header>, <footer>. Nav text isn't audit content and it pollutes downstream analysis.
  3. Extract hierarchical content. Walk h1 through h6 plus p tags in document order. The H1/H2/H3 structure is half the story of an SEO audit by itself - if a page has zero H2s, that's a finding before I read a word.
  4. Score readability per page. Flesch Reading Ease via the textstat library. One number per URL that any client can interpret at a glance.
  5. Output to Word. One .docx per page, plus an aggregate Flesch report.

Why Word and not Markdown or JSON? Because clients live in Word and Google Docs. They want to track-change-comment-suggest on the content I send them. They don't want to clone a repo. The audit deliverable has to land where the client already works.

from docx import Document

doc = Document()
doc.add_heading(f"URL: {content['URL']}", level=1)
doc.add_paragraph(content['text'])
doc.save(f"content/{slug}.docx")

A 200-page client site becomes 200 Word docs and one readability summary in the time it takes to drink a coffee.

The hidden benefit nobody mentions

Google's rendering crawler is a headless Chromium. The view of your client's site that ends up in Google's index is, more or less, the view a headless browser sees. So if your audit is meant to reflect what Google indexes, you actively want the rendered DOM. requests + BS4 would have given you a 2008-crawler view of the page. Headless-rendered HTML gives you a 2026-Googlebot view.

This is not a small thing. Half the audits I've run have flagged content that "exists" in the human view of the page but isn't actually in the indexable DOM - usually because it's behind a tab, an accordion, an "Load More" button, or rendered after a deferred script. A requests scrape would have shown some of that as missing. A headless render shows it the way a crawler would.

Lessons that took me embarrassingly long to learn

Explicit waits beat sleeps. WebDriverWait(driver, 10).until(EC.presence_of_element_located(...)) is correct. time.sleep(5) is "I have no idea when the page is ready, so I'll guess." Use the former everywhere.

Strip nav/header/footer before scoring readability. Repeated chrome text inflates word counts and drags down the Flesch score. The audit number you give the client should reflect the content of the page, not the navigation.

Webdriver-manager is the right answer. Manually managing geckodriver or chromedriver versions is a Saturday afternoon you'll never get back. webdriver_manager.firefox.GeckoDriverManager().install() handles versions, paths, and updates.

Respect robots.txt and rate-limit. Audits are for sites you have permission to audit. Even then, hammering a client's production origin is a bad way to introduce yourself.

Headless Firefox vs. headless Chrome is a real choice. They render slightly differently in edge cases. If one keeps tripping a bot wall on a particular site, swap to the other. Cheap experiment.

What I'd do differently if I started today

Playwright over Selenium for any new work. Modern async API, better debugging tools, screenshot/PDF/trace built in, and a healthier ecosystem in 2026 than Selenium. The mental model is identical - drive a real browser, then parse the DOM - so the lessons transfer cleanly. For very aggressive bot walls, libraries like playwright-stealth and undetected-chromedriver exist specifically to make the browser fingerprint less detectable as automation.

I'd also output to a small SQLite database before generating Word docs, not after. Once you have the content in a table, you can re-run an audit, diff against last quarter, and surface "what changed since last time" - which is the audit clients actually pay for the second time around.

But the core lesson is the same as it was: if your scraper is fighting the website, the website is going to win eventually. Stop fighting. Be the browser. The mental shift from "scrape" to "render then read" changes everything.