All posts

Killing a spreadsheet that was hiding $100K - the ad spend pipeline rewrite

A Google Sheet doing real ETL work was misreporting client ad spend by about $100K per quarter. Here's how I rebuilt it in Python and what the spreadsheet was actually doing wrong.

A few months into auditing this client's marketing data, I found something uncomfortable: the spreadsheet they trusted was lying.

Not catastrophically. Not maliciously. Just steadily, quietly, in the way that VLOOKUP-based ETL pipelines tend to lie - a few rows misjoined here, a missing date dimension there, a Supermetrics sync that pulled a slightly different version of "ad spend" than the previous month. The compounding error across multiple platforms and campaigns came out to about $100K in misreported quarterly ad spend.

The spreadsheet was not the right tool for the job. It had become the right tool because nobody noticed when the job got too big.

What the spreadsheet was doing

The client had a Google Workspace setup with a Looker Studio dashboard for executive review. Behind that dashboard:

  • A Google Sheet pulling Google Ads data via the Google Sheets add-on.
  • Another tab pulling Facebook Ads data via Supermetrics.
  • A "combined" tab with VLOOKUPs joining the two by date and campaign.
  • Manual columns mapping campaigns to budget categories.
  • Hand-pasted corrections when somebody noticed a number looked off.

Each piece worked in isolation. Together, they were a fragile data pipeline pretending to be a spreadsheet. Every time the column order changed in either source, a VLOOKUP silently returned #N/A for some rows. Every time a campaign was renamed, the budget mapping broke until somebody updated it manually. Every time the spreadsheet hit Google's calculation limit, formulas started returning stale values from the last successful refresh.

The dashboard at the top of the chain looked clean. Nobody was looking at the spreadsheet that fed it.

The rebuild

The rule I settled on: Google Sheets is great for client review. It is terrible for transformation. Separate those.

The new pipeline:

Google Ads API   ─►  Python ETL  ─►  MySQL  ─►  Python loader  ─►  Google Sheets  ─►  Looker Studio
Facebook (Sheets) ────────────────────►

Five components, three languages, one source of truth in the middle.

Step 1 - data collection. Python script hits the Google Ads API directly. Facebook Ads still flowed through the existing Supermetrics-to-Sheets connection (no point rebuilding what works), but Python read it from there. Both feeds normalized to the same schema before continuing.

Step 2 - transformation in MySQL. This is where I stopped using VLOOKUPs forever. Data lands in MySQL, and SQL handles the joins, the budget category mapping, the date dimension enrichment, the deduplication. SQL is built for this; spreadsheets are not. A single JOIN that runs in 80ms replaced about thirty interlocking VLOOKUPs that took several seconds to recalculate and broke whenever somebody sorted a column.

Step 3 - upload back to Sheets. The combined, cleaned data lands back in Google Sheets via gspread. The dashboard sees the same Sheet structure it has always seen. The Looker Studio source did not have to change. The client did not have to learn anything new. They still get to open a Sheet and look at numbers.

Step 4 - notifications. A short status email goes out after every run: "Google Ads data current to Y, Facebook current to Z, X rows updated." If anything fails, the email says what failed and exits non-zero. Easy to glance at, easy to integrate with cron alerting later.

The whole thing runs on cron, three times a day. Most updates take under thirty seconds.

Why the database layer mattered more than I expected

I almost skipped the MySQL step. The plan early on was "Python pulls API data, transforms in Pandas, writes back to Sheets." Simpler stack, no database to maintain.

I am very glad I did not.

The database changed three things:

Historical data is queryable. When the client asked "what was our cost per conversion in March compared to last March," I had a SQL answer in 30 seconds. The old system would have required digging into a snapshot of the spreadsheet, if anyone had thought to snapshot it.

Schema is enforced. A decimal(10,2) column will not accept a #N/A string masquerading as a number. The class of error where Sheets quietly type-coerces nonsense into numeric columns is impossible.

The transformation is auditable. The SQL that does the budget category mapping lives in a file I can read. The VLOOKUPs that used to do that lived inside cells I had to click on one at a time to inspect.

The maintenance cost of MySQL is real - somebody has to back it up, somebody has to upgrade it - but for a pipeline that handles money, the auditability alone justifies it.

What this looked like in client conversations

I led with the math. "The system you currently trust has misreported your ad spend by about $100K this quarter. Here is the spreadsheet cell showing one of the misjoined rows. Here is what the dashboard says, and here is what the platforms actually report. Here is the gap." The client did not push back. Once you can show somebody a cell-level error in their own data, the conversation is no longer about whether the rewrite is justified.

The political move was the easy part. The technical move was the same as every "fix a fragile spreadsheet that runs the business" project: get the data into a database, do the transformation in SQL, write the output back to whatever interface the humans already use. Do not try to migrate the humans. Migrate the data.

What I would change

If I were starting this rewrite today, I would skip MySQL on a remote SSH server and put the whole thing in BigQuery or a managed Postgres. The SSH and SFTP dance to push SQL files to a remote MySQL instance worked, but it added moving pieces that managed cloud databases would eliminate. The cost difference at this data volume is negligible.

I would also build a simple "reconciliation report" as a fourth stage: a daily comparison of what each platform's API says vs what is in the database vs what is in the Sheet. Catching drift between layers is the whole reason the rewrite happened. The rewrite should be the system that catches its own drift.

Source: github.com/schandler7171/ad-spend-data-pipeline