A YAML-driven Python orchestrator - sequencing ten scripts without writing a tenth script

A common evolution in any data project: you start with one Python script. Then two. Then a few related scripts that need to run in sequence. Then a wrapper bash file that calls them all. Then the bash file gets too big, so you migrate to a Python wrapper. Then the wrapper hard-codes the order and the wait times, and changing anything means editing Python instead of editing config.

The fix is to externalize the sequence into a YAML file and write a single runner that reads it.

The shape

The runner is small. About 80 lines. The YAML is large because it has to list every script and its options.

A typical YAML config:

scripts:
  - name: "1-merge-CSV.py"
    enabled: true
  - name: "2.3.0-SendEmail-Status.py"
    enabled: true
  - name: "5.0.0-GSC-AggregateTotals.py"
    enabled: false
  - name: "7.0.2-FormerB-contenttoWord-JS_heavy.py"
    enabled: true
    custom_wait: 120
  - name: "11.5.4-SSH-ExecuteDB-Sync.py"
    enabled: true
    custom_wait: 60

The runner reads the YAML, iterates over the script list, and for each enabled entry:

Builds the absolute path to the script in the same directory as the runner.
Launches it via subprocess.run using the current Python interpreter.
Captures stdout and stderr into a timestamped log file.
Waits the per-script custom_wait if specified, else a default wait_time_regular.
Moves on to the next.

Disabling a script is one boolean flip. Reordering is moving lines in the YAML. Adding a new script is one entry in the YAML. None of those touch the runner code.

Why per-script wait times matter

Most scripts in a pipeline are quick - they do their thing in a few seconds and return. But some scripts trigger expensive downstream work that the next script depends on. Examples:

A script that POSTs to a CMS to push 200 articles. The CMS needs 90 seconds to process the queue before the next step can read the new state.
A script that triggers an SSH command to import a SQL file remotely. The remote import takes 45 seconds to complete; you can't query the destination DB before that finishes.
A script that fires a webhook to a third-party automation platform (Zapier, Make, etc.). The third-party platform takes 30-60 seconds to do its thing.

In the early version of this pipeline, those waits were hard-coded time.sleep(120) calls inside each script. Wrong place. The script doesn't know how long the next step needs; only the orchestrator knows the relationship between adjacent steps.

Moving the waits into the YAML, keyed by script name, made the dependency explicit. The YAML says: "after 7.0.2-FormerB-contenttoWord-JS_heavy.py runs, wait 120 seconds before starting the next script." That's a piece of pipeline knowledge, not script-internal knowledge.

Logging that survives unattended runs

The pipeline runs unattended overnight on some clients. When something fails at 3:47 AM and the error message scrolls off the terminal, you need that error in a file somewhere.

The runner overrides sys.stdout and sys.stderr with a LoggerWriter that tees every write to both the terminal and a timestamped log file. The log file lives in a logs/ subdirectory and is named after the runner's parent directory plus _master_runs.txt.

The pattern looks like:

class LoggerWriter:
    def __init__(self, stream, log_file_path):
        self.stream = stream
        self.log_file_path = log_file_path

    def write(self, message):
        self.stream.write(message)
        with open(self.log_file_path, 'a') as f:
            f.write(message)

    def flush(self):
        self.stream.flush()

sys.stdout = LoggerWriter(sys.stdout, log_filename)
sys.stderr = LoggerWriter(sys.stderr, log_filename)

Now every print statement in every script the runner launches gets tee'd to the log. The unattended run produces a transcript I can read the next morning to see exactly what happened, what failed, and where the time went.

The "skip on failure, log the error" principle

When script #4 of 10 fails, the right behavior depends on the kind of failure.

Transient failure (network timeout, rate limit, temporary 503): the right move is often to retry or skip this run and try again next time.
Hard failure (script crashed, data corruption, dependency missing): you want to stop the pipeline and surface the error.

The runner errs on the side of "keep going and log loudly." Each script invocation is wrapped in try/except. If a script throws, the runner prints a clear error block to the log, increments a failure counter, and moves on to the next script. The summary at the end of the run lists which scripts succeeded, which failed, and how long the whole pipeline took.

This is the opposite of the "stop on first error" pattern that most pipelines default to. The reasoning: in a 10-step pipeline where step 4 fails, steps 5-10 might still be useful (they might depend on a different upstream that's fine). Better to run them and let the operator decide what's salvageable than to stop everything.

For pipelines where one step's failure invalidates everything downstream (financial reconciliation, for example), stop-on-first-error is the right pattern. Those pipelines should not use this runner without modification.

What I would change

A few next-iteration improvements:

Conditional execution. Right now enabled: true/false is a global flag per script. A future version should support "run this script only if the previous script's exit code is 0" so the YAML can encode actual dependency logic, not just a flat sequence.

Parallelism for independent steps. Some scripts in the pipeline don't depend on each other. They're sequential only because the YAML lists them in order. Adding a parallel_group: field would let the runner fan out independent scripts and join them before moving on.

Notifications. When the pipeline finishes - successfully or not - it should post a summary to Slack or send an email. Right now it just writes to a log file and waits for someone to look at it. Wiring the existing send-status script into the runner's exit hook would close this loop.

Schedule integration. The runner is currently invoked by cron. A future version should know its own schedule (a schedule: "0 2 * * *" field at the top of the YAML) and self-register a launchd job on macOS or a cron entry on Linux. Removes the "did I remember to add this to cron?" cognitive load.

But for the actual job - "run these ten scripts in this order, wait the right amount between each, log everything, and survive overnight" - the YAML-plus-runner pattern is the most low-friction way I've found. Adding a step is one line of YAML. Removing a step is one boolean flip. Reordering is drag-and-drop in your editor. The runner code itself almost never has to change.

Source: github.com/schandler7171/portfolio-example-scripts/tree/main/master-run-yaml