Good engineers make boring choices

SaaS apps do not fail because a server explodes. They fail because the stack has become too complex to debug and maintain.

The app cannot be reproduced locally. Bugs require jumping through dozens of files, services and dashboards. Migrations are fragile. Nobody knows where anything lives. The person who built the system left six months ago, and now every change feels dangerous.

This is what kills productivity. This is what kills projects.

Start with the smallest system that can work

For every app I work on, I try to keep the stack as lean as possible.

Take this blog for example. It is just HTML files. Nothing fancy. It works. I can edit anything I want at any time. I have version control with Git and it scales because I have a CDN. I do not need anything else. It is just me working on this alone, so why would I need a complex system?

Before I add anything, I ask:

Can we run it locally?
Can we restore it?
Can we replace it?
Can we debug it at 2am?
Can we remove it without rewriting the product?

Redundancy is not resilience

I had a client who wanted maximum redundancy. We set up multi-cloud failover so if AWS died, Azure would take over.

In the end, the card on file expired so both providers stopped working and the whole setup fell over.

Hidden coupling is everywhere. Billing, identity, DNS, CI, secrets managers, etc. I can do my best but I won't catch them all.

I have also seen provider lock-in hurt teams hard. A vendor gets acquired, prices jump, an API changes or an account gets suspended. So today, I build in a way I can easily leave.

If the answer to "can we replace it?" is no, that vendor owns a piece of our future.

Optimise for recovery, not fantasy uptime

Across my career, hardware has rarely caused the worst incidents. Human mistakes, bad deploys, provider lockouts and architecture mistakes did.

I cannot predict every outage. I can control how fast I recover.

Systems I can reproduce locally. Deployments I can understand. Dependencies I can replace. Incidents where the fix is obvious. Infrastructure that fails in ways I have seen before.

What matters at 2am is whether the failure is understandable and the recovery path is already known — not whether the architecture looks impressive on a diagram.

My default stack for small teams

For small teams, this setup has worked for me for a long time:

One oversized server
Postgres for primary data
Files on local disk
Containerised app deployment

I oversize the machine on purpose. Extra RAM gives room for spikes, leaks and debugging without panic. It also stops premature optimisation work that does not matter yet.

A server with too much headroom beats a clever setup with no margin.

My backup plan is deliberately simple:

Regular pg_dump backups
Incremental backups to object storage with restic

I can run restore drills on any machine. Rebuild on a fresh server, restore files and database and verify the app works end-to-end.

Postgres gets you surprisingly far

Most of the time, I need a database, a key-value store and a queue. Turns out, Postgres can already do all of this and so much more.

Normal relational tables for core data
Materialised views for caching
Unlogged tables for fast transient data
A Postgres-backed queue library such as pg-boss
Scheduled jobs with cron or pg_cron
Built-in full-text search is more than enough for most apps
Append-only tables for audit trails and logs.
Time-series data with partitioning and rollups.

Eventually, I might move to Kafka, Redis, Elasticsearch or ClickHouse but Postgres works really well for a surprisingly long time.

Deploy fast

I care a lot about CI speed. If deploys take half an hour, the team only gets a handful of deploys in a day.

My deploy path follows the same rule.

Build one Docker image or binary containing everything: the backend, frontend bundle and any static assets. Cache every build layer aggressively. No separate asset pipeline unless the product actually needs one. The app serves the built assets, and the CDN caches them.

Fast deploys change team behaviour. People ship smaller changes, roll back faster and avoid the giant risky release that happens because deployment is painful.

Avoid microservices until the pain is real

I'm not the biggest fan of microservices, but they can make sense when:

Different services need different runtimes
A critical ingestion path must stay isolated
Parts of the system scale differently
Different teams need independent ownership

Most of those things do not apply to most startups and small teams.

Most of the other reasons I've heard are simply due to organisational issues or shortcomings in the architecture. At this stage, your team is too small for "coordination overhead", "domain-driven design" or the worst reason to use microservices, "scaling".

A modular monolith is usually the right answer until proven otherwise. Then split it. Still, always keep things simple with HTTP API calls. No Kafka, Pub/Sub or event bus unless the pain justifies it.

Boring protocols fail in obvious ways and have tools with decades of development to monitor and debug them.

Can we remove it without rewriting the product? A monolith can be split later. A distributed mess cannot be untangled.

Do not mistake platform limits for architecture needs

I am not against cloud. I am against cloud complexity theater that teams adopt before they have a real need.

There's the classic mistake of over-engineering too early with Kubernetes and full cloud ceremony, but there's another more subtle one.

Then there is the memory- and CPU-starved PaaS that forces you to deal with fake scaling issues.

Most of those platforms have a ridiculously low amount of provisioned resources. I have seen teams waste time on caps, cold starts and workarounds instead of shipping product. If you only have a few gigabytes of RAM to work with, you're going to have to optimise your code much earlier than you should.

People misread platform limits as architecture needs. They add caches and service splits and queues and whatever they can to work around the fact that they just don't have enough CPU or RAM.

Much of that disappears if you start with one generously sized server. I can get a fully managed server with 32-core, 256 GB RAM and 2TB of storage for around € 199/m at the time of writing.

Hardware failures are not usually the incidents that hurt the most. The app goes down because a human messed something up. Even if the server itself fails, the backup recovery is so simple the whole app can easily be redeployed somewhere else.

Strong engineering is boring by default and innovative by exception. Use the simple thing first. Keep the system easy to run, inspect, restore and replace. Then, when complexity becomes unavoidable, add it deliberately, isolate it carefully and make sure you can undo it.