Intro

In a software system, reliability is the ability for the system to continue working correctly even when things go wrong. Things that can go wrong are called faults and we should design systems to anticipate and deal with them. These systems are called fault-tolerant or resilient.

Faults vs Failures

Faults can be confused with failures but they are very different:

Fault: One component of the system deviating for it’s defined behaviour
Failure: The system as a whole stops providing it’s required service

It is impossible to reduce the probability of a fault to zero so we should design mechanisms that prevent individual faults from cascading into system failures.

For example, in an asynchronous system, your message brokers’ at-least once delivery means that you may receive the same message twice. You should handle the case by introducing message deduplication and or idempotency checks to turn an at-least once guarantee of your message broker to an at-least once processing guarantee in your system. Lest you send a payment to a user twice!

Hardware Faults

You’re still managing your own hardware in 2022? Damn.

The data and computing demands of applications have increased by an order of magnitude since the inception of the internet. Where previously one bare-metal was enough to meet all the needs of your service, we have applications running thousands of nodes over dozens or even hundreds of different physical machines thanks to modern cloud platforms like AWS.

Still it doesn’t hurt to learn how we can mitigate hardware failures. Typically we add redundancy to hardware components:

Setup your storage in RAID configurations
Servers with dual power supplies and hot-swappable CPUS
Batteries and diesel generators for backup power

When a component fails its redundant backup will replace it. Machines can run uninterrupted for years in a hardened redundant setup.

Software Errors

Software bugs can be caused by seemingly anything and result in seemingly anything. Realistically there is no one solution or way to account for all possible software errors, but we can take a first-principle approach to asses common root-causes that may be common over different domains.

Software faults can be introduced into a system, but lay dormant for a long time until the perfect storm of unusual circumstances is created. In these situations we usually find out that the software is making some sort of assumption about something. That something may be correct and true for some amount of time until that is no-longer the case, maybe due to a bug introduced in an upstream dependency, in the environment, or perhaps a config change.

Some considerations that may help mitigate systemic software faults are:

Think carefully about assumptions and interactions in the system

What if your message bus seemingly breaks its SLA and replays events that are months old? Surely that could never happen right?

Thoroughly testing your system
Process isolation
Purposely crashing processes and observing restart behaviour
Robust observability; metrics; logging; tracing in production
Taking a state-machine approach to systems where entity state changes cause many side effects
etc

Human Errors

Earlier this year we saw outages of awesome scale happen to Amazon and Facebook. The root cause of these outages? Human error when applying changes to DNS config.

Even the most talented, battle hardened engineers make mistakes from time. As a result we should aim to make our systems reliable in spite unreliable humans. Some approaches are:

Design systems in a way that minimises opportunities for error.

Well designed abstractions, APIs, and interfaces do well to discourage the wrong thing. But if too restrictive, your consumers will find clever ways to work around them. Once you make something available, people will find ways to depend on it forever. Thus some thought into the surface area of your API should be done.

Seperate the place where people make the most mistakes from the place where they can cause failures.

Fully featured non-production sandbox environments with real data that cannot affect real users.

Test thoroughly at all levels.

unit tests; integration tests; e2e testing; contract-testing; browser testing; smoke-testing;

Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure.

Make it possible, fast and easy to roll back configuration changes and deployments
Roll out code changes gradually (a/b testing, dark launches)
Provide tools to recompute and replay data

Setup robust Observability (logging, metrics, and traces)

A good monitor on a key metric can tell you that something is going wrong before it happens, or at the very least immediately as it happens
Traces can help us more quickly track down the root cause of an issue
Logs are good. Nuff said.

Implement good management practices and training

Conclusion

Reliability is a huge topic. In future articles we will explain different patterns an algorithms for how we can secure a distributed data-intensive application and see how we can make our systems more reliable