Resilient Systems: A Primer on Stability and Recovery

Reading Michael Nygard's Release It has introduced me to many ideas around creating robust software. Part one of his book presents the idea of stability, what causes instability, and patterns for mitigating such issues.

Stability in a system is being able to continue even in the face of

shock
stress
flat-out component failure

Whether it be unexpected shocks or sustained stress, ultimately components will begin to give way. These can be known as the cracks in our system. Every system has them and so we must accept that failures will happen, it's about designing your system's reaction to those failures.

In the following sections, we'll dive into the common causes of these cracks, such as Integration Points and Chain Reactions, and discuss effective strategies like Circuit Breakers and Bulkheads that help mitigate the impact of system failures.

Uncovering the Culprits of System Instability

The antipatterns in this section are certain to create, accelerate or multiply cracks within a system. Being aware of them is half the battle.

Integration Points

Having an isolated system that doesn't integrate with anything is rare. Integration points are the number one killer for any system.

Each time your system calls out to another, it presents a stability risk.

network requests
shared resources
message brokers

All of these can and will hang, raise an exception, or create other shocks. Failure in a remote system propagates and has a habit of quickly becoming your problem.

Programming defensively using Circuit Breakers and Timeouts can go a long way in navigating the dangers of integration points.

Chain Reactions

Many of our systems are now multiple nodes behind some form of load balancer. In a scenario where one node fails, all others must pick up the slack.

Although each server in the load-balanced group takes on only 8% of the total workload, the same server's load increases by 32%.

A chain reaction occurs when there is an underlying problem within an application. Usually some kind of

load-related resource leak
race conditions causing deadlock
blocked threads

Each application that falls over from this problem then increases the likelihood of another failing. This can quickly bring a whole layer down. Due to the nature of these applications being identical to one another, the only real fix is to address the underlying defect. Partitioning servers with Bulkheads can be a good way to prevent chain reactions from taking out the entire service.

A chain reaction failure in one system could just as easily jump the crack and become a Cascading Failure upstream. The calling layer can use a Circuit Breaker for that.

Cascading Failures

All failures begin as a crack. When the patterns for containing these cracks are absent, we witness cracks that jump layers. A cascading failure is when a crack from one layer becomes the catalyst for a crack in another.

The easiest example to give is a database failure. If a database cluster croaks then any application using it will certainly see issues. If it does not handle this case well, then the whole layer will begin to fail.

Cascading failures are oftentimes a result of exhausted resource pools in lower layers. Leaving these Integration Points unprotected is a foolproof way of allowing the cracks to jump.

A cascading failure happens after something has already gone wrong. Protect the callers using Timeouts and Circuit Breakers.

Slow Responses

Slow responses themselves can be classed as a sort of gradual Cascading Failure. Upstream services experiencing our slow responses will also be experiencing slowness, tied-up resources, and potential instability. This tends to propagate upwards through the layers.

Sometimes an underlying issue (memory leak, network congestion) will be the root cause but more often than not unrestricted traffic is the culprit.

If your system is capable of self-monitoring, consider failing fast if the average response time exceeds the system's allowed time. For example, a web server may respond with HTTP 503 and a suitable Retry-After HTTP header.

Strategies to Mitigate System Instability

Timeouts

Networks are unreliable and we cannot wait forever. Timeouts provide a way to stop waiting for a response when it’s unlikely to come.

Good placement of a timeout can provide fault isolation within your system. If another system is struggling, it shouldn't be able to take us down with it.

Consider combining timeouts with a retry mechanism. Immediately retrying an operation may sometimes offer the expected result, however, it is often beneficial to delay retries as some problems will not be resolved straight away.

Circuit Breakers

In electrical engineering, a circuit breaker is a safety device that is designed to protect an electrical circuit from damage when there is too much current.

Similarly, in software systems, a circuit breaker allows one subsystem to fail without destroying the entire system.

In a normal closed state, the breaker allows operations to execute as normal. If the breaker detects a threshold of failures it trips, opening the circuit and failing further calls immediately. After a reset period, the breaker can place itself in a half-open state whereby it will allow the next call to pass. If it succeeds it returns to the regular closed state, allowing all operations. If it fails, we enter the open state again and wait out the reset.

It can be useful for the circuit breaker to raise its own class of exception. This allows the system to differentiate breaker trips from regular errors and handle them differently. Or indeed, allow monitoring of the frequency of our state changes.

Circuit breakers when combined with Timeouts can be a powerful tool for protecting a system from any number of problems that arise from Integration Points

Bulkheads

Much like how a bulkhead in a ship refers to a partition designed to prevent flooding from spreading, in systems a bulkhead enforces a principle of damage containment through a partition.

The most common form of bulkhead we see is physical redundancy. If we have four servers, a hardware failure in one cannot impact the others. A bulkhead doesn't require partitioning exclusively through physical means, we can also, for example, partition application instances or thread groups.

What the bulkhead pattern is good at is maintaining some functionality in the face of critical failure. Knowing that failure is inevitable, you must consider how to minimise the damage caused by one particular failure.

Fail Fast

Nobody likes waiting around, especially not for a failure response. If the system can determine upfront that it will fail an operation, it's better to fail fast.

At the start of any transaction, a service will be able to gauge a rough estimation of what resources it requires or what Integration Points it will touch. Armed with this knowledge the pragmatic engineer can obtain the necessary connections, verify the state of the Circuit Breakers around these Integration Points, and validate the required parameters. If any of these are unavailable or invalid, fail right away rather than expend valuable resources.

Conclusion

Failures are inescapable. The systems we create and depend on are certain to succumb to cracks in many ways. Taking a vigilant and distrustful approach to system design can go a long way.

It is important to remember however that the stability of your system isn't measured by the number of these patterns you have implemented or antipatterns you have avoided. The true measure lies in developing a recovery-orientated mindset and strategically applying the specific patterns that fit your system's failure modes.