Operating On The Edge Of Failure… of MicroServices

Systems have Always been Complex

The complexity of a system is a correlation between the problem it aims to solve and the technology it is built upon.

Microservices Architectures offer solutions for contexts requiring extreme scalability, high performance, and rapid change. These systems are characterized by strong decoupling and redundancy. While they are excellent at solving complex problems by modularizing them into “sub-problems,” they introduce a new challenge: the macro-system must now handle the cohesion of many small, moving parts.

This introduces types of failures that were simply unknown to standard monolithic applications. It might seem like microservices “caused” new problems, but the reality is that these problems are inherent to complexity. Choosing one architecture over another doesn’t remove problems; it simply changes their nature.

As Richard Cook describes in his talk, a complex system is always operating in a state of partial failure. We should consider these failures as part of the normal context, not as rare exceptions.

The complexity of a system is the correlation of the problem is aiming to solve and the technology is made of.

The edge of failure

A system operates within three distinct boundaries. When a system crosses one of these lines, it either stops working or loses its reason to exist:

edge-of-failure
  • Economic Failure: An application exists only as long as it makes sense from an economic perspective. Management constantly pushes the application away from this boundary by requesting new features or changes to maintain value.
  • Unacceptable Workload: Every application requires a certain amount of effort to run or modify. This workload must remain bearable. Developers are constantly pushing away from this boundary, trying to minimize the effort and “friction” required to keep the system alive.
  • The Accident Boundary: This defines the point at which the application stops working. “Failure” means different things for different applications, and the definition changes over time.

It has been observed that applications tend to run very close to the Accident Boundary, separated only by a thin Error Margin. This margin represents the developers’ confidence, the calculated risk of how close they can operate and make decisions without triggering a catastrophe.

Microservices and “Accepted” Failures

The Microservices paradigm has pulled new types of failures into the “Error Margin”, failures that would be considered catastrophic in other architectures.

In a distributed system, unreliable communication and machine failure are not “bugs”; they are normalcy. Because this is the expected environment, every individual service must be designed to handle these interruptions gracefully.

Reliability by Design

This new class of failures has elevated reliability to a primary role in architectural design. Reliability cannot be an afterthought; it must be built into every component that interacts with unreliable resources or critical systems from day one.