Operating On The Edge Of Failure… of MicroServices

Systems have Always been Complex

The complexity of a system is the correlation of the problem is aiming to solve and the technology is made of.

Micro Services Architectures bring solutions that can solve problems when we are in a context of scalability, high performance and need for quick changes. These solutions are characterized by strong decoupling and redundancy.

These architectures are also good to solve complex problems because they modularize in sub-problems. On the other side the macro-system has to handle the cohesion of small services is made of and this introduces new type of failures that were unknown to a standard monolithic application.

It seems that Micro Services have solved some problems but have caused new ones. The reality is that the problems are not related with the architecture but to the complexity. Choosing an architecture instead of another one will change only the type of problems rather then remove them.

Richard Cook, in his talk, describes how a complex system has always problems at the point we should consider the failures as part of the context, not exceptions.

The edge of failure

A system is operating between three boundaries. When the system crosses these boundaries stops working or stops to have a reason to work.

edge-of-failure

The first boundary is the Economic Failure. An application exists until it makes sense from an economical prospective. The company or the management is working to push the application away from this boundary, for example asking for new features or changes.

The second boundary is the Unacceptable Workload. An application requires an amount of work in order to run or to in order to be modified. This amount should be bearable and the developers or the people are working on the project are always pushing away from this boundary, trying to minimize the effort needed.

The last boundary is the Accident that defines the point after which the application stops working. Stop working has different meaning for different application and it changes to time to time.

It has been observed the applications tend to run close the Accident boundary with a certain distance defined by the Error Margin. This margin is the developers confidence in how close they can operate and make decision without causing an accident.

Micro Services Failures

The Micro Service architecture has bought inside the error margin new types of accepted failures, failures that in different architectures would be considered catastrophic or unacceptable. These errors are caused by the unreliable communication and the possibility of failing machines. This is considered normality in a Micro Service architecture so the each applications has to handle.

Reliability

This new class of failures has given to the reliability an important role into the architecture design. The reliability should be part of the application since the beginning, it should be build with every components that interact with unreliable resources or critical systems.