Exploring Fault-Tolerance with Circuit Breakers

Tech

Jul 24

Developing a fantastic user-facing application takes tremendous technical skill and attention to detail to capture your user’s requirements. The payoff for exceeding your user’s expectations is akin to scoring a touchdown, and I don’t know about you but I’m in it to win it! Have you ever spent countless hours meticulously crafting the perfect user experience; accounting for all of the possible exceptions, making the API as ergonomic as possible, etc… to have it all foiled by an unreliable dependency? This is one of the most frustrating things about user-centric development. I had been searching for a way to handle fickle upstream dependencies for quite a while before I stumbled on the simple concept of a circuit breaker.

TLDR;

Adding circuit breakers to our existing code will increase the durability and resilience of our applications. It follows that this will also reduce the need for manual intervention when a dependent system fails or goes offline. Furthermore, with proper observability, we can be proactive about resolving downstream outages.

What problem are we trying to solve?

Current software applications almost always rely on someone else’s data to provide their users with the best possible experience. To get this data our applications have to make calls to another set of applications. This is where we can run into issues. Martin Fowler, one of the technical gurus in the software development space explains it pretty well.

It's common for software systems to make remote calls to software running in different processes, probably on different machines across a network. One inherent issue with this is that remote calls can fail, or hang without a response until some timeout limit is reached. What's worse is if you have many callers on an unresponsive supplier, then you can run out of critical resources leading to cascading failures across multiple systems.

How do Circuit Breakers fix this?

The basic idea behind the circuit breaker is pretty simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually, you'll also want some kind of monitor alert if the circuit breaker trips. (Fowler)

Well that’s pretty dense, so let’s break it down. Let’s suppose that we have one call to Google Maps that is failing intermittently, disrupting the user experience of our application.

Counterintuitively, the desired state is for the Circuit Breaker to be Closed. In this state the application is running as expected — your service is receiving successful responses from Google Maps. Now, let’s say Google Maps goes down for 2 hours… uh oh!

All of a sudden we have a bunch of failed calls to Google maps and our users are complaining… let’s stop making request to that service. The Circuit Breaker will pump the brakes and move to the Open state. In the Open state the Circuit Breaker will prevent calls to Google Maps for some configured time, let’s say 5 minutes.

After that 5 minutes is up the Circuit Breaker will transition to the Half-Open state. Here the Circuit Breaker tests whether Google Maps is actually working again. If calls succeed – Nice! We can move back to the Closed state and resume feeding our users Google Maps data. If they are still failing – Back to the Open state we go!

How to Use Circuit Breakers

This section is a bit more technical and subjective to each his/her/their own, but let me describe the configuration I’m running across some of the applications I’ve built recently.

Firstly, let me explain the difference between a Network Exception and a Business Exception… stay with me it will make sense in a bit I promise.

A Network Exception is one that happens because a transient connectivity issue between servers or because one of the servers is down/not accepting requests. These exceptions are usually not easily fixed by code changes, they usually have to do with a network configuration or application load.

A Business Exception is one that occurs when some logic condition fails; such as a null value where an object is expected. These Exceptions are usually fixed through code changes.

I’ve found that Circuit Breakers are most useful when reacting to Network Exceptions, because they give a stronger signal that the upstream service is having an outage. See I told you it would make sense. I don’t want to stop sending requests to a service because I accidentally sent a null object 5 times in a row, I just need to patch my logic or retry until my object has a value. I want to optimize for the Circuit Breaker being Open only when that service is down.

How to Optimize the User Experience

The best part about using Circuit Breakers, in my opinion, is that it offers me a backup plan for missing data. To speak plainly if I can’t rely on my first choice service to serve my users the data they need, then I need to have a contingency to give them some type of data. When Open, we can use the Circuit Breaker to switch to alternate logic to get the same or similar data from another service. Going back to our Google Maps hypothetical, if Google Maps is down, I want to get similar data from Apple Maps, Waze, or some combination.

In a perfect world, no service would ever have an outage. Unfortunately, as we all know — nothing is perfect. When a service we depend on goes down the experience for users should not have to change at all; Circuit Breakers protect our users from this inevitability.