Overview of the circuit breaker in Hystrix

Libraries provided by Netflix, usually look simple, but after a deep dive, you will realize this stuff is pretty complicated. In this article, I want to explain behavior and usage of the circuit-breaker pattern being a part of the Hystrix.

Following the Wikipedia definition "Circuit breaker is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly (during maintenance, temporary external system failure or unexpected system difficulties)." What does it mean?
For example, let's consider the circuit as a connection between two services. It operates normally (aka stays closed) when the connection is stable, and the target service is healthy. We can execute our requests (assume 100 per second) without any problems.
But when our target service is down (what we know after a few unsuccessful tries), it no longer makes any sense to try 100 times per second. So we open our circuit and serve responses from a predefined fallback, as we know that the service won't reply.

How is it implemented in the Hystrix?

We start with the closed circuit. All requests are processed, and Hystrix is gathering metrics. Metric includes information about a number of processed requests, execution time and a finish status. As long as everything is up and running, and there are no network issues the circuit breaker (CB) stays closed.
If we execute less than a minimum number of request (circuitBreakerRequestVolumeThreshold property, default to 20) in a given time window (metricsRollingStatisticalWindowInMilliseconds property, default to 10 seconds), there won't be any decision to change the status. If we fulfill the minimum threshold requirement and more than a given number of requests have failed (circuitBreakerErrorThresholdPercentage, default to 50%), Hystrix will decide to open the CB. Now when we try to execute the next request, it will be redirected straight to the fallback.

But how do we know when the service will be back online? Hystrix allows one request per some time, called the 'sleep window` (circuitBreakerSleepWindowInMilliseconds, default to 5 seconds) to execute normally. Launching such request changes the CB status to half open. Now based on the result of this request, CB status will be changed to:
- closed, if the request finished successfully,
- open, if the request failed.

Now, to be fully correct, I need to clarify one thing. In fact, for Hystrix there is no such thing like "request." It operates on "commands," which of course can be implemented to execute a RESTful request.


But when I decide to use Hystrix in my application, does it mean there will be just one circuit breaker? No! There are usually more instances, and the one which will be used for a given command is determined by a key being the name of the command. So is there any naming convention? Not really, because it depends on what we exactly want to achieve - how to divide our commands.

If you're for example afraid that the particular REST endpoint (regardless the physical host we hit) can fail - let's say "GET /users," it's a good idea to name the command "getUsers". Then all calls to userService1, userService2, ..., userServiceN will be protected by the same circuit breaker.

When more important is which host are you talking to (because e.g. we do use sticky-session client-side load-balancing) the above name won't help. Then it's better to name the command equally to the host name, so "userService1", "userService2", etc. Then outage of the one instance won't impact the other calls. There is just one tiny problem - if you want to specify any properties of the command (like a timeout) it can be done on a command name level. So all requests to the userService1 must have the same configuration. Often we have endpoints which take longer to finish, so it looks like a serious limitation.

There is the third possibility, to merge both approaches and use "userService1#getUsers" name. It solves the different settings issue, but it will be less effective than the second solution, as all circuits we have for a given host needs to be closed/opened independently.

The best solution would be to be able to use a custom key for a circuit breaker resolution, however, it's unfortunately still not possible.


newvalue said…
Thanks for your detail explanation.

Popular posts from this blog

Smart package structure to improve testability

Understanding Spring Web Initialization

Injecting Spring beans into non-managed objects