在服务器流量波动的情况下，我们需要根据下游服务器容量、业务要求等等对系统进行策略性的保护。保护策略有很多种，包括：

限流（Rate limit）：限制系统输入输出以达到维持服务稳定的目的；
熔断（Circuit break）：在系统受到过多failing response的时候，拒绝系统输出；
减载（Load shedding）：在系统输入请求响应时间过长的时候，拒绝系统输入。

熔断的作用是阻止服务发送过多可能失败的请求 (The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail)。本文从开源代码 sony/gobreaker ^[1]出发，介绍熔断器的工作原理和机制。

设计要求 (Requirements)

分布式系统中，一般的故障场景例如网络波动（slow network connection），请求超时（timeout）或者过载（overload）等等都可能是暂时性的问题，能够通过系统自修复或者云系统的延展性（horizontal/vertical scaling）等等方式解决。熔断器（Circuit Breaker）是为了解决一些不可预测、难以自修复的故障，比如系统下游服务不可用，数据库宕机等等。另外，熔断器也能有效地阻止连锁反应（cascading failure）的发生。比如当网关（gateway）某一个下游服务不可用，系统不断发送请求并不断重试，可能会导致网关服务占用过多资源内存导致整体崩溃；下游服务如果只是部分不可用，过多的失败请求也会导致下游服务崩溃。

设计一个熔断器要求能够在故障时迅速反应，并且在故障恢复后能够自动恢复。

状态机 (State machine)

熔断器其实是一个小型的状态机，随着请求返回状态码动态进行状态的调整。状态可以分为三类：闭合（Closed），开启（Open），半开（Half-open）。相互关系如下图^[2]所示。

每次请求到来的时候熔断器会有两个内置函数before_request以及after_request。before_request在请求前进行调用，根据状态决定是否截断请求，并记录请求数量。after_request作用在请求结束之后，负责根据请求返回状态码进行状态和计数器的更新。

// Execute runs the given request if the CircuitBreaker accepts it.
// Execute returns an error instantly if the CircuitBreaker rejects the request.
// Otherwise, Execute returns the result of the request.
// If a panic occurs in the request, the CircuitBreaker handles it as an error
// and causes the same panic again.
func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error) {
	generation, err := cb.beforeRequest()
	//...
	
	defer func() {
		e := recover()
		if e != nil {
			cb.afterRequest(generation, false)
			panic(e)
		}
	}()

	cb.afterRequest(generation, err == nil)
	// ...
}

闭合（Closed）

熔断器闭合时系统能够正常发送请求，闭合状态下熔断器会维护一个最近失败的请求数量。

// Counts holds the numbers of requests and their successes/failures.
// CircuitBreaker clears the internal Counts either
// on the change of the state or at the closed-state intervals.
// Counts ignores the results of the requests sent before clearing.
type Counts struct {
	Requests             uint32
	TotalSuccesses       uint32
	TotalFailures        uint32
	ConsecutiveSuccesses uint32
	ConsecutiveFailures  uint32
}

每次请求到来时将会更新成功或者失败状态的数量。一旦请求失败数量超过某一个阈值，熔断器将会进入开启（Open）状态。

func (cb *CircuitBreaker) onFailure(state State, now time.Time) {
	switch state {
	case StateClosed:
		cb.counts.onFailure()
		if cb.readyToTrip(cb.counts) {
			cb.setState(StateOpen, now)
		}
	// ...
	}
}

开启（Open）

熔断器开启后所有请求立刻失败并抛出异常。系统在设定开启状态时会给熔断器设置一个expire time，一旦熔断器处于开启状态时间超过expire time，将会自动转入半开状态。这样做的好处是可以让断路器自行检查下游服务可用性。

半开（Half open）

一旦断路器开启状态超时便会进入半开状态。在半开状态下系统会限量发送请求，一旦请求连续成功达到某一个阈值，熔断器将会恢复闭合状态并发送所有请求。一旦有请求失败熔断器将回滚至开启状态并重置计时器。

func (cb *CircuitBreaker) onSuccess(state State, now time.Time) {
	//...
	case StateHalfOpen:
		cb.counts.onSuccess()
		if cb.counts.ConsecutiveSuccesses >= cb.maxRequests {
			cb.setState(StateClosed, now)
		}
	}
}

Reference

[1] https://github.com/sony/gobreaker “Circuit Breaker implemented in Go”
[2] https://docs.microsoft.com/en-us/previous-versions/msp-n-p/dn589784(v=pandp.10)?redirectedfrom=MSDN “Circuit Breaker Pattern”