Three Datacenters

You are building a system that is going to be hosted in the Cloud. You have the option of spreading your application across any number of regions or zones in the cloud. You need to make your system resiliant to the possibility of failures that are inherent in the cloud model.

What is the best number of datacenters to deploy to in a system built with minimum cost but maximum redundancy?

One datacenter doesn’t make a resilient system, but do we need two, or three, or twenty? The more datacenters you have to operate, the more complex operating them becomes; but less datacenters increase the consequences of any single one going down. If you want to guard against losing a single datacenter, then the remaining datacenters need to have plenty of reserve to absorb the load of the one that is now removed from the platform; it’s easy to see that you need $ n $ datacenters each having $ n/(n - 1) $ capacity, from which it follows that the total spare capacity carried ($ 1/(n-1) $) goes down with an increasing $ n $.

On the other hand, every data center requires a set of “overhead” machines–for monitoring, infrastructure, etcetera–which can be regarded as fixed relative to the amount of systems running in that datacenter. With more datacenters, the total of the fixed overhead per datacenter goes up, while the total of the spare capacity carried goes down–often, the overhead increase wins it from the spare capacity decrease. Furthermore, you often want to run consensus systems over your datacenters requiring a quorum decision; these systems almost always work best if there is an uneven number of participants involved in order to avoid split-brain situations.

Therefore,

Choose the minimum amount of datacenters that cannot cause a split brain situation but still will give redudancy: Three. Have good reasons to go above that, and consider always having an uneven number to make quorum consensus simpler.

There are a number of conflicing forces here at work. Even in a cloud, datacenters often have special machines that manage infrastructure and thus do not contribute compute and/or storage to systems running the business logic; examples are login bastion hosts, machines collecting logs and metrics (but see [Outsource Metrics and Logging], machines responsible for running software deployment services, etc. This, in a sense, is fixed overhead per datacenter which needs to be amortized over the machines that do contribute capacity. Especially for smaller systems, this overhead may make adding too many datacenters infeasible. With more datacenters, however, each datacenter contributes less capacity to the total capacity required and thus the spare capacity that needs to be carried around to protect against the outage of a datacenter becomes smaller. The detailed calculations are outside the scope of this pattern, but it should be relatively simple to find, for a given system, a point where the spare capacity plus the total overhead is at a minimum. For example, if the total capacity required is 60 machines, and the fixed overhead per datacenter is 5 machines, then at $ n = 4 $ or $ n = 5 $ the total overhead ($ n * 5 $) plus the spare capacity ($ total * (1 / (n - 1))$) are both 40 - with $ n $ above or below this point, this sum is higher[^2].

The other consideration is more technical and less a matter of punching cost figures into spreadsheets. Various patterns in this language argue that to make the system work as a mostly seamless whole across datacenters you must consider what happens if one goes away. Distributed systems very often rely on not being able to end up in a split-brain situation, meaning a situation where a network outage between parts of a system results in the parts of the system each continuing to work, mutating data, and ending up with conflicts when the parts are re-joined. A simple way to achieve resistence is by making sure that the members of a distributed system only work when there’s a simple majority, or quorum, visible; there can be only one simple majority and thus only one part of a system that keeps working under a loss of network connectivity[^3]. Simple majorities work best if there is an uneven number of participants; in the absence of this condition, some participants’ votes in the system need to be weighed as more important to make sure that a split-brain situation cannot occur; these weighing factors often make systems more brittle and harder to maintain, so if possible an uneven number of datacenters is the simplest solution.