We are experiencing too much load. Let's add a new server.


Oct 27, 2010

Taking a look through antirez's redis sharding slides tonight, one of the bullet points really jumped out at me. From slide #11:

We are experiencing too much load. Let's add a new server.

There's this idea floating around that we can scale out our data services "just in time". Proponents of cloud computing frequently tout this as an advantage of such a platform. Got a load spike? No problem, just spin up a few new instances to handle the demand. It's a great sounding story, but sadly, things don't quite work that way.

First of all, we don't have any information about the "load". From the statement, we can assume that we're running low on some kind of resource. But, what kind of resource? Network resources, disk resources, CPU resources, memory resources? All of the above?

Moving data on to a new machine is a resource-intensive process. At the very least, if the data is stored entirely in memory, replicating data to a new machine and splitting a partition is going to require CPU and network resources (and probably some memory). If the data is stored on hard drives, bootstrapping a new node will consume network, disk, CPU and memory resources. So, the process of adding capacity is going to add load before it can relieve any.

If we're maxed out of one of the resources we need to add capacity, attempting to spin up a new node is only going to make the situation worse. If you're maxing out your disk array's IOPS capacity, for example, and bootstrapping a new node requires reading data from that disk array, you're in trouble. If you need memory to add a node, and you're out of memory, you're in trouble. Let me say it again: adding a node to a storage cluster is not free.

If you have enough data and traffic to make a clustered database relevant, capacity has to be planned carefully. You have to use your system and application metrics to develop an understanding of your usage patterns. You may not always be able to predict when traffic spikes will occur, but in general, it's possible to know in advance roughly how large they might be.

You can focus on writing your app and let Mongo focus on scaling it. - Kristina Chodorow

If you take at face value the marketing materials of many NoSQL database vendors, you'd think that with a horizontally scalable data store, operations engineering simply isn't necessary. Recent high profile outages suggest otherwise.

How is this issue triggered? ... Systems are at or over capacity.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren't there.