Primer on Database Sharding

Covers what database sharding means and different implementations

Intro

Another exciting database article! I bet you couldn’t wait! =) I’m going to keep this one short, and cover one topic: database sharding.

What Is It?

Let’s say you have an application that needs a backend database. You’re using Postgres, and you’ve scaled it as much as you can vertically. Like, you’re using AWS X1E 32xlarge that cost around $20,000 USD per month. What can you do next? Shard, of course!

Here’s the definition of sharding we’re going to be working with: A database shard is a horizontal partition of data in a database or search engine.

What’s a Horizontal Partition?

Glad you asked!

Imagine we have this table:

Table 1. Customers
idfirst_namelast_name

1

Fletcher

Haynes

2

Perceval

Grimwhisker

3

Xochipilli

Deathpaw

A horizontal partition is one or more rows. If this table held a thousand rows, we should break it into two shards, each containing 500 rows.

You’re probably wondering where we store the second shard, and the answer is on a second database server. =)

Note
This could technically be two Docker containers on one machine. Or two Postgres processes on one machine. But that defeats the purpose of gaining more compute power.

After the Partition

So we pay AWS another ridiculous sum of money and get a second server. Now our tables might look like:

Table 2. Customers on Server A
idfirst_namelast_name

1

Fletcher

Haynes

2

Perceval

Grimwhisker

Table 3. Customers on Server B
idfirst_namelast_name

3

Xochipilli

Deathpaw

Each server now has to deal with half the load. This introduces a problem, though. Well, several problems, but let’s start with one: how does the application know which shard has the data it is looking for?

Right now, our topology might look something like (enjoy my ASCII art, ha!):

                +------------+
                | Database A |
                +------------+
              /
+-------------
| App Server |
+--------------
               \
                +------------+
                | Database B |
                +------------+

Option 1: Ask ALL the Servers

This is a simple way to do it. You can make your application aware of all the servers, and it can send a query to all of them. The one with the data will return a result, the rest will return not found.

While workable, this is not an ideal solution, because we’re still sending needless queries, which is going to put load on the server. Probably less than the server that has the data, but it is still unnecessary.

Option 2: Partition by ID

If we have to servers, we do a simple check on the primary key. If it is an even number, it goes to Database A. If it is odd, it goes to Database B.

Problems

What if we need to add a third server? This won’t work in that case. We could start doing a round-robin style scheme, where each server gets 33% of the requests. We could probably do some modulo tricks to on the application side to figure out which server it is located on.

But all this is when we query by ID. What if we want to query by first_name? We’re back to having no fucking clue which server it is on.

Option 3: Dynamic Sharding

When in doubt, add more databases!

What if we create a table that looks something like this:

Table 4. Shards
first_idlast_iddatabase

0

500

Database A

501

1000

Database B

Now when the application needs to looking something up by ID, it first queries this table, then queries the appropriate database. This can also accomodate the addition of more servers; we just add another row in the database.

Problems

If the field we are sharding on (often called the shard key, and it does NOT have to be the same as the primary key) is monotonically increasing (always going up), then the newest data will always be on the newest servers.

If our service is one that sees a lot of activity from new users, but then trails off, we’ll end up with a lot of unused databases, and a few overloaded ones. This is often called the hot shard problem.

Rebalancing

To deal with the hot shard problem, systems often do something called rebalancing. It examines the system, and redistributes the partitions across the servers so that the utilization is more equitable. HDFS, MongoDB, and many others have background processes that do this, and then update the metadata table.

Problems

If your application’s access pattern has a high degree of randomness, the re-sharding process may not be able to move data around fast enough. Sharding also uses network capacity. If you have a multi-TB DB that is frequently being re-sharded, your cloud provider bill is going to go up.

A lot.

Trust me on this.

End

The typical solution you will see in databases like MongoDB, Cassandra, etc is the dynamic sharding and a metadata server. Some applications implement sharding at the application level and just run completely discrete DB instances that share nothing; this is also a perfectly valid choice. Know your data, and know your usage patterns.

P.S. For an interesting example of sharding, check out https://vitess.io/.


If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.