Primer on Database Sharding
Covers what database sharding means and different implementations
Intro
Another exciting database article! I bet you couldn’t wait! =) I’m going to keep this one short, and cover one topic: database sharding.
What Is It?
Let’s say you have an application that needs a backend database. You’re using Postgres, and you’ve scaled it as much as you can vertically. Like, you’re using AWS X1E 32xlarge
that cost around $20,000 USD per month. What can you do next? Shard, of course!
Here’s the definition of sharding we’re going to be working with: A database shard is a horizontal partition of data in a database or search engine.
What’s a Horizontal Partition?
Glad you asked!
Imagine we have this table:
id | first_name | last_name |
---|---|---|
1 | Fletcher | Haynes |
2 | Perceval | Grimwhisker |
3 | Xochipilli | Deathpaw |
A horizontal partition is one or more rows. If this table held a thousand rows, we should break it into two shards, each containing 500 rows.
You’re probably wondering where we store the second shard, and the answer is on a second database server. =)
Note | This could technically be two Docker containers on one machine. Or two Postgres processes on one machine. But that defeats the purpose of gaining more compute power. |
After the Partition
So we pay AWS another ridiculous sum of money and get a second server. Now our tables might look like:
id | first_name | last_name |
---|---|---|
1 | Fletcher | Haynes |
2 | Perceval | Grimwhisker |
id | first_name | last_name |
---|---|---|
3 | Xochipilli | Deathpaw |
Each server now has to deal with half the load. This introduces a problem, though. Well, several problems, but let’s start with one: how does the application know which shard has the data it is looking for?
Right now, our topology might look something like (enjoy my ASCII art, ha!):
+------------+
| Database A |
+------------+
/
+-------------
| App Server |
+--------------
\
+------------+
| Database B |
+------------+
Option 1: Ask ALL the Servers
This is a simple way to do it. You can make your application aware of all the servers, and it can send a query to all of them. The one with the data will return a result, the rest will return not found.
While workable, this is not an ideal solution, because we’re still sending needless queries, which is going to put load on the server. Probably less than the server that has the data, but it is still unnecessary.
Option 2: Partition by ID
If we have to servers, we do a simple check on the primary key. If it is an even number, it goes to Database A. If it is odd, it goes to Database B.
Problems
What if we need to add a third server? This won’t work in that case. We could start doing a round-robin style scheme, where each server gets 33% of the requests. We could probably do some modulo tricks to on the application side to figure out which server it is located on.
But all this is when we query by ID. What if we want to query by first_name
? We’re back to having no fucking clue which server it is on.
Option 3: Dynamic Sharding
When in doubt, add more databases!
What if we create a table that looks something like this:
first_id | last_id | database |
---|---|---|
0 | 500 | Database A |
501 | 1000 | Database B |
Now when the application needs to looking something up by ID, it first queries this table, then queries the appropriate database. This can also accomodate the addition of more servers; we just add another row in the database.
Problems
If the field we are sharding on (often called the shard key, and it does NOT have to be the same as the primary key) is monotonically increasing (always going up), then the newest data will always be on the newest servers.
If our service is one that sees a lot of activity from new users, but then trails off, we’ll end up with a lot of unused databases, and a few overloaded ones. This is often called the hot shard problem.
Rebalancing
To deal with the hot shard problem, systems often do something called rebalancing. It examines the system, and redistributes the partitions across the servers so that the utilization is more equitable. HDFS, MongoDB, and many others have background processes that do this, and then update the metadata table.
Problems
If your application’s access pattern has a high degree of randomness, the re-sharding process may not be able to move data around fast enough. Sharding also uses network capacity. If you have a multi-TB DB that is frequently being re-sharded, your cloud provider bill is going to go up.
A lot.
Trust me on this.
End
The typical solution you will see in databases like MongoDB, Cassandra, etc is the dynamic sharding and a metadata server. Some applications implement sharding at the application level and just run completely discrete DB instances that share nothing; this is also a perfectly valid choice. Know your data, and know your usage patterns.
P.S. For an interesting example of sharding, check out https://vitess.io/.
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.