Leveraging Hierarchical Partition Keys in Azure Cosmos DB

Partitioning is a critical core concept in Azure Cosmos DB. Getting it right means your database can achieve unlimited scale, both in terms of storage and throughput.

I explain the concept in detail in my earlier post, Horizontal Partitioning in Azure Cosmos DB, but the gist of it is this:

You need to define a partition key for your container, which is a property the every document contains. The property is used to group documents with the same value together in the same logical partition.

NoSQL databases like Cosmos DB do not support JOINs, which requires us to fetch related documents individually, and that becomes problematic from a performance standpoint if those documents are scattered across a huge container with many partitions, requiring a “cross-partition” query. But if the related documents all share the same partition key value, then they are essentially “joined” in the sense that they are all stored in the same logical partition, and therefore they can be retrieved very efficiently using a “single-partition” query, no matter how large the container may be.

Containers are elastic, and can scale indefinitely. There is no upper limit to the number of logical partitions in a container. There is, however, an upper limit to the size of each logical partition, and that limit is 20 GB. This limit is usually very generous, as the general recommendation is to choose a partition key that yields relatively small logical partitions which can then be distributed uniformly across the physical partitions that Cosmos DB manages internally. However, some scenarios have less uniformity in logical partition size than others, and this is where the new hierarchical partition key feature just announced for Azure Cosmos DB comes to the rescue.

Before Hierarchical Partition Keys

Let’s use a multi-tenant scenario to explain and understand hierarchical partition keys. Each tenant has user documents, and since each tenant will always query for user documents belonging just to that tenant, the tenant ID is the ideal property to choose as the container’s partition key. It means that user documents for any given tenant can be retrieved very efficiently and performantly, even if the container is storing terabytes of data for hundreds of thousands of tenants.

For example, below we see a container partitioned on tenant ID. It is storing user documents grouped together inside logical partitions (the dashed blue lines), with one logical partition per tenant, spread across four physical partitions.

If we need to retrieve the documents for all the users of a given tenant, then that can be achieved with a single-partition query (since Cosmos DB knows which logical partition the query is asking for, it can route the query to just the single physical partition known to be hosting that logical partition):

Likewise, if we need to retrieve the documents for just a specific user of a given tenant, then that will also be a single-partition query (since the tenant ID is known, and we are retrieving only a subset of that tenant’s logical partition filtered on user ID):

However, partitioning on tenant ID also means that no single tenant can exceed 20 GB of data for their users. While most tenants can fit comfortably inside a 20 GB logical partition, there will be a few that can’t.

Traditionally, this problem is handled in one of two ways.

One approach is to have a single container partitioned on tenant ID, and use this container for the majority of tenants that don’t need to exceed 20 GB of storage for their user documents (as just demonstrated). Then, create one container for each large tenant that is partitioned on the user ID. This works, but adds obvious complexity. You no longer have a single container for all your tenants, and you need to manage the mapping of the large tenants to their respective containers for every query. Furthermore, querying across multiple users in the large tenant containers that are partitioned on user ID still results in cross-partition queries.

Another approach is to use what’s called a synthetic key to partition a single container for tenants of all sizes. A synthetic key is basically manufactured from multiple properties in the document, so in this case we could partition on a property that combines the tenant ID and user ID. With this approach, we can enjoy a single container for tenants of all sizes, but the partition granularity is at the user level. So all the tenants that could fit comfortably in a 20 GB logical partition (and therefore could use single-partition queries to retrieve documents across multiple users if we partitioned just on the tenant ID) are forced to partition down to the user level just in order to accommodate the few larger tenants that require more than a single 20 GB logical partition. Clearly here, the needs of the many (the majority of smaller tenants) do not outweigh the needs of the few (the minority of larger tenants).

Introducing Hierarchical Partition Keys (aka Subpartitioning)

Microsoft has just released a new feature called Hierarchical Partition Keys that can solve this problem. Instead of using separate containers or synthetic partition keys, we can define a hierarchical partition key with two levels: tenant ID and user ID. Note that this is not simply a combination of these two values like a synthetic key is, but rather an actual hierarchy with two levels, where the user ID is the “leaf” level (hierarchical partition keys actually support up to three levels of properties, but two levels will suffice for this example).

Here’s what the updated portal experience looks like for defining a hierarchical partition key:

Switching from a simple partition key of tenant ID to a hierarchical partition key of tenant ID and user ID won’t have any effect as long as each tenant fits inside a single logical partition. So below, each tenant is still wholly contained inside a single logical partition:

At some point, that large tenant on the second physical partition will grow to exceed the 20 GB limit, and that’s the point where hierarchical partitioning kicks in. Cosmos DB will add another physical partition, and allow the large tenant to span two physical partitions, using the finer level granularity of the user ID property to “sub-group” user documents belonging to this large tenant into logical partitions:

For the second and third physical partitions shown above, the blue dashed line represents the entire tenant, while the green dashed lines represent the logical partitions at the user ID level, for the user documents belonging to the tenant. This tenant can now exceed the normal 20 GB limit, and can in fact grow to an unlimited size, where now its user documents are grouped inside 20 GB logical partitions. Meanwhile, all the other (smaller) tenants continue using a single 20 GB logical partition for the entire tenant.

The result is this is the best of all worlds!

For example, we can retrieve all the documents for all users belonging to TenantX, where TenantX is one of the smaller tenants. Naturally, this is a single-partition query that is efficiently routed to a single physical partition:

However, if TenantX is the large tenant, then this becomes a “sub-partition” query, which is actually a new term alongside “single-partition” and “cross-partition” queries (and also why, incidentally, the new hierarchical partition key feature is also known as sub-partitioning).

A sub-partition query is the next best thing to a single-partition query. Unlike a cross-partition query, Cosmos DB can still intelligently route the query to just the two physical partitions containing the data, and does not need to check every physical partition.

Querying for all documents belonging to a specific user within a tenant will always result in a single-partition query, for large and small tenants alike. So if we query for UserY documents in large TenantX, it will still be routed to the single physical partition for that data, even though the tenant has data spread across two physical partitions:

I don’t know about you, but I find this to be pretty slick!

Now see how this can apply to other scenarios, like IoT for example. You could define a hierarchical partition key based on device ID and month. Then, devices that accumulate less then 20 GB of data can store all their telemetry inside a single 20 GB partition, but devices that accumulate more data can exceed 20 GB and span multiple physical partitions, while keeping each month’s worth of data “sub-partitioned” inside individual 20 GB logical partitions. Then, querying on any device ID will always result in either a single-partition query (devices with less telemetry) or a sub-partition query (devices with more telemetry). Qualifying the query with a month will always result in a single-partition query for all devices, but we will never require a cross-partition query in either case.

One final note about using hierarchical partition keys. If your top level property values are not included in the query, then it becomes a full cross-partition query. For example, if you query only on UserZ without specifying the tenant ID, then the query hits every single physical partition for a result (where of course, if UserZ existed in multiple tenants, they would all be returned):

And that’s all there is to it. Get started with hierarchical partition keys by checking out the docs at https://learn.microsoft.com/en-us/azure/cosmos-db/hierarchical-partition-keys.

Happy coding!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: