Demystifying the Multi-Model Capabilities in Azure Cosmos DB

When someone casually asks me, “Hey, what is Cosmos DB?,” I casually respond, “Well, that’s Microsoft’s globally distributed, massively scalable, horizontally partitioned, low latency, fully indexed, multi-model NoSQL database, of course.” One of two things happen then. I’ll often get a long weird look, after which the other person politely excuses themselves and moves along. But every now and again, I’ll hear “wow, that sounds awesome, tell me more!” If you’re still reading this, then I’m guessing you’re in the latter category.

If you start to elaborate on each of the bullet points in my soundbite response, there’s a lot to discuss before you get to “multi-model NoSQL” at the tail end. Starting with “globally distributed,” Cosmos DB is – first and foremost – a database designed for modern web and mobile applications, which are (typically) global applications in nature. Simply by clicking the mouse on a map in the portal, your Cosmos DB database is instantly replicated anywhere and everywhere Microsoft hosts a data center (there are nearly 50 of them worldwide, to date). This delivers high availability and low latency to users wherever they’re located.

Cosmos DB also delivers virtually unlimited scale, both in terms of storage – via server-side horizontal partitioning, and throughput – by provisioning a prescribed number of request units (RUs) per second. This ensures low latency, and is backed by comprehensive SLAs to yield predictable database performance for your applications. And it’s unique “inverted indexing” scheme enables automatic indexing of every property that you store, with minimal overhead.

Whew. That’s quite a lot to digest before we even start pondering Cosmos DB’s multi-model support, mentioned there at the end of my lengthy description. In fact, it’s very deliberately placed at the end. Because regardless of which data model you choose, it actually makes no difference to Cosmos DB. The capabilities around global distribution, horizontal partitioning, provisioned throughput, and automatic indexing apply – these are durable concepts that transcend whatever data model you choose. So you get to pick and choose among any of the supported data models, without compromising any of the core features of the Cosmos DB engine.

Which segues right into the topic of this blog post. What exactly is “multi-model”? And specifically, what does it mean for a database platform like Cosmos DB to support multiple data models?

It all boils down to how you’d like to treat your data – and this is where the developer comes in. Because, while massive scale is clearly important (if not critical), developers don’t really care about such details as long as it all “just works.” And it’s Cosmos DB’s job to make sure that this all works. When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to choose comes into play.

Depending on the type of application being built, it could be more appropriate to use one data model and not another. For example, if the application focuses more on relationships between entities than the entities themselves, then a graph data model may work better than a document model. In other cases, a developer may want to migrate an existing NoSQL application to Cosmos DB; for example, an existing Mongo DB or Cassandra application. In these scenarios, the Cosmos DB data model will be pre-determined; depending on the back-end database dependency of the application being ported, the developer would choose either the Mongo DB-compatible or Cassandra-compatible data model. Such a migration would require minimal (to no) changes to the existing application code. And yet, in other “green field” situations, developers that are very opinionated about how data should be modeled are free to choose whichever data model they prefer.

Each data model has an API for developers to work with in Cosmos DB. Put another way, the developer chooses an API for their database, and that determines the data model that is used. So, let’s break it down:

MM1

Document Data Model (SQL & MongoDB APIs)

The first thing to point out is that the SQL API is, essentially, the original DocumentDB programming model from the days when Cosmos DB was called DocumentDB. This is arguably the most robust and capable of all the APIs, because it is the only one that exposes a server-side programming model that lets you build fully transactional stored procedures, triggers, and user-defined functions.

So both the SQL and MongoDB APIs give you a document data model, but the two APIs themselves are radically different. Yes, they are similar from a data modeling perspective; you store complete denormalized entities as a hierarchical key-value document model; pure JSON in the case of the SQL API, or BSON in the case of the MongoDB API (BSON is MongoDB’s special binary-encoded version of JSON that extends JSON with additional data types and multi-language support).

The critical difference between the two APIs is the programming interface itself. The SQL API uses Microsoft’s innovative variant of structured query language (SQL) that is tailored for searching across hierarchical JSON documents. It also supports the server-side programming model (for example, stored procedures), which none of the other APIs do.

In contrast, the MongoDB API actually provides wire-level support, which is a compatibility layer that understands the protocol used by the MongoDB driver for sending packets over the network. MongoDB has a built-in find method used for querying documents (unlike the SQL support found in the SQL API). So the MongoDB API appeals to existing MongoDB developers, because they now enjoy the scale-out, throughput, and global distribution capabilities of Cosmos DB, without deserting the MongoDB ecosystem. Because the MongoDB API in Cosmos DB gives you full compatibility with existing MongoDB application code, and lets you continue working with familiar MongoDB tools.

Key-Value Data Model (Table API)

You can also model your data as a key-value store, using the Table API. This API is actually the evolution of Azure Table Storage – one of the very first NoSQL databases available on Azure. In fact, all existing Azure Table Storage customers will eventually be migrated over to Cosmos DB and the Table API.

With this data model, each entity consists of a key and a value pair, but the value itself is a set of key-value pairs. So this is nothing like a table in a relational database, where each row has the same columns; with the Table API in Cosmo DB, each entity’s value can have a different set of key-value pairs.

The Table API appeals primarily to existing Azure Table Storage customers, because it emulates the Azure Table Storage API. Using this API, existing Azure Table Storage apps can be migrated quickly and easily to Cosmos DB. For a new project though, there would be little reason to ever consider using the Table API, when you consider the fact that the SQL API is far more capable than the Table API.

So then, when would you actually choose to use the Table API? Well, again, the primary use case is to migrate an existing Azure Table Storage account over to Cosmos DB, without having to change any code in your applications. Remember that Microsoft is planning to do this for every customer over a long term migration, but there’s no reason to wait for them to do that, if you don’t want to wait. You can migrate the data yourself now, and then immediately start enjoying the benefits of Cosmos DB as a back-end, and you don’t have to make any changes whatsoever to your existing Azure Table Storage applications. You just change the connection string to point to Cosmos DB, and the application continues to work seamlessly against the Cosmos DB Table API.

Graph Data Model (Gremlin API)

You can also choose the Gremlin API, which gives you a graph database based on the Apache Tinkerpop open source project. Graph databases are becoming increasingly popular in the NoSQL world.

What do you put in a graph? One of two things; either a vertex or an edge. Now don’t let these terms intimidate you. They’re just fancy words for entities and relationships, respectively. So a vertex is an entity, and an edge is a one-way relationship between any two vertices. And that’s it – nothing more and nothing less. These are the building blocks of any graph database. And whether you’re storing a vertex or an edge, you can attach any number of arbitrary properties to it; much like the arbitrary key-value pairs you can define for a row using the Table API, or a flat JSON document using the SQL API.

The Gremlin API provides a succinct graph traversal language that enables you to efficiently query across the many relationships that exist in a graph database. For example, in a social networking application, you could easily find a user, then look for all of that user’s posts where the location is NY, and of those, find all the relationships where some other user has commented on or liked those posts.

Columnar (Cassandra API)

There’s a fourth option for choosing a data model, and that’s columnar, using the Cassandra API. Columnar is yet another way of modeling your data, where – in a departure from the typical way of dealing with schema-free data in the NoSQL world – you actually do define the schema of your data up-front. However, data is still stored physically in a column-oriented fashion, so it’s still OK to have sparse columns, and it has good support for aggregations. Columnar is somewhat similar to the Key-Value data model with the Table API, except that every item in the container is required to adhere to the schema defined for the container. And in that sense, columnar is really most similar to columnstore in SQL Server, except of course that it is implemented using a NoSQL architecture, so it’s distributed and partitioned to massively scale out big data.

Atom Record Sequence (ARS)

The fact of the matter is, these APIs merely project your data as different data models; whereas internally, your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the persistence layer for key-value pairs. Now you don’t need to know anything about ARS; you don’t even need to know that it’s there. But it is there, under the covers, storing all your data as key-value pairs in a manner that’s agnostic to the data-model you’ve chosen to work with.

Because at the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models. They’re all some form of keys and values. A JSON or BSON document is a collection of keys and values, where values can either be simple values, or they can contain nested key-value pairs. The key-value model is clearly based on keys and values, but so are graph and columnar. The relationships you define in a graph database are expressed as annotations that are, themselves key-value pairs, and certainly all the columns defined for a columnar data model can be viewed as key-value pairs as well.

So, these API’s are here to broaden your choices in terms of how you get to treat your data; they bear no consequence on the ability to scale your database. For example, if you want to be able to write SQL queries, you would choose the SQL API, and not the Table API. But if you want MongoDB or Azure Table Storage compatibility, then you’d go with the MongoDB or Table API respectively.

Switching Between Data Models

As you’ve seen, when you choose an API, you are also choosing a data model. And today (since the release of Cosmos DB in May 2017), you choose an API when you create a Cosmos DB account. This means that today, a Cosmos DB account is tied to one API, which ties it to one data model:

MM2

But again, each data model is merely a projection of the same underlying ARS format, and so eventually you will be able to create a single account, and then switch freely between different APIs within the account. So that then, you’ll be able to access one database as graph, key-value, document, or columnar, all at once, if you wish:

MM3

You can also expect to see additional APIs in the future, as Cosmos DB broadens its compatibility support for other database systems. This will enable a an even wider range of developers to stick with their database of choice, while leveraging Cosmos DB as a back end for horizontal partitioning, provisioned throughput, global distribution, and automatic indexing.

Summary

Azure Cosmos DB has multiple APIs and supports multiple data models. In this blog post, we explored the multi-API, multi-model capabilities of Cosmos DB, including the document data model with either the SQL or MongoDB APIs, key-value with the Table API, graph with the Gremlin API, and columnar with the Cassandra API.

Regardless of which data model you choose, however, Cosmos DB stores everything in ARS, and merely projects different data models, based on the different APIs. This provides developers with a wide range of choices for how they’d like to model their data, without making any compromises in scale, partitioning, throughput, indexing, or global distribution.

Advertisements

Learning Azure Cosmos DB

Hey everyone!

I’m extremely delighted to announce the launch of my brand new Pluralsight course on Azure Cosmos DB.

This is a 10-module course that targets developers – and that’s you!

The course should be live on Pluralsight in the next few days. If you’re interested in building next-generation global apps for the web and mobile, just join me for about 6 hours, and get up to speed in a heartbeat!

Here’s what the course delivers:

Part 1: Introduction and Overview

We begin by defining NoSQL, explaining big data, and describing the characteristics of a NoSQL Database. Then we introduce Cosmos DB with a high-level discussion around global distribution, server-side partitioning, and its multi-model capabilities (with support for documents, tables, and graphs). We also introduce the local emulator, which lets you develop for Cosmos DB without an internet connection or Azure subscription. This module has several demos showing how to create an Azure Cosmos DB account, how to download and install the local emulator, and how to use the SQL API to create a collection, populate it with documents, and query for documents using Cosmos DB SQL.

Part 2: Tuning for Throughput and Performance

This module explains how to provision throughput in Cosmos DB. We begin by learning about request units, or RUs. With RUs, you don’t worry about hardware concerns, like CPU, memory, or I/O. Instead, Cosmos DB delivers predictable performance by letting you reserve as many RUs per second that your application needs, and will throttle requests if you start to exceed to throughput limits that you reserve. The portal has many customizable charts and graphs that let you monitor the request unit consumption of your database, and you can whiteboard the cost to estimate your throughput needs using the online request unit calculator. The module concludes with an explanation of how Cosmos DB pricing is divided between storage and throughput.

Part 3: Horizontal Partitioning

This module discusses how Cosmos DB achieves elastic scale via server-side horizontal partitioning. We explain how containers are logical resources that Cosmos DB manages by creating and replicating multiple physical partitions to scale both storage and throughput. For best results, we discuss the considerations to keep in mind when choosing a partition key, which groups related items together for queries, and supports updating multiple items in a transaction using stored procedures. The module concludes by showing how to enable cross-partition queries, where Cosmos DB automatically fans out the query to multiple partitions, and aggregates each partition’s results into a single result for the query.

Part 4: Globally Distributing Data

In this module, we explore the geo-replication capabilities in Cosmos DB, also called turnkey global distribution. In addition to local replication within one Azure data center, Cosmos DB can replicate your database to any number of other Azure data centers in regions located throughout the world. This brings your data closer to your users, giving them low latency and high availability wherever they are located. You will see how easy it is to enable global distribution, and configure your application with a failover sequence of preferred data. We wrap up the module by discussing consistency, and explaining the five consistency levels that let you balance latency and availability with stale reads that occur when querying replicas that are not up to date.

Part 5: Data Modeling for the SQL API

Using the SQL API, you model items in the database as JSON documents. This module starts by calling out some of the key differences in data modeling between a traditional relational database platform (like SQL Server) and a JSON document database (like Cosmos DB with the SQL API). We then examine the special properties, including the resource ID and self-link properties that uniquely identify each resource, and the URI factory which simplifies the task of constructing the self-link to any resource in the database. The module concludes by showing how the Data Migration Tool can be used to transform and import data from SQL Server to Cosmos DB.

Part 6: Querying Documents

This module explores the special version of SQL used by Cosmos DB for querying JSON documents in a collection
with the SQL API. We explain how the query language is rooted in JSON and JavaScript semantics, and then
dive into numerous demos that show how to filter, sort, iterate arrays, and perform intra-document joins,
using the many available operators and built-in math, type checking, string, array, aggregate, and spatial
functions provided by Cosmos DB SQL.

Part 7: Programming with the .NET SDK

This module shows how to use the .NET SDK with the SQL API to build client applications for Cosmos DB. We cover how to use the SDK for working with databases and collections, creating and querying documents, indexing, and resource tokens based on users and permissions.

Part 8: Programming the Cosmos DB Server

This module teaches you how to write server-side code that runs within the Cosmos DB service, including stored procedures, triggers, and user-defined functions. We also show you how to cope with throttled requests, and how to implement a continuation model for long-running processes.

Part 9: Using the Table API

This module explains the Table API in Cosmos DB, which provides 100% compatibility with the original Azure Table Storage NoSQL service, but adds predictable reserved throughput, and automatic indexing. We discuss the key-value data model used by the Table API, explain when it makes sense to use this API and when it doesn’t, and show how to migrate an existing Azure Table Storage account and application to Cosmos DB and the Table API.

Part 10: Using the Gremlin API

This module explains the Gremlin API, which provides a graph data model over Cosmos DB. This API is based on the Apache TinkerPop open source project, and defines a set of steps that can be used to populate a graph with vertices (entities) and edges (relationships). After populating the graph, you can run Gremlin queries that traverse the many relationships defined in the database. We’ll create several demos, including a simple graph, airport model, and a multi-model comic book catalog.

I hope this has whet your appetite to learn more about Cosmos DB!