Introducing Azure DocumentDB

On April 8, 2015, Microsoft officially launched Azure DocumentDB, and it certainly can be characterized as a typical NoSQL document database. It is a massively scalable NoSQL document database that works with schema-free JSON documents. Beyond this, however, DocumentDB stands out with some very unique capabilities.

SQL Queries Over Schema-Free JSON

Of course, DocumentDB works with schema-free JSON. But unlike some platforms that require you to define index paths in advance of being able to query on specific properties, DocumentDB automatically indexes every property in a document as soon as the document is added to the database. Simply put, every document is instantly queryable the moment it’s created, and you can search on any property anywhere within the document hierarchy. Furthermore, documents are queryable using SQL, or I should say, using a special flavor of SQL that anyone with SQL experience should immediately find intuitive.

ACID Transactions Updating Multiple Documents

DocumentDB provides a server-side environment inside which you can write JavaScript code to update multiple documents with full transactional processing. This is an easy and powerful way to ensure data consistency across multiple documents, because DocumentDB ensures that all updates made on the server are committed together, or will roll everything back together in the event of an error.

Tunable Performance

There are many ways to tune DocumentDB for the performance needed by your application. For example, throughput can be scaled up or down instantly across three different performance tiers. And although DocumentDB indexes every property on every document, you can take control and fine-tune an indexing policy that reduces storage and processing overhead for specific documents and/or properties that never need to be indexed. And while DocumentDB supports both strong and eventual consistency, it also has two additional options to give you even greater control over the tradeoffs between performance and consistency.

Runs on Azure

Finally, DocumentDB is available as a fully managed, cloud-based, Platform As A Service, running on Azure. There’s just nothing for you to install or manage. No servers, cables, operating systems, or updates to deal with, no replicas to setup – Microsoft does all that work, and keeps the service running. Azure guarantees availability as well as predictable performance based on the service tier that you sign up for. Within literally minutes, you get started working with DocumentDB using just a browser and an Azure subscription.

Stay tuned for upcoming posts, where I’ll dig into all of these exciting capabilities in greater detail.

Advertisements

Relational Databases vs. NoSQL Document Databases

In this post, we’ll take a close look at some of the differences between a traditional relational store and a NoSQL document store.

Rows vs. Documents

To begin with, a document database stores entities as documents – meaning JSON documents, and this is very different to the way relational databases store data as rows in a table.

i1

Columns vs. Properties

While rows in the relational world are made of up columns, documents contain properties.

i2

Schema vs. Schema-Free

In the relational world, every table has a schema that defines the columns and data types that every row in the table must conform to. In contrast, a document database has no defined schema, and every document can be structured differently.

i3

In this example, there are four columns defined for a table, and it would be necessary to alter the table schema if we wanted a fifth column, or if we wanted to change the maximum length of the name column, or if we wanted to allow nulls in date-of-birth, you get the idea. But because document databases as schema-free, they aren’t subject to these constraints. This makes them ideal when you have a rapidly evolving schema, as is usually the case in software development today.

i4

Here, the first document has several properties, one of which is called name. Yet this doesn’t prevent us from adding a second document that uses the property fullName, rather than name, because there are no rules on schema. Similarly, we can add a third document that stores the fullName property an object that has distinct first and last name properties embedded within it. The database is more than happy to accept all these variations. Of course, this is just an example, and it would normally make no sense to mix up property names and shapes like this, just because you can. In this case, we can speculate that – at some point, it was determined that fullName was a better choice than name, and then at some later point, the fullName property was enhanced to distinguish between first and last names. This is a big selling point of document databases, because they let you evolve your schema as your needs dictate, as compared to relational database, where keeping up with an evolving schema is far more disruptive. Of course, when you are supporting varying schemas like this, it falls on you to present a unified view of the data. In this scenario, you could easily devise a query that returns the full name, first, by testing for the presence of a property called name, or a property called fullName if there is no name property, and then by testing if the fullName is an object or not; if it is, then you’d concatenate the embedded first and last name properties, otherwise, you’d just return the fullName property.

Normalized vs. Denormalized

Another significant difference between relational databases and document databases has to do with data normalization. In the relational world, we strive to normalize data as much as possible. This means avoiding data duplication, and maintaining separate tables for each related entity that can be joined together to produce a complete view of the data.

i5

In this example, we have a User table with a row for John, and we have a separate Holdings table with multiple rows for all of John’s stock holding. The two tables are related on User ID, which is 1 for John, so that all individual users are stored in the user table, and each user’s related holdings are stored in the holdings table. Not only must the tables be joined when retrieving data for an application, but the application must know how to take a single modified object with a user and their holdings, and persist changes to the database by updating the user and holding tables separately. This takes significant effort, even if you leverage assistance from an Object-Relational Mapping framework (ORM) like Entity Framework.

In a document database, we typically do the opposite.

i6

Here we see a single document that contains both the user and their holdings. Suddenly, there is no need to join when running a query, nor is there any need to shred the object into different places when saving changes, in fact, no need for an ORM layer at all. This, together with a schema-free model, eliminates a tremendous amount of friction that you normally have with relational database when designing schemas, joining tables, and maintaining an ORM layer on top of your database.

Of course, this isn’t a silver bullet that works well in every scenario. Here we see the one-to-many relationship implemented in a JSON document using an array of embedded objects. This is fine when you expect a reasonable maximum number of child objects, but what about so-called “unbounded” data. For example, imagine a blog post document with an infinite number of related comments. It would not be possible to store a single blog post with all of its comments inside a single document, and this is an example of where you actually might implement a model similar to a relational database, with each blog post in their own document, and each comment in a separate related document that is tied logically on the blog post ID. You can certainly do this, but it will fall on you and your application to store and retrieve blog posts and comments separately, because the document database won’t join them for you. You’re also free to model things however you’d like, so you could create blog post documents that include the first 100 comments, and then related documents that contain the next hundred, and the next hundred, with 100 comments per related document. There’s no one approach to data modeling that works in every scenario, so it’s up to you to experiment with what works best.

Data duplication is another aspect to this. With document databases, it’s not uncommon to duplicate data across multiple documents so that each document has the data it needs without having to locate other documents. Of course, if this is data that frequently changes, then you face another question as to whether it’s better to update multiple documents when a single piece of duplicated data is changed, or to extract the duplicate data out of each document and maintain it in a single shared document. Once again, how you model your data is all up to you. But at the end of the day, it’s important to understand that document databases work best when dealing with rich hierarchical documents that are entirely, or almost entirely, self-contained. Yes, you can model related documents when you need to, but if you find yourself modeling a database that contains many related documents, and/or, your documents have mostly flat structures, then this is a clear sign that a document database may not be the right tool for the job.

Strong Consistency vs. Eventual Consistency

Relational databases also enforce strong consistency on write operations. After updating the balance of a bank account, for example, we must be guaranteed that queries immediately show the updated balance – it’s entirely unacceptable to continue showing the old balance any time after it’s been changed.

One reason that document databases perform as well as they do is because when you write to the database, the changes are propagated to multiple replicas in the background. Then, read requests can be satisfied by any replica, making it possible to satisfy a high volume of client queries just by maintaining enough replicas. However, because not all replicas may be up-to-date at the point in time that a client issues a query, it is possible to receive inconsistent query results. Eventually, of course, all the replicas will be updated, and queries will return consistent results, which is why this behavior is called “eventual consistency.”

Simple vs. Complex

As I had started explaining, NoSQL databases are simple by design, which is the primary reason that they are able to achieve scale and performance that surpasses relational databases. This is also the reason why they do not and cannot replace relational databases that are often better at handling more complex requirements that don’t necessarily need to achieve massive scale.

Scale-Up vs. Scale-Out

And finally, speaking of scale, relational databases simply don’t scale out easily. You can do it, but it’s hard and expensive, whereas scale-out is a fundamental design goal with document databases.

What is a NoSQL Document Database?

In this blog post, we’ll discuss the major concepts around NoSQL document databases. In future posts, I’ll introduced Azure DocumentDB, Microsoft’s newest NoSQL document database, and discuss the major differences between relational databases and document databases.

What is NoSQL?

The best way to start is to clear up some terminology, where the industry has unfortunately adopted a couple of terms that are arguably misleading. SQL means Structured Query Language – meaning it’s just a language; a way of expressing a request to “go find something from someplace, where some condition is true, and give me back the result in the shape that I want it.” And so, SQL per-se doesn’t really define a specific technology. Again, it’s just a language, a dialect, but because SQL is the traditional query language of relational databases, the terms are often equated. So it’s really more helpful to think of a NoSQL database as a “non-relational” database, where – however you go about querying this database – it’s a database that abandons may of the concepts of relational databases. And so we wound up with the term “NoSQL,” where by now many NoSQL databases have emerged, and Azure DocumentDB is the latest NoSQL contender from Microsoft. But unlike most other NoSQL databases, the primary way to query DocumentDB is – oddly enough – by using SQL, or at least, a version of SQL that’s been adapted to the non-relational world of NoSQL databases. I’ll be talking a lot about DocumentDB in upcoming posts.

OK, so NoSQL really means non-relational. Now that’s a really broad definition. Saying it isn’t relational is like saying it’s anything else. And that’s true, which is why in fact there are different types of NoSQL databases. These include key-value stores, such as Azure Table Storage, column based stores like Cassandra, graph databases like Neo4, and document databases like MongoDB and Azure DocumentDB. While there are key differences between these types, all NoSQL database platforms share several common characteristics.

Huge amounts of data

First, they are designed to scale out, not just up. Meaning that while relational databases scale up easily enough, simply by adding more hardware, it’s much more difficult to scale them out horizontally – that is, to spread relational data across multiple partititions – once you hit the ceiling on CPU, disk, and memory, and can no longer scale up. In contrast, NoSQL databases are designed to scale out – infinitely, in fact—making it much easier to achieve internet scale for modern applications.

Schema-free data

Another common characteristic among NoSQL databases is the concept of schema-free data. That is, unlike relational databases, a NoSQL database does not enforce any schema. Every item in the database is free to store information that may or may not be structured the same as other items – even other items of the same type. This means that you can simply introduce new elements in your data as they become pertinent, without requiring any design changes in the database, such as adding or dropping columns, or changing data types. Similarly, you can stop including elements in new data as they start becoming irrelevant, again, without maintaining a schema in the database.

Simplicity Rules

By design, NoSQL databases are simple. They are not nearly as robust as traditional relational database platforms, like SQL Server and Oracle, and there are two reasons for this. For one, NoSQL databases don’t try to provide the complete functionality that is currently available in a relational database. That is, they are specifically designed to be simpler than relational databases, which is how they are able to out-perform relational databases on a large scale. In other cases, NoSQL databases lack features simply because they are much younger than their relational counterparts, which achieved maturity a long time ago. So while you can expect to see improvements in areas of missing functionality as NoSQL databases evolve, these platforms won’t make an attempt to replace full feature set available in relational databases. Despite the negative connotation in the name “NoSQL,” eliminating relational databases in favor of NoSQL is certainly not a stated goal. But at the same time, a scalable, schema-free, and easy-to-use database platform is rather compelling, and gives us more choices for our applications than we had before. Relational databases are definitely here to stay, but they no longer enjoy the monopoly they once had as the back-end platform of choice for new applications, now that a variety of NoSQL alternatives are here.

Document Database

I mentioned that Azure DocumentDB is a NoSQL database, and that it’s a NoSQL document database specifically. This is yet another unfortunate term, because when most people think of a document, they think of a file – like a Word document, spreadsheet, PDF file, and this is definitely not the type of document we mean when we say “document database.” In NoSQL terms, a document is more like an object graph; a complete representation of an entity and its related entities. While object graphs can be serialized and deserialized in any format, NoSQL databases typically leverage JSON, or JavaScript object notation, as the format for storing, projecting, and transporting data. Given the pervasiveness of JSON in today’s world – particularly among web applications – it should really take noone by surprise that NoSQL document databases have generally embraced JSON as their native data format. It’s a simple, lightweight format, yet expressive enough to support many different data modeling scenarios.