Google Cloud Platform Blog: Introduction to data models in Cloud Datastore

Introduction to data models in Cloud Datastore

Tuesday, August 4, 2015

Non-relational databases, like Google Cloud Datastore, present benefits in terms of ease of management and scale, but can also introduce unique challenges for modeling and storing data. Many of the best practices in relational databases, such as joins and normalization, are discouraged as they can be unscalable for non-relational data. The common metaphor of viewing a relational database as objects and relationships is difficult to apply to a non-relational database.

This blog post explores a different way of modeling non-relational data. Instead of thinking in terms of relationships, think of paths and entities, which are conceptually similar to a file system. This metaphor reveals some interesting solutions to common data-modeling problems.

Today, we'll explore two example situations for how to organize data as paths: multi-user blogs and wikis.

The following examples use the idea of representing data as paths and file systems as a way to understand data design for Cloud Datastore, not as an implementation of an actual file system. If you’re interested in a scalable file-storage solution, see Google Cloud Storage.

Creating a Cloud Datastore data model Consider the data model for a multi-user blog. This application can have millions of users, each of which can create thousands of small posts. Modeling the data as User and Post objects a traditional relational database is as follows:
User

Id	Username	Name
1	tonystark	Tony Stark
2	dianaprince	Diana Prince

Post

id	user_id	content
1	1	Hello, world!
2	2	Another post

You would query for all of the posts written by a particular user using the following SQL:
SELECT * FROM Posts WHERE user_id = {user_id}
In Cloud Datastore, there are no tables, instead data entities have a particular kind. Directly converting the tables into two kinds, Post and User, the data is as follows in Cloud Datastore:

Key	Data
(Post, 1)	{"user": Key(User, tonystark), "content": "Hello, World!"}
(Post, 2)	{"user": Key(User, dianaprince), "content": "Another post"}
(User, tonystark)	{"name": "Tony Stark"}
(User, dianaprince)	{"name": "Diana Prince"}

Querying the database We can then query for all of the posts written by particular user posts by filtering the entire Post kind by username.
datastore.Query(kind='Post', filters=[('user', '=', Key('User', username))]) Global queries, like the one above, are eventually consistent in Cloud Datastore.
Most non-relational databases are optimized for high-scalability and high-availability. The trade-off for this performance is that some types of queries are eventually consistent. This means there may be a delay between when data is added or updated in the database and when it is returned in a query. This is in contrast to the strong consistency you are familiar with from relational databases, where updated data is immediately available.
In this example, eventually consistent means that when a user creates a new post they may not see it immediately on their home page. For some applications, eventual consistency is acceptable, but in a multi-user blog, users want to see and verify their updates immediately.
Data modeling for strong consistency When you need to read data immediately after a write with Cloud Datastore you can choose between strong and eventual consistency, depending on how you organize your data. You could think of a kind as a table and the key name as the primary key.
However, the key can have ancestors that create a key path. Instead of thinking of non-relational data in terms of tables and keys, imagine it as a file system. Representing the non-relational data as a file system is as follows:
/1.post
/2.post
/tonystark.user
/dianaprince.user You can re-organize the data, grouping it by user: /tonystark.user
/1.post
/dianaprince.user
/2.post

In this model, the conceptual path to the first post is /tonystark.user/1.post. In Cloud Datastore the data is now organized as follows:

Key	Data
(User, tonystark)	{"name": "Tony Stark"}
(User, tonystark, Post, 1)	{"content": "Hello, World!"}
(User, dianaprince)	{"name": "Diana Prince"}
(User, dianaprince, Post, 2)	{"content": "Another post"}

Organizing the posts by user, creates an entity group that contains the user’s profile and all of their posts. In Cloud Datastore, queries over a single entity group are strongly consistent. Now when a user creates a post, they will see it immediately. You now perform the query by ancestor key.
datastore.Query(kind='Post', ancestor=Key('User', username)) You make a trade-off between consistency and write throughput, however, if you group the user profile and posts into an entity group. In Cloud Datastore, each entity group can only be updated about once per second, but all reads are strongly consistent. This means each individual user can only post once per second, but all users see a strongly consistent view of their own posts.
Note: The once-per-second limitation only applies to a single user. Multiple users can post simultaneously because each updates a different entity group.
This design also allows for eventually consistent queries that return all posts submitted by all users. You could use this type of query, for example, to display a stream of new posts.
datastore.Query(kind='Post') Eventual consistency is typically acceptable for this use case. Users may not see their post in the “all posts” query immediately, but will see their post immediately in the “my posts” query. And it’s unlikely that a user will submit more than one post per second.
In this application, organizing a user’s posts under the same entity group forms a natural boundary. Many applications demonstrate a natural boundary you can use to create entity groups. The trade-off between write throughput and read consistency is one of the key factors in deciding how to organize your data in Cloud Datastore.
Data hierarchy with ancestor keys and transactions Let’s say we’re building a wiki app. Each time a page is saved, a new revision is created. Users can restore a page from any revision.
Representing this data as a file system is as follows:
/home.page
/ current.revision
/ 05-29-2015-10-30-27.revision
/ 05-20-2015-06-33-11.revision
/another.page
/ current.revision
/ 04-10-2015-11-23-10.revision This structure stores all revisions of each page in a separate entity group.
Note: A key path can contain keys that refer to ancestors that do not exist as separate entities. In this example, this means that a revision entity can exist even though the page entity it refers to does not (e.g., /another.page). A page doesn’t have any properties other than its current content and revisions, so there’s no need to create an actual page entity.
To save a page, your application copies the current page data into a new revision, and overwrites the current.revision with the new content. You can use Cloud Datastore transactions to ensure that all of the steps of saving a page successfully complete, or that the whole process fails. You can create a transactions across up to 25 entity groups, though the more entity groups involved in a transaction, the greater the chance that the transaction will fail due to contention. In the example of a wiki you only need to update a single entity group when you save a page.
The following Python code uses a transaction to save a page, raising an error if the transaction fails for any reason, such as attempting to save the page more than once a second.
def save_page(ds, page, content):
   with ds.transaction():
       now = datetime.datetime.utcnow()
       current_key = path_to_key(ds, '{}.page/current.revision'.format(page))
       revision_key = path_to_key(ds, '{}.page/{}.revision'.format(page, now))

       if ds.get(revision_key):
           raise AssertionError("Revision %s already exists" % revision_key)

       current = ds.get(current_key)

       if current:
           revision = datastore.Entity(key=revision_key)
           revision.update(current)
           ds.put(revision)
       else:
           current = datastore.Entity(key=current_key)

       current['content'] = content

       ds.put(current) As with the blog, you use an ancestor query to list all of the revisions associated with a page.
datastore.Query(kind='revision', ancestor=Key('page', home))
Restoring a revision is nearly the same as saving a page. Instead of using newly submitted content, however, you specify the revision’s content.
The following Python code demonstrates how to restore a revision.
def restore_revision(ds, page, revision):
   save_page(ds, page, revision['content'])
Just as the blog example had a natural boundary between users, the wiki has a natural boundary between pages.
This design creates the following results:

A page can be updated no more than once per second.

Queries on a page and its revisions are strongly consistent.

Because the save operation uses strongly consistent queries you can save a page or restore a revision in a transaction.

Reading and writing pages remains fast even if the number of pages becomes exceedingly large.

In a collaborative authoring application, transactions are important to prevent data loss while creating a revision, or to prevent revisions created by multiple users from overwriting each other.
Learn More It’s often easier to approach storing your data in Cloud Datastore as an organization problem instead of a modeling problem. By thinking in terms of a file system, you gain insight into how to organize and manipulate non-relational data. Look for natural boundaries that you can use to organize your data for strong consistency and transactionality. To learn more about the topics discussed in this blog, check out these resources: