Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Introduction to data models in Cloud Datastore
Tuesday, August 4, 2015
Non-relational databases, like
Google Cloud Datastore,
present benefits in terms of ease of management and scale, but can also introduce unique challenges for modeling and storing data. Many of the best practices in relational databases, such as joins and normalization, are discouraged as they can be unscalable for non-relational data. The common metaphor of viewing a relational database as objects and relationships is difficult to apply to a non-relational database.
This blog post explores a different way of modeling non-relational data. Instead of thinking in terms of relationships, think of
paths
and
entities
, which are conceptually similar to a file system. This metaphor reveals some interesting solutions to common data-modeling problems.
Today, we'll explore two example situations for how to organize data as paths: multi-user blogs and wikis.
The following examples use the idea of representing data as paths and file systems as a way to understand data design for Cloud Datastore, not as an implementation of an actual file system. If you’re interested in a scalable file-storage solution, see
Google Cloud Storage.
Creating a Cloud Datastore data model
Consider the data model for a multi-user blog. This application can have millions of users, each of which can create thousands of small posts. Modeling the data as User and Post objects a traditional relational database is as follows:
User
Id
Username
Name
1
tonystark
Tony Stark
2
dianaprince
Diana Prince
Post
id
user_id
content
1
1
Hello, world!
2
2
Another post
You would query for all of the posts written by a particular user using the following SQL:
SELECT * FROM Posts WHERE user_id = {user_id}
In Cloud Datastore, there are no tables, instead data
entities
have a particular
kind
. Directly converting the tables into two kinds, Post and User, the data is as follows in Cloud Datastore:
Key
Data
(Post, 1)
{"user": Key(User, tonystark), "content": "Hello, World!"}
(Post, 2)
{"user": Key(User, dianaprince), "content": "Another post"}
(User, tonystark)
{"name": "Tony Stark"}
(User, dianaprince)
{"name": "Diana Prince"}
Querying the database
We can then query for all of the posts written by particular user posts by filtering the entire Post kind by username.
datastore.Query(kind='Post', filters=[('user', '=', Key('User', username))])
Global queries, like the one above, are
eventually consistent
in Cloud Datastore
.
Most non-relational databases are optimized for high-scalability and high-availability. The trade-off for this performance is that some types of queries are eventually consistent. This means there may be a delay between when data is added or updated in the database and when it is returned in a query. This is in contrast to the
strong consistency
you are familiar with from relational databases, where updated data is immediately available.
In this example, eventually consistent means that when a user creates a new post they may not see it immediately on their home page. For some applications, eventual consistency is acceptable, but in a multi-user blog, users want to see and verify their updates immediately.
Data modeling for strong consistency
When you need to read data immediately after a write with Cloud Datastore you can choose between
strong and eventual consistency
, depending on how you organize your data. You could think of a kind as a table and the
key name
as the primary key.
However, the key can have
ancestors
that create a key path. Instead of thinking of non-relational data in terms of tables and keys, imagine it as a file system.
Representing the non-relational data as a file system is as follows:
/1.post
/2.post
/tonystark.user
/
dianaprince
.user
You can re-organize the data, grouping it by user:
/tonystark.user
/1.post
/dianaprince.user
/2.post
In this model, the conceptual path to the first post is
/tonystark.user/1.post.
In Cloud Datastore the data is now organized as follows:
Key
Data
(User, tonystark)
{"name": "Tony Stark"}
(User, tonystark, Post, 1)
{"content": "Hello, World!"}
(User, dianaprince)
{"name": "Diana Prince"}
(User, dianaprince, Post, 2)
{"content": "Another post"}
Organizing the posts by user, creates an entity group that contains the user’s profile and all of their posts. In Cloud Datastore, queries over a single entity group are strongly consistent. Now when a user creates a post, they will see it immediately. You now perform the query by ancestor key.
datastore.Query(kind='Post', ancestor=Key('User', username))
You make a trade-off between consistency and write throughput, however, if you group the user profile and posts into an entity group. In Cloud Datastore, each entity group can only be updated about once per second, but all reads are strongly consistent. This means each individual user can only post once per second, but all users see a strongly consistent view of their own posts.
Note:
The once-per-second limitation only applies to a single user. Multiple users can post simultaneously because each updates a different entity group.
This design also allows for eventually consistent queries that return all posts submitted by all users. You could use this type of query, for example, to display a stream of new posts.
datastore.Query(kind='Post')
Eventual consistency is typically acceptable for this use case. Users may not see their post in the “all posts” query immediately, but will see their post immediately in the “my posts” query. And it’s unlikely that a user will submit more than one post per second.
In this application, organizing a user’s posts under the same entity group forms a
natural boundary
. Many applications demonstrate a natural boundary you can use to create entity groups. The trade-off between write throughput and read consistency is one of the key factors in deciding how to organize your data in Cloud Datastore.
Data hierarchy with ancestor keys and transactions
Let’s say we’re building a wiki app. Each time a page is saved, a new revision is created. Users can restore a page from any revision.
Representing this data as a file system is as follows:
/home.page
/ current.revision
/ 05-29-2015-10-30-27.revision
/ 05-20-2015-06-33-11.revision
/another.page
/ current.revision
/ 04-10-2015-11-23-10.revision
This structure stores all revisions of each page in a separate entity group.
Note: A key path can contain keys that refer to ancestors that do not exist as separate entities. In this example, this means that a revision entity can exist even though the page entity it refers to does not (e.g., /another.page). A page doesn’t have any properties other than its current content and revisions, so there’s no need to create an actual page entity.
To save a page, your application copies the current page data into a new revision, and overwrites the current.revision with the new content. You can use Cloud Datastore
transactions
to ensure that all of the steps of saving a page successfully complete, or that the whole process fails. You can create a transactions across up to 25 entity groups, though the more entity groups involved in a transaction, the greater the chance that the transaction will fail due to contention. In the example of a wiki you only need to update a single entity group when you save a page.
The following Python code uses a transaction to save a page, raising an error if the transaction fails for any reason, such as attempting to save the page more than once a second.
def
save_page
(ds, page, content):
with
ds.transaction():
now
=
datetime.datetime.utcnow()
current_key
=
path_to_key(ds,
'{}.page/current.revision'
.format(page))
revision_key
=
path_to_key(ds,
'{}.page/{}.revision'
.format(page, now))
if
ds.get(revision_key):
raise
AssertionError
(
"Revision %s already exists"
%
revision_key)
current
=
ds.get(current_key)
if
current:
revision
=
datastore.Entity(key
=
revision_key)
revision.update(current)
ds.put(revision)
else
:
current
=
datastore.Entity(key
=
current_key)
current[
'content'
]
=
content
ds.put(current)
As with the blog, you use an ancestor query to list all of the revisions associated with a page.
datastore.Query(kind='revision', ancestor=Key('page', home))
Restoring a revision is nearly the same as saving a page. Instead of using newly submitted content, however, you specify the revision’s content.
The following Python code demonstrates how to restore a revision.
def
restore_revision
(ds, page, revision):
save_page(ds, page, revision[
'content'
])
Just as the blog example had a natural boundary between users, the wiki has a natural boundary between pages.
This design creates the following results:
A page can be updated no more than once per second.
Queries on a page and its revisions are strongly consistent.
Because the save operation uses strongly consistent queries you can save a page or restore a revision in a transaction.
Reading and writing pages remains fast even if the number of pages becomes exceedingly large.
In a collaborative authoring application, transactions are important to prevent data loss while creating a revision, or to prevent revisions created by multiple users from overwriting each other.
Learn More
It’s often easier to approach storing your data in Cloud Datastore as an organization problem instead of a modeling problem. By thinking in terms of a file system, you gain insight into how to organize and manipulate non-relational data. Look for natural boundaries that you can use to organize your data for strong consistency and transactionality. To learn more about the topics discussed in this blog, check out these resources:
Sample code for this post
Entities, Properties, and Keys in Datastore
Structuring data for strong consistency
Datastore transactions
- Posted by Jon Wayne Parrott, Developer Programs Engineer, Google Cloud Platform
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
Understanding Cloud Pricing
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
New in Google Cloud Storage: auto-delete, regional buckets and faster uploads
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow