Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Geofencing 340 million NYC taxi locations with Google Cloud Dataflow
Friday, December 19, 2014
Posted by Thorsten Schaeff, Sales Engineer Intern
Fun fact: around 170 million taxi journeys occur across New York City yearly, holding vast amounts of information each time someone steps in and out of one of those bright yellow cabs. How much information exactly? Being a not-so-secret maps enthusiast, I made it my challenge to visualize a NYC taxi dataset on Google Maps.
Anyone who’s tried to put a large amount of data points on a map knows about the difficulties one faces when working with big geolocation data. That's why I want to share with you how I used Cloud Dataflow to spatially aggregate every single pick-up and drop-off location with the objective of painting the whole picture on a map. For background info, Google Cloud Dataflow is now in alpha stage and can help you gain insight into large geolocation datasets. You can try
experimenting with it
by applying for the alpha program or learn more with yesterday's
update
.
When I first sat down to think through this data visualization, I knew I needed to create a thematic map, so I built a simple pipeline that was able to geofence all the 340 million pick-up and drop-off locations against 342 different polygons that resulted from converting the
NYC neighbourhood tabulation
areas into single-part polygons. You can find the processed data in
this public BigQuery table
. (In order to access BigQuery you need to have at least one project listed in
your Google Developers Console
. After creating a project you can access the table by following
this link
.)
Thematic map showing the distribution of taxi pick-up locations in NYC in 2013. Midtown South is New Yorkers’ favourite area to get a cab with almost 28 million trips starting there, which is roughly 1 trip per second. You can find an interactive map
here
.
This open data, released by the NYC Taxi & Limo Commission, has been the foundation for some
beautiful visualizations
. By utilizing the power of Google Cloud Platform's tools, I’ve been able to spatially aggregate the data using
Cloud Dataflow
, and then do ad hoc querying on the results using
BigQuery
, to gain fast and comprehensive insight into this immense dataset.
With the Google Cloud Dataflow SDK, which parallels the data transformations across multiple Cloud Platform instances, I was able to build, test and run the whole processing pipeline in a couple of days. The actual processing, distributed across five workers, took slightly less than two hours.
The pipeline’s architecture is extremely simple. Since Cloud Dataflow offers a BigQuery reader and writer, most of the heavy lifting is already taken care of. The only thing I had to provide was the geofencing function that could be parallelised across multiple instances. For a detailed description on how to do complex geofencing using open source libraries see this post on the
Google Developers Blog
.
When executing the pipeline, Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass and deploys the result to multiple Google Compute Engine instances. At the time of deploying the pipeline you can read in files from Google Cloud Storage that contain data you need for your transformations, e.g., shapefiles or GeoJSON formats. Alternatively you can call an external API to load in the geofences you want to test against.
I utilized an
API I built on App Engine
which exposes a list of geofences stored in Datastore. Using the
Java Topology Suite
I created a spatial index maintained in a class variable in the memory of each instance for fast querying access.
Distributed across five workers, Cloud Dataflow was able to process an average of 25,000 records per second, each record having two locations, ploughing through more than 170 million table rows in just under two hours. The amount of workers can be flexibly assigned at the time of deployment. The more workers you use, the more records can be processed in parallel, the faster the execution of your pipeline.
The interactive Cloud Dataflow graph of your Pipeline, helping you to monitor and debug your Pipeline in your Google Developer Console in the browser.
Having the data preprocessed and written back into BigQuery, we were then able to run super fast queries over the whole table answering questions like, “where do the best-paid trips start from?”.
Unsurprisingly they start from JFK airport with an average fare of $46 and an average tip of 20.7%*. Okay, this is probably not a secret, but did you know that, even though the average fare from LGA airport is $15 less, there are roughly 800,000 trips more starting from LGA? And with 22.2%
1
, passengers from LGA airport actually tip best.
As cash tips aren’t reported, only 52% of trips have a tip noted, therefore the values regarding tips could be inaccurate.
Most of the taxi trips start in Midtown-South (28 million) with an average fare of $11. Carnegie Hill in the Upper East Side comes fourth with 12 million pick-ups, however these trips are fairly short. Journeys that start there mostly stay in the Upper East Side and therefore only generate an average fare of $9.80.
Here's
an interactive map visualizing where people went to, what they paid on average and how they tipped at and some other visualizations of of how people tip from where:
(click to visit interactive map)
The processed data is publicly available in this
BigQuery table
. You can find some interesting queries to run against this data in
this gist
.
Though NYC taxi cab journeys may not seem to amount to much, they actually that conceal a ton of information, which Google Cloud Dataflow, as a powerful big data tool, helped reveal by making big data processing easy and affordable. Maybe I'll try
London's black cabs
next.
1
As cash tips aren’t reported, only 52% of trips have a tip noted, therefore the values regarding tips could be inaccurate.
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Using labels to organize Google Cloud Platform resources
New in Google Cloud Storage: auto-delete, regional buckets and faster uploads
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow