Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
Google Cloud Dataproc: Making Spark and Hadoop Easier, Faster, and Cheaper
Wednesday, September 23, 2015
Working with large datasets requires powerful tools, but too often those tools add new layers of complexity. To use your data efficiently, you need to minimize the time from data-capture to insights. But concerns about deployment, scaling, monitoring, utilization, and cost can get in the way of what matters most: your data. With more data being generated each day, you have less time to peel back the layers of complexity around the tools you rely on for success. We think using powerful data tools should be easy as 1-2-3.
Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data. In the time it takes you to read this blog post, you can have a Spark or Hadoop cluster created, configured, and ready to work for you.
Cloud Dataproc minimizes the time you spend on administration and management
When compared to traditional, on-premises products and competing cloud services, Cloud Dataproc has a number of unique advantages for clusters of 3 to hundreds of nodes:
Low-cost
. Cloud Dataproc is priced at only 1 cent per virtual CPU in your cluster per hour, on top of the other Cloud Platform resources you use. In addition to this low price, Cloud Dataproc clusters can include
preemptible instances
that have lower compute prices, reducing your costs even further. Instead of rounding your usage up to the nearest hour, Cloud Dataproc charges you only for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.
Super fast
. Without using Cloud Dataproc, it can take anywhere from 5 to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. By comparison, Cloud Dataproc clusters are quick to start, scale, and shutdown with each of these operations taking 90 seconds or less, on average. This means you can spend less time waiting for clusters and more hands-on time working with your data.
Integrated
. Cloud Dataproc has built-in integration with other Google Cloud Platform services, such as
BigQuery
,
Google Cloud Storage
,
Google Cloud Bigtable
,
Google Cloud Logging
, and
Google Cloud Monitoring
, so you have more than just a Spark or Hadoop cluster—you have a complete data platform. For example, you can use Cloud Dataproc to effortlessly ETL terabytes of raw log data directly into BigQuery for business reporting.
Managed
. Use Spark and Hadoop clusters without the assistance of an administrator or special software. You can easily interact with clusters and Spark or Hadoop jobs through the Google Developers Console, the Google Cloud SDK, or the Cloud Dataproc REST API. When you're done with a cluster, you can simply turn it off so you don’t spend money on an idle cluster. You won’t need to worry about losing data, because Cloud Dataproc is integrated with
Cloud Storage
,
BigQuery
, and
Cloud Bigtable
.
Simple and familiar.
You don’t need to learn new tools or APIs to use Cloud Dataproc, making it easy to move existing projects into Cloud Dataproc without redevelopment. Spark, Hadoop, Pig, and Hive are frequently updated, so you can be productive faster. Today, we are launching with clusters that have Spark 1.5 and Hadoop 2.7.1.
Cloud Dataproc joins a rich set of cloud technologies focused on faster speed, robust features, and lower costs. With Cloud Platform you have access to:
Awesome infrastructure including
Google Compute Engine
,
Cloud Storage
, and
Google Cloud Networking
.
Cloud Dataproc, builds on this infrastructure to let you use Spark and Hadoop more easily, faster and at a lower cost. Since Cloud Dataproc is built on Cloud Platform, you have instant access to
solid-state drives (SSD)
and
preemptible virtual machines
.
Combining Cloud Dataproc with next-generation data processing and analytics services in Google Cloud Platform powered by Google-native technologies, including
BigQuery
,
Google Cloud Dataflow
, and
Google Cloud Pub/Sub
.
Today we’re releasing
Google Cloud Dataproc
as a beta service. Cloud Dataproc gives you anytime access to super-fast, simple yet powerful, managed Spark and Hadoop clusters. Since you only pay for what you use with minute-by-minute billing, you won’t break the bank in the process. We look forward to seeing how you find creative, innovative, and productive ways to use Cloud Dataproc. To learn more about Cloud Dataproc, visit the
Cloud Dataproc site
, review our
getting started guide
, or submit your questions and feedback on
Stack Overflow
.
- Posted by James Malone, Product Manager
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
World's largest event dataset now publicly available in BigQuery
A look inside Google’s Data Center Networks
Enter the Andromeda zone - Google Cloud Platform’s latest networking stack
Using labels to organize Google Cloud Platform resources
New in Google Cloud Storage: auto-delete, regional buckets and faster uploads
Labels
Announcements
193
Big Data & Machine Learning
134
Compute
271
Containers & Kubernetes
92
CRE
27
Customers
107
Developer Tools & Insights
151
Events
38
Infrastructure
44
Management Tools
87
Networking
43
Open
1
Open Source
135
Partners
102
Pricing
28
Security & Identity
85
Solutions
24
Stackdriver
24
Storage & Databases
164
Weekly Roundups
20
Feed
Subscribe by email
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow