Google Cloud Platform Blog
Product updates, customer stories, and tips and tricks on Google Cloud Platform
GitHub on BigQuery: Analyze all the open source code
Wednesday, June 29, 2016
Posted by
Felipe Hoffa
, Google Developer Advocate
Google, in collaboration with GitHub, is releasing an incredible new open dataset on
Google BigQuery
. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks
GitHub Archive project
!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?
The
Google BigQuery Public Datasets
program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.
For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.
On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can.
Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
With BigQuery everyone gets
a terabyte every month to run queries
. If you've never tried BigQuery before, follow these
getting started instructions
.
The contents table has all the non-binary files in GitHub that are less than 1MB. It's a huge table, with more than 1.5 terabytes of data! This means the monthly terabyte for BigQuery queries won't last long if you want to query this table. To make your life easier, we've created extracts with only a sample of 10% of all files of the most popular projects, as well as another dataset with all the .go, .rb. .js, .php, .py, and .java code. Use them to make your free quota last!
If these tables are not enough, you can always create your own extracts (but you'll be billed for the respective storage). To do so, you could sign up for $300 in
Google Cloud Platform
credits. These credits could be used to store terabytes (and more) of data in BigQuery.
BigQuery makes it easy to join different datasets. How about ranking coding patterns by the number of stars their projects get? See a related post looking at the
Hacker News effect on a project’s GitHub stars
.
SQL is not enough? Learn how BigQuery allows you to run arbitrary
JavaScript code inside SQL
to enable a full range of possibilities.
To learn more, read
GitHub's announcement
and try some
sample queries
. Share your queries and findings in our
reddit.com/r/bigquery
and
Hacker News
posts. The ideas are endless, and I'll start collecting tips and links to other articles on this
post on Medium
.
Stay curious!
Free Trial
GCP Blogs
Big Data & Machine Learning
Kubernetes
GCP Japan Blog
Firebase Blog
Apigee Blog
Popular Posts
Introducing Cloud Spanner: a global database service for mission-critical applications
Quantifying the performance of the TPU, our first machine learning chip
Introducing Network Service Tiers: Your cloud network, your way
API design: Choosing between names and identifiers in URLs
GPUs are now available for Google Compute Engine and Cloud Machine Learning
Labels
Announcements
96
Big Data & Machine Learning
121
Compute
211
Containers & Kubernetes
54
CRE
20
Customers
97
Developer Tools & Insights
119
Events
38
Infrastructure
36
Management Tools
54
Networking
33
Open Source
120
Partners
83
Pricing
27
Security & Identity
50
Solutions
20
Stackdriver
19
Storage & Databases
131
Weekly Roundups
16
Feed
Subscribe by email
Certified Professional
Cloud Architect
Demonstrate your proficiency to design, build and manage solutions on Google Cloud Platform.
Learn More
Technical questions? Check us out on
Stack Overflow
.
Subscribe to
our monthly newsletter
.
Google
on
Follow @googlecloud
Follow
Follow