Jump to Content
Google Cloud

Cracking the GitHub code: this week on Google Cloud Platform

July 15, 2016
Alex Barrett

GCP Blog Editor, Google Cloud Platform Blog

It’s been a couple of weeks since GitHub announced that it was making 3+TB of its open source library available on BigQuery, and the Google Cloud Platform community has been busy ever since.

Google Developer Advocate Felipe Hoffa showed the world how it was done in “GitHub on BigQuery: Analyze all the open source code,” and fellow DA Fransesc Campoy followed suit with a post analyzing GitHub Go packages. Along the way, he discovers that he can create even more nuanced queries by using BigQuery User Defined Functions.

Then, one of Google’s newest DAs Guillaume Laforge informs us that there are 743,070 Groovy files on GitHub with 16,464,376 lines of code, while CloudFlare’s Filippo Valsorda (the “Heartbleed guy”) analyzes how the Go ecosystem “does vendoring.”

Meanwhile, over on Medium, Google program manager for big data and machine learning Lak Lakshmanan uses BigQuery to discover which popular Java projects need the most help by searching for tagged comments such as FIXME and TODO. The post also shows how to use Google Cloud Dataflow to build a pipeline starting from BigQuery to Java in order to process the data in steps.

Or check out Robert Kozikowski’s blog for a treasure trove of GitHub data analysis: posts on visualizing relationships between python packages; top pandas, numpy and scipy functions, emacs packages and angular directives.

And if that’s still not enough BigQuery on GitHub for you, here’s a Changelog podcast on the topic for your drive home!

Posted in