
Third Annual GitHub Data Challenge - geetarista
https://github.com/blog/1864-third-annual-github-data-challenge
======
peterwaller
This just made me discover the github archive.

    
    
      $ wget http://data.githubarchive.org/2014-07-21-{0..23}.json.gz
      ...
      Downloaded: 24 files, 129M in 30s (4.25 MB/s)
    

Cool. A day's worth of public events is 129MB compressed. That's surprisingly
small! Let's play for a second.

    
    
      $ ls *.gz | xargs -P4 -n1 gunzip
      $ du -sch *.json
      ...
      807M	total
    

Time to break out JQ:
[https://stedolan.github.io/jq/manual/](https://stedolan.github.io/jq/manual/)

    
    
      $ time jq .type *.json | wc -l
      408218
    
      real	0m16.788s
      user	0m16.366s
      sys	0m0.325s
    

That's an easy amount of data to mess with. If a day is 16 seconds to process,
I can do 14 years on my measly desktop in one day! 408k public records -
around 5 a second. I somehow imagined events would flood into github even
faster than that. I wonder what their public/private activity ratio is.

Let's explore the event types:

    
    
      $ time jq .type *.json | sort | uniq -c | sort -n
          405 "PublicEvent"
          697 "TeamAddEvent"
         1018 "ReleaseEvent"
         1636 "MemberEvent"
         3166 "CommitCommentEvent"
         3892 "GollumEvent"
         6925 "DeleteEvent"
         7051 "PullRequestReviewCommentEvent"
        14807 "ForkEvent"
        18579 "PullRequestEvent"
        19919 "IssuesEvent"
        37942 "WatchEvent"
        38402 "IssueCommentEvent"
        46033 "CreateEvent"
       207746 "PushEvent"
    

Pushes dominate - 10 pushes for every issue created.

This is probably more than enough for an HN comment. It'll be fun to see what
people do with this stuff this year. :)

~~~
minimaxir
The Google BigQuery implementation of the archive can do such a query across
all the data in seconds.

I wasn't aware until today that you could use BigQuery on a recently-updated
data set, though.

~~~
rsivapr
I can confirm. That query took about 2 seconds. More discussion here:
[http://www.datatau.com/item?id=3608](http://www.datatau.com/item?id=3608)

------
onetimeusename
I don't get why the first prize is a one-day course about data visualization.
You already won the contest which shows that you are knowledgeable about data
visualization, what would a 1 day course do for you?

~~~
cmthornton
The course is taught by Edward Tufte.

------
aleksi
> you’re not participating from a country against which the United States has
> issued export sanctions or other trade restrictions, including Cuba, Iran,
> North Korea, the Sudan and Syria

Does it include Russia this days?

~~~
ahmett
No this is the official boycott list I guess.

------
sytelus
Why prizes for this competition is so pathetic. Looks like a corporate moral
budget where you skimp for pennies (I know for a fact of a big company having
$75 moral budget per person per year). Is this how much our time worth? At
least the organizers could have been more creative if execs at Github decided
to through mere pennies at developers to compete like giving out some cool
designed t-shirts or something.

 _Top 3 winners get $200, $100, and $50_

~~~
oniTony
> Top 3 winners get $200, $100, and $50

Those are 2013 numbers. This year it's "all-expense paid trip to attend a one-
day data visualization course", $500, and $250 for top 3 winners. But
presumably this isn't meant to be done as a job.

