Hacker News new | comments | show | ask | jobs | submit login

Here's a google bigquery that lists the most common PDFs referenced in the github sample dataset, and the top 100 results: https://gist.github.com/llimllib/3f1877eab06208958060f491cf3...

It's possible to run this query against the full github dataset but I couldn't figure out how to pay for it, so if somebody wants to do that it would be excellent.




just a note: it's bizarre that I absolutely cannot find a way to determine a) how much it would cost to run or b) how I would pay for it if I wanted to run it


I changed it to query from [bigquery-public-data:github_repos.contents] instead, and before I execute the query it says "Valid: This query will process 1.68 TB when run.".

Queries are $5/TB [0].

So a bit less than 10 bucks. :)

Edit: brb, that's totally worth it.

[0]: https://cloud.google.com/bigquery/pricing



OK, so why is the most common document something to do with the Turkish 2012 elections? (If the rough Google Translate is to be believed...)

Weird...


> 3896 http://www.pdf

Hmm...


Yeah I didn't care to make the regexp perfect. The most common site is www.pdfsharp.com, then www.pdfparser.org, then www.pdflib.com, etc etc


Weird! Mine just says "Quota exceeded..." without ever saying how big the query will be. Where do I find that info?

(http://i.imgur.com/3EkPYIY.png is what I see)




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: