
Tell HN: Full Hacker News dataset now available on BigQuery - minimaxir
https:&#x2F;&#x2F;bigquery.cloud.google.com&#x2F;table&#x2F;fh-bigquery:hackernews.comments<p>The dataset is up-to-date for October 2015 and uses the official HN API as a data source. (Unfortuntely, this includes the HTML formatting in the comments) Dataset is about 4GB total; since BigQuery allows for 1000GB processing for free each month, it is effectively no cost to analyze.<p>Felipe Hoffa (who uploaded the dataset) has links to the dataset along with some sample Python code for analysis: https:&#x2F;&#x2F;github.com&#x2F;fhoffa&#x2F;notebooks&#x2F;blob&#x2F;master&#x2F;analyzing%20hacker%20news.ipynb<p>I have a few scripts for downloading all the data manually (https:&#x2F;&#x2F;github.com&#x2F;minimaxir&#x2F;get-all-hacker-news-submissions-comments ), but the BigQuery table may be more pragmatic. I can do some more sample queries if requested. (See my BigQuery tutorial for Reddit data: http:&#x2F;&#x2F;minimaxir.com&#x2F;2015&#x2F;10&#x2F;reddit-bigquery&#x2F; )
======
fhoffa
Hi! I'm the one that loaded this dataset into BigQuery. Feel free to ask any
questions :).

The notebook with sample queries and visualizations:

[https://github.com/fhoffa/notebooks/blob/master/analyzing%20...](https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb)

~~~
rndn
There are items in there that are deleted/[dead] on HN (and not very recent).
How come?

~~~
mintplant
You can see dead things on HN if you turn on "showdead" in your account
settings.

------
minimaxir
Clickables:

[https://bigquery.cloud.google.com/table/fh-
bigquery:hackerne...](https://bigquery.cloud.google.com/table/fh-
bigquery:hackernews.comments)

[https://github.com/fhoffa/notebooks/blob/master/analyzing%20...](https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb)

[https://github.com/minimaxir/get-all-hacker-news-
submissions...](https://github.com/minimaxir/get-all-hacker-news-submissions-
comments)

[http://minimaxir.com/2015/10/reddit-
bigquery/](http://minimaxir.com/2015/10/reddit-bigquery/)

------
JoshMandel
This makes a really nice introduction to BigQuery (which is to say: BigQuery
is nicely discoverable, given an easy-to-understand dataset).

Is there a good way to find the story to which a comment belongs? This dataset
raises the issue of recursive query (e.g. "with recursive" in SQLite or
PostgreSQL, or "connect by" in Oracle). The only approach I see in BigQuery is
specifying a fixed level with something scary like:

    
    
      SELECT p0.text, s.id, s.title
      FROM
        [fh-bigquery:hackernews.comments] p0
      JOIN EACH [fh-bigquery:hackernews.comments] p1 ON p1.id=p0.parent
      JOIN EACH [fh-bigquery:hackernews.comments] p2 ON p2.id=p1.parent
      JOIN EACH [fh-bigquery:hackernews.comments] p3 ON p3.id=p2.parent
      JOIN EACH [fh-bigquery:hackernews.comments] p4 ON p4.id=p3.parent
      JOIN EACH [fh-bigquery:hackernews.stories] s ON s.id=p4.parent
      WHERE
        REGEXP_MATCH(p0.text, '(?i)bigquery')
      ORDER BY
        p0.time DESC
    

For this particular data set: linking each comment to its story might be a
good denormalization.

~~~
fhoffa
You are right - I'll prepare a new release with that data.

My oversight, sorry! :)

~~~
JoshMandel
Not an oversight — just a different use case for the data! And I wasn't sure
if BigQuery had a generic approach here, but it looks like not.

~~~
fhoffa
Btw, I really like your query.

I modified it to get the story for up to 7 levels of recursion:

    
    
      SELECT p0.id, s.id, s.title, level
      FROM (
        SELECT p0.id, p0.parent, p2.id, p3.id, p4.id, COALESCE(p7.parent, p6.parent, p5.parent, p4.parent, p3.parent, p2.parent, p1.parent, p0.parent) story_id,
               GREATEST(IF(p7.parent IS null, -1, 7), IF(p6.parent IS null, -1, 6), IF(p5.parent IS null, -1, 5), IF(p4.parent IS null, -1, 4), IF(p3.parent IS null, -1, 3),
                        IF(p2.parent IS null, -1, 2), IF(p1.parent IS null, -1, 1), 0) level
        FROM    [fh-bigquery:hackernews.comments] p0
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p1 ON p1.id=p0.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p2 ON p2.id=p1.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p3 ON p3.id=p2.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p4 ON p4.id=p3.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p5 ON p5.id=p4.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p6 ON p6.id=p5.parent
        LEFT JOIN EACH [fh-bigquery:hackernews.comments] p7 ON p7.id=p6.parent
        HAVING level=0
        LIMIT 100
      ) a
      LEFT JOIN EACH [fh-bigquery:hackernews.stories] s
      ON s.id=a.story_id
    
    

(having so many left joins consumes a lot of resources, so to run it massively
I would look for a different strategy)

------
rm999
Very cool, thanks! Looking forward to playing around with this.

FYI, you named a column a reserved sql keyword ('by'). For future reference,
and for others reading this: this is bad database design and makes it harder
to use the table. You can get around this by wrapping the column name in
brackets, like:

>select ... where [by] = ...

~~~
fhoffa
I didn't name the column "by", I just gave the field the name that the API
uses for it.

And to make everyone's lives easier (including mine), I copied the [by] column
to an [author] column, so you can do a

>select ... where author = ...

instead :)

~~~
malkia
Oh, first I thought there is some difference and did this:

    
    
      SELECT * FROM [fh-bigquery:hackernews.stories] WHERE [by] != author

------
nsmalch
Nice trove to pore through when I find the time.

I like to use Twitter to analyze HN datasets. It's mostly limited to links,
because that's what I'm after mostly.

[https://twitter.com/newsycombinator](https://twitter.com/newsycombinator)
[https://twitter.com/HackerNews](https://twitter.com/HackerNews) .. And a few
other accounts. Try to avoid Bitly wrapped links.

Use something like Greptweet to harvest the tweets and parse out any noise.

~~~
minimaxir
That will not get you accurate results for HN data analysis since a) those
accounts only tweet important links so analysis will be biased b) you can only
get 3200 tweets at a time. (This is a Twitter API limitation)

You have to look at both the good and the bad.

~~~
nsmalch
I hear you. Raw unfiltered links always have hidden gems.

One thing though: Greptweet has an archive somewhere with a huge trove of
tweets that users of the service have searched for, and were thus logged and
kept. (Some even go over the 3200 limit). It's a massive Tarball, so set aside
time to download it and parse out boring/noisy links.

A lot of HN links are tech-press posts which consist of hearsay and merely
proxy the thoughts of others. The recent changes in HN with regards to more
academia-style posts is refreshing.

------
BinaryIdiot
This is really cool; thanks! Not exactly sure what I'll do with it (honestly
probably nothing) but it's sure cool especially since I didn't know about
BigQuery prior.

------
jonathannorris
Huge fan of BQ, its the best product Google Cloud has that AWS doesn't.
Excited for its future.

------
gregw134
Many thanks! Could you put up a magnet link so we can download the data
ourselves? It would also be useful to have the submission id as a field in the
comment table.

------
arikfr
If anyone wants to play with this dataset, and easily share his queries or
visualize the data, you're welcome to use re:dash's demo instance [1]. I've
also created an example query [2].

[1] [http://demo.redash.io](http://demo.redash.io) [2]
[http://demo.redash.io/queries/667/source#table](http://demo.redash.io/queries/667/source#table)

------
hoodoof
Is this OK from a privacy perspective?

~~~
0942v8653
Why wouldn't it be? All the data provided by the HN API is public.

~~~
andreasvc
Yes, when you submit a comment to HN you intend for it to be published on the
HN site. However, using comments for text mining purposes is a different thing
and in principle consent is required to do that. The default case is that user
submitted content remains copyrighted by the author, with an implied consent
for the site to publish the material on the site. There can be an explicit
license such as with Wikipedia or StackOverflow; in that case text mining may
be explicitly allowed.

------
cobookman
All of the github event data is also on bigquery as well.
[https://www.githubarchive.org/#bigquery](https://www.githubarchive.org/#bigquery)

Now for someone to analyze github <-> HN correlations.

~~~
fhoffa
I joined GitHub and Hacker News here - how many stars after a front page post?

[https://www.reddit.com/r/bigquery/comments/3qpyor/joining_ha...](https://www.reddit.com/r/bigquery/comments/3qpyor/joining_hacker_news_and_github_how_much_attention/)

------
DanBC
Cool.

I'd be interested to see a list of people who submit a lot. I submit too much
- about one submission per day - and I'm curious what percentile that puts me
in.

------
buzzsudo
Hey guys check it out
[https://news.ycombinator.com/item?id=10445270](https://news.ycombinator.com/item?id=10445270)

------
apryldelancey
Thank you for this, I think it's great!

