Hacker News new | comments | show | ask | jobs | submit login
Tell HN: Full Hacker News dataset now available on BigQuery
238 points by minimaxir 578 days ago | hide | past | web | 43 comments | favorite

The dataset is up-to-date for October 2015 and uses the official HN API as a data source. (Unfortuntely, this includes the HTML formatting in the comments) Dataset is about 4GB total; since BigQuery allows for 1000GB processing for free each month, it is effectively no cost to analyze.

Felipe Hoffa (who uploaded the dataset) has links to the dataset along with some sample Python code for analysis: https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb

I have a few scripts for downloading all the data manually (https://github.com/minimaxir/get-all-hacker-news-submissions-comments ), but the BigQuery table may be more pragmatic. I can do some more sample queries if requested. (See my BigQuery tutorial for Reddit data: http://minimaxir.com/2015/10/reddit-bigquery/ )

Hi! I'm the one that loaded this dataset into BigQuery. Feel free to ask any questions :).

The notebook with sample queries and visualizations:


Great stuff! Looks like the whole dataset was imported 2 days back. If you can stream it daily, I can use table decorator and avoid processing all the records.

There are items in there that are deleted/[dead] on HN (and not very recent). How come?

You can see dead things on HN if you turn on "showdead" in your account settings.

What is the "comment_ranking" data you mention in the notebook?

Hacker News chose to hide comment scores some time ago, but I still wanted to find a way to rank comments. The good news is that the API gives you a "kids" column that ranks comments in the order they should be displayed - that's how I can find what's the top comment for each post (as shown in the linked notebook).

Is that column independent of "gravity"?

I guess it incorporates gravity - it's whichever way Hacker News tells its clients to rank the comments by.

This makes a really nice introduction to BigQuery (which is to say: BigQuery is nicely discoverable, given an easy-to-understand dataset).

Is there a good way to find the story to which a comment belongs? This dataset raises the issue of recursive query (e.g. "with recursive" in SQLite or PostgreSQL, or "connect by" in Oracle). The only approach I see in BigQuery is specifying a fixed level with something scary like:

  SELECT p0.text, s.id, s.title
    [fh-bigquery:hackernews.comments] p0
  JOIN EACH [fh-bigquery:hackernews.comments] p1 ON p1.id=p0.parent
  JOIN EACH [fh-bigquery:hackernews.comments] p2 ON p2.id=p1.parent
  JOIN EACH [fh-bigquery:hackernews.comments] p3 ON p3.id=p2.parent
  JOIN EACH [fh-bigquery:hackernews.comments] p4 ON p4.id=p3.parent
  JOIN EACH [fh-bigquery:hackernews.stories] s ON s.id=p4.parent
    REGEXP_MATCH(p0.text, '(?i)bigquery')
    p0.time DESC
For this particular data set: linking each comment to its story might be a good denormalization.

You are right - I'll prepare a new release with that data.

My oversight, sorry! :)

Not an oversight — just a different use case for the data! And I wasn't sure if BigQuery had a generic approach here, but it looks like not.

Btw, I really like your query.

I modified it to get the story for up to 7 levels of recursion:

  SELECT p0.id, s.id, s.title, level
  FROM (
    SELECT p0.id, p0.parent, p2.id, p3.id, p4.id, COALESCE(p7.parent, p6.parent, p5.parent, p4.parent, p3.parent, p2.parent, p1.parent, p0.parent) story_id,
           GREATEST(IF(p7.parent IS null, -1, 7), IF(p6.parent IS null, -1, 6), IF(p5.parent IS null, -1, 5), IF(p4.parent IS null, -1, 4), IF(p3.parent IS null, -1, 3),
                    IF(p2.parent IS null, -1, 2), IF(p1.parent IS null, -1, 1), 0) level
    FROM    [fh-bigquery:hackernews.comments] p0
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p1 ON p1.id=p0.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p2 ON p2.id=p1.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p3 ON p3.id=p2.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p4 ON p4.id=p3.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p5 ON p5.id=p4.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p6 ON p6.id=p5.parent
    LEFT JOIN EACH [fh-bigquery:hackernews.comments] p7 ON p7.id=p6.parent
    HAVING level=0
    LIMIT 100
  ) a
  LEFT JOIN EACH [fh-bigquery:hackernews.stories] s
  ON s.id=a.story_id

(having so many left joins consumes a lot of resources, so to run it massively I would look for a different strategy)

Very cool, thanks! Looking forward to playing around with this.

FYI, you named a column a reserved sql keyword ('by'). For future reference, and for others reading this: this is bad database design and makes it harder to use the table. You can get around this by wrapping the column name in brackets, like:

>select ... where [by] = ...

I didn't name the column "by", I just gave the field the name that the API uses for it.

And to make everyone's lives easier (including mine), I copied the [by] column to an [author] column, so you can do a

>select ... where author = ...

instead :)

Oh, first I thought there is some difference and did this:

  SELECT * FROM [fh-bigquery:hackernews.stories] WHERE [by] != author

Nice trove to pore through when I find the time.

I like to use Twitter to analyze HN datasets. It's mostly limited to links, because that's what I'm after mostly.

https://twitter.com/newsycombinator https://twitter.com/HackerNews .. And a few other accounts. Try to avoid Bitly wrapped links.

Use something like Greptweet to harvest the tweets and parse out any noise.

That will not get you accurate results for HN data analysis since a) those accounts only tweet important links so analysis will be biased b) you can only get 3200 tweets at a time. (This is a Twitter API limitation)

You have to look at both the good and the bad.

I hear you. Raw unfiltered links always have hidden gems.

One thing though: Greptweet has an archive somewhere with a huge trove of tweets that users of the service have searched for, and were thus logged and kept. (Some even go over the 3200 limit). It's a massive Tarball, so set aside time to download it and parse out boring/noisy links.

A lot of HN links are tech-press posts which consist of hearsay and merely proxy the thoughts of others. The recent changes in HN with regards to more academia-style posts is refreshing.

This is really cool; thanks! Not exactly sure what I'll do with it (honestly probably nothing) but it's sure cool especially since I didn't know about BigQuery prior.

Huge fan of BQ, its the best product Google Cloud has that AWS doesn't. Excited for its future.

Many thanks! Could you put up a magnet link so we can download the data ourselves? It would also be useful to have the submission id as a field in the comment table.

If anyone wants to play with this dataset, and easily share his queries or visualize the data, you're welcome to use re:dash's demo instance [1]. I've also created an example query [2].

[1] http://demo.redash.io [2] http://demo.redash.io/queries/667/source#table


in what way is it similar?

Parent comment was deleted, but given the grandparent comment I'll guess you are asking how is BigQuery different to Redshift?

There is an ongoing conversation about this on reddit: https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_...

the deleted comment asked how the BigQuery data set was different than http://hn.algolia.com

Bad guess then!

The answer is that the dataset must be very similar to the Algolia one, since both get their data from the same source: the Hacker News official API on Firebase.

(but Algolia keeps it up-to-date in realtime, while I haven't written anything to keep the BigQuery one updated - yet)

Is this OK from a privacy perspective?

Why wouldn't it be? All the data provided by the HN API is public.

Yes, when you submit a comment to HN you intend for it to be published on the HN site. However, using comments for text mining purposes is a different thing and in principle consent is required to do that. The default case is that user submitted content remains copyrighted by the author, with an implied consent for the site to publish the material on the site. There can be an explicit license such as with Wikipedia or StackOverflow; in that case text mining may be explicitly allowed.

It would be a godsend to the #hnwatch crowd looking for sentences to take out of context.

Except email, there is nothing private on HN to raise privacy issue. Public internet is written in ink.

Perhaps not in the content. But I'm not so sure there might not be actual legal issues (in jurisdiction that have privacy/data protection laws) with regards to stuff like being able to see patterns in login-times (based on post meta data), and similar. It's tenuous -- but just because things are a) on-line in the form of the main hn site, and b) accessible through the api, doesn't automatically mean that giving raw access to the data doesn't carry with it implications for whether or not the data should be considered personal/(more)sensitive.

I'm assuming this data set offers something the API doesn't (that's the point of uploading it, right? Other than just being an easy way for others to "back up" the hn data) -- and that can very well mean there are privacy/personal data implications.

I'm mostly stating this because of the principle, not because I think it's anything wrong with this concrete upload of hn data (and wrong != legal/illegal anyway).

As an example, at least in Norway, while you may have two separate databases that are linked to people, such as a database of cars passing through toll booths, and credit card charges from a large bar/restaurant chain -- it would generally be illegal to link the two.

It's sort of a moot point. It's not especially difficult to scrape HN and compile a dataset like this for yourself.

It's a little like saying it's easy enough to make your own wine and sell it to minors. It is, but that doesn't make it legal.

No it's like that wine is available for everyone to buy anyway (via HTTP) and people are arguing that perhaps the bottled version (via non-HTTP protocol) should be age limited while keeping non-bottled version as is.

I don't see how difficulty has any bearing on privacy or copyright issues.

All of the github event data is also on bigquery as well. https://www.githubarchive.org/#bigquery

Now for someone to analyze github <-> HN correlations.

I joined GitHub and Hacker News here - how many stars after a front page post?



I'd be interested to see a list of people who submit a lot. I submit too much - about one submission per day - and I'm curious what percentile that puts me in.

Thank you for this, I think it's great!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact