
Hacker News BigQuery Dataset - svdr
https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news
======
fhoffa
Hi, Felipe Hoffa at Google here.

We're aware the dataset hasn't been updated since a month ago, and we are
working to fix it. You can track the issue here:

\-
[https://issuetracker.google.com/issues/127132286](https://issuetracker.google.com/issues/127132286)

In the meantime you can still play with the dataset, and dig into the full
history of Hacker News - less this last month. I left some interesting queries
to get you started here:

\- [https://medium.com/@hoffa/hacker-news-on-bigquery-now-
with-d...](https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-
updates-so-what-are-the-top-domains-963d3c68b2e2)

~~~
wodenokoto
How do you run the import? Love to read more about how you consume the data

~~~
physcab
I would also like to know where the data comes from!

------
danielecook
I've been using the HN API to maintain a bigquery table of all posts,
comments, and URLs on HN and putting it on BigQuery for a while now. I use it
to put this site together: [https://hntrending.com/](https://hntrending.com/).
BQ is awesome.

It's a side project so may have some issues!

------
minimaxir
Looks like it stopped updating as of February 2nd, but otherwise it's pretty
reliable, and as noted in the description, it's free. (you probably won't hit
the 1TB limit working with this dataset). Here's a few queries I've done
recently to answer ad-hoc questions to get an exact answer:

Top posts about bootstrapping
([https://news.ycombinator.com/item?id=19258249](https://news.ycombinator.com/item?id=19258249)):

    
    
        #standardSQL
        SELECT *
        FROM `bigquery-public-data.hacker_news.full`
        WHERE REGEXP_CONTAINS(title, '[Bb]ootstrap')
        ORDER BY score DESC
        LIMIT 100
    

Count of YC startup posts over time by month
([https://news.ycombinator.com/item?id=19185946](https://news.ycombinator.com/item?id=19185946)):

    
    
        #standardSQL
        SELECT TIMESTAMP_TRUNC(timestamp, MONTH) as month_posted,
        COUNT(*) as num_posts_gte_5
        FROM `bigquery-public-data.hacker_news.full`
        WHERE REGEXP_CONTAINS(title, 'YC [S|W][0-9]{2}')
        AND score >= 5
        AND timestamp >= '2015-01-01'
        GROUP BY 1
        ORDER BY 1

~~~
shanecglass
Hey all,

I manage the BigQuery Public Datasets Program here at Google. You're right,
the dataset last updated February 2nd, but we intend to continuing updating
it. We had an issue on our end that disrupted our update feed, but we're
working to repair it now and get the latest data uploaded to BigQuery.

~~~
minimaxir
Great to hear! :)

------
cobookman
Top Commentors of all time. tptacek is at 1st place with 33839 comments.

Hacker news is 12 years old. That's an average of 7 comments per day since
inception. Wow

    
    
        #standardSQL
        SELECT
          author,
          count(DISTINCT id) as `num_comments`
        FROM `bigquery-public-data.hacker_news.comments`
        WHERE id IS NOT NULL
        GROUP BY author
        ORDER BY num_comments DESC
        LIMIT 100;

~~~
minimaxir
Don't use the `comments` table: it was last updated December 2017.

On the full table:

    
    
        #standardSQL
        SELECT
         `by`,
         COUNT(DISTINCT id) as `num_comments`
        FROM `bigquery-public-data.hacker_news.full`
        WHERE id IS NOT NULL AND `by` != ''
        AND type='comment'
        GROUP BY 1
        ORDER BY num_comments DESC
        LIMIT 100
    

tptacek is in first place with 47283 comments.

------
sbr464
I added a simple api endpoint to access favorites on HN, since they weren’t
available on the normal api.

[https://github.com/reactual/hacker-news-favorites-
api](https://github.com/reactual/hacker-news-favorites-api)

------
vinnyglennon
[https://hnify.com/leaderboard.html](https://hnify.com/leaderboard.html) using
the dataset tool too, amazing to have so much data freely available to play
with.

------
refrigerator
Last year I built a domain leaderboard based on this dataset:
[https://hnleaderboard.com](https://hnleaderboard.com) — planning to update
for 2019 soon!

------
lettergram
I’m actually fairly excited to learn about this. I painstakingly scrapped HN
to build:

[https://hnprofile.com/](https://hnprofile.com/)

I’m excited about this alternative

------
fsiefken
is there a way to download the dataset and query it locally from for example
postgresql or sqlite? How big is the database, 4G compressed?

------
tobr
A dataset like this is going to have a bunch of personal information in it.
When it’s distributed like this, how does that jive with regulations like
GDPR? If a HN user would like to delete all their comments, how would that
request be forwarded to every user of this dataset?

~~~
petercooper
If people are sharing their _own_ PII in HN comments, they agreed to HN's T&Cs
when signing up. Such T&Cs state (heavily trimmed for length):

 _By uploading any User Content you hereby grant [..] a nonexclusive,
worldwide [..] irrevocable license to [..] distribute [..] your User Content
for any Y Combinator-related purpose in any form [..]_

Agreeing to the T&Cs and deliberately sharing information publicly covers the
GDPR's "consent" lawful base.

Even under GDPR this is not a situation where someone signed up for something
_else_ and then happen to have their personal data shared as a byproduct. They
signed up to a site, agreed to T&Cs, and then explicitly and deliberately
shared their personal data.

~~~
CobrastanJorji
The GDPR's super complicated, but I'm pretty sure its Right of Erasure, and
specifically Article 7(3), which gives data subjects the right to withdraw
consent at any time and the clause “it shall be as easy to withdraw consent as
to give it" trumps any ridiculous "irrevocable" license to distribute your
content in any form forever.

Also importantly, the GDPR requires that a controller not make a service
conditional upon consent. Hacker News is likely not in compliance unless they
make such data processing optional and require anyone interested to explicitly
opt in.

But, then again, I'm not a lawyer, and even if I were, actual lawyers don't
seen to know what the hell the GDPR actually requires either.

~~~
dasil003
IAANAL, and I don't mean to single you out here, but this seemingly rational
argument strikes me as subtle FUD. It's the type of argument that someone with
a vested interest in collecting user data for profit might put forth in the
hopes of polarizing the tech community and painting GDPR as out of touch with
technical common sense.

Again, I'm not accusing you of anything here, I'm just pointing out who
benefits from framing the conversation this way. So far there is a lot of
precedent for small operators shutting down their sites out of _fear_ of GDPR,
but there is actually no precedent for regulators having actually gone after
small operators for anything resembling reasonable practices. The day may come
where EU regulators try to crack down on forums for who are unwilling or
unable to redact users messages post-facto, but we're nowhere close to that
today and I don't see strong reason to believe that's where we're headed
either.

~~~
tobr
What about this is out of touch with technical common sense? If you export all
your user data and syndicate it, why would it be so unreasonable to have a
system in place to be able to syndicate requests to delete data as well?

All of us here are users of this forum, so this concerns the legal rights to
our personal information. It’s not FUD for us to discuss how those rights are
affected by things like this.

~~~
dasil003
Let's leave syndication aside for a moment; I don't think it's unreasonable
for a forum to have terms of use that you are participating in a public forum
that needs to maintain integrity. If people just go deleting their posts then
it screws up the public discourse. I've actually had the experience of
building and running a forum that allowed deleting your content in this way,
and we had to remove the capability as trolls used it in a specifically
destructive capacity.

Now this position is certainly debatable, but I think it's at least a
reasonable argument that you could take to regulators. Contrast that with the
bullshit that Facebook, Google and a zillion ad-tech companies are doing with
our data every day. You're free to object to the syndication of HN data, but
personally I feel that is a distraction from the issues GDPR is meant to
address, and I am hoping regulators feel the same way.

------
cannabisfarmer
BigQuery keeps adding useless data.

What we truly need is common crawl data then we can check specific site on our
own.

Or wait, BigQuery simply can't handle common crawl size dataset in their
public service!

Otherwise there is no reason to not add it, maybe it puts their search
engine/ad business in geoparady.

Is there any other Google public dataset BigQuery like platform? Where their
direct search engine/ad platform interests don't get in way of Common Crawl
like data searching/indexing?

~~~
mayank
> Or wait, BigQuery simply can't handle common crawl size dataset in their
> public service!

This is not true. Source: ex-googler.

