
Hacker News on BigQuery: Now with daily updates. Top domains and time to post? - fhoffa
https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2#.mkt08mped
======
minimaxir
A few comments about on Hacker News data (i.e why I haven't played with the
data in awhile):

1\. The algorithm changed recently. This post uses >40pts as a proxy for front
pageness. That's too conservative; even my 10pt threshold back then was
conservative. With recent algorithm changes to Hacker News (<1 yr), I've seen
posts with _3pts_ get into the Top 10 for whatever reason, which breaks
predictive analysis.

2) The dataset/this submission only includes submissions/ submission scores;
comment scores were removed from the API which is disappointing.

3) Given that HN titles/links can be edited by moderators (and they do a good
job), it's harder to judge initial submissions from the final result.

4) Slight edge case in the article, but link shorteners are auto-killed which
is why youtu.be/goo.gl links are not prominent.

~~~
fhoffa
You are the expert! (and I linked to one of your posts in the article too)

Replying to the comments:

1) Yes, it doesn't take many votes to reach the front page, but once a post
gets established there I expect it to gather more upvotes (then crossing the
40 points threshold). 85% of posts on the front page right now have >30
upvotes now, and I expect them to keep going up.

2) We don't have comment scores, but we do have them ranked. So you could
assign them points based on what position they are relative to others, and
what position does their parent have. Looking forwards to the experiment :).

3) Yup, titles matter a lot, and unfortunately we don't know what titles were
modified after the fact.

4) Thanks for the clarification!

Thinking about 2) - as you comment a lot and you can see your own comment
scores, we could train a model that goes from rank/time to score.

~~~
fhoffa
Data points: So this post made it to the front page with few votes, but 2
hours later it dropped off with 31 votes.

Lessons:

\- Maybe don't post on a Friday afternoon.

\- Maybe don't post from medium.com (low chances).

\- >40 score choice seems sound, to look at posts with front-page permanence.

~~~
minimaxir
\- Meta-submissions receive an extra penalty.

Another reason why analysis is tricky, as such submissions are hard to
algorithmically identify.

If manual penalties were public, they could be accounted for in a model.
That's another reason why I like Reddit data better; fewer unknown penalties.

------
koolba
How does the data get to BigQuery? Anything special/fun or just repeatedly
polling the API endpoint?

~~~
fhoffa
I'm not sure if we are ready to document the process, but I can tell you that
having an API based on Firebase helps a lot:

\- [https://github.com/HackerNews/API](https://github.com/HackerNews/API)

~~~
koolba
I'd be interested in reading up on it if you ever do write it up.

I put something similar together (for HN data). One hitch is tracking
historical changes as I don't think you can get a raw feed of it so stuck with
polling a historical range to ensure you get the latest version of a comment.

~~~
fhoffa
We need to test this - but I'm almost sure we have the latest version of each
comment in BigQuery.

