

Ask PG: HN Ngram Viewer? - zissou

Since writing a scraper to discover and parse all historical comments&#x2F;submissions on HN would obviously get me in trouble, would the HN admins be willing to provide a dump of the historical text&#x2F;metadata from all comments and [local] submissions so I can make a HN Ngram Viewer for the HN public?<p>I work in an academic lab where I&#x27;m one of the developers of a system that generates ngram viewers from large corpuses of text, which we call &quot;Bookworms&quot;. Here are a few Bookworms we&#x27;ve created:<p>arXiv scientific publications: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;arxiv&#x2F;<p>US Congress legislation: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;congress&#x2F;<p>Open Library books: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;OL&#x2F;<p>Chronicling America historical newspapers: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;ChronAm&#x2F;<p>Social Science Research Network research paper abstracts: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;ssrn&#x2F;<p>We have more Bookworms in the pipeline, including historical legislation in the UK and a massive corpus of texts (70MM+ documents) from the National Library of Australia (Trove) spanning multiple centuries. A new GUI for all our Bookworms will also be rolling out shortly. (Preview: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;new_gui_teaser.png).<p>In my opinion, HN be an awesome candidate for an ngram viewer because there are so many subsets of topics that come&#x2F;go&#x2F;stay here, such as the frequency of discussions about web technologies, programming languages, companies&#x2F;services, the NSA, etc.<p>If this is something the HN admins would be interested in, I&#x27;d be happy to put it together. If a privacy agreement is desired before passing off any bulk data, that is not a problem as we&#x27;ve gone this route before, albeit only for private ngram viewers we&#x27;ve created for companies, like the NYT, to use internally.
======
kogir
The Octopart team has done a great job with HNSearch, and we really appreciate
the huge favor they've done us by providing it. That said, due to limitations
on our end around how they integrate with us, they're not able to offer real-
time updates or full fidelity ranking snapshots.

I'm working on a more comprehensive first-party API for HN, and plan to
implement the following, in this order:

    
    
      1) Near-real-time profiles, comments, and stories as JSON.
      2) Real-time streaming of profile and item changes.
      3) Near-real-time ranking of comments and stories.
      4) Real-time streaming of ranking changes.
      5) History of ranking changes.
    

Sadly, I can't commit to any firm timeline for future progress right now, but
know that I'm working on it :)

\-- Edit: Remove link to broken data file. Fixing it up tomorrow.

~~~
zissou
Thanks for the info and I do look forward to the new API!

However my question still remains with regards to historic posts/comments. The
historic aspect is really the import element here. Generally speaking,
building an ngram viewer requires a collection of texts over time, with each
text having some kind of metadata that is categorical, boolean, datetime, or
numeric. Categorical data can always be made of numeric data by creating bins
-- i.e. posts by people with karma or a ranking of 1-50, 51-150, 151-300, etc
at the time the comment/post was created. Datetimes can also be made into
useful categorical variables for an ngram viewer such as day of the week (to
spot weekly seasonality trends) or day of the year (annual seasonality
trends).

If I was allowed, I would be willing to write a scraper/crawler to discover as
many historic threads (since: threads -> comments) as possible using HNSearch,
but this could take a long time depending on rate limits and/or be subject to
unknown biases within my discovery method. I'm sure you can understand why a
"top-down" approach like a database dump would make for a much higher quality
corpus than attempting the "bottom-up" approach of a crawler. I have no idea
if a "database dump of everything" is even feasible as I don't know anything
about the HN's backend infrastructure. However, if it is feasible, then I'm
certain that I can work with whatever would be available. Adding structure to
unstructured data is my bread and butter.

I really think this would be a very cool tool that a lot of people would
enjoy, so I'm willing to do what is needed on my end to help make it work.
After all, I'd be on the clock while working on this rather than just a hobby
project, so the incentives are definitely aligned on my end.

If you want to discuss anything in private, I can be reached at the following
_reversed_ address: moc{dot}liamg{at}yalkcin{dot}wehttam

------
kristianp
You could look into
[https://www.hnsearch.com/api](https://www.hnsearch.com/api) . They provide
the search bar functionality on this site.

