
Show HN: A thin Python library to access HN data using Algolia's API - santiagobasulto
Hello community. Some time ago I was trying to create a project for my students using Hacker News data. As you might know, HN offers an official API [0], but it&#x27;s based on Firebase and I felt it&#x27;s main usage is to build clients, rather than consult data.<p>I found out that Algolia also provides an official REST API [1]. It&#x27;s exactly what I needed: the ability to &quot;search&quot; HN. Either by keywords, type of stories (Show HN, Ask HN, etc) and&#x2F;or date.<p>So I created a thin python wrapper on top of Algolia&#x27;s Search API: https:&#x2F;&#x2F;github.com&#x2F;santiagobasulto&#x2F;python-hacker-news<p>The library is in early stage, but already usable. A few examples:<p><i>How to search posts from one user:</i><p><pre><code>    results = search_by_date(
        author=&#x27;pg&#x27;,
        hits_per_page=1000)
</code></pre>
<i>How to search posts by type (this would find this same post)</i><p><pre><code>    results = search_by_date(
        &#x27;thin python library&#x27;,
        show_hn=True,
        hits_per_page=1000)
</code></pre>
I&#x27;m working on implementing the the other methods. If you have suggestions please bring them up!<p>[0] https:&#x2F;&#x2F;github.com&#x2F;HackerNews&#x2F;API<p>[1] https:&#x2F;&#x2F;hn.algolia.com&#x2F;api
======
minxomat
Awesome. I imagine this being useful for things that quickly check if
something exists on HN or watch for new items etc.

Though I think this needs clarification:

> but it's based on Firebase

The entire HN dataset is also available as a public BigQuery dataset, which
enables much more intricate queries. For example, the following query means
"Get all Show HNs with more than 5 or more points and 5 or more comments,
along with the decoded submission title and all decoded top-level comments
which are neither dead nor deleted" (and page):

    
    
        CREATE TEMPORARY FUNCTION
          HTML_DECODE(enc STRING)
          RETURNS STRING
          LANGUAGE js AS """
        var decodeHtmlEntity = function(str) {
          return str.replace(/&#(\\d+);/g, function(match, dec) {
            return String.fromCharCode(dec);
          }).replace(/&#x([a-fA-F0-9]+);/g, function(match, hex) {
            return String.fromCharCode(parseInt(hex, 16));
          });
        };
          try { 
            return decodeHtmlEntity(enc);;
          } catch (e) { return null }
          return null;
        """;
        WITH
          top_shows AS (
          SELECT id, HTML_DECODE(title) AS dtitle
          FROM `bigquery-public-data.hacker_news.stories`
          WHERE descendants >= 5 AND score >= 5 AND title LIKE "Show HN:%"),
          first_comments AS (
          SELECT
            parent,
            HTML_DECODE(REGEXP_REPLACE(text, r"(</?[a-z]+>)|(<a href=\")|(\" rel=\"nofollow\">.+?</a>)", " ")) AS dcomment
          FROM `bigquery-public-data.hacker_news.comments`
          WHERE dead IS NULL AND deleted IS NULL )
        SELECT
          top_shows.id, top_shows.dtitle, first_comments.dcomment
        FROM top_shows JOIN first_comments
        ON first_comments.parent = top_shows.id
        LIMIT 1000 OFFSET 0
    

So if you need to answer specific questions like this, which could return
500k+ rows, it's better to use BigQuery than stressing the API.

