
Show HN: Hndex.org – a full-text search engine of articles submitted to HN - mcovalt
https://hndex.org/
======
arawde
I was just telling a friend about a post I had read about buying cheap tools
to start out. I couldn't find it at all googling, and the HN search didn't
work. I saw this post, searched "cheap tools quality", and immediately found
it. Tnank you!

The post in question: [https://www.johndcook.com/blog/2020/07/25/worst-tool-
for-the...](https://www.johndcook.com/blog/2020/07/25/worst-tool-for-the-job/)

~~~
haddr
where actually the best happens in the comments under hn submission to this
article (in short: the claim seems not valid for certain domains)

------
ALittleLight
I love this. The search box had an option to autocomplete terms I had recently
searched, I clicked on one (a term I was interested in, obviously) and
immediately found an interesting HackerNews post. Nice!

I think HackerNews and other link aggregators (e.g. reddit) have a kind of
recency problem, where there is a lot of great content, but people only see
the recent stuff. This seems like a great way to uncover some of the latent
value of old HackerNews content.

Is there a way to suggest content too? e.g. If you liked that, you'd probably
also like X, Y, and Z.

~~~
MattGaiser
>I think HackerNews and other link aggregators (e.g. reddit) have a kind of
recency problem, where there is a lot of great content, but people only see
the recent stuff.

With Reddit and HackerNews, I want a relative ranking index. I can search by
the top content, but something 5th today could have more votes than the top
submission of 2015 because of forum/subreddit growth.

I want them ranked by something like the ratio of views to upvotes or upvotes
compared to total upvotes for that day.

~~~
rhaksw
> I want them ranked by something like the ratio of views to upvotes or
> upvotes compared to total upvotes for that day.

Yeah, this is definitely possible with public data. I did something similar on
reveddit [1] for removed reddit content. Hovering over the graph shows the
item with the highest vote ratio [3], and clicking skips to that point in
time. Code is here [2] for anyone interested and I apologize in advance..

[1] [https://i.imgur.com/p3Bi5IS.png](https://i.imgur.com/p3Bi5IS.png)

[2]
[https://github.com/reveddit/ragger/blob/master/revddit_aggre...](https://github.com/reveddit/ragger/blob/master/revddit_aggregator.py)

[3]
[https://www.reveddit.com/r/worldnews/?showFilters=true](https://www.reveddit.com/r/worldnews/?showFilters=true)

------
mcovalt
I spend a lot of time trying to find articles I once read on here via Google
with...

    
    
      site:news.ycombinator.com some term
    

...but a many times "some term" is not in the title of the HN post, but in the
body of the article. I didn't see any other tools that did this for HN.

~~~
walterbell
There's also Algolia, [https://hn.algolia.com](https://hn.algolia.com)

~~~
tomhoward
It doesn't index the article contents though; only the story titles and
comments, same as Google.

~~~
1vuio0pswjnm7
How did you come to this conclusion with respect to hndex?

Most articles have no comments. Yet one can search on hndex for terms only
found in the article, not in the title.

~~~
tomhoward
I wasn't really saying anything about hndex, apart from that it indexes the
contents of external articles submitted to HN, as you said.

The correction I was making was about HN Algolia search (which is linked from
the search box at the bottom of each HN page), which only indexes content on
news.ycombinator.com itself – i.e., article titles, comments and text-only
posts like Show/Ask HN – but not the external content in submitted articles.

------
kody
So the first term I searched was "cognitive". Search for it yourself, and
click the first result.

Go back and click "cached".

That was a cool experience.

~~~
freediver
I wonder how cached results are created and is there some kind of copyright
violation because they store the content.

~~~
JoshuaRLi
I suspect it might be something mentioned in
[https://github.com/masukomi/arc90-readability](https://github.com/masukomi/arc90-readability).
I believe it's what powers "readability" view in modern browsers (firefox
[https://github.com/mozilla/readability](https://github.com/mozilla/readability),
safari).

------
benatkin
That's awesome! I searched for MuleSoft, not expecting anything in particular,
and found a random fact (at the end) that certainly wasn't in the title of the
article.

Bug report: in Safari on dark mode, the text in the search box is almost the
same color as the background (white on white).

From "GitHub was also talking to Google about a deal, but went with Microsoft
instead" [https://www.cnbc.com/2018/06/05/github-interest-from-
google-...](https://www.cnbc.com/2018/06/05/github-interest-from-google-and-
others-revenue-about-300-million.html?__source=twitter%7Cmain) \- Salesforce
got MuleSoft for $7.5 billion, Microsoft got GitHub for $7.5 billion.

------
omarchowdhury
I searched for "red light therapy" and none of the articles matched, not even
the most recent article on red light therapy that was on the homepage. Same
for "red light".

------
z3t4
It's fast and light weight! What stack are you using?

My only wish is that it should be possible to sort chronologically.

------
1123581321
Neat project. It’s a bit depressing how many of the article links don’t work
anymore. For example, only the official signal vs noise article works on the
first page of results for Basecamp:
[https://hndex.org/?q=Basecamp](https://hndex.org/?q=Basecamp)

How does ranking work? Apologies if I’ve missed the explanation.

~~~
codetrotter
Articles which have been posted to HN often end up in the archives of
Archive.org and Archive.is in my experience.

The OP site also has a “cached” link for each article, don’t know if you saw
that.

Also, to the maker of HNdex, consider adding links to Archive.org and
Archive.is next to the cached link you have, so that readers can click through
and check if they have a version of it in case there’s images etc

~~~
specialist
Nice.

HN feature request: Add submissions to archive.org (or equiv), include that
cache link with story.

------
edjrage
What does this offer over [https://hn.algolia.com](https://hn.algolia.com)?

(Algolia was backed by YC btw)

~~~
thsowers
Looks like this search engine searches the _articles_ and links that have been
shared, whereas hn.algolia.com only searches title, author, and the text of a
post if it is a text post.

------
BlackLotus89
Ohhhh I first thought this was a search engine for hacker news and I was
wondering why the hell anyone would reimplement this, but this is fricking
cool. Tbh I wanted to implement something similar for more than hackernews,
but this is a nice thing to have. I especially love that you implemented a
cache :)

Seems nice. Sort and filter functionality would probably add to this, but I
will bookmark this for sure and try it out as a search engine for tech topics
in general.

Thanks for creating this!

------
visarga
May I suggest you add time as well to the search results? It's super relevant
in the context of news and technical articles.

------
arc-in-space
Holy crap this is fast! At first I was mildly disappointed it wasn't a live
search like hn.algolia.com, but I was seriously not expecting the results page
to load so quickly. I guess goog et al. trained me to expect garbage latency
in internet search forms.

This seems to be a good complement to algolia, given that it searches through
the linked pages instead of comments.

Minor nitpick: would it be possible to make it give a 'past' link, to search
for all discussions on a result? Some of the 'comments' take you to duplicate
posts with no comments instead of the more popular cases.

------
ksangeelee
I wrote something similar to this a year ago (though shabby looking by
comparison!): -

[http://kakapo.susa.net:8080/cfs/](http://kakapo.susa.net:8080/cfs/)

I abandoned it when Google deprecated the WebRequest API, but the code's still
available on GitLab.
[https://gitlab.com/ksangeelee/cfs_build](https://gitlab.com/ksangeelee/cfs_build)

It allows article score and uBlock Origin 'hits' as search criteria.

~~~
freediver
Very cool implementation. I am wondering have you considered using something
like [https://github.com/cliqz-oss/adblocker](https://github.com/cliqz-
oss/adblocker) which can sit on top of headless browser and do not require
bridging to an extension.

------
1f60c
This is neat!

\- What is your stack?

\- How often is your database updated?

\- And will it be open source in the future?

------
reactchain
This is fantastic. Way better than the Algolia search which seems to be the
most "official" one. I love that you cached the article content too.

------
city41
What is the criteria for an article getting into the index? To test it out I
tried several of my submissions and could only find one in the index.

------
wh-uws
this is great whats the stack / architecture you built this with? Would love
some insights.

I often find myself searching HN for opinions on different technologies over
time and this will be invaluable for that use case.

------
tucif
So many articles found that are gone from the original sites, but still
readable from the cached version, that's pretty useful!

------
NOGDP
Nice tool, though I tried a few searches, and a few searches of <same text> \+
' hackernews' on google, and I gotta say I like the google results better.
Search options for comments/titles/users, date range, ask/show/jobs, sorting
by date/votes (weighted?), etc would be nice additions.

------
sriram_malhar
Love it, love it! Thank you.

Is it possible to give a little inside peek of the techniques, algorithms and
tech stack etc. you have used?

------
IncludeSecurity
Tried to find an article from earlier this week, no dice.
[https://hndex.org/?q=open%20source%20security](https://hndex.org/?q=open%20source%20security)
[https://hndex.org/?q=openssf](https://hndex.org/?q=openssf)

Great idea, I appreciate how fast it returns results! Just needs more more
control over the search parameters and figuring out why articles like the
example I posted above aren't working and you got yourself a nice HN search.

------
adaszko
There's also Falcon[1] Chrome extension which does full text indexing on your
browser history so if you read something and can't either Google it or find it
in the browser's history, Falcon will broaden the search scope.

[1]
[https://chrome.google.com/webstore/detail/falcon/mmifbbohghe...](https://chrome.google.com/webstore/detail/falcon/mmifbbohghecjloeklpbinkjpbplfalb?hl=en)

------
maddyboo
Here we are in 2020, and web search is not what I’d hoped it would be. I would
love to see more niche, small search tools like this rather than better
general-purpose search. Building upon that, maybe an aggregator of search
results from many small niche search engines. Actually, just writing that
sentence reminded me of Searx [0], which I have to admit I haven’t tried in
earnest but really should.

0: [https://searx.me/](https://searx.me/)

------
tyingq
Well done. One nit...if you enter something it doesn't like, the error that
pops up isn't helpful: _" Please match the requested format"_

------
edent
Found a bug! Searched for my name and found a bunch of unrelated BoingBoing
links.

[https://hndex.org/?q=Terence+Eden](https://hndex.org/?q=Terence+Eden)

Why? Because those old posts have a "what's new" set of links. One of which
contains my name. I'd suggest only searching the `<main>` element, perhaps?

------
zerop
Thanks for making it. great resource. Two things:

1\. Show dates in results.

2\. Are results better than searching on Google with
"site:news.ycombinator.com"

~~~
rustamm
I think it indexes the content of articles referenced on hachernews which
makes it different from Google with "site:news.ycombinator.com".

------
breakfastduck
This produces such a unique set of results for things.

Fantastic project and well worth creating!

As another commentor has noted - this totally disobeys recency bias and throws
up interesting articles for a topic.

Edit - it is disappointing how many of these links 404. But even if that's the
case the headline and intro is a set of time capsules of sorts nonetheless.

------
smhmd
Could you add post points and how much time since/when was the article posted?
It would help as some topics do not age as well as others. Having sorting
(most popular, most recent) and range search (last year, custom range) would
also help those who want to narrow their search.

------
Kelamir
I love the option to view cached versions of websites. This is a nice project.
Thank you, Mcovalt.

------
aabbcc1241
How does the "full-text search" work? I tried to search my previous post[1]
but no luck, even tried the exact title.

[1]
[https://news.ycombinator.com/item?id=22830472](https://news.ycombinator.com/item?id=22830472)

------
nelsonenzo
can i sort by date? i searched for kubernetes and the first page results were
from 2016.

~~~
nmstoker
Yes, with so much context missing from the results it's pretty confusing.

Ideally for this to be useful I'd want date on HN, date of article (although I
see that'd perhaps be hard to extract from unstructured pages) and also the HN
points, as those are a major proxy to quality usually.

As others have said it has a nice crisp UI and i like that it's so quick.

------
chvid
What stack did you use to make this?

Ps: show the date / age of the posts on the results page

------
gkhartman
I would love to see the source and/or a write up about the stack used here.
I've thought about making something similar, but got distracted while in the
planning phase.

------
noseratio
Very cool! How frequently does it index the HN content? I've just tried
searching for DevComrade which I submitted 14hrs ago and nothing shows up.

------
disqard
Beautiful! Please consider adding a dark mode option.

~~~
airstrike
I don't know why but to me the background is #000 black with #fff text by
default...

It's such an intense "dark mode" that I couldn't read the first full article
of interest I found because the contrast was killing me, and then coming back
to regular HN left me nearly blinded for a few secs

------
mindhash
I would love to search books, video recommendations that people post on HN.
It's one of the cool things I like here

~~~
smoe
There is this site / newsletter for books mentioned in hn comments

[https://hackernewsbooks.com/](https://hackernewsbooks.com/)

There are much fewer books in the last six or so weeks editions. Don't know
why this is. Whether it is some scraper not working, or actually no books
being mentioned.

------
antihero
This is absolutely fantastic, is there any way to display the dates on the
cached version and/or result?

------
abhayhegde
How does a site which can index text inside a document usually work? Is it
similar to Linux grep?

------
sarosh
Hi Matt, can you provide any details on the code/backend?

------
baccredited
Wow - would love an RSS feed feature to subscribe to a search

------
dennisy
Would you be willing to provide an API for this service?

------
asdf333
awesome! would be great to have a submissiondate/votes/num comments/link to
the hn submission in the results!

------
Watchnerd
this is really cool, now I can filter by topics I really care about, well
done!

------
crazypython
I'm worried making it public will encourage companies to "SEO"ify HN.

------
throwawaysea
Does this work with articles behind a paywall?

