Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hndex.org – a full-text search engine of articles submitted to HN (hndex.org)
442 points by mcovalt on Aug 7, 2020 | hide | past | favorite | 78 comments



I was just telling a friend about a post I had read about buying cheap tools to start out. I couldn't find it at all googling, and the HN search didn't work. I saw this post, searched "cheap tools quality", and immediately found it. Tnank you!

The post in question: https://www.johndcook.com/blog/2020/07/25/worst-tool-for-the...


where actually the best happens in the comments under hn submission to this article (in short: the claim seems not valid for certain domains)


I love this. The search box had an option to autocomplete terms I had recently searched, I clicked on one (a term I was interested in, obviously) and immediately found an interesting HackerNews post. Nice!

I think HackerNews and other link aggregators (e.g. reddit) have a kind of recency problem, where there is a lot of great content, but people only see the recent stuff. This seems like a great way to uncover some of the latent value of old HackerNews content.

Is there a way to suggest content too? e.g. If you liked that, you'd probably also like X, Y, and Z.


>I think HackerNews and other link aggregators (e.g. reddit) have a kind of recency problem, where there is a lot of great content, but people only see the recent stuff.

With Reddit and HackerNews, I want a relative ranking index. I can search by the top content, but something 5th today could have more votes than the top submission of 2015 because of forum/subreddit growth.

I want them ranked by something like the ratio of views to upvotes or upvotes compared to total upvotes for that day.


> I want them ranked by something like the ratio of views to upvotes or upvotes compared to total upvotes for that day.

Yeah, this is definitely possible with public data. I did something similar on reveddit [1] for removed reddit content. Hovering over the graph shows the item with the highest vote ratio [3], and clicking skips to that point in time. Code is here [2] for anyone interested and I apologize in advance..

[1] https://i.imgur.com/p3Bi5IS.png

[2] https://github.com/reveddit/ragger/blob/master/revddit_aggre...

[3] https://www.reveddit.com/r/worldnews/?showFilters=true


Absolutely. HN user 'EvanMiller had a lot to say about this, 11 years ago. His tl;dr is that the ranking score should be "the Lower bound of Wilson score confidence interval for a Bernoulli parameter."

I believe HN's ranking system is extremely creative and works great for the main page and day-to-day use (my understanding is that it additionally makes use of time-decay terms for comments and stories). Like you said, it's really just historical search (i.e. algolia) that seems broken.

https://www.evanmiller.org/how-not-to-sort-by-average-rating...

https://news.ycombinator.com/item?id=15709405

https://www.evanmiller.org/bayesian-average-ratings.html

https://www.evanmiller.org/ranking-items-with-star-ratings.h...

https://redditblog.com/2009/10/15/reddits-new-comment-sortin...


I want a map-reduce over their content that turns them into a canonical “master guide” of their topics.


Agreed! The problem I see is that the score does not correlate well with the perceived quality of the community. I'm researching that for some time now and am in the process of preparing a blog article with data analysis and solution approaches.

Work in progress:

https://github.com/fdietze/downvote-scoring

https://felix.unote.io/hacker-news-scores


I spend a lot of time trying to find articles I once read on here via Google with...

  site:news.ycombinator.com some term
...but a many times "some term" is not in the title of the HN post, but in the body of the article. I didn't see any other tools that did this for HN.


There's also Algolia, https://hn.algolia.com


It doesn't index the article contents though; only the story titles and comments, same as Google.


How did you come to this conclusion with respect to hndex?

Most articles have no comments. Yet one can search on hndex for terms only found in the article, not in the title.


I wasn't really saying anything about hndex, apart from that it indexes the contents of external articles submitted to HN, as you said.

The correction I was making was about HN Algolia search (which is linked from the search box at the bottom of each HN page), which only indexes content on news.ycombinator.com itself – i.e., article titles, comments and text-only posts like Show/Ask HN – but not the external content in submitted articles.


It does seem to have a much larger index though, which for me at least makes it more useful. Hopefully hndex's index will grow.


Consider expanding the scope to index links posted in user comments, too.


This is really useful! You might want to consider hiding the “More”-button if the current page isn’t filled up, so as to not just have an empty page when clicking it.

For some reason I couldn’t find my own blog post[0], even when searching for the embarrassing typo I made in the title - acommodating.

[0] https://news.ycombinator.com/item?id=23426951


Very nice. Works as expected. I stumbled about two things that could be improved:

- Add a search Button for convenient mobile use or if a user copy pasts things into to the search field using the mouse

- Add a comment counter on the result page, Since you index every article a lot of them have none or very few comments.

Oh and just a warning, depending on the jurisdiction providing the cache could be problematic under copyright laws since its basically a copy of the article.


For that specific use case, you can try the intitle keyword

  site:news.ycombinator.com intitle:"some term"


Yes, me too! This is really great, thanks! I'll definitely be using it.


what are you using to do the full text search?


So the first term I searched was "cognitive". Search for it yourself, and click the first result.

Go back and click "cached".

That was a cool experience.


I wonder how cached results are created and is there some kind of copyright violation because they store the content.


I suspect it might be something mentioned in https://github.com/masukomi/arc90-readability. I believe it's what powers "readability" view in modern browsers (firefox https://github.com/mozilla/readability, safari).


That's awesome! I searched for MuleSoft, not expecting anything in particular, and found a random fact (at the end) that certainly wasn't in the title of the article.

Bug report: in Safari on dark mode, the text in the search box is almost the same color as the background (white on white).

From "GitHub was also talking to Google about a deal, but went with Microsoft instead" https://www.cnbc.com/2018/06/05/github-interest-from-google-... - Salesforce got MuleSoft for $7.5 billion, Microsoft got GitHub for $7.5 billion.


I searched for "red light therapy" and none of the articles matched, not even the most recent article on red light therapy that was on the homepage. Same for "red light".


It's fast and light weight! What stack are you using?

My only wish is that it should be possible to sort chronologically.


Neat project. It’s a bit depressing how many of the article links don’t work anymore. For example, only the official signal vs noise article works on the first page of results for Basecamp: https://hndex.org/?q=Basecamp

How does ranking work? Apologies if I’ve missed the explanation.


Articles which have been posted to HN often end up in the archives of Archive.org and Archive.is in my experience.

The OP site also has a “cached” link for each article, don’t know if you saw that.

Also, to the maker of HNdex, consider adding links to Archive.org and Archive.is next to the cached link you have, so that readers can click through and check if they have a version of it in case there’s images etc


Nice.

HN feature request: Add submissions to archive.org (or equiv), include that cache link with story.


I did miss the cached link—thank you!


It would be great if each result also had [archive] and [web] links like HN.

> How does ranking work?

Great question, it's something I'd love to know more about as well.


It would be great if that was native HN functionality. Why isn't it? It seems like it would only help, and how hard could it be?


They easily could (I think it would take the average HN user no more than a couple of hours to implement this functionality), but they won't, because they're very reluctant to make changes.

This is also why HN still looks like a site from the 90s, instead of New Reddit (thank God).


It’s not all bad. We still have HN as it is now. Status quo is its own kind of success. HN is essentially feature complete, so I don’t blame them for not trying to fix what ain’t broke.


What does this offer over https://hn.algolia.com?

(Algolia was backed by YC btw)


Looks like this search engine searches the _articles_ and links that have been shared, whereas hn.algolia.com only searches title, author, and the text of a post if it is a text post.


Ohhhh I first thought this was a search engine for hacker news and I was wondering why the hell anyone would reimplement this, but this is fricking cool. Tbh I wanted to implement something similar for more than hackernews, but this is a nice thing to have. I especially love that you implemented a cache :)

Seems nice. Sort and filter functionality would probably add to this, but I will bookmark this for sure and try it out as a search engine for tech topics in general.

Thanks for creating this!


May I suggest you add time as well to the search results? It's super relevant in the context of news and technical articles.


Holy crap this is fast! At first I was mildly disappointed it wasn't a live search like hn.algolia.com, but I was seriously not expecting the results page to load so quickly. I guess goog et al. trained me to expect garbage latency in internet search forms.

This seems to be a good complement to algolia, given that it searches through the linked pages instead of comments.

Minor nitpick: would it be possible to make it give a 'past' link, to search for all discussions on a result? Some of the 'comments' take you to duplicate posts with no comments instead of the more popular cases.


I wrote something similar to this a year ago (though shabby looking by comparison!): -

http://kakapo.susa.net:8080/cfs/

I abandoned it when Google deprecated the WebRequest API, but the code's still available on GitLab. https://gitlab.com/ksangeelee/cfs_build

It allows article score and uBlock Origin 'hits' as search criteria.


Very cool implementation. I am wondering have you considered using something like https://github.com/cliqz-oss/adblocker which can sit on top of headless browser and do not require bridging to an extension.


This is neat!

- What is your stack?

- How often is your database updated?

- And will it be open source in the future?


This is fantastic. Way better than the Algolia search which seems to be the most "official" one. I love that you cached the article content too.


What is the criteria for an article getting into the index? To test it out I tried several of my submissions and could only find one in the index.


this is great whats the stack / architecture you built this with? Would love some insights.

I often find myself searching HN for opinions on different technologies over time and this will be invaluable for that use case.


So many articles found that are gone from the original sites, but still readable from the cached version, that's pretty useful!


Nice tool, though I tried a few searches, and a few searches of <same text> + ' hackernews' on google, and I gotta say I like the google results better. Search options for comments/titles/users, date range, ask/show/jobs, sorting by date/votes (weighted?), etc would be nice additions.


Love it, love it! Thank you.

Is it possible to give a little inside peek of the techniques, algorithms and tech stack etc. you have used?


Tried to find an article from earlier this week, no dice. https://hndex.org/?q=open%20source%20security https://hndex.org/?q=openssf

Great idea, I appreciate how fast it returns results! Just needs more more control over the search parameters and figuring out why articles like the example I posted above aren't working and you got yourself a nice HN search.


Here we are in 2020, and web search is not what I’d hoped it would be. I would love to see more niche, small search tools like this rather than better general-purpose search. Building upon that, maybe an aggregator of search results from many small niche search engines. Actually, just writing that sentence reminded me of Searx [0], which I have to admit I haven’t tried in earnest but really should.

0: https://searx.me/


There's also Falcon[1] Chrome extension which does full text indexing on your browser history so if you read something and can't either Google it or find it in the browser's history, Falcon will broaden the search scope.

[1] https://chrome.google.com/webstore/detail/falcon/mmifbbohghe...


Well done. One nit...if you enter something it doesn't like, the error that pops up isn't helpful: "Please match the requested format"


Found a bug! Searched for my name and found a bunch of unrelated BoingBoing links.

https://hndex.org/?q=Terence+Eden

Why? Because those old posts have a "what's new" set of links. One of which contains my name. I'd suggest only searching the `<main>` element, perhaps?


Thanks for making it. great resource. Two things:

1. Show dates in results.

2. Are results better than searching on Google with "site:news.ycombinator.com"


I think it indexes the content of articles referenced on hachernews which makes it different from Google with "site:news.ycombinator.com".


This produces such a unique set of results for things.

Fantastic project and well worth creating!

As another commentor has noted - this totally disobeys recency bias and throws up interesting articles for a topic.

Edit - it is disappointing how many of these links 404. But even if that's the case the headline and intro is a set of time capsules of sorts nonetheless.


Could you add post points and how much time since/when was the article posted? It would help as some topics do not age as well as others. Having sorting (most popular, most recent) and range search (last year, custom range) would also help those who want to narrow their search.


I love the option to view cached versions of websites. This is a nice project. Thank you, Mcovalt.


How does the "full-text search" work? I tried to search my previous post[1] but no luck, even tried the exact title.

[1] https://news.ycombinator.com/item?id=22830472


can i sort by date? i searched for kubernetes and the first page results were from 2016.


Yes, with so much context missing from the results it's pretty confusing.

Ideally for this to be useful I'd want date on HN, date of article (although I see that'd perhaps be hard to extract from unstructured pages) and also the HN points, as those are a major proxy to quality usually.

As others have said it has a nice crisp UI and i like that it's so quick.


What stack did you use to make this?

Ps: show the date / age of the posts on the results page


I would love to see the source and/or a write up about the stack used here. I've thought about making something similar, but got distracted while in the planning phase.


Very cool! How frequently does it index the HN content? I've just tried searching for DevComrade which I submitted 14hrs ago and nothing shows up.


Beautiful! Please consider adding a dark mode option.


I don't know why but to me the background is #000 black with #fff text by default...

It's such an intense "dark mode" that I couldn't read the first full article of interest I found because the contrast was killing me, and then coming back to regular HN left me nearly blinded for a few secs


Hm, dark mode is working for me (iPhone, iOS 13.6)


I would love to search books, video recommendations that people post on HN. It's one of the cool things I like here


There is this site / newsletter for books mentioned in hn comments

https://hackernewsbooks.com/

There are much fewer books in the last six or so weeks editions. Don't know why this is. Whether it is some scraper not working, or actually no books being mentioned.


This is absolutely fantastic, is there any way to display the dates on the cached version and/or result?


How does a site which can index text inside a document usually work? Is it similar to Linux grep?


Hi Matt, can you provide any details on the code/backend?


Wow - would love an RSS feed feature to subscribe to a search


Would you be willing to provide an API for this service?


awesome! would be great to have a submissiondate/votes/num comments/link to the hn submission in the results!


this is really cool, now I can filter by topics I really care about, well done!


I'm worried making it public will encourage companies to "SEO"ify HN.


Does this work with articles behind a paywall?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: