Hacker News new | past | comments | ask | show | jobs | submit login
Hacker News Search (ycombinator.com)
344 points by pg on June 4, 2011 | hide | past | favorite | 117 comments

I like hnsearch a lot, but I'd like us to take a second to thank whoever was running SearchYC, which for the past couple years has been practically indispensable in keeping up with this community.

SearchYC was also more fully featured, with lists of top users based on points per submission, points per comment, etc. - at least until pg removed comment points from the site.

That could be a good app for the HNSearch API contest: http://www.hnsearch.com/contest

Every hnsearch contestant gets a dotCloud account to run their app easily, and for free.

Email hn@dotcloud.com to get started.

You rock. I'll make a note on the contest page.

I miss searchyc and feel sad that they went compleatly dark with no real explanation.

Yes, I saw that the first time around thanks.

It describes an action, but explains nothing.

Just off the top of my head here are some unanswered questions: Why did Comcast shut them down? why did comcast have the authority to shut them down? Why didn't the searchyc admins take their site somewhere else? Why have they said nothing here on hn? Why were they shutdown mere days before pg announces an official search (hopefully just a coincidence). What happened to their data (it's especially valuable now that we can no longer see comment scores). Etc.

Jiminy. Can we skip the outrage! this time?

You obviously haven't even looked at HNSearch if your big concern is comment scores.


No outrage intended sorry if it came across that way.

I was commenting on the abrupt departure of searchyc not on the new search. Haven't seen the new search yet since it's not working on my blackberry for some reason. And I don't have my laptop with me.

Not everything that isn't explained in crystalline detail is a conspiracy or even evidence of significant drama.

You seem to be reading an undertone into my comment that simply wasn't there.

"Hopefully it's just a coincidence" that SearchYC lost its Comcast connectivity right before Paul Graham announced HNSearch. No. I don't think I misread the undertone.

Regardless, we don't need to beat this to death. Have a good weekend.

I don't see any outrage there...

You're coming across as more than a little paranoid.

I can see that. You know, if I squint. :)

I came here to throw some love at searchYC and heard this bad news. Very bad news.

searchyc is still alive and well http://searchyc.com/

Definitely! I'm actually trying to get in touch with the SearchYC folks. Recently they put up a link to HNSearch which was very nice of them.

SearchYC also provided excellent search for pg's Arc Forum, which is now gone: http://news.ycombinator.com/item?id=2620297

...and still is: http://searchyc.com/

ThriftDB is the software they had to write to do Octopart's electronic parts searches; no existing software was up to the task.

Can the Octopart guys comment on why no existing software was up to the task? Also, I'm curious if there's something that makes ThriftDB a particularly good choice for HN search as well.

A search engine has many moving parts and we found that the existing technology worked but with great pain.

Lucene/Solr is great for search but it's useless without a fast datastore. Additionally, you can't expose it to the open internet so you need to build a REST API wrapper around the DB/index.

One particularly acute problem we had at Octopart was search across a frequently changing schema. We have 15M parts in our database so it was a real pain to change the schema. ThriftDB uses the Thrift serialization protocol internally to maintain a flexible schema so you can change the schema independently of the underlying data.

The current implementation of ThriftDB takes advantage of solr in smart ways to simplify app development dramatically. At Octopart we found that we were spending a lot of time building custom search solutions. Now we're using the same backend technology to search for electronic parts on Octopart and comments on Hacker News.

Using ThriftDB for search is like going from a compiled language to a scripting language. It takes care of a lot of issues so you can focus on app development.

Hope that explanation helps! Please let me know if you have more questions.

One particularly acute problem we had at Octopart was search across a frequently changing schema. We have 15M parts in our database so it was a real pain to change the schema.

Solr can be run without a schema quite easily, or use the very common trope of the dynamicField typed schema (string_, text_, int_* or whatever)

Lucene/Solr is great for search but it's useless without a fast datastore.

This is true at real scale (EN has 500m docs in our bigger solrs and we have to back it w/ TT) but for 15m docs, no way. Stored data in lucene/solr is certainly fast enough at that level. Think <50ms including data even on AWS.

Additionally, you can't expose it to the open internet so you need to build a REST API wrapper around the DB/index.

This I don't get. Surely you have a web app layer that talks to your services? That is, you don't have a text field in HTML hitting solr on port 8983, you call your web app's search which then hits solr's HTTP API?

Solr can be run without a schema quite easily, or use the very common trope of the dynamicField typed schema (string_, text_, int_ or whatever)

We're using solr dynamic fields in our current implementation of ThriftDB. ThriftDB adds a layer on top though so you can change attribute names on the fly.

This is true at real scale (EN has 500m docs in our bigger solrs and we have to back it w/ TT) but for 15m docs, no way. Stored data in lucene/solr is certainly fast enough at that level. Think <50ms including data even on AWS.

In our experience, stored data in lucene/solr doesn't scale well.

This I don't get. Surely you have a web app layer that talks to your services? That is, you don't have a text field in HTML hitting solr on port 8983, you call your web app's search which then hits solr's HTTP API?

It sound like we're saying the same thing. If you use solr you're responsible for the web app layer. With ThriftDB you get a JSON REST API out of the box for every collection you create.

OK, thanks I think I get what thriftDB is a little better. It's hosted (only) -- so instead of booting a solr on your own boxes, you just hit api.thriftdb.com to create indexes, add data and do queries. So thus my "API" confusion -- I was assuming this was software I run on my own box and I don't know what I would get from the API that I wouldn't get hitting solr direct. But a turnkey search service is a nice idea.

But do you guys have a lot of experience with solr / lucene scaling (I mean well beyond the 15m docs octopart has)? What happens when an API customer starts ingesting >100m docs? And then 10 do it at once? Are the indexes on different boxes? Are you on AWS? What's the disk backing it? Does each index fit in RAM? Does the user have any control over caches, tokenizers, stemming, triefields? Like... when I add a date, you're not indexing it with millisecond accuracy, are you?

After EN gets bought on accident by the norwegian fish cannery I plan to retire on giving speeches of 5 years of Solr scaling woes. Unless Otis G is running it there's no way I would trust a hosted solr solution that I didn't have full control over.

Exactly! Would love to get more feedback if you have a chance to try out ThriftDB (http://www.thriftdb.com).

We have a lot of experience scaling faceted search with solr. 15M docs might not sound like a lot if you're doing full-text search but faceted-search adds another level of complexity:


As far as scaling goes, most of the demand right now is for smaller indexes so that's not a problem. We know how to scale out though so we're pretty excited to get customers with >100M docs.

We're hosted on AWS but we're still iterating on the architecture. The indexes fit in RAM. Currently you can't control caches, tokenizers, stemming, etc. but that's in the pipeline.

Good luck with your new Norwegian cannery owners! I hear you on hosted search. We're trying our best to create a headache-free hosted solution.

I agree, I'm not sure what's so different about HN's information where something like Lucene, Sphinx or any other full text search tool couldn't handle.

ThriftDB uses Solr internally so I don't want to give the impression that we've built new full-text indexing technology. Our goal with ThriftDB was to solve the search infrastructure problem. We wanted to give developers access to fast, scalable, on-demand, cloud-based search.

Here's a link to the ThriftDB docs in case you'd like to learn more:


I should add that HNSearch has an API


which should be a lot faster than crawling HN itself to get the same data.

So after some quick hacking I created a new HN bookmarklet based on the new API:


If the current URL is added, it will go straight to discussion thread (without voting). If not, it will ask if you want to submit it.

Why not model it after the submission page (submit or upvote)?

Original HN bookmarklet already works that way http://ycombinator.com/bookmarklet.html

However - I don't always want to upvote a post just because I want to check the discussion. Published this with an assumption that I'm not the only one.

I'm already working on integrating the api in unscatter.com using a /hn tag - http://www.unscatter.com/search/?q=%2Fhn+facebook (currently only functional)

I will be super cool if JSONP support is added. If we want to build web app with JavaScript it is much easier to use directly JSONP.

When I build this web app http://www.vcarrer.com/2010/11/hacker-news-mobile-front-page... I need to use Yahoo YQL to obtain JSONP.

So pretty please, add JSONP.

There is already JSONP support, see callback argument:


Cool! Thanks!

This is cool, but the API looks a bit cryptic. Is there a way to simply lookup an item by URL?

Edit: Nevermind, figured it out: http://api.thriftdb.com/api.hnsearch.com/items/_search?prett...

The API docs are a work in progress. Please let me know which parts need work.

For example what's the difference between "filter[queries][]" and "q" arguments for http://www.thriftdb.com/documentation/rest-api/search-api? The way I understand, q does a simple keyword matching on all fields while "filter[queries][]" is more like SQL "where" query. In that case, what operators are supported?

The `q` argument is used to perform a full-text search across all fields. One benefit to using `q` is that you can rank results based on field matches (using the `weights` argument).

The `filter` arguments are used to cut the data before performing a full-text query:

`filter[fields][fieldname]` can be used to filter on fieldnames:


`filter[queries][]` can be used to add arbitrary filters:



The syntax for filter queries is the same as solr:


Or, by "id"? I looked, but I wasn't able to figure out, for example, how I'd get pg's top-level comment here from the id "2619846". I was able to find it with this search, but it requires a second-part to the "_id" that I don't know where to find: http://api.thriftdb.com/api.hnsearch.com/items/_search?q=... (Trying to filter for "id" instead of "_id" throws an error, for some reason.)

The `_id` attribute for each item is its submission id plus a signature. ThriftDB references objects by their unique `_id` attributes.

The `id` attribute on the other hand is an item's submission id. It's not indexed though so you can't do lookup by id.

Well that explains why I failed at fetching an item by id, and I tried a whole bunch of different ways. You should really mention somewhere that the "id" referenced is an internal id, and not the HN id.

I would like to lookup items by HN id.

could you please index the submission ids? i can think of a few reasons this would be useful


How do I find this comment through the API? How do I find its replies?

Would you mind emailing me with your ideas? andres@octopart.com

Should be faster for the developer and also good for HN's servers to not have so many independent apps scraping the forum.

Browser notification when someone comments on your post. Would that be useful?

Will you move the search box to the top right hand corner? Perhaps just below my username?

Please do this. Having the search box at the bottom of the page makes no sense to me.

+1 for this. I use AutoPatchWork so it's not very accessible (I can imagine this the same for those not using AutoPatchWork too, though)

One downside of your suggested placement is it would push down page content about 20px. But I agree that the footer is sub-optimal.

Or actually, maybe if it was just in the top bar -- between "submit" and my username.

Why at the bottom? I'm sure it's supposed to be more important than the footer. You could use SmashingMagazine resources when it comes to UX and design. It looks like you made almost all the frequent mistakes:

   Placing the search box at the bottom of the page, or hiding it in the navigation menu.
   Making the input field too short; users are forced to use short, imprecise queries, because longer queries would be hard and inconvenient to read.
   Making the submit button too small, so that users have to point the mouse very precisely.
   Making the search box hard to find.
Source: http://www.smashingmagazine.com/2008/12/04/designing-the-hol...

Comment scores are back! (in a convoluted way, through the API). I can see that at this time that the top comment of this page currently (tptacek's) has 58 points:


Looks like it's been changed. Points seems to be always null through the API now.

edit: just removed from the search. still here: http://api.thriftdb.com/api.hnsearch.com/items/2619811-02e34...

expect to be gone soon.

Interface improvement suggestions:

When the search results page comes up, the search box is active. This makes any keystrokes I type appear in the search box instead of being passed on to my browser to do things like scroll down the page of results.

This is pretty annoying, and makes me have to click outside the search box before I can use the keyboard to scroll down the page, etc.

So please make it so that the search box does not have focus when the search results are returned.

The other suggestion I have is to allow people to use their keyboards to go to the next page of search results. In Opera, I do this simply by hitting the space bar once I get to the bottom of the page.

This is how Google's interface works. In general, I think you can't go wrong by studying the design of their search interface carefully and copying it.

It's a shame that, rather than embracing SearchYC which has existed and filled a needed feature for several years, you chose to use this as a plug for one of your companies...

The feature that jumps out at me is that I can see comment karma scores for all the comments that I search up with a keyword search. Evidently the upvote/downvote buttons I see next to the comments in the search results don't change the comment karma scores.

I noticed the same thing, especially for current discussions. But then it changed. It looks like there's now a 5-day threshold for karma point exposure.

Would love to see the search box placed at the top-right of the site. Kind of a universal standard these days (and would mean no scrolling required to search).

The HTML5 `type=search` attribute would be cool as well :)

Edit: Additionally, use of `placeholder=Search` and the ditching of the label shouldn’t be terribly controversial here. Given the audience, I have to imagine most users will see the placeholder text, no polyfill required.

PG: Need to change the link in the bottom right of the page that says search to hnsearch instead of google with site:...

This is a wonderful gain in functionality. Any chance pg and Octopart can implement search for pg's other forum? (Relevant HN thread at http://news.ycombinator.com/item?id=2620297 .)

Should be a good test of the ThriftDB technology. Too bad the HN discussion about ThriftDB didn't come up in my test search, found it via Google instead.


Really? That exact story is result #4 searching for "thriftdb" (http://www.hnsearch.com/search#request/all&q=thriftdb) and the only result when searching for stories (http://www.hnsearch.com/search#request/submissions&q=thr...)

These results are as of 1pm EST on Saturday, as this very comment thread gets indexed the ordering may change...

Ah, I had JavaScript disabled. Once enabled, I got good search results!

Do you typically browse with JavaScript disabled? I would think that these days it would seriously change the browsing experience and many sites would be unusable. Just curious. :)

I have JavaScript disabled by default, and selectively enable it for the sites where I feel it is worthwhile. I'm a longtime NoScript user, and these days I use NotScripts with Chrome.

Is there any search engine that returns the right results for the term "C++"?

Both Google and Bing seem to handle it: http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycom... http://www.bing.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycombi... Are there some particular cases where you see them screwing up?

Yeah sorry, I realize Google and Bing have fixed those issues but for the longest time, even Google didn't do it!

It's on the todo list.

This was by far the single feature that I wanted most for HN. Thank you

Pardon me and I don't want to come off as ignorant, what is the big deal with this? I have been using Google to search hackernews articles by querying `<subject> site:news.ycombinator.com` and it has worked wonders.

Since hackernews is all publicly available, Googlebots must have indexed every single page of this site. And we all know how good Google has been with ranking and relevance. Why did HN decide to reinvent the wheel? Why didn't HN use the Google Custom Search plugin?

(I am just curious to know and not criticizing.)

The problem with Google Search for a site like Hacker News is that it doesn't have the same level of understanding of the metadata that makes up the site. A good example is sort-by-date (which Google can approximate based on the date something was first spotted by its crawlers, but it's not nearly as accurate as having access to the "date" field in the underlying data structures) - another is "just search comments by this username".

SearchYC allowed results to be sorted by date of submission, allowed searches by username and points, and other aspects that Google simply doesn't understand and can't provide.

HNSearch is an Octopart/ThriftDB project to test out our search technology and give back to the HN community.

Google has access to the link structure of the internet so it's great for macro searches. However, if you want to do a site-specific micro search then it usually helps to have access to the underlying metadata (e.g. points, karma, timestamps).

Google is great for simple searches, but fails utterly for any site-specific filtering of results.

I search HN a lot, so am excited by this native implementation. I would be curious to know more about how the "relevance" sorting algorithm works though. When I did a test search for "domain registrar," for example, the top result was a comment with a score of -4. It seems like there are many ways to implement that feature, so would would be quite interested to hear more if the creators were able to share some thoughts on the general "relevance" problem.

We're using the HN hotness algorithm with some points boosts to surface older items. For a full explanation of the ranking algorithm check out the API docs:


The app flow is a bit unnatural. You get to it directly from HN, but the only back link to the main page is hidden in the footer.

That's still a great thing to have. Simple and efficient.

Good point. The UI is designed for someone coming directly to hnsearch.com. I'll see if I can tweak it for people coming from HN.

any chance of integrating the search box into the top bar? i use some auto pagerize plugins that sort of get in the way. i know i could just turn it off, but that would brake its purpose. also, i suppose this would help with general visibility of the search box, as most sites have theirs on top of their pages!

but it's definitely great to see search build into the site! will make finding stuff on hn a lot more convenient.

Do you use Chrome? You can list this search engine as a custom search engine in Preferences > Basics > Manage Search Engines...

Add a new search engine -> Hacker News

Keyword -> hn

Url with %s in place of query -> http://www.hnsearch.com/search#request/all&q=%s

You can do the same thing with a keyword bookmark in Firefox.

I've been searching HN all this time with a custom Google search including "site:news.ycombinator.com". I'll have to see if this is better. Google offers results filtered by date, which is really nice, since technology moves so incredibly fast.

You can add HNSearch as a custom search engine in Firefox as well. Just visit hnsearch.com, click on the favicon in the browser's search box and select "Add HNSearch".

For the Chrom custom search engine it would be better to use the non-javascript url: http://www.hnsearch.com/search?q=

That will get redirected automatically and won't change with the webapp implementation.

You can filter by date/points with HNSearch.

I can't reproduce the Firefox steps you mention. I click on the favicon in the browser's search box and I don't get a chance to select "Add HNSearch," but rather a message saying "This website does not supply identity information."

It's the favicon in the search box, not the location bar.

Any reason for the search to be at the bottom?

PG: Can we please have it at the top, after submit?

The search box should be on the top

I agree, but where would you fit it in? I'm trying to figure that out as well.

There should be a prominent link back to news.combinator.com from HNSearch pages.

How is ThriftDB helping HN search? It'd be interesting to hear the details.

HNSearch consists of a simple javascript webapp hosted at hnsearch.com which sends ajax requests directly to ThriftDB. ThriftDB returns items in the search response which makes it ideal for this type of architecture. For a more detailed explanation checkout the API docs:


And here's a sample search response from ThriftDB:


So is HN data now being stored in ThriftDB? Or is it being replicated on your server? If so, what's the lag?

We wrote a crawler to download data from HN, parse it into JSON, and upload it to ThriftDB. The webapp at hnsearch.com sends requests directly to ThriftDB. There's a ~15min lag between HN and HNSearch.

Finally. Could we get one on the arc forum as well ?

Thank you! I was wondering exactly the same thing: http://news.ycombinator.com/item?id=2620297

How about adding the search box in the top?

I put together a simple safari extension that puts the bar after the submit link in the header. (not sure on the amount of safari users, but maybe a chrome version?)


I'm really excited to start using this! Also - "built outside of Octopartitself" needs a space.

I noticed a good percentage of the spam I received originated from SearchYC links (via links placed on HN under a different account name).

I would be interested to know if HN also receives less spam with the removal of SearchYC.

Does HN track stats on this?

I don't know if it was left intentionally, but the search link in the footer menu bar still goes to google instead of hnsearch.

typo: "versionso" should be "version_so"

This should cut down on duplicate submissions. That is, IF people remember to use it.

submission could automatically give search results on the title and/or url before confirming (like SO).

MetaFilter has been doing this for years too and it works ridiculously well. You have to see a list of potential matches before you can commit your post. So +1 to that.

People won't use it. Decent search has been available via SearchYC for years, and no one seemed to use it to prevent duplicate and/or similar submissions. Further, if you submit via the bookmarklet there is a huge incentive not to search.

Wow, what an achievement.

To who ever runs it: 'Yeserday' is spelled wrong. :)

Nice catch! Fixed.

whoever ;)

First thought: Seriously, you guys couldn't just make SQL or something lucene-based work for this (like elasticsearch) ?

Second thought: Hmm, ooh well, I just use google with site: news.ycombinator.com

EDIT: It annoys me when people hate on SQL and act like it's OMGSOSLOW. I would agree that fulltext search BY ITSELF isn't amazing, but you can use something like Sphinx to wrap it and support awesome things like Soundex / Metaphone fuzzy matching. It's also stupid easy to set up. The NoSQL movement is, IMO, largely misguided. If you need to shard data and scale ridiculously, something like Lucene can probably do it. If you need bigger than that, use memcached /redis and shard and map/reduce queries....that makes sense, but NoSQL doesn't seem like a great idea for PRIMARY data storage. It seems like you better have amazing backups :)

How about opening up an API and letting the hackers here have a bash at it?

It would be a fun challenge to see who can come up with the best and most useful solution :)

There is an API, linked from the bottom of every page: http://www.hnsearch.com/api

For the next two weeks there's even an API contest going on! http://www.hnsearch.com/contest

How many more posts do we need expressing our displeasure that searchyc has gone AWOL?

Wake me up when there's an official status update.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact