
Hacker News Search - pg
http://ycombinator.com/newsnews.html#4jun11
======
tptacek
I like hnsearch a lot, but I'd like us to take a second to thank whoever was
running SearchYC, which for the past couple years has been practically
indispensable in keeping up with this community.

~~~
Sukotto
I miss searchyc and feel sad that they went compleatly dark with no real
explanation.

~~~
xentronium
[http://webcache.googleusercontent.com/search?q=cache:_FVEnmE...](http://webcache.googleusercontent.com/search?q=cache:_FVEnmEbigYJ:searchyc.com/post/6027748746/why-
is-searchyc-down+searchyc)

~~~
Sukotto
Yes, I saw that the first time around thanks.

It describes an action, but explains nothing.

Just off the top of my head here are some unanswered questions: Why did
Comcast shut them down? why did comcast have the authority to shut them down?
Why didn't the searchyc admins take their site somewhere else? Why have they
said nothing here on hn? Why were they shutdown mere days before pg announces
an official search (hopefully just a coincidence). What happened to their data
(it's especially valuable now that we can no longer see comment scores). Etc.

~~~
tptacek
Jiminy. Can we skip the outrage! this time?

You obviously haven't even looked at HNSearch if your big concern is comment
scores.

Sheesh!

~~~
Sukotto
No outrage intended sorry if it came across that way.

I was commenting on the abrupt departure of searchyc not on the new search.
Haven't seen the new search yet since it's not working on my blackberry for
some reason. And I don't have my laptop with me.

~~~
tptacek
Not everything that isn't explained in crystalline detail is a conspiracy or
even evidence of significant drama.

~~~
Sukotto
You seem to be reading an undertone into my comment that simply wasn't there.

~~~
tptacek
"Hopefully it's just a coincidence" that SearchYC lost its Comcast
connectivity right before Paul Graham announced HNSearch. No. I don't think I
misread the undertone.

Regardless, we don't need to beat this to death. Have a good weekend.

------
sigil
_ThriftDB is the software they had to write to do Octopart's electronic parts
searches; no existing software was up to the task._

Can the Octopart guys comment on why no existing software was up to the task?
Also, I'm curious if there's something that makes ThriftDB a particularly good
choice for HN search as well.

~~~
andres
A search engine has many moving parts and we found that the existing
technology worked but with great pain.

Lucene/Solr is great for search but it's useless without a fast datastore.
Additionally, you can't expose it to the open internet so you need to build a
REST API wrapper around the DB/index.

One particularly acute problem we had at Octopart was search across a
frequently changing schema. We have 15M parts in our database so it was a real
pain to change the schema. ThriftDB uses the Thrift serialization protocol
internally to maintain a flexible schema so you can change the schema
independently of the underlying data.

The current implementation of ThriftDB takes advantage of solr in smart ways
to simplify app development dramatically. At Octopart we found that we were
spending a lot of time building custom search solutions. Now we're using the
same backend technology to search for electronic parts on Octopart and
comments on Hacker News.

Using ThriftDB for search is like going from a compiled language to a
scripting language. It takes care of a lot of issues so you can focus on app
development.

Hope that explanation helps! Please let me know if you have more questions.

~~~
brianwhitman
_One particularly acute problem we had at Octopart was search across a
frequently changing schema. We have 15M parts in our database so it was a real
pain to change the schema._

Solr can be run without a schema quite easily, or use the very common trope of
the dynamicField typed schema (string_ _, text__ , int_* or whatever)

 _Lucene/Solr is great for search but it's useless without a fast datastore._

This is true at real scale (EN has 500m docs in our bigger solrs and we have
to back it w/ TT) but for 15m docs, no way. Stored data in lucene/solr is
certainly fast enough at that level. Think <50ms including data even on AWS.

 _Additionally, you can't expose it to the open internet so you need to build
a REST API wrapper around the DB/index._

This I don't get. Surely you have a web app layer that talks to your services?
That is, you don't have a text field in HTML hitting solr on port 8983, you
call your web app's search which then hits solr's HTTP API?

~~~
andres
_Solr can be run without a schema quite easily, or use the very common trope
of the dynamicField typed schema (string_, text_, int_ or whatever)_

We're using solr dynamic fields in our current implementation of ThriftDB.
ThriftDB adds a layer on top though so you can change attribute names on the
fly.

 _This is true at real scale (EN has 500m docs in our bigger solrs and we have
to back it w/ TT) but for 15m docs, no way. Stored data in lucene/solr is
certainly fast enough at that level. Think <50ms including data even on AWS._

In our experience, stored data in lucene/solr doesn't scale well.

 _This I don't get. Surely you have a web app layer that talks to your
services? That is, you don't have a text field in HTML hitting solr on port
8983, you call your web app's search which then hits solr's HTTP API?_

It sound like we're saying the same thing. If you use solr you're responsible
for the web app layer. With ThriftDB you get a JSON REST API out of the box
for every collection you create.

~~~
brianwhitman
OK, thanks I think I get what thriftDB is a little better. It's hosted (only)
-- so instead of booting a solr on your own boxes, you just hit
api.thriftdb.com to create indexes, add data and do queries. So thus my "API"
confusion -- I was assuming this was software I run on my own box and I don't
know what I would get from the API that I wouldn't get hitting solr direct.
But a turnkey search service is a nice idea.

But do you guys have a lot of experience with solr / lucene scaling (I mean
well beyond the 15m docs octopart has)? What happens when an API customer
starts ingesting >100m docs? And then 10 do it at once? Are the indexes on
different boxes? Are you on AWS? What's the disk backing it? Does each index
fit in RAM? Does the user have any control over caches, tokenizers, stemming,
triefields? Like... when I add a date, you're not indexing it with millisecond
accuracy, are you?

After EN gets bought on accident by the norwegian fish cannery I plan to
retire on giving speeches of 5 years of Solr scaling woes. Unless Otis G is
running it there's no way I would trust a hosted solr solution that I didn't
have full control over.

~~~
andres
Exactly! Would love to get more feedback if you have a chance to try out
ThriftDB (<http://www.thriftdb.com>).

We have a lot of experience scaling faceted search with solr. 15M docs might
not sound like a lot if you're doing full-text search but faceted-search adds
another level of complexity:

[http://octopart.com/partsearch/#search/requestData&q=cap...](http://octopart.com/partsearch/#search/requestData&q=capacitor)

As far as scaling goes, most of the demand right now is for smaller indexes so
that's not a problem. We know how to scale out though so we're pretty excited
to get customers with >100M docs.

We're hosted on AWS but we're still iterating on the architecture. The indexes
fit in RAM. Currently you can't control caches, tokenizers, stemming, etc. but
that's in the pipeline.

Good luck with your new Norwegian cannery owners! I hear you on hosted search.
We're trying our best to create a headache-free hosted solution.

------
pg
I should add that HNSearch has an API

<http://www.hnsearch.com/api>

which should be a lot faster than crawling HN itself to get the same data.

~~~
k7d
This is cool, but the API looks a bit cryptic. Is there a way to simply lookup
an item by URL?

Edit: Nevermind, figured it out:
[http://api.thriftdb.com/api.hnsearch.com/items/_search?prett...](http://api.thriftdb.com/api.hnsearch.com/items/_search?pretty_print=true&filter\[fields\]\[url\]\[\]=http%3A%2F%2Fycombinator.com%2Fnewsnews.html%234jun11)

~~~
news-yc
Or, by "id"? I looked, but I wasn't able to figure out, for example, how I'd
get pg's top-level comment here from the id "2619846". I was able to find it
with this search, but it requires a second-part to the "_id" that I don't know
where to find:
[http://api.thriftdb.com/api.hnsearch.com/items/_search?q=...](http://api.thriftdb.com/api.hnsearch.com/items/_search?q=&filter\[fields\]\[_id\]=2619846-a30f9&pretty_print=true)
(Trying to filter for "id" instead of "_id" throws an error, for some reason.)

~~~
andres
The `_id` attribute for each item is its submission id plus a signature.
ThriftDB references objects by their unique `_id` attributes.

The `id` attribute on the other hand is an item's submission id. It's not
indexed though so you can't do lookup by id.

~~~
milkshakes
could you please index the submission ids? i can think of a few reasons this
would be useful

~~~
lt
Exactly.

How do I find this comment through the API? How do I find its replies?

------
stevenj
Will you move the search box to the top right hand corner? Perhaps just below
my username?

~~~
arctangent
Please do this. Having the search box at the bottom of the page makes no sense
to me.

------
fara
Why at the bottom? I'm sure it's supposed to be more important than the
footer. You could use SmashingMagazine resources when it comes to UX and
design. It looks like you made almost all the frequent mistakes:

    
    
       Placing the search box at the bottom of the page, or hiding it in the navigation menu.
       Making the input field too short; users are forced to use short, imprecise queries, because longer queries would be hard and inconvenient to read.
       Making the submit button too small, so that users have to point the mouse very precisely.
       Making the search box hard to find.
    

Source: [http://www.smashingmagazine.com/2008/12/04/designing-the-
hol...](http://www.smashingmagazine.com/2008/12/04/designing-the-holy-search-
box-examples-and-best-practices/)

------
lt
Comment scores are back! (in a convoluted way, through the API). I can see
that at this time that the top comment of this page currently (tptacek's) has
58 points:

[http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...](http://api.thriftdb.com/api.hnsearch.com/items/_search?filter\[fields\]\[username\]\[\]=tptacek&sortby=create_ts%20desc&pretty_print=true)

~~~
lt
Looks like it's been changed. Points seems to be always null through the API
now.

edit: just removed from the search. still here:
[http://api.thriftdb.com/api.hnsearch.com/items/2619811-02e34...](http://api.thriftdb.com/api.hnsearch.com/items/2619811-02e34?pretty_print=true)

expect to be gone soon.

------
gnosis
Interface improvement suggestions:

When the search results page comes up, the search box is active. This makes
any keystrokes I type appear in the search box instead of being passed on to
my browser to do things like scroll down the page of results.

This is pretty annoying, and makes me have to click outside the search box
before I can use the keyboard to scroll down the page, etc.

So please make it so that the search box does _not_ have focus when the search
results are returned.

The other suggestion I have is to allow people to use their keyboards to go to
the next page of search results. In Opera, I do this simply by hitting the
space bar once I get to the bottom of the page.

This is how Google's interface works. In general, I think you can't go wrong
by studying the design of their search interface carefully and copying it.

------
mrjbq7
It's a shame that, rather than embracing SearchYC which has existed and filled
a needed feature for several years, you chose to use this as a plug for one of
your companies...

------
tokenadult
The feature that jumps out at me is that I can see comment karma scores for
all the comments that I search up with a keyword search. Evidently the
upvote/downvote buttons I see next to the comments in the search results don't
change the comment karma scores.

~~~
woodrow
I noticed the same thing, especially for current discussions. But then it
changed. It looks like there's now a 5-day threshold for karma point exposure.

------
alanh
Would love to see the search box placed at the top-right of the site. Kind of
a universal standard these days (and would mean no scrolling required to
search).

The HTML5 `type=search` attribute would be cool as well :)

 _Edit:_ Additionally, use of `placeholder=Search` and the ditching of the
label shouldn’t be terribly controversial here. Given the audience, I have to
imagine most users will see the placeholder text, no polyfill required.

------
ZackOfAllTrades
PG: Need to change the link in the bottom right of the page that says search
to hnsearch instead of google with site:...

------
evanrmurphy
This is a wonderful gain in functionality. Any chance pg and Octopart can
implement search for pg's other forum? (Relevant HN thread at
<http://news.ycombinator.com/item?id=2620297> .)

------
evangineer
Should be a good test of the ThriftDB technology. Too bad the HN discussion
about ThriftDB didn't come up in my test search, found it via Google instead.

<http://news.ycombinator.com/item?id=2581652>

~~~
bnewbold
Really? That exact story is result #4 searching for "thriftdb"
([http://www.hnsearch.com/search#request/all&q=thriftdb](http://www.hnsearch.com/search#request/all&q=thriftdb))
and the only result when searching for stories
([http://www.hnsearch.com/search#request/submissions&q=thr...](http://www.hnsearch.com/search#request/submissions&q=thriftdb&start=0))

These results are as of 1pm EST on Saturday, as this very comment thread gets
indexed the ordering may change...

~~~
evangineer
Ah, I had JavaScript disabled. Once enabled, I got good search results!

~~~
evanrmurphy
Do you typically browse with JavaScript disabled? I would think that these
days it would seriously change the browsing experience and many sites would be
unusable. Just curious. :)

~~~
evangineer
I have JavaScript disabled by default, and selectively enable it for the sites
where I feel it is worthwhile. I'm a longtime NoScript user, and these days I
use NotScripts with Chrome.

------
cheez
Is there any search engine that returns the right results for the term "C++"?

~~~
util
Both Google and Bing seem to handle it:
[http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycom...](http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycombinator.com%2F+c%2B%2B)
[http://www.bing.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycombi...](http://www.bing.com/search?q=site%3Ahttp%3A%2F%2Fnews.ycombinator.com%2F+c%2B%2B)
Are there some particular cases where you see them screwing up?

~~~
cheez
Yeah sorry, I realize Google and Bing have fixed those issues but for the
longest time, even Google didn't do it!

------
jaxonrice
This was by far the single feature that I wanted most for HN. Thank you

------
vicngtor
Pardon me and I don't want to come off as ignorant, what is the big deal with
this? I have been using Google to search hackernews articles by querying
`<subject> site:news.ycombinator.com` and it has worked wonders.

Since hackernews is all publicly available, Googlebots must have indexed every
single page of this site. And we all know how good Google has been with
ranking and relevance. Why did HN decide to reinvent the wheel? Why didn't HN
use the Google Custom Search plugin?

(I am just curious to know and not criticizing.)

~~~
simonw
The problem with Google Search for a site like Hacker News is that it doesn't
have the same level of understanding of the metadata that makes up the site. A
good example is sort-by-date (which Google can approximate based on the date
something was first spotted by its crawlers, but it's not nearly as accurate
as having access to the "date" field in the underlying data structures) -
another is "just search comments by this username".

------
akikuchi
I search HN a lot, so am excited by this native implementation. I would be
curious to know more about how the "relevance" sorting algorithm works though.
When I did a test search for "domain registrar," for example, the top result
was a comment with a score of -4. It seems like there are many ways to
implement that feature, so would would be quite interested to hear more if the
creators were able to share some thoughts on the general "relevance" problem.

~~~
andres
We're using the HN hotness algorithm with some points boosts to surface older
items. For a full explanation of the ranking algorithm check out the API docs:

<http://www.hnsearch.com/api>

------
Wilya
The app flow is a bit unnatural. You get to it directly from HN, but the only
back link to the main page is hidden in the footer.

That's still a great thing to have. Simple and efficient.

~~~
andres
Good point. The UI is designed for someone coming directly to hnsearch.com.
I'll see if I can tweak it for people coming from HN.

------
dnlk
any chance of integrating the search box into the top bar? i use some auto
pagerize plugins that sort of get in the way. i know i could just turn it off,
but that would brake its purpose. also, i suppose this would help with general
_visibility_ of the search box, as most sites have theirs on top of their
pages!

but it's definitely great to see search build into the site! will make finding
stuff on hn _a lot_ more convenient.

~~~
peng
Do you use Chrome? You can list this search engine as a custom search engine
in Preferences > Basics > Manage Search Engines...

Add a new search engine -> Hacker News

Keyword -> hn

Url with %s in place of query ->
[http://www.hnsearch.com/search#request/all&q=%s](http://www.hnsearch.com/search#request/all&q=%s)

You can do the same thing with a keyword bookmark in Firefox.

I've been searching HN all this time with a custom Google search including
"site:news.ycombinator.com". I'll have to see if this is better. Google offers
results filtered by date, which is really nice, since technology moves so
incredibly fast.

~~~
andres
You can add HNSearch as a custom search engine in Firefox as well. Just visit
hnsearch.com, click on the favicon in the browser's search box and select "Add
HNSearch".

For the Chrom custom search engine it would be better to use the non-
javascript url: <http://www.hnsearch.com/search?q=>

That will get redirected automatically and won't change with the webapp
implementation.

You can filter by date/points with HNSearch.

~~~
tokenadult
I can't reproduce the Firefox steps you mention. I click on the favicon in the
browser's search box and I don't get a chance to select "Add HNSearch," but
rather a message saying "This website does not supply identity information."

~~~
akkartik
It's the favicon in the search box, not the location bar.

------
dawie
Any reason for the search to be at the bottom?

PG: Can we please have it at the top, after submit?

------
rbreve
The search box should be on the top

~~~
evanrmurphy
I agree, but where would you fit it in? I'm trying to figure that out as well.

------
amichail
There should be a prominent link back to news.combinator.com from HNSearch
pages.

------
revorad
How is ThriftDB helping HN search? It'd be interesting to hear the details.

~~~
andres
HNSearch consists of a simple javascript webapp hosted at hnsearch.com which
sends ajax requests directly to ThriftDB. ThriftDB returns items in the search
response which makes it ideal for this type of architecture. For a more
detailed explanation checkout the API docs:

<http://www.hnsearch.com/api>

And here's a sample search response from ThriftDB:

[http://api.thriftdb.com/api.hnsearch.com/items/_search?q=fac...](http://api.thriftdb.com/api.hnsearch.com/items/_search?q=facebook&pretty_print=true)

~~~
revorad
So is HN data now being stored in ThriftDB? Or is it being replicated on your
server? If so, what's the lag?

~~~
andres
We wrote a crawler to download data from HN, parse it into JSON, and upload it
to ThriftDB. The webapp at hnsearch.com sends requests directly to ThriftDB.
There's a ~15min lag between HN and HNSearch.

------
Typhon
Finally. Could we get one on the arc forum as well ?

~~~
evanrmurphy
Thank you! I was wondering exactly the same thing:
<http://news.ycombinator.com/item?id=2620297>

------
nabaraj
How about adding the search box in the top?

------
thomasswift
I put together a simple safari extension that puts the bar after the submit
link in the header. (not sure on the amount of safari users, but maybe a
chrome version?)

<http://1821design.com/HNTopBarSearchField/>

------
jasonshen
I'm really excited to start using this! Also - "built outside of
Octopartitself" needs a space.

------
blackstag
I noticed a good percentage of the spam I received originated from SearchYC
links (via links placed on HN under a different account name).

I would be interested to know if HN also receives less spam with the removal
of SearchYC.

Does HN track stats on this?

------
nhebb
I don't know if it was left intentionally, but the search link in the footer
menu bar still goes to google instead of hnsearch.

------
daniel-cussen
typo: "versionso" should be "version_so"

------
krzysz00
This should cut down on duplicate submissions. That is, IF people remember to
use it.

~~~
6ren
submission could automatically give search results on the title and/or url
before confirming (like SO).

~~~
petercooper
MetaFilter has been doing this for years too and it works ridiculously well.
You have to see a list of potential matches before you can commit your post.
So +1 to that.

------
brupm2
Wow, what an achievement.

------
duck
To who ever runs it: 'Yeserday' is spelled wrong. :)

~~~
andres
Nice catch! Fixed.

------
seanp2k
First thought: Seriously, you guys couldn't just make SQL or something lucene-
based work for this (like elasticsearch) ?

Second thought: Hmm, ooh well, I just use google with site:
news.ycombinator.com

EDIT: It annoys me when people hate on SQL and act like it's OMGSOSLOW. I
would agree that fulltext search BY ITSELF isn't amazing, but you can use
something like Sphinx to wrap it and support awesome things like Soundex /
Metaphone fuzzy matching. It's also stupid easy to set up. The NoSQL movement
is, IMO, largely misguided. If you need to shard data and scale ridiculously,
something like Lucene can probably do it. If you need bigger than that, use
memcached /redis and shard and map/reduce queries....that makes sense, but
NoSQL doesn't seem like a great idea for PRIMARY data storage. It seems like
you better have amazing backups :)

------
mrvc
How about opening up an API and letting the hackers here have a bash at it?

It would be a fun challenge to see who can come up with the best and most
useful solution :)

~~~
bnewbold
There is an API, linked from the bottom of every page:
<http://www.hnsearch.com/api>

For the next two weeks there's even an API contest going on!
<http://www.hnsearch.com/contest>

------
brndnhy
How many more posts do we need expressing our displeasure that searchyc has
gone AWOL?

Wake me up when there's an official status update.

