Email firstname.lastname@example.org to get started.
It describes an action, but explains nothing.
Just off the top of my head here are some unanswered questions:
Why did Comcast shut them down? why did comcast have the authority to shut them down? Why didn't the searchyc admins take their site somewhere else? Why have they said nothing here on hn? Why were they shutdown mere days before pg announces an official search (hopefully just a coincidence). What happened to their data (it's especially valuable now that we can no longer see comment scores). Etc.
You obviously haven't even looked at HNSearch if your big concern is comment scores.
I was commenting on the abrupt departure of searchyc not on the new search. Haven't seen the new search yet since it's not working on my blackberry for some reason. And I don't have my laptop with me.
Regardless, we don't need to beat this to death. Have a good weekend.
Can the Octopart guys comment on why no existing software was up to the task? Also, I'm curious if there's something that makes ThriftDB a particularly good choice for HN search as well.
Lucene/Solr is great for search but it's useless without a fast datastore. Additionally, you can't expose it to the open internet so you need to build a REST API wrapper around the DB/index.
One particularly acute problem we had at Octopart was search across a frequently changing schema. We have 15M parts in our database so it was a real pain to change the schema. ThriftDB uses the Thrift serialization protocol internally to maintain a flexible schema so you can change the schema independently of the underlying data.
The current implementation of ThriftDB takes advantage of solr in smart ways to simplify app development dramatically. At Octopart we found that we were spending a lot of time building custom search solutions. Now we're using the same backend technology to search for electronic parts on Octopart and comments on Hacker News.
Using ThriftDB for search is like going from a compiled language to a scripting language. It takes care of a lot of issues so you can focus on app development.
Hope that explanation helps! Please let me know if you have more questions.
Solr can be run without a schema quite easily, or use the very common trope of the dynamicField typed schema (string_, text_, int_* or whatever)
Lucene/Solr is great for search but it's useless without a fast datastore.
This is true at real scale (EN has 500m docs in our bigger solrs and we have to back it w/ TT) but for 15m docs, no way. Stored data in lucene/solr is certainly fast enough at that level. Think <50ms including data even on AWS.
Additionally, you can't expose it to the open internet so you need to build a REST API wrapper around the DB/index.
This I don't get. Surely you have a web app layer that talks to your services? That is, you don't have a text field in HTML hitting solr on port 8983, you call your web app's search which then hits solr's HTTP API?
We're using solr dynamic fields in our current implementation of ThriftDB. ThriftDB adds a layer on top though so you can change attribute names on the fly.
In our experience, stored data in lucene/solr doesn't scale well.
It sound like we're saying the same thing. If you use solr you're responsible for the web app layer. With ThriftDB you get a JSON REST API out of the box for every collection you create.
But do you guys have a lot of experience with solr / lucene scaling (I mean well beyond the 15m docs octopart has)? What happens when an API customer starts ingesting >100m docs? And then 10 do it at once? Are the indexes on different boxes? Are you on AWS? What's the disk backing it? Does each index fit in RAM? Does the user have any control over caches, tokenizers, stemming, triefields? Like... when I add a date, you're not indexing it with millisecond accuracy, are you?
After EN gets bought on accident by the norwegian fish cannery I plan to retire on giving speeches of 5 years of Solr scaling woes. Unless Otis G is running it there's no way I would trust a hosted solr solution that I didn't have full control over.
We have a lot of experience scaling faceted search with solr. 15M docs might not sound like a lot if you're doing full-text search but faceted-search adds another level of complexity:
As far as scaling goes, most of the demand right now is for smaller indexes so that's not a problem. We know how to scale out though so we're pretty excited to get customers with >100M docs.
We're hosted on AWS but we're still iterating on the architecture. The indexes fit in RAM. Currently you can't control caches, tokenizers, stemming, etc. but that's in the pipeline.
Good luck with your new Norwegian cannery owners! I hear you on hosted search. We're trying our best to create a headache-free hosted solution.
Here's a link to the ThriftDB docs in case you'd like to learn more:
which should be a lot faster than crawling HN itself to get the same data.
If the current URL is added, it will go straight to discussion thread (without voting). If not, it will ask if you want to submit it.
However - I don't always want to upvote a post just because I want to check the discussion. Published this with an assumption that I'm not the only one.
When I build this web app http://www.vcarrer.com/2010/11/hacker-news-mobile-front-page... I need to use Yahoo YQL to obtain JSONP.
So pretty please, add JSONP.
Nevermind, figured it out:
The `filter` arguments are used to cut the data before performing a full-text query:
`filter[fields][fieldname]` can be used to filter on fieldnames:
`filter[queries]` can be used to add arbitrary filters:
The syntax for filter queries is the same as solr:
The `id` attribute on the other hand is an item's submission id. It's not indexed though so you can't do lookup by id.
I would like to lookup items by HN id.
How do I find this comment through the API? How do I find its replies?
Placing the search box at the bottom of the page, or hiding it in the navigation menu.
Making the input field too short; users are forced to use short, imprecise queries, because longer queries would be hard and inconvenient to read.
Making the submit button too small, so that users have to point the mouse very precisely.
Making the search box hard to find.
edit: just removed from the search. still here:
expect to be gone soon.
When the search results page comes up, the search box is active. This makes any keystrokes I type appear in the search box instead of being passed on to my browser to do things like scroll down the page of results.
This is pretty annoying, and makes me have to click outside the search box before I can use the keyboard to scroll down the page, etc.
So please make it so that the search box does not have focus when the search results are returned.
The other suggestion I have is to allow people to use their keyboards to go to the next page of search results. In Opera, I do this simply by hitting the space bar once I get to the bottom of the page.
This is how Google's interface works. In general, I think you can't go wrong by studying the design of their search interface carefully and copying it.
The HTML5 `type=search` attribute would be cool as well :)
Edit: Additionally, use of `placeholder=Search` and the ditching of the label shouldn’t be terribly controversial here. Given the audience, I have to imagine most users will see the placeholder text, no polyfill required.
These results are as of 1pm EST on Saturday, as this very comment thread gets indexed the ordering may change...
Since hackernews is all publicly available, Googlebots must have indexed every single page of this site. And we all know how good Google has been with ranking and relevance. Why did HN decide to reinvent the wheel? Why didn't HN use the Google Custom Search plugin?
(I am just curious to know and not criticizing.)
Google has access to the link structure of the internet so it's great for macro searches. However, if you want to do a site-specific micro search then it usually helps to have access to the underlying metadata (e.g. points, karma, timestamps).
That's still a great thing to have. Simple and efficient.
but it's definitely great to see search build into the site!
will make finding stuff on hn a lot more convenient.
Add a new search engine -> Hacker News
Keyword -> hn
Url with %s in place of query -> http://www.hnsearch.com/search#request/all&q=%s
You can do the same thing with a keyword bookmark in Firefox.
I've been searching HN all this time with a custom Google search including "site:news.ycombinator.com". I'll have to see if this is better. Google offers results filtered by date, which is really nice, since technology moves so incredibly fast.
That will get redirected automatically and won't change with the webapp implementation.
You can filter by date/points with HNSearch.
PG: Can we please have it at the top, after submit?
And here's a sample search response from ThriftDB:
I would be interested to know if HN also receives less spam with the removal of SearchYC.
Does HN track stats on this?
Second thought: Hmm, ooh well, I just use google with site: news.ycombinator.com
EDIT: It annoys me when people hate on SQL and act like it's OMGSOSLOW. I would agree that fulltext search BY ITSELF isn't amazing, but you can use something like Sphinx to wrap it and support awesome things like Soundex / Metaphone fuzzy matching. It's also stupid easy to set up. The NoSQL movement is, IMO, largely misguided. If you need to shard data and scale ridiculously, something like Lucene can probably do it. If you need bigger than that, use memcached /redis and shard and map/reduce queries....that makes sense, but NoSQL doesn't seem like a great idea for PRIMARY data storage. It seems like you better have amazing backups :)
It would be a fun challenge to see who can come up with the best and most useful solution :)
For the next two weeks there's even an API contest going on! http://www.hnsearch.com/contest
Wake me up when there's an official status update.