Hacker News new | past | comments | ask | show | jobs | submit login
Cloudant (YC S08) Releases In-Database, Distributed Search (cloudant.com)
53 points by _bkgg on July 14, 2011 | hide | past | favorite | 19 comments



Nice! I really appreciate the tone of Tim's post in that it acknowledges both the strengths and weaknesses of NOSQL in general and of Cloudant Search in particular in an honest manner.


How does this compare to the various existing Lucene based search options for CouchDB? The post says the new Cloudant Search "is a way that would not require you to set up a third-party, financially or operationally expensive solution." Adding basic Lucene searches to a CouchDB setup isn't all that hard. What about elasticsearch and solr? Aside from the cost and hosting, are there other differentiaters between Cloudant's Search and these third party options?


I already responded to this partly in http://news.ycombinator.com/item?id=2764736 — We think the integration into a single deployment is in itself a big gain. Maintaining several of those infrastructures can be very painful, especially as your clusters grow large. Also, as mentioned, we’ve also already added several features on top of Lucene, and we’ll be adding more in the future.


Yes, thank you. Just after I posted my question here I realized that much of what I was wondering about was also answered in another post on the Cloudant blog: http://blog.cloudant.com/technical-look-at-cloudant-search/


I have a very noob question. For most databases, suffix search is always super slow. However, can't someone just build an index based on the string reversed, then treat suffix search exactly the same way as prefix search? This doubles your index storage requirement. But index storage is generally not a problem. Finally this could be perhaps extended to cover any wildcard searches (hell*world)


The issue there is that you're still anchoring your index to one end of the string which means you're not solving the general problem, only a specific manifestation of it.

A general example would be given the string "foo bar baz", your solution could find "foo%" or "%baz" efficiently, but not "%bar%". Its not out of the question if what you really want is a suffix search, but the general problem of finding an internal substring is still less than optimal.

Edit: Formatting


Yep. That is exactly how you solve this problem. I've done this with Lucene and then searched either the regular field or the reversed field depending on the wildcard query.


Maybe slightly off-topic here, but are there plans to eventually merge Big Couch upstream into Couch core?


Its hard to say. There are quite a few ways in which this could play out.

Firstly, there are two important points to consider. Currently, BigCouch is more or less a superset of Apache CouchDB. The only patches we have to CouchDB sources can and should be back-ported but require that we solve a couple possibilities for bugs in non-clustered deployments. Secondly, Erlang is a language which allows for an easy mish-mashing of code so that once we have back-ported these patches there's no real requirement for a merge at all.

There are also a few things that we're discussing in the CouchDB community that could very well contribute to not needing to merge the projects. Specifically rearranging our source tree to be more prototypically Erlang as well as some tools like a couch-config script that could allow plugin-type extensions to CouchDB.

In the end, its hard to tell how things will shape up. It could be a full on back-port, or it could just be a general improvement to CouchDB's source tree and build system so that BigCouch is strictly "CouchDB + Other Erlang Apps" if that makes sense. And with my CouchDB committer hat on, it really depends on what the community wants. Its easy to fall into think of the trap of "it's obvious" but we also have to consider that others are taking CouchDB and porting it to mobile phones. What we end up with in "core" CouchDB has to consider a lot of use cases.


Thanks for this response, and thanks for the great work on both Couch and BigCouch.


An "open" alternative could be the CouchDB integration that elasticsearch provides:

http://www.elasticsearch.org/guide/reference/river/couchdb.h...


Will this make its way upstream into the open-source BigCouch?


For now, this feature will stay a part of the closed-source hosted and licensed products. But remember you can always try it out for free with our Oxygen plan at cloudant.com


While I share your sentiment that it would be nice to have as part of BigCouch, that is probably asking a lot since this is a good value add for their (Cloudant's) service.

BTW, I have only set up BigCouch a couple of times to play with it but it is very impressive.


Wouldn't it be possible to do the same thing with Apache Lucene/Solr?


Not entirely. The big difference here is that you only have one, integrated architecture to maintain. So that means less operational complexity, smoother “scalability” of your infrastructure, tighter integration between DB & Search. Also, we added stuff like queries across several indices, typed queries, etc.


Hm, as far as I understand multiple-index (core) search has been implemented in Solr 1.3?


Sharded search is not new. Solr, elastic-search, and Riak do it as well. The difference here is that we've built the Search on top of the BigCouch map-reduce view model. Views are calculated post commit so there are no data insertion locks. Multiple copies of each shard exist for fault-tolerance. Also, multiple map-reduce analytics passes can be used as input to the search.


Come and kick the tires!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: