Cloudant (YC S08) Releases In-Database, Distributed Search

turbodog · on July 14, 2011

Nice! I really appreciate the tone of Tim's post in that it acknowledges both the strengths and weaknesses of NOSQL in general and of Cloudant Search in particular in an honest manner.

dcaylor · on July 14, 2011

How does this compare to the various existing Lucene based search options for CouchDB? The post says the new Cloudant Search "is a way that would not require you to set up a third-party, financially or operationally expensive solution." Adding basic Lucene searches to a CouchDB setup isn't all that hard. What about elasticsearch and solr? Aside from the cost and hosting, are there other differentiaters between Cloudant's Search and these third party options?

_bkgg · on July 14, 2011

I already responded to this partly in http://news.ycombinator.com/item?id=2764736 — We think the integration into a single deployment is in itself a big gain. Maintaining several of those infrastructures can be very painful, especially as your clusters grow large. Also, as mentioned, we’ve also already added several features on top of Lucene, and we’ll be adding more in the future.

dcaylor · on July 14, 2011

Yes, thank you. Just after I posted my question here I realized that much of what I was wondering about was also answered in another post on the Cloudant blog: http://blog.cloudant.com/technical-look-at-cloudant-search/

gniquil · on July 14, 2011

I have a very noob question. For most databases, suffix search is always super slow. However, can't someone just build an index based on the string reversed, then treat suffix search exactly the same way as prefix search? This doubles your index storage requirement. But index storage is generally not a problem. Finally this could be perhaps extended to cover any wildcard searches (hell*world)

davisp · on July 14, 2011

The issue there is that you're still anchoring your index to one end of the string which means you're not solving the general problem, only a specific manifestation of it.

A general example would be given the string "foo bar baz", your solution could find "foo%" or "%baz" efficiently, but not "%bar%". Its not out of the question if what you really want is a suffix search, but the general problem of finding an internal substring is still less than optimal.

Edit: Formatting

mronge · on July 14, 2011

Yep. That is exactly how you solve this problem. I've done this with Lucene and then searched either the regular field or the reversed field depending on the wildcard query.

brendoncrawford · on July 14, 2011

Maybe slightly off-topic here, but are there plans to eventually merge Big Couch upstream into Couch core?

davisp · on July 14, 2011

Its hard to say. There are quite a few ways in which this could play out.

Firstly, there are two important points to consider. Currently, BigCouch is more or less a superset of Apache CouchDB. The only patches we have to CouchDB sources can and should be back-ported but require that we solve a couple possibilities for bugs in non-clustered deployments. Secondly, Erlang is a language which allows for an easy mish-mashing of code so that once we have back-ported these patches there's no real requirement for a merge at all.

There are also a few things that we're discussing in the CouchDB community that could very well contribute to not needing to merge the projects. Specifically rearranging our source tree to be more prototypically Erlang as well as some tools like a couch-config script that could allow plugin-type extensions to CouchDB.

In the end, its hard to tell how things will shape up. It could be a full on back-port, or it could just be a general improvement to CouchDB's source tree and build system so that BigCouch is strictly "CouchDB + Other Erlang Apps" if that makes sense. And with my CouchDB committer hat on, it really depends on what the community wants. Its easy to fall into think of the trap of "it's obvious" but we also have to consider that others are taking CouchDB and porting it to mobile phones. What we end up with in "core" CouchDB has to consider a lot of use cases.

brendoncrawford · on July 15, 2011

Thanks for this response, and thanks for the great work on both Couch and BigCouch.

rb2k_ · on July 15, 2011

An "open" alternative could be the CouchDB integration that elasticsearch provides:

http://www.elasticsearch.org/guide/reference/river/couchdb.h...

owenmarshall · on July 14, 2011

Will this make its way upstream into the open-source BigCouch?

ahoff · on July 14, 2011

For now, this feature will stay a part of the closed-source hosted and licensed products. But remember you can always try it out for free with our Oxygen plan at cloudant.com

mark_l_watson · on July 14, 2011

While I share your sentiment that it would be nice to have as part of BigCouch, that is probably asking a lot since this is a good value add for their (Cloudant's) service.

BTW, I have only set up BigCouch a couple of times to play with it but it is very impressive.

paulasmuth · on July 14, 2011

Wouldn't it be possible to do the same thing with Apache Lucene/Solr?

_bkgg · on July 14, 2011

Not entirely. The big difference here is that you only have one, integrated architecture to maintain. So that means less operational complexity, smoother “scalability” of your infrastructure, tighter integration between DB & Search. Also, we added stuff like queries across several indices, typed queries, etc.

paulasmuth · on July 14, 2011

Hm, as far as I understand multiple-index (core) search has been implemented in Solr 1.3?

hardtke · on July 14, 2011

Sharded search is not new. Solr, elastic-search, and Riak do it as well. The difference here is that we've built the Search on top of the BigCouch map-reduce view model. Views are calculated post commit so there are no data insertion locks. Multiple copies of each shard exist for fault-tolerance. Also, multiple map-reduce analytics passes can be used as input to the search.

mlmilleratmit · on July 14, 2011

Come and kick the tires!