How does this compare to the various existing Lucene based search options for CouchDB? The post says the new Cloudant Search "is a way that would not require you to set up a third-party, financially or operationally expensive solution." Adding basic Lucene searches to a CouchDB setup isn't all that hard. What about elasticsearch and solr? Aside from the cost and hosting, are there other differentiaters between Cloudant's Search and these third party options?
I already responded to this partly in http://news.ycombinator.com/item?id=2764736 — We think the integration into a single deployment is in itself a big gain. Maintaining several of those infrastructures can be very painful, especially as your clusters grow large. Also, as mentioned, we’ve also already added several features on top of Lucene, and we’ll be adding more in the future.
I have a very noob question. For most databases, suffix search is always super slow. However, can't someone just build an index based on the string reversed, then treat suffix search exactly the same way as prefix search? This doubles your index storage requirement. But index storage is generally not a problem. Finally this could be perhaps extended to cover any wildcard searches (hell*world)
The issue there is that you're still anchoring your index to one end of the string which means you're not solving the general problem, only a specific manifestation of it.
A general example would be given the string "foo bar baz", your solution could find "foo%" or "%baz" efficiently, but not "%bar%". Its not out of the question if what you really want is a suffix search, but the general problem of finding an internal substring is still less than optimal.
Its hard to say. There are quite a few ways in which this could play out.
Firstly, there are two important points to consider. Currently, BigCouch is more or less a superset of Apache CouchDB. The only patches we have to CouchDB sources can and should be back-ported but require that we solve a couple possibilities for bugs in non-clustered deployments. Secondly, Erlang is a language which allows for an easy mish-mashing of code so that once we have back-ported these patches there's no real requirement for a merge at all.
There are also a few things that we're discussing in the CouchDB community that could very well contribute to not needing to merge the projects. Specifically rearranging our source tree to be more prototypically Erlang as well as some tools like a couch-config script that could allow plugin-type extensions to CouchDB.
In the end, its hard to tell how things will shape up. It could be a full on back-port, or it could just be a general improvement to CouchDB's source tree and build system so that BigCouch is strictly "CouchDB + Other Erlang Apps" if that makes sense. And with my CouchDB committer hat on, it really depends on what the community wants. Its easy to fall into think of the trap of "it's obvious" but we also have to consider that others are taking CouchDB and porting it to mobile phones. What we end up with in "core" CouchDB has to consider a lot of use cases.
Not entirely. The big difference here is that you only have one, integrated architecture to maintain. So that means less operational complexity, smoother “scalability” of your infrastructure, tighter integration between DB & Search. Also, we added stuff like queries across several indices, typed queries, etc.
Sharded search is not new. Solr, elastic-search, and Riak do it as well. The difference here is that we've built the Search on top of the BigCouch map-reduce view model. Views are calculated post commit so there are no data insertion locks. Multiple copies of each shard exist for fault-tolerance. Also, multiple map-reduce analytics passes can be used as input to the search.