
Elassandra: Elasticsearch implemented on top of Cassandra - ddorian43
https://github.com/vroyer/elassandra
======
ceocoder
A while ago (4ish years) Jake Luciani implemented Lucandra (Cassandra as the
store for Lucene indices) and Solandra (extension of Lucandra - solr on top of
Cassandra). I used both of those at one point.

It was really interesting work.

OP - was elassandra influenced by either of those in anyway?

[1] [https://github.com/tjake/Lucandra](https://github.com/tjake/Lucandra) [2]
[https://github.com/tjake/Solandra](https://github.com/tjake/Solandra)

~~~
isoos
Also, Stratio's lucene index: [https://github.com/Stratio/cassandra-lucene-
index](https://github.com/Stratio/cassandra-lucene-index)

------
grizzles
In solr vs elasticsearch, my understanding is that solr is more correct and
has faster & better algorithms for certain edge cases. Though I'd be
interested in hearing the details of any differences of opinion.

Therefore Stratio's Cassandra Lucene Index is worth an equal mention.
[https://github.com/Stratio/cassandra-lucene-
index](https://github.com/Stratio/cassandra-lucene-index)

~~~
phamilton
Solr is operationally more complex to cluster (at least it was a few years
ago) and the API is less intuitive in my opinion.

Elasticsearch is terribly broken in a lot of ways, but it's awfully easy to
get up and running.

I'm not sure which I'm holding out for. That Solr will be easier to manage or
that ElasticSearch will stop doing terrible things.

~~~
joslin01
Would be curious to hear what Elasticsearch is doing wrong if you don't mind.
I have no dog in the fight, I just always hear good things.

~~~
fusiongyro
I've used Solr in production for about five years. All of the pain with Solr
is up-front. Once you get it running, set up your schemas and you have your
indexes built, it just hums along, doing its job and being obscenely fast.

Trying out Elasticsearch, my experience was that it really wants to be run in
a cluster, but it also loses data pretty easily. I had more issues with it
crashing and it's generally a lot hungrier for memory.

Both have non-obvious shortcomings. Solr's schema will make you believe that
it likes deeply nested JSON documents. False! It actually wants pretty flat
"documents" without nesting (you /can/ nest, but it usually doesn't do what
you want without some extra legwork). ES will have you believe that it
supports lots of query types and they'll all perform great on semi-structured
data. My experience was that it was difficult to predict performance, but that
generally the fancier the query the worse it would be.

Solr's querying functionality is not extremely powerful (though they
"helpfully" made it offensively complex with different query parsers and
stuff) but performance has always been excellent for me.

IMO, if you don't need clustering, Solr is definitely better. A cranky but
robust piece of engineering from before scaling was everything. ES has better
documentation, a better "getting started" story, and is generally a lot more
user-friendly. Aphyr's posts about it have made me wary of using it without a
re-indexing story.

I haven't tried Solr's scaling stuff because I haven't needed it, but I would
expect it to be in pretty rough shape compared to ES because it's not a
primary use case for Solr and it is for ES.

~~~
rrampage
We use SolrCloud cluster where I work. The initial setup is rather daunting
and involves reading up on Zookeeper, and Solr terminology on Collections,
Cores, Shards and Nodes. But once the reading is done, clustering is
effortless to execute. Solr 5 and 6 have a robust REST API for managing
Collections.

The new SQL / Parallel Streaming has also made querying multiple collections a
cinch.

~~~
johnbellone
What's daunting about running Zookeeper?

Most of the problems that I have had are with the ZK clients and not the
server. As long as you follow the operational documentation (there are a few
basic rules) it hums along nicely. We have a few clusters with Kafka and have
a decent process in Chef:

[https://github.com/bloomberg/zookeeper-
cookbook](https://github.com/bloomberg/zookeeper-cookbook)

~~~
phamilton
Compared to ES documentation on clustering it's a lot of work. ES merely
requires a seed host to connect to and will gossip the rest of the cluster. No
external service needed.

On the other hand, it's had its share of improperly handled split brain
scenarios. I still think it has problems with partial split brain (where A and
B can't talk, but C can talk to both of them).

------
cipherzero
This looks very interesting, however I'm concerned about some of the
implementation details...

1\. It mentions using secondary indexes - its my understanding thats a huge
no-no, as they have to hit the whole cluster 2\. Uses "lightweight"
transactions - also another perf hit, as lightweight transactions have
(anecdotally) a 6x slowdown...

I like the idea but I'm curious if these are issues and whether these uses are
something the author is looking to replace...

Very interesting idea though!

(CoAuthor of cassieq here so these were things we had to learn about.)

~~~
ddorian43
indexes are together with the data. the partition key becomes the _routing
key, so you can always search 1, x, or all nodes depending on your _routing
value

lightweight transactions are only used on schema-changes (which are/should-be
rare)

~~~
cipherzero
Awesome, i watched the demo video... i will be trying this. Thanks for the
info on lightweight transactions, sounds like the perfect use then!

As for the indexes - are they standard Cassandra secondary indexes? "Custom
secondary indexes" \- does that mean that it just looks like a secondary
index, but is actually backed by Elastic search?

~~~
ddorian43
Cassandra offers a way to create your own custom-secondary-index. In this
case, the secondary-index is backed by elasticsearch/lucene.

Though you can't query it from cassandra yet. You have to use the elastic-
search rest-api.

~~~
cipherzero
Thats ridiculously beautiful! Thank you!

------
vroyer
Here is a typical use case of elassandra + kibana with cross datacenter
replication [https://github.com/vroyer/elassandra/blob/master/cross-
datac...](https://github.com/vroyer/elassandra/blob/master/cross-datacenter-
replication.md)

~~~
thulya
@vroyer, we are currently migrating the MSSQL + Elasticsearch backend of
[http://apply4u.co.uk](http://apply4u.co.uk) as well as
[http://thulya.com](http://thulya.com) to Elassandra. Our initial tests are
very promising. I'll be happy to share more details soon about these
production use cases.

------
cphoover
I don't understand... elasticsearch is built on top of lucene indexes. Data
indexed with a postings lists is designed around the search use case. Don't
know too much about cassandra but it's not a search engine?

Would like to know more about how indexing is handled.

~~~
cnlwsu
Closest thing Cassandra has to behaving like a search engine is the new SASI
indexes. Good deep dive here:
[http://www.doanduyhai.com/blog/?p=2058](http://www.doanduyhai.com/blog/?p=2058)
which describes how its different from elastic search in "SASI vs Search
Engines" section.

    
    
        - SASI requires 2 passes on disk to fetch data: 1 pass to read the index files and 1 pass for the normal
        Cassandra read path whereas search engines retrieves the result in a single pass (DSE Search has a singlePass option too).
        By laws of physics, SASI will always be slower, even if we improve the sequential read path in Cassandra
    
        - Although SASI allows full text search with tokenization and CONTAINS mode, there is no scoring applied
        to matched terms SASI returns result in token range order, which can be considered as random order from the
        user point of view. It is not possible to ask for total ordering of the result, even when LIMIT clause is used.
        Search engines don't have this limitation
    
        - last but not least, it is not possible to perform aggregation (or faceting) with SASI.
        The GROUP BY clause may be introduced into CQL in a near future but it is done on Cassandra side,
        there is no pre-aggregation possible on SASI terms that can help speeding up aggregation queries

------
jason_heo
Good naming ;)

------
bryanrasmussen
seems interesting, but our current elasticsearch is as a third party service -
so that means we need a third party cassandra service that would also suppot
elassandra on top which seems unlikely - does anyone know of any I can look
at?

~~~
ddorian43
I don't know if there's any. One of the good things is that you don't have to
overprovision shards, since each node has 1 shard for each index.

------
tjake
This is close to my heart, I love these integrations!

