Hacker News new | past | comments | ask | show | jobs | submit login
Crate: Distributed SQL Database for the Age of Docker (crate.io)
67 points by based2 on April 6, 2015 | hide | past | favorite | 44 comments

Anybody who is using Elastic Search as the basis for persistent data storage is going to come in for some scary surprises down the road in production. Without violating my NDA(s) I can't give significant details, but I've personally witnessed weird split-brain multi-master, dropped writes, etc. in response to network partitions and other common failure scenarios for a distributed data store. Elastic Search is a fantastic tool for high-speed full-text queries, but it is NOT and should never be used as a reliable persistent data store. Crate looks doomed to failure on this alone.

Without violating your confidentiality, can you share what version of ES you experienced these failures with? I have also experienced this first hand, but since 0.9x and 1.0x a TON of work has been done on both OOM (one of the leading causes of split brain scenarios) and split-brain from network partitioning. As I mentioned below, all of these issues are being addressed in an open and transparent way, and while there's still work to be done, a non-trivial amount of progress has already been made. I hope you can share at least some about your experiences.

I wish someone would revive the Zookeeper election and discovery backend. It's orders of magnitude better than the Zen ping system or anything like it (JGroups springs to mind). Unfortunately it's been left for dead for quite a while now. :(

When it comes to consensus real consensus servers are the only way and Zookeeper is the only production consensus server available outside of Google. (sorry etcd, you aren't quite there yet).

Speaking of, this fork has been posted on the README of the orignal Sonian repo: https://github.com/grmblfrz/elasticsearch-zookeeper

Seems to be compatible with ES 1.4 :D

full disclosure. i'm one of the co-founders of crate :)

In December 2010 we found about Elasticsearch and were truly amazed by it's simplicity, speed and elegance. We built our service and consultancy business around it.

In 2011 we've built some of the largest ES applications at that time (6TB, 120node cluster, http://2012.berlinbuzzwords.de/sessions/you-know-search-quer...) and started to develop a set of plugins, such as the in-out-plugin to allow distributed dump/restore.

With this background - and the mission to build a datastore that is as easy to use and administer as Elasticsearch we founded in 2013 Crate and raised some Seed money. Since that we're working hard to make this vision become true. We're often confronted with the results of the so-called Jepsen-Test (https://aphyr.com/posts/317-call-me-maybe-elasticsearch) that Aphyr published in 2014. Don't forget: Lucene, Netty, Elasticsearch, Crate - all are Open Source products (APL) and rely on all kinds of contributions - such as this analysis! No matter be it bad news or good news. We can only improve based on hard testing and feedback. However, this caused a lot of rumblings in the Elasticsearch ecosystem and the reaction of Elasticsearch was exemplary:

1) explain the reasoning and make an official statement: https://www.elastic.co/blog/resiliency-elasticsearch/

2) list all the issues and hunt them down. one by one: http://www.elastic.co/guide/en/elasticsearch/resiliency/curr... (and add new ones as they occur).

3) stay in contact with the community that reported the issues: https://twitter.com/aphyr/status/525712547911974913

All that being said. We see many people using Crate as primary store (and of course backing up their data) but we also see people that don't put that much trust in a younger database and keep all their primary data in another location and sync/index to Crate.

ALWAYS make backups (COPY TO / COPY FROM), make sure you have replicas, and most important configure minimum_master_nodes correctly to avoid split brain.

At Crate we stand on the shoulders of these great Open Source project, try to be as good citizens as possible and focus mainly on our Query engine (Analyzer, Planner, Execution Engine).

Elastic Search is known to drop your data in some occasions. I think it's fair to advise against using it as your source of truth (or anything built on top of it like Crate).

the elastic team is working hard in fixing the issues discovered by aphyr (jepsen test). you can find the current status here: http://www.elastic.co/guide/en/elasticsearch/resiliency/curr... - we're doing our best to support them. (btw. here's what aphyr thinks about that document: https://twitter.com/aphyr/status/525712547911974913)

Yes, and while there are still cases (see the link @jodok supplied below) a ton of work was done in ES v1.4.x (which Crate uses) and even more issues were closed in ES v1.5 which we're working on supporting.

With the amount of attention and support this has received from both Crate and ES, coupled with the amount of progress that's been made against these issues, I don't think it's a fair to advise against using either as a source of truth.

The Lucene team has been promising an end to corruption since the 2.9 days and before (~10 years). If you believe them that the problems are all fixed, and you need to be serious about it, I'd wait more than a month before declaring the problem solved.

That said, they are checksumming a hell of a lot more than before, so there's a chance.

The URL is annoyingly similar to https://crates.io/

That's what I was thinking... I know Rust isn't 1.0 yet, but it's hardly esoteric.

I spent a bunch of time reading the website and I can't tell what this product does: Is it a database engine or an abstraction on top of other products?

If it's a database engine, then here are my thoughts: How is this database engine built to replace SQL and No-SQL? If it doesn't support JOINS why would I replace my SQL with it? Are transactions ACID? Why would anyone use this if there are no built-in user / group security mechanisms to protect data?

Crate is a database engine that uses Lucene for storage and leverages portions of Elastic for cluster management. It is a NoSQL storage engine that gives you an SQL API (via REST). It does not support JOINs yet but this work is well underway and we expect a release with JOINs soon. We have several customers migrating from MySQL without JOINs (either handling joins in the application layer, using arrays in columns, or denormalizing data). Transactions are not ACID, we fall into the eventually consistent realm. And finally, not all use cases require user/group security, in fact much of the current SQL usage falls into the single user category.

I also agree that our site is not as clear as it needs to be and we're working on it already.

Seems like it's an layer over ElasticSearch using Presto SQL parser. SQL queries are translated into ES queries.

That's part of it, but not the whole picture. For example, we mostly bypass the ES query engine and go directly to Lucene. Queries are not simply translated to ES query syntax. Also, we've done a lot more work than simply pasting an SQL layer over the top. We've built streaming BLOB support, a distributed SQL layer with real-time MapReduce, and a distributed aggregation engine that gives accurate results for aggregations rather than HLL estimates.

If you'd like we're happy to answer any questions in IRC or our Google Group: https://groups.google.com/forum/#!forum/crateio IRC Freenode #crate: irc://irc.freenode.net/crate @mention anyone with Voice

do you have a field type that indexes in real time? or are you bound to the (default 1s) index delay from es?

this is one thing that bothers me with elasticsearch, that I can not define eg "type": "cart","index":"realtime", "not-analyzed" so if an item gets added to a cart, the subsequent count would directly return the correct number of items in the cart.

not yet. but we have some "tweaks" for exactly your use-case on our backlog. using the client libraries should make it mucn easier (e.g. https://crate.io/docs/projects/crate-python/stable/sqlalchem...). so right now you would need to do a refresh. on a side note: it's not an index delay. it's the readers that "sit" on the lucene index. they are being repurposed for performance reasons (and meanwhile other writes are appending). like the client libraries you can force reopening them (https://crate.io/docs/en/0.47.8/sql/reference/refresh.html) - of course at the cost of performance.

Why can't you aggregate on non-indexed fields? I know lucene doesn't allow that, but why? It seems to work on normal-rdbms ?

We run aggregations fully distributed and when iterating over the values we heavily rely on the field-caches. They hold the values of the latest used fields in memory and therefor allow in-memory performance on them. for example they don't grow linearly with the amount of rows stored, but depend on the cardinality of the fields. Running aggregations over non-indexed data is not supported.

arthur, that's how we started two years ago. meanwhile the queries are being handled by our own engine. https://github.com/crate/crate

Is just SQL query syntax supported?

The beauty of the Elasticsearch query syntax is that you can dynamically create complex JSON dsl objects as you drill down just using push and other methods.

With the crate SQL syntax it looks like it would be a messier dynamic query generation using string functions?

So my question is: Is it possible to query Crate with Elasticsearch syntax?

yes, you can enable the elasticsearch API ("es.api.enabled: true" in the crate.yml config file). however, it's not officially supported and we recommend to use it read-only as we store additional information about tables,...

Read-Only should be fine.

Is there specific "Crate" query syntax for selects that is not supported by the Elasticsearch DSL?

Crate-only queries - like exact counts, distinct counts, (soon JOINs) - are not exposed via this API. The Crate query engine is accessing Lucene directly and not running on top of the ES query engine.

Thought that name was familiar: https://github.com/crateio/crate.io

Hi, Spanky from Crate.io here. We still maintain the PyPI portion of the site at https://crate.io/packages

As for crates.io, I guess their entitled to choose their own name & domain, but I admit it is confusing.

Might end up being a source of confusion for you guys, as the package manager for a major language will have much more mindshare than a database.

we're good friends with the original owner "dstufft" and took over the domain after he took over an official pypi position (more than a year ago). read his own blog post: https://caremad.io/2014/03/crate-io-new-ownership/

I should have mentioned, the Crate.io (database) codebase is on GitHub: https://github.com/crate and this is what we're building: https://crate.io/overview/

looks like the domain was bought. https://github.com/crateio/crate.io/issues/18

I also confused the name with this: https://crates.io/

Lots of great questions here. Don't forget that you can also ask detailed questions in our Google Group: https://groups.google.com/forum/#!forum/crateio

So the website mentions that a Crate cluster is a share-nothing system; That each node can act independently. This would mean that the data is limited to one machines storage size? Doesn't really fit in with the claim of a distributed database unless I'm missing something.

By shared-nothing, we mean that no node is special, no node is indispensable. Unlike other popular NoSQL databases, we don't have different classes of nodes. The persistent data is distributed across the cluster, and nodes communicate amongst each other about which data are where.


Given that this is based on Elasticsearch, there will be a master node, which should be chosen explicitly in configuration to be away from the data nodes, because ES responds poorly to garbage collection on the master node. So you want to keep the load down on your master.

So you can run in a sort-of-shared-nothing configuration, but its not recommended.

True. There is an elected master node and you can configure a cluster to have master-only (non-data) nodes, data-only nodes (non-master eligible), or both. Even still, and even if you choose to set up a cluster this way, every node can perform the same functions, it's more a matter of what responsibility a node has. And even though there is a master node, that master can disappear without consequence, another will be elected in its stead. There is nothing really "special" about a master node.

No, it can hold as much data as the sum of all the nodes in the cluster have (also depending on the amount of replicas you configure). Some background: All data is being sharded by default. These shards are distributed evenly between the nodes. Nodes can hold multiple shards. Furthermore shards can have replicas, being hold by other nodes. Memory is being utilized to hold indizes in RAM and to cache values that are being needed frequently for e.g. aggregations ("field caches").

Where does it keep its data? Docker volumes?

Like Jodok says, we don't recommend this, but you could do it. What this would mean that if you destroy a container, you also destroy the data. This would cause your cluster to have to rebalance itself adding IOPs overhead.

When you map a volume into a container as suggested, the data can persist through a container restart/replacement. When the container is instantiated, the volume is read, the node checksums the shards it finds to make sure they're not stale. If so, they're brought up to date. By tuning the recovery settings you can avoid extraneous shard movement and therefore leverage containers as you would expect.

we recommend to expose a host directory to crate ('docker run -d -p 4200:4200 -p 4300:4300 -v <data-dir>:/data crate') and configure replicas. if one of the crate containers disappears, replicas will be promoted as primary shard and new replicas created on the fly. it's also possible to expose multiple directories (e.g. on multiple disks for more performance), you can configure crate to use them in parallel.

I guess this is using ES multiple datapath support?

Are you also planning to move to single shard per datapath like ES? If that is the case what is your thoughts on increasing the shard count post single shard per datapath?

I guess you are talking about the plan to have the data of one shard only on one disk (see https://github.com/elastic/elasticsearch/issues/9498)? This does not necessarily mean that you will end up having only one shard per datapath - only if you have just one shard per node. But you are right, the change might lead to unbalanced disk usage in some scenarios, where increasing the number of shards would solve the problem.

There are two options:

1. (Recommended for now) Export the table with COPY TO ( https://crate.io/docs/stable/sql/reference/copy_to.html). Drop the table and then import it again using COPY FROM (https://crate.io/docs/stable/sql/reference/copy_from.html ).

2. Use insert by query (see https://crate.io/docs/stable/sql/dml.html#inserting-data-by-... ) if it is ok for you to copy the whole data to another table (with more shards).

1) is recommended, since it allows for throttling on import time (see https://crate.io/docs/en/latest/best_practice/data_import.ht...) and also does not require a rename of a table, which is currently not implemented but is on our backlog. However i think once ES 2.0 is out we will have table renames and also throttling in insert by query, so option 2) will be recommended then.

Our genreal recommendation to the fixed number of shards limitation is to choose a higher number of shards upfront (number of expected cores matches the most use cases) or to use partitioned tables (https://crate.io/docs/en/latest/sql/partitioned_tables.html) where possible since those allow to change the number of shards for future partitions.

Look cool. What font are they using?

It's named Blender Pro and has been designed by our swiss friends from binnenland. We're super happy that they provided us with a generous license to use it within Crate. More about the font: http://www.binnenland.ch/notes/view/about-the-blender-typefa...

Use of Java is a deal breaker...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact