Nevertheless, just by browsing the documentation, it seems they only support very basic SQL. For example, joins are not supported.
To me, "SQL support" basically entails taking advantage of the relational model. In here, "SQL Support" means "you can use something that looks like a SQL-like syntax to insert and fetch records from a table". Those are two very different things.
I do understand why they did that though. Automatic sharding of a single table in a cluster is much simpler problem than auto-scaling cross cluster joins.
I'll still download and take a look at the project since conceptually it is interesting.
1. The main reason I want SQL is for relational data. The lack of joins basically makes this a "NoSQL" database in every respect except the query language being something resembling SQL. I'm fairly sure ANSI SQL requires support for joins.
2. It does auto-sharding, but I don't know how. The documentation doesn't specify how the data is sharded, and this is quite important. Range-based sharding, like MongoDB uses by default, is often not what users want (depends on use-case), so if it's that, we need to know. Whatever it is, there are trade-offs with different approaches and that's something users need to take account of.
3. You can't change the shard cluster size after initially sharding. I assume this is a planned feature, but until then, it's probably not ready for production use.
4. You can only shard on the primary key, if you have a primary key.
5. The configuration for the number of replicas is confusing, and appears to not be very configurable.
A brief read makes this seem to be basically MongoDB from its early days, with many of the disadvantages, but some advantages like custom analysers which appear to replace Mongo's map reduce, that can be queried by an SQL-like syntax.
When I saw this, I thought it was going to be a relational database that did auto-sharding and replication. That would have been great. Unfortunately it's not. It might become that a few years down the line, but right now, I'm not inspired by it.
i promise, we're working hard and try to be there faster than in a few years :)
Basic JOIN support would be good, but I think it is misleading to advertise Crate as having SQL support, because I think most users would assume that the SQL everyone knows, ANSI SQL, would be supported, which it definitely isn't yet.
As I said, sharding is mentioned, but no details about how this is actually implemented. A potential user could check in the code, but realistically this is unlikely as many users have probably not given a huge amount of thought to how the routing is done to different shards. It's a really important issue, because if you're doing range-based sharding like MongoDB does by default, that changes the kind of key that it should be partitioned on.
With regards to point 4, your documentation says "If a primary key constraint is defined, the routing column definition can be omitted or must match a primary key column." I read this as if there is a primary key defined, the routing column must match it, or can just not be specified, or more simply, you can only shard on the primary key. If this isn't the case, I think this needs re-wording.
In terms of replication, the explanation of the ranges of replicas is confusingly worded. I was wondering what the use-case for this is? Surely the idea of replicas is to determine how many node failures you want to support, and then set the count at the minimum number required to support this so as not to waste resources. Also, if you set a range, how does Crate determine where in that range to set the number of replicas? Is it as many as possible?
Yup. The older SQL-92 spec BNF definition for SELECT clauses:
I've done a lot of work with Elasticsearch and while it's a great search engine, it is NOT a primary source of truth or something you want to trust not to lose your data.
: Built a startup's product on top of it and have written an open source client library for it.
at least i remember they didn't recommend it for database until the backup funcionality was done
The data needs to go somewhere other than ES before you considered it "saved".
Edit: I'm not sure they are related
- we come from the service business and discovered that nearly every database design for applications which needed to scale somewhere reached a point where data needed to be de-normalized because joins where simply too expensive in terms of cost and latency when data does not fit on a single affordable machine. therefore we do not have join support yet. however we already planned to allow joins in the future which still makes sense for smaller datasets of course, but it is currently not a top priority, since many join use-cases could also be implemented by using nested objects which we support.
- we have chosen SQL as a query language, since this allows us to re-use existing ORMs and tools. but most of all SQL is still a great language to define queries, so we thought "why re-invent the wheel and crate yet another query syntax"
- regarding sharding: we use a hash/modulo based sharding mechanisms - actually the same as elasticsearch, since we use elasticsearch under the hood for cluster state, sharding and replication. we also added partitioned table support in our current development branch.
there are still a lot of features on our roadmap; and apparently also a lot of things we need to document and explain in our documentation. so if you are interested in our progress you might keep an eye on our github project page https://github.com/crate/crate
I also found more links at: http://www.jwz.org/blog/2011/01/cathode-vintage-terminal-emu...
As a quibble on the suggested use cases, this platform will not work for the Internet of Things generally for two reasons. First, it lacks support for the spatial data types, including polygons, and operations, including spatial joins, that are typical of those types of data models. Second, typical commercial IoT data sources are often on the order of 100TB-1PB per day, which implies continuous insert rates far beyond what this architecture can support and still have real-time queries. It is why Internet of Things is a completely different class of Big Data problem that does not fit on platforms like Hadoop or Spark that were not specifically designed for it.
You can design databases for Internet of Things workloads but this isn't an example of one. However, Crate should work great for more traditional Big Data workloads.
Interesting. Do you have a source on that?
People are just beginning to take advantage of these data sources but table stakes is being able to continuously ingest and index many millions of complex spatial relationships every second, which by itself is something no popular Big Data platform supports.
The (fun!) computer science challenge of it is that the systems have to be unbelievably scalable, but you can't use hash partitioning or range partitioning, and many things you were taught about database engine design turn out to be completely wrong in this context. It is an area ripe for innovation and growth.
I'm sure there are some IoT apps sending 1PB/day, but there are plenty that don't.
If you look at every company that is working in this space, one of the first things you will notice is that they all use custom storage engines that do a full operating system bypass i.e. they manage all the system resources in userspace. If you do not do this, you cannot reliably get the necessary throughput out of the system for IoT. As far as I know, no scalable storage+execution engine in open source is designed like this yet. It requires much more computer science sophistication and lines of code to implement compared to traditional storage engines, so not the kind of thing you hack together over the weekend.
Also as far as throughput Netflix is doing 1.5 trillion (yes trillion) transactions per day in production on Cassandra.
I know of a production IoT system in the private sector that does 1.5 trillion (quasi-)transactions every 10 minutes, so almost three orders of magnitude higher throughput. Cassandra is an okay choice for storing IoT data but it isn't real-time in the sense that you can do immediate, fast queries about the relationships across those records as they are hitting the system.
That depends on how you are using Cassandra. Typically, you are expected to know your query patterns up front, and so you will lay your data out accordingly when ingesting. When done properly, this allows for ~1ms queries that return completely up-to-date results.
Also, to nitpick, these are not transactions; these are "operations" or some other word that doesn't imply what the word "transactions" implies.
My view is that it's always a good idea to wait until a data management platform or db is widely adopted before using it for anything. For some reason I don't like the idea of using the word 'realtime' for what's really just very fast queries but that may be nitpicking. I think of real-time as something where you're given a real-time view of some data where you don't have to query for updates.
This makes no sense. Presto is already a SQL query engine, so what does it mean to "extend" it with a SQL interface? Furthermore, the crate.io SQL is even more limited than Presto SQL! Presto allows operations that are not in crate SQL, like joins and "create table as select".
Their demo is only working against ~190,000 records. I don't now any databases that aren't going to perform the demonstrated operations quickly.
Not sure why would RethinkDb be moving away from SQL where every one (e.g. cassandra, hadoop [with hive, prestodb etc.]) seems to be coming back to SQL (like syntax).
Hard to tell exactly what the sweet spot is on this...