Hacker News new | past | comments | ask | show | jobs | submit login
Crate.io – Big Data SQL in real time (crate.io)
161 points by adrianmsmith on April 18, 2014 | hide | past | favorite | 77 comments


Nevertheless, just by browsing the documentation, it seems they only support very basic SQL. For example, joins are not supported.

To me, "SQL support" basically entails taking advantage of the relational model. In here, "SQL Support" means "you can use something that looks like a SQL-like syntax to insert and fetch records from a table". Those are two very different things.

I do understand why they did that though. Automatic sharding of a single table in a cluster is much simpler problem than auto-scaling cross cluster joins.

I'll still download and take a look at the project since conceptually it is interesting.

Not looking to down the work here but it sounds like in using this then, you're basically lowering DB admin complexity at the cost of increasing programming complexity. Does that sound right? I'm kind of torn as to when I would want to go that route. Perhaps on very simple data sets with limited relationships.

Some initial thoughts from looking at the website and documentation...

1. The main reason I want SQL is for relational data. The lack of joins basically makes this a "NoSQL" database in every respect except the query language being something resembling SQL. I'm fairly sure ANSI SQL requires support for joins.

2. It does auto-sharding, but I don't know how. The documentation doesn't specify how the data is sharded, and this is quite important. Range-based sharding, like MongoDB uses by default, is often not what users want (depends on use-case), so if it's that, we need to know. Whatever it is, there are trade-offs with different approaches and that's something users need to take account of.

3. You can't change the shard cluster size after initially sharding. I assume this is a planned feature, but until then, it's probably not ready for production use.

4. You can only shard on the primary key, if you have a primary key.

5. The configuration for the number of replicas is confusing, and appears to not be very configurable.

A brief read makes this seem to be basically MongoDB from its early days, with many of the disadvantages, but some advantages like custom analysers which appear to replace Mongo's map reduce, that can be queried by an SQL-like syntax.

When I saw this, I thought it was going to be a relational database that did auto-sharding and replication. That would have been great. Unfortunately it's not. It might become that a few years down the line, but right now, I'm not inspired by it.

thanks for your summary and valuable feedback. a few notes from the crate team: 1. JOINs are on our roadmap. we might never reach full JOIN support anytime soon (INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS), but simple use-cases (1:n), equal joins shouldn't be too hard. 2. after re-reading our documentation on sharding i need to admit that we need to improve it. we mention it briefly here https://crate.io/docs/current/sql/ddl.html#sharding, more will be added asap 3. if we have growing datasets we typically work with parted tables. we just pushed that change just a few days ago (https://github.com/crate/crate/blob/master/docs/sql/partitio...), not yet in the RPM build. 4. no, you can specify routing https://crate.io/docs/current/sql/ddl.html#routing 5. agreed, that the doc can be improved. and we're at the very beginning https://crate.io/docs/current/sql/ddl.html#replication - rack awareness,... there are quite some replica settings on our roadmap. i'd be happy to hear your most important additions to replica policies.

i promise, we're working hard and try to be there faster than in a few years :)

Thanks for the reply!

Basic JOIN support would be good, but I think it is misleading to advertise Crate as having SQL support, because I think most users would assume that the SQL everyone knows, ANSI SQL, would be supported, which it definitely isn't yet.

As I said, sharding is mentioned, but no details about how this is actually implemented. A potential user could check in the code, but realistically this is unlikely as many users have probably not given a huge amount of thought to how the routing is done to different shards. It's a really important issue, because if you're doing range-based sharding like MongoDB does by default, that changes the kind of key that it should be partitioned on.

With regards to point 4, your documentation says "If a primary key constraint is defined, the routing column definition can be omitted or must match a primary key column." I read this as if there is a primary key defined, the routing column must match it, or can just not be specified, or more simply, you can only shard on the primary key. If this isn't the case, I think this needs re-wording.

In terms of replication, the explanation of the ranges of replicas is confusingly worded. I was wondering what the use-case for this is? Surely the idea of replicas is to determine how many node failures you want to support, and then set the count at the minimum number required to support this so as not to waste resources. Also, if you set a range, how does Crate determine where in that range to set the number of replicas? Is it as many as possible?

Even if you can get basic JOINs going, then I can see your tool becoming super useful in a lot of my use cases :)

> I'm fairly sure ANSI SQL requires support for joins

Yup. The older SQL-92 spec BNF definition for SELECT clauses:


its actually sql on top of elasticsearch

That is bad.

I've done a lot of work with Elasticsearch[1] and while it's a great search engine, it is NOT a primary source of truth or something you want to trust not to lose your data.

[1]: Built a startup's product on top of it and have written an open source client library for it.

why ?

at least i remember they didn't recommend it for database until the backup funcionality was done

Most people aren't good enough at backups for me to consider that enough to mitigate the risks.

The data needs to go somewhere other than ES before you considered it "saved".

If you were initially confused like I was, crate.io used to be the domain for an alternative Python Package Index.


It still is: https://simple.crate.io/ :-)

we took crate.io over in december. some people still haven't updated their links. that's why we keep some subdomains and redirects up and running (https://crate.io/thanks-dstufft)

I visited crate.io last month and remembered this. It was more than a little confusing at first.

Confusing, but it seems to be moved to https://warehouse.python.org/

Edit: I'm not sure they are related

Warehouse is the re-reimplementation, also done by the same guy, which will eventually become the new pypi.python.org

we're unrelated. but we keep some redirects and subdomains from the "old" crate.io. see https://crate.io/thanks-dstufft.

hi, this is bernd from crate ... first - we are overwhelmed here at crate that we are mentioned at hacker news! by reading through the comments on this page, i thought it would make sense to give you some background, which should at least partly answer some questions mentioned here.

- we come from the service business and discovered that nearly every database design for applications which needed to scale somewhere reached a point where data needed to be de-normalized because joins where simply too expensive in terms of cost and latency when data does not fit on a single affordable machine. therefore we do not have join support yet. however we already planned to allow joins in the future which still makes sense for smaller datasets of course, but it is currently not a top priority, since many join use-cases could also be implemented by using nested objects which we support.

- we have chosen SQL as a query language, since this allows us to re-use existing ORMs and tools. but most of all SQL is still a great language to define queries, so we thought "why re-invent the wheel and crate yet another query syntax"

- regarding sharding: we use a hash/modulo based sharding mechanisms - actually the same as elasticsearch, since we use elasticsearch under the hood for cluster state, sharding and replication. we also added partitioned table support in our current development branch.

there are still a lot of features on our roadmap; and apparently also a lot of things we need to document and explain in our documentation. so if you are interested in our progress you might keep an eye on our github project page https://github.com/crate/crate

thx, bernd

Any idea what app or library they used to make the retro green-screen terminal video?

Winner. Thank you.

I also found more links at: http://www.jwz.org/blog/2011/01/cathode-vintage-terminal-emu...

As a bonus, I noticed they just released the iOS version in February. Pretty cool.

Nicely done and thoughtfully executed.

As a quibble on the suggested use cases, this platform will not work for the Internet of Things generally for two reasons. First, it lacks support for the spatial data types, including polygons, and operations, including spatial joins, that are typical of those types of data models. Second, typical commercial IoT data sources are often on the order of 100TB-1PB per day, which implies continuous insert rates far beyond what this architecture can support and still have real-time queries. It is why Internet of Things is a completely different class of Big Data problem that does not fit on platforms like Hadoop or Spark that were not specifically designed for it.

You can design databases for Internet of Things workloads but this isn't an example of one. However, Crate should work great for more traditional Big Data workloads.

> Second, typical commercial IoT data sources are often on the order of 100TB-1PB per day

Interesting. Do you have a source on that?

These are the kinds of applications and workloads I have been working on for almost a decade now, spanning many industries, so it is my area of domain expertise. The above numbers are a pretty typical range for the myriad companies and organizations I work with. A small system or pilot project might be 10 TB per day. For some government systems, 1PB per day is a small fraction.

People are just beginning to take advantage of these data sources but table stakes is being able to continuously ingest and index many millions of complex spatial relationships every second, which by itself is something no popular Big Data platform supports.

The (fun!) computer science challenge of it is that the systems have to be unbelievably scalable, but you can't use hash partitioning or range partitioning, and many things you were taught about database engine design turn out to be completely wrong in this context. It is an area ripe for innovation and growth.

I was hoping you could give a source not because I was questioning the veracity of your statement, but because I wanted to learn more about it, specifically what kind of data that would be and what you do with it. I bet there are a lot of interesting challenges in that space.

Assume Nest has 10 million users, which is ballpark-ish. 100TB is 10megs per user per day. That sounds high unless my Nest is sending audio clips to Google.

I'm sure there are some IoT apps sending 1PB/day, but there are plenty that don't.

so what kind of tool would you recommend instead to deal with 100TB-1PB per day ? I'm genuinely interested.

There are commercial database solutions focused on IoT that have no problems with these workloads (e.g. my company, SpaceCurve, or Pixia) but nothing open source. A single rack of servers arranged as a parallel system with a 10GbE switch fabric can support it if you design the software correctly.

If you look at every company that is working in this space, one of the first things you will notice is that they all use custom storage engines that do a full operating system bypass i.e. they manage all the system resources in userspace. If you do not do this, you cannot reliably get the necessary throughput out of the system for IoT. As far as I know, no scalable storage+execution engine in open source is designed like this yet. It requires much more computer science sophistication and lines of code to implement compared to traditional storage engines, so not the kind of thing you hack together over the weekend.

Cassandra is open source and used in a large number of IoT applications. Most famous one being Nest.

Also as far as throughput Netflix is doing 1.5 trillion (yes trillion) transactions per day in production on Cassandra.

This is a great example of not understanding the scaling problem of IoT. Not only is the above (quasi-)transaction rate modest by IoT system standards today, but Cassandra is not doing real-time analysis or ad hoc querying of complex relationships across those transactions at the same time, which is usually a requirement.

I know of a production IoT system in the private sector that does 1.5 trillion (quasi-)transactions every 10 minutes, so almost three orders of magnitude higher throughput. Cassandra is an okay choice for storing IoT data but it isn't real-time in the sense that you can do immediate, fast queries about the relationships across those records as they are hitting the system.

"Cassandra [...] isn't real-time in the sense that you can do immediate, fast queries about the relationships across those records as they are hitting the system."

That depends on how you are using Cassandra. Typically, you are expected to know your query patterns up front, and so you will lay your data out accordingly when ingesting. When done properly, this allows for ~1ms queries that return completely up-to-date results.

Do you have a source on the Netflix number? Not doubting, just curious...

Also, to nitpick, these are not transactions; these are "operations" or some other word that doesn't imply what the word "transactions" implies.

spatial data types are high on our backlog. you'll be able to store geo points, query them by location, bounding box, polygon, sort them by geo distance, and also aggregate them based on geo-distance (e.g. 0-10km, 10-50km,..). we haven't thought about additional topics on top of lucene - but that's our list for a start. feedback, github issues welcome!

This is a cool idea. Its good to see that SQL is a driver in new data management technologies. It supports the idea that as applications change over the years, the data model should try to remain the same.

My view is that it's always a good idea to wait until a data management platform or db is widely adopted before using it for anything. For some reason I don't like the idea of using the word 'realtime' for what's really just very fast queries but that may be nitpicking. I think of real-time as something where you're given a real-time view of some data where you don't have to query for updates.

"It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty) and extends those with added core functionalities like read/write support, a SQL interface, a dashboard and a query console."

This makes no sense. Presto is already a SQL query engine, so what does it mean to "extend" it with a SQL interface? Furthermore, the crate.io SQL is even more limited than Presto SQL! Presto allows operations that are not in crate SQL, like joins and "create table as select".

From their webpage it seems that they just use presto's SQL parser.

hi, i'm jodok from crate. yes, right now we only use the sql parser (and extended it)

Just a minor detail, the video show the load of the table with 190K rows, but then the distinct shows 194K results.. that is weird, because the table was empty before the load. :P

bummer! good catch. video was taken before this bugfix: https://github.com/crate/crate/blob/master/CHANGES.txt#L160 we'll re-captchure it: https://twitter.com/burndee/status/457256576345448449

It would be helpful to know the largest data set this has been used with and the performance of some example queries. Ideally > 1TB to qualify as "big data".

we've worked with upt to 25bn records (small size) in crate, that was about 3TB (without replicas). if you have larger records or blobs, you're soon in the 10th of TB.

I don't understand how this can be called "Big Data SQL in real time" when that is not even their selling point. Their selling point is the scalability.

Their demo is only working against ~190,000 records. I don't now any databases that aren't going to perform the demonstrated operations quickly.

190k records is Big Data? For me big data starts from 9M+ rows

9m rows, you might as well use Excel...

Big data starts at N+1 rows, where N is the number of rows the person your responding to manages.

9M rows? Luxury! Big data starts at 100M rows! /s

I just checked my database, 69235709557 rows. And I wouldn't even consider that "big data".

I parsed the final "/s" as per second.

It looks like it is behind where Apache Phoenix (https://phoenix.incubator.apache.org/) overtop of Hbase in a hadoop cluster is at. They at least have joins somewhat working.

This seems very interesting. Like others have mentioned it lacks details on sharding & clusterting. But it reminds me of RethinkDb (http://www.rethinkdb.com) which does not support SQL. Both of these have similar functionalities (admin GUI, sharding/clustering etc.) but RethinkDb supports joins as well.

Not sure why would RethinkDb be moving away from SQL where every one (e.g. cassandra, hadoop [with hive, prestodb etc.]) seems to be coming back to SQL (like syntax).

"It requires zero administration." I am skeptical.

Does this compress data? is it a columnar store? Is this Dremel with 'realtime' inserts?

Hard to tell exactly what the sweet spot is on this...

Looks like it has a lot of potential for data scientists. Relatively simple set up with little maintenance is key for data science, especially those with less hardcore dev experience. I'd be interested to know if there is any support for queries too large for Presto / Lucene to handle. Give me arbitrary map-reduce and I'll be in love.

No Joins?

Right, what were they thinking? :)

I'm just trying this out of curiosity and I haven't been able so far to find how to restrict the access to the admin panel (not per IP but per user/pass or event public key / certificate).

Looks like it's a sql-layer + blob layer on top of elasticsearch.

This looks really cool. I'm curious how this compares to something like Couchbase, which also has horizontal scaling abilities. I guess I know what I'm doing this weekend.

Looks interesting, I like the query syntax for object columns. Will be waiting for more information about the horizontal scaling.

Isn't this very similar to Apache Cassandra ? Perhaps with management-automation included.

cassandra is a super choice for write oriented workloads with really high performance. it's also super stable. we want to be in the spot where you also need advanced realtime search, high numbers of concurrent read/writes. and yes, we focus to make administration as easy as possible

io.crate:crateClient:0.1.0 doesnt seem to exist on maven as indicated on the docs https://crate.io/docs/clients/java/java.html

This looks interesting, but the Cathode-based video is extremely distracting.

It seems like replicating NoSQL with SQL?

is drill same thing?

Anybody who comes up with a new "Big Data" solution has to realize that the 90% of the data users are vested deeply in Hadoop and the ecosystem around it. Having tools like PrestoDB, Hive, Shark on the top of HDFS makes it really hard to newcomers to convince companies to invest in something else. If data was a greenfield territory these project would be more viable. Btw. on this note, PrestoDB just getting the right features like join optimization, support for nested structures, etc.

Maybe in your case they have Hadoop, but not everyone is satisfied with Hadoop or wants to create new systems with it. Choice is good, as long as we can differentiate the choices by the values they provide.

we here at crate were living in the hadoop ecosystem for some time. but when we came across the beautiful architecture of netty.io (async, event-driven) and the way elasticsearch orchestrated it - since that time we know there will be room for newcomers :)

Well sorry my buzzword filter removed half of your sentence. :) I understand that engineers can get excited about async and event driven but these mean absolutely nothing to your endusers. I could implement your service with blocking IO and non-event driven code and still get the same performance. It simply does not matter that much. The key to Hadoop's success is scalability and predictable performance. Don't get me wrong, I think Hadoop is one of the worst ecosystems I have ever seen in my entire life, the code quality makes me cry sometimes, but putting all these aside, projects like PrestoDB making Hadoop viable and keeping it alive in the long run. Your project looks interesting but without extensive performance testing and proving that your TCO is lower than Hadoop's, and having the same features as Hive, PrestoDB does will be really hard to break in to this market. Again, I would be the happiest person to see something more sane than Hadoop on the market.

Well I guess not everyone can hire extra engineers to run and keep Hadoop happy so this is a viable solution for smaller teams as long as the performance is on par. I am no big data guy nor have I worked with Petabytes of data but as a front end focused developer running the tech part of my startup, I'd rather deploy Crate and worry less about it than pay tons of money for a DBaas solution.

I like Hadoop tonnes so I would love to get an injection of reality and perspective: what are the insanities of Hadoop?

sorry for my ignorance, but Crate and "SQL on top of Hadoop" fulfil quite different needs. How would you handle tens of thousands concurrent queries on a Hadoop platform? Read and write at the same time? On the other hand side - crate will never be able to batch-oriented, complex map/reduce workload.

totally; the main reason for me is the idea of substantial long term investment by large support communities (including me!)

that green terminal looks insanely cool. I'd love if Putty.exe turned into something like that.

I thought that this page is a joke? is it real?

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact