
Crate.io – Big Data SQL in real time - adrianmsmith
https://crate.io
======
rburhum
Interesting.

Nevertheless, just by browsing the documentation, it seems they only support
very basic SQL. For example, joins are not supported.

To me, "SQL support" basically entails taking advantage of the relational
model. In here, "SQL Support" means "you can use something that looks like a
SQL-like syntax to insert and fetch records from a table". Those are two very
different things.

I do understand why they did that though. Automatic sharding of a single table
in a cluster is much simpler problem than auto-scaling cross cluster joins.

I'll still download and take a look at the project since conceptually it is
interesting.

~~~
xhrpost
Not looking to down the work here but it sounds like in using this then,
you're basically lowering DB admin complexity at the cost of increasing
programming complexity. Does that sound right? I'm kind of torn as to when I
would want to go that route. Perhaps on very simple data sets with limited
relationships.

------
danpalmer
Some initial thoughts from looking at the website and documentation...

1\. The main reason I want SQL is for relational data. The lack of joins
basically makes this a "NoSQL" database in every respect except the query
language being something resembling SQL. I'm fairly sure ANSI SQL requires
support for joins.

2\. It does auto-sharding, but I don't know how. The documentation doesn't
specify how the data is sharded, and this is quite important. Range-based
sharding, like MongoDB uses by default, is often not what users want (depends
on use-case), so if it's that, we need to know. Whatever it is, there are
trade-offs with different approaches and that's something users need to take
account of.

3\. You can't change the shard cluster size after initially sharding. I assume
this is a planned feature, but until then, it's probably not ready for
production use.

4\. You can only shard on the primary key, if you have a primary key.

5\. The configuration for the number of replicas is confusing, and appears to
not be very configurable.

A brief read makes this seem to be basically MongoDB from its early days, with
many of the disadvantages, but some advantages like custom analysers which
appear to replace Mongo's map reduce, that can be queried by an SQL-like
syntax.

When I saw this, I thought it was going to be a relational database that did
auto-sharding and replication. That would have been great. Unfortunately it's
not. It might become that a few years down the line, but right now, I'm not
inspired by it.

~~~
jodok
thanks for your summary and valuable feedback. a few notes from the crate
team: 1\. JOINs are on our roadmap. we might never reach full JOIN support
anytime soon (INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS), but
simple use-cases (1:n), equal joins shouldn't be too hard. 2\. after re-
reading our documentation on sharding i need to admit that we need to improve
it. we mention it briefly here
[https://crate.io/docs/current/sql/ddl.html#sharding](https://crate.io/docs/current/sql/ddl.html#sharding),
more will be added asap 3\. if we have growing datasets we typically work with
parted tables. we just pushed that change just a few days ago
([https://github.com/crate/crate/blob/master/docs/sql/partitio...](https://github.com/crate/crate/blob/master/docs/sql/partitioned_tables.txt)),
not yet in the RPM build. 4\. no, you can specify routing
[https://crate.io/docs/current/sql/ddl.html#routing](https://crate.io/docs/current/sql/ddl.html#routing)
5\. agreed, that the doc can be improved. and we're at the very beginning
[https://crate.io/docs/current/sql/ddl.html#replication](https://crate.io/docs/current/sql/ddl.html#replication)
\- rack awareness,... there are quite some replica settings on our roadmap.
i'd be happy to hear your most important additions to replica policies.

i promise, we're working hard and try to be there faster than in a few years
:)

~~~
danpalmer
Thanks for the reply!

Basic JOIN support would be good, but I think it is misleading to advertise
Crate as having SQL support, because I think most users would assume that the
SQL everyone knows, ANSI SQL, would be supported, which it definitely isn't
yet.

As I said, sharding is mentioned, but no details about how this is actually
implemented. A potential user could check in the code, but realistically this
is unlikely as many users have probably not given a huge amount of thought to
how the routing is done to different shards. It's a really important issue,
because if you're doing range-based sharding like MongoDB does by default,
that changes the kind of key that it should be partitioned on.

With regards to point 4, your documentation says "If a primary key constraint
is defined, the routing column definition can be omitted or must match a
primary key column." I read this as if there is a primary key defined, the
routing column must match it, or can just not be specified, or more simply,
you can only shard on the primary key. If this isn't the case, I think this
needs re-wording.

In terms of replication, the explanation of the ranges of replicas is
confusingly worded. I was wondering what the use-case for this is? Surely the
idea of replicas is to determine how many node failures you want to support,
and then set the count at the minimum number required to support this so as
not to waste resources. Also, if you set a range, how does Crate determine
where in that range to set the number of replicas? Is it as many as possible?

------
huxley
If you were initially confused like I was, crate.io used to be the domain for
an alternative Python Package Index.

[http://www.reddit.com/r/Python/comments/1wcp93/what_happened...](http://www.reddit.com/r/Python/comments/1wcp93/what_happened_to_crateio/)

~~~
caioariede
Confusing, but it seems to be moved to
[https://warehouse.python.org/](https://warehouse.python.org/)

Edit: I'm not sure they are related

~~~
kingkilr
Warehouse is the re-reimplementation, also done by the same guy, which will
eventually become the new pypi.python.org

------
dobe
hi, this is bernd from crate ... first - we are overwhelmed here at crate that
we are mentioned at hacker news! by reading through the comments on this page,
i thought it would make sense to give you some background, which should at
least partly answer some questions mentioned here.

\- we come from the service business and discovered that nearly every database
design for applications which needed to scale somewhere reached a point where
data needed to be de-normalized because joins where simply too expensive in
terms of cost and latency when data does not fit on a single affordable
machine. therefore we do not have join support yet. however we already planned
to allow joins in the future which still makes sense for smaller datasets of
course, but it is currently not a top priority, since many join use-cases
could also be implemented by using nested objects which we support.

\- we have chosen SQL as a query language, since this allows us to re-use
existing ORMs and tools. but most of all SQL is still a great language to
define queries, so we thought "why re-invent the wheel and crate yet another
query syntax"

\- regarding sharding: we use a hash/modulo based sharding mechanisms -
actually the same as elasticsearch, since we use elasticsearch under the hood
for cluster state, sharding and replication. we also added partitioned table
support in our current development branch.

there are still a lot of features on our roadmap; and apparently also a lot of
things we need to document and explain in our documentation. so if you are
interested in our progress you might keep an eye on our github project page
[https://github.com/crate/crate](https://github.com/crate/crate)

thx, bernd

------
pdq
Any idea what app or library they used to make the retro green-screen terminal
video?

~~~
TruthSHIFT
Looks like Cathode:

[http://www.secretgeometry.com/apps/cathode/](http://www.secretgeometry.com/apps/cathode/)

~~~
pdq
Winner. Thank you.

I also found more links at: [http://www.jwz.org/blog/2011/01/cathode-vintage-
terminal-emu...](http://www.jwz.org/blog/2011/01/cathode-vintage-terminal-
emulator/)

------
jandrewrogers
Nicely done and thoughtfully executed.

As a quibble on the suggested use cases, this platform will not work for the
Internet of Things generally for two reasons. First, it lacks support for the
spatial data types, including polygons, and operations, including spatial
joins, that are typical of those types of data models. Second, typical
commercial IoT data sources are often on the order of 100TB-1PB _per day_ ,
which implies continuous insert rates far beyond what this architecture can
support and still have real-time queries. It is why Internet of Things is a
completely different class of Big Data problem that does not fit on platforms
like Hadoop or Spark that were not specifically designed for it.

You can design databases for Internet of Things workloads but this isn't an
example of one. However, Crate should work great for more traditional Big Data
workloads.

~~~
icambron
> Second, typical commercial IoT data sources are often on the order of
> 100TB-1PB per day

Interesting. Do you have a source on that?

~~~
jandrewrogers
These are the kinds of applications and workloads I have been working on for
almost a decade now, spanning many industries, so it is my area of domain
expertise. The above numbers are a pretty typical range for the myriad
companies and organizations I work with. A small system or pilot project might
be 10 TB per day. For some government systems, 1PB per day is a small
fraction.

People are just beginning to take advantage of these data sources but table
stakes is being able to continuously ingest and index many millions of complex
spatial relationships every second, which by itself is something no popular
Big Data platform supports.

The (fun!) computer science challenge of it is that the systems have to be
unbelievably scalable, but you can't use hash partitioning or range
partitioning, and many things you were taught about database engine design
turn out to be completely wrong in this context. It is an area ripe for
innovation and growth.

~~~
icambron
I was hoping you could give a source not because I was questioning the
veracity of your statement, but because I wanted to learn more about it,
specifically what kind of data that would be and what you do with it. I bet
there are a lot of interesting challenges in that space.

------
capkutay
This is a cool idea. Its good to see that SQL is a driver in new data
management technologies. It supports the idea that as applications change over
the years, the data model should try to remain the same.

My view is that it's always a good idea to wait until a data management
platform or db is widely adopted before using it for anything. For some reason
I don't like the idea of using the word 'realtime' for what's really just very
fast queries but that may be nitpicking. I think of real-time as something
where you're given a real-time view of some data where you don't have to query
for updates.

------
asdfologist
"It includes solid established open source components (Presto, Elasticsearch,
Lucene, Netty) and extends those with added core functionalities like
read/write support, a SQL interface, a dashboard and a query console."

This makes no sense. Presto is already a SQL query engine, so what does it
mean to "extend" it with a SQL interface? Furthermore, the crate.io SQL is
even more limited than Presto SQL! Presto allows operations that are not in
crate SQL, like joins and "create table as select".

~~~
ma2rten
From their webpage it seems that they just use presto's SQL parser.

------
crorella
Just a minor detail, the video show the load of the table with 190K rows, but
then the distinct shows 194K results.. that is weird, because the table was
empty before the load. :P

~~~
jodok
bummer! good catch. video was taken before this bugfix:
[https://github.com/crate/crate/blob/master/CHANGES.txt#L160](https://github.com/crate/crate/blob/master/CHANGES.txt#L160)
we'll re-captchure it:
[https://twitter.com/burndee/status/457256576345448449](https://twitter.com/burndee/status/457256576345448449)

------
tedchs
It would be helpful to know the largest data set this has been used with and
the performance of some example queries. Ideally > 1TB to qualify as "big
data".

~~~
jodok
we've worked with upt to 25bn records (small size) in crate, that was about
3TB (without replicas). if you have larger records or blobs, you're soon in
the 10th of TB.

------
monokrome
I don't understand how this can be called "Big Data SQL in real time" when
that is not even their selling point. Their selling point is the scalability.

Their demo is only working against ~190,000 records. I don't now any databases
that aren't going to perform the demonstrated operations quickly.

------
edbond22
190k records is Big Data? For me big data starts from 9M+ rows

~~~
dvirsky
9M rows? Luxury! Big data starts at 100M rows! /s

~~~
gd1
I just checked my database, 69235709557 rows. And I wouldn't even consider
that "big data".

~~~
ithkuil
I parsed the final "/s" as per second.

------
adenner
It looks like it is behind where Apache Phoenix
([https://phoenix.incubator.apache.org/](https://phoenix.incubator.apache.org/))
overtop of Hbase in a hadoop cluster is at. They at least have joins somewhat
working.

------
attozk
This seems very interesting. Like others have mentioned it lacks details on
sharding & clusterting. But it reminds me of RethinkDb
([http://www.rethinkdb.com](http://www.rethinkdb.com)) which does not support
SQL. Both of these have similar functionalities (admin GUI,
sharding/clustering etc.) but RethinkDb supports joins as well.

Not sure why would RethinkDb be moving away from SQL where every one (e.g.
cassandra, hadoop [with hive, prestodb etc.]) seems to be coming back to SQL
(like syntax).

------
Xorlev
"It requires zero administration." I am skeptical.

------
lerchmo
Does this compress data? is it a columnar store? Is this Dremel with
'realtime' inserts?

Hard to tell exactly what the sweet spot is on this...

------
billwilliams
Looks like it has a lot of potential for data scientists. Relatively simple
set up with little maintenance is key for data science, especially those with
less hardcore dev experience. I'd be interested to know if there is any
support for queries too large for Presto / Lucene to handle. Give me arbitrary
map-reduce and I'll be in love.

------
bra-ket
No Joins?

~~~
leccine
Right, what were they thinking? :)

------
mobiplayer
I'm just trying this out of curiosity and I haven't been able so far to find
how to restrict the access to the admin panel (not per IP but per user/pass or
event public key / certificate).

------
ddorian43
Looks like it's a sql-layer + blob layer on top of elasticsearch.

------
jetblackio
This looks really cool. I'm curious how this compares to something like
Couchbase, which also has horizontal scaling abilities. I guess I know what
I'm doing this weekend.

------
Goosey
Looks interesting, I like the query syntax for object columns. Will be waiting
for more information about the horizontal scaling.

------
arjn
Isn't this very similar to Apache Cassandra ? Perhaps with management-
automation included.

~~~
jodok
cassandra is a super choice for write oriented workloads with really high
performance. it's also super stable. we want to be in the spot where you also
need advanced realtime search, high numbers of concurrent read/writes. and
yes, we focus to make administration as easy as possible

------
Meai
io.crate:crateClient:0.1.0 doesnt seem to exist on maven as indicated on the
docs
[https://crate.io/docs/clients/java/java.html](https://crate.io/docs/clients/java/java.html)

------
dpcx
This looks interesting, but the Cathode-based video is extremely distracting.

------
michaelchum
It seems like replicating NoSQL with SQL?

------
nagarch
is drill same thing?

------
leccine
Anybody who comes up with a new "Big Data" solution has to realize that the
90% of the data users are vested deeply in Hadoop and the ecosystem around it.
Having tools like PrestoDB, Hive, Shark on the top of HDFS makes it really
hard to newcomers to convince companies to invest in something else. If data
was a greenfield territory these project would be more viable. Btw. on this
note, PrestoDB just getting the right features like join optimization, support
for nested structures, etc.

~~~
jodok
we here at crate were living in the hadoop ecosystem for some time. but when
we came across the beautiful architecture of netty.io (async, event-driven)
and the way elasticsearch orchestrated it - since that time we know there will
be room for newcomers :)

~~~
leccine
Well sorry my buzzword filter removed half of your sentence. :) I understand
that engineers can get excited about async and event driven but these mean
absolutely nothing to your endusers. I could implement your service with
blocking IO and non-event driven code and still get the same performance. It
simply does not matter that much. The key to Hadoop's success is scalability
and predictable performance. Don't get me wrong, I think Hadoop is one of the
worst ecosystems I have ever seen in my entire life, the code quality makes me
cry sometimes, but putting all these aside, projects like PrestoDB making
Hadoop viable and keeping it alive in the long run. Your project looks
interesting but without extensive performance testing and proving that your
TCO is lower than Hadoop's, and having the same features as Hive, PrestoDB
does will be really hard to break in to this market. Again, I would be the
happiest person to see something more sane than Hadoop on the market.

~~~
agonbina
Well I guess not everyone can hire extra engineers to run and keep Hadoop
happy so this is a viable solution for smaller teams as long as the
_performance is on par_. I am no big data guy nor have I worked with Petabytes
of data but as a front end focused developer running the tech part of my
startup, I'd rather deploy Crate and worry less about it than pay tons of
money for a DBaas solution.

------
notastartup
that green terminal looks insanely cool. I'd love if Putty.exe turned into
something like that.

------
robot
I thought that this page is a joke? is it real?

