

Citus Data (YC S11) Wants To Make Scalable Data Analytics Accessible To Anyone - umur
http://techcrunch.com/2012/06/27/citus-data-launches-new-scalable-analytics-database/

======
smilliken
We've been using Citus at MixRank for storing our timeseries data, and it's
worked out magnificently well for our use-case.

A few points:

(i) We can do ad-hoc realtime analytics on hundreds of millions of data
points.

(ii) We can also do realtime analytics on billions of datapoints as long as we
pre-compute along one dimension.

(iii) We could do a lot better at (i) and (ii) if we invested more heavily in
hardware (and Citus would make this pretty painless, actually).

(iv) I'd normally not consider a closed-source solution personally, but since
Citus is based so heavily on PostgreSQL (protocol-level compatibility,
configuration, codebase), this has been a non-issue for us. We can still lean
on the amazing PostgreSQL community, documentation, and for the parts we don't
have the source code to, the Citus team has been very helpful in explaining
how things work.

(v) Fault tolerance is immaculate. At the node level, PostgreSQL is
notoriously one of the most reliable and robust databases available. At the
cluster level, Citus will magically fall back to a replica mid-query when a
server dies.

(vi) Although realtime inserts are not supported out of the box, the system is
flexible enough that we were able to get this working on our own without help
from Citus.

(vii) Schema migrations are also not supported out of the box, but we built a
schema migration framework that takes care of this for us.

(viii) We're not worried about vendor lock-in, since the data is just stored
on our servers, in the PostgreSQL serialization format. If we wanted to, we
could just give up the features that Citus gives us and build our own data-
access layer on top of our cluster.

Anyway, it won't be everything to everyone, but it works very well for our
OLAP use-case of timeseries ad impression data. I'd definitely recommend
looking into it if you're otherwise considering Hadoop, Vertica, Aster,
Greenplum, or a sharded MySQL/PostgreSQL setup.

Full disclosure: I am extremely biased since I've gotten to know the team very
well after using Citus. I'm definitely one of their biggest fans, if for no
other reason the amount of time they've saved us at MixRank.

~~~
nwyc2012
Hi Smilliken Would it be possible to elaborate a bit on how you do real time
inserts/updates? I'm interested in trying Citus but would most probably need
the realtime feature for production use

~~~
smilliken
I'd be happy to help, but this is probably out of scope of an HN comment. Feel
free to reach out to the email in my profile.

------
johnpmayer
This sounds a lot like AsterData Database, which I know has been around for at
least a few years. I'm interested to know if you are able to write queries
that define explicit parallelism like in the SQL-MapReduce language extension,
and also the ability of the query preprocessor for the distributed workload.

~~~
ozgune
Hey, we currently don't have an SQL/MapReduce language extension, but we do
have the Map & Reduce execution primitives implemented under the covers (for
parallel query processing).

For the distributed query processor, we can efficiently parallelize SQL
queries that involve look-ups, complex selections, groupings and orderings,
analytic functions, and joins between one large and multiple small tables. We
also have a lot more coming; are there any queries that you are particularly
interested in?

~~~
pella
any information about PostGis compatibility?

~~~
ozgune
In all honesty, I'd have to check. The worker nodes in our architecture will
be able to use PostGIS indexes just like regular PostgreSQL instances, and the
master node should correctly handle (partition) most PostGIS functions. Still,
we'd probably need to implement parallelization for the && operator, which
shouldn't be that hard.

That said, I don't want to misguide you here before going over PostGIS'
documentation more thoroughly. If you could ping us at engage@citusdata.com,
we'll send you a reply once we know for sure.

------
benbjohnson
Distributed SQL queries are cool and accessible to people but I feel like
projects that apply relational languages to event data don't make much sense.
If I have click stream data then I'm more interested in knowing what users are
doing after they performed action "A", "B" & "C" than rolling up how many
users performed a single action "A". SQL falls flat on its face for this type
of analysis.

Also, the name also threw me off. I thought it was "Citrus" and not "Citus".

[Full disclosure: I am writing an open source, distributed, behavioral
database - <https://github.com/skylandlabs/sky>]

~~~
bgilroy26
I'm working outside the bounds of my understanding here, but why can't that
result be formed from a query that pulls from the user_history table and a
subquery that pulls from the user_history table with different conditions on
each one?

~~~
ozgune
It's doable, but the paradigm doesn't naturally fit into SQL. Users typically
need to use sub-selects and self-joins for this kind of "A", "B" & "C"
analysis; and that introduces inefficiencies.

~~~
bkj123
This is where a good analytical (as opposed to operational) data model comes
into play. Something less normalized that is intuitive and efficient to use.

~~~
benbjohnson
Event data is difficult to denormalize within a relational database. You can
group events by actor (e.g. the user performing the events) and stuff it all
into a single row. You'll get great performance but you'll have to custom
process your data which is in a custom format. SQL functions will become
useless.

You can also store every event as a row and pull it out through Hadoop to
process (as another commenter mentioned) but you're going to get huge
performance bottlenecks just in extracting the data from the database,
serializing it, transferring to another server for processing, and then
deserializing it. Not to mention that MapReduce is batch-oriented so it's not
going to be real-time.

~~~
bgilroy26
In a typical data environment for an event table, you can have a user_history
table with the unique_primary_key, timestamp, user_id, attribute_column_name,
Previous_value, and Subsequent_value.

If you want to know what users who did X and later did Y, couldn't you select
all of the users who did X in a subquery and then find out how many of those
user_id's match user_id's of people who did Y, where the timestamp on Y is
between the X's timestamp and X's timestamp +
some_predetermined_amount_of_time?

I am out of my depth, but isn't there some database best practice for tracking
user session start and stop times?

 __Edit: __I wrote this comment before I saw a different comment of yours
above, which I think answers my question.

------
pella
"Features Not in v1.0"

<http://www.citusdata.com/documentation#missing-features>

~~~
bsg75
"Real-time insert, update, or deletes issued against the master node."

Is this a bulk/batch load only system then?

~~~
spathak
That is correct. This is primarily a bulk-load system. There are setups (as
mentioned in the smilliken's comment above), where it can be used for real-
time inserts, but requires more hands-on setup and configuration.

------
edouard1234567
Congrats Umur and team. I look forward to trying it for ZeTrip analytics.

------
kt9
Congratulations on the launch! I've been following this company for a while
and its great to see the public launch!

------
emre
Congrats to the Citus team! It looks like a great product, will definitely use
it!

------
kolistivra
Looks very promising, I will definitely give it a shot! Good job!

------
seboavalin
Scalable data analytics accessible to anyone? Great!

------
pinarsezer
Love the video btw!

