
Call me maybe: Cassandra - lucian1900
http://aphyr.com/posts/294-call-me-maybe-cassandra
======
fleitz
Great article on the fundamental problems associated with mutable state. The
fundamental problem is that the idea of an object with a set of state that is
the same to all observers violates pretty much the whole of information
theory. It's not a problem that will ever be fixed with out changing the
fundamental laws of the universe.

Ditch the mutable data and you can stop asking questions like what do we do if
10 becomes 10.5 before it becomes 11 and start storing values which never
change.

~~~
fusiongyro
The argument here is basically to replace mutable state with event sourcing.
It's an interesting idea and sometimes the right one, but if each user action
that triggered a one cell update becomes an event I have to keep forever, I
see my database size exploding. This is also going to yield performance
problems I'll be tempted to avert with caching, leading to fantastically bad
performance or cascading failures whenever the caches fail or are restarted.

I'm sure it's the right answer for some people, some of the time. In fact, I'm
sure it's not applied as frequently as it should be. But it's definitely a
specialized tool with special applications, where mutable state is, for better
or worse, the hammer we can and ought to continue relying on.

~~~
agilord
> but if each user action that triggered a one cell update becomes an event I
> have to keep forever, I see my database size exploding

I think there might be a balance here: you can always garbage-collect events
that are already merged in an updated value of the given object. Depending on
the requirement, this GC can be done e.g. after days or after months of that
merge...

~~~
fusiongyro
Doesn't having an "updated value" imply mutable state? If I have some, why not
have it all? Having both mutable state and a complex folding operation to
maintain it is going to give me multiple sources for the same information.
Which one will be authoritative? I'd expect a compromise to be worse than
either extreme.

~~~
agilord
I assume you have 'events' that you will 'merge' together in a single state
object (in case you want to display something). So the operation is to fetch
every related event, merge, display.

Now the 'folding' can be defined as snapshotting the 'merged state'. Instead
of fetching 10 events, after the folding + GC, you will fetch e.g. 2 + the
folded one. You are saving some CPU and bandwidth over time and that's it.

------
tomjohnson3
i've been following kyle's jepsen project for a few months (check out his
other posts) - and it seems (thankfully) to be nudging the focus of discussion
away from only "performance metrics" (transactions/sec, etc.) of distributed
systems and toward data consistency and behavior in the face of partitions and
other faults. ...what's the use of being able to make 20k updates per second
if half your data is lost during a common failure?

kudos!

------
jbellis
The bugs he ran into with lightweight transactions were fixed within days, two
weeks ago, and included in the 2.0.1 release shortly afterwards.

~~~
itp
The fact that these bugs were so trivially reproduced and made it into an
initial release at all should be cause for concern.

Conflict of interest disclosure -- I work for FoundationDB, where we put a
shockingly high level of effort into testing our software in simulation and
the real world. [1][2]

[1] [https://foundationdb.com/white-
papers/testing](https://foundationdb.com/white-papers/testing)

[2] [https://foundationdb.com/blog/quicksand-continuous-real-
worl...](https://foundationdb.com/blog/quicksand-continuous-real-world-fault-
tolerance-testing)

~~~
rdtsc
If foundation db released and is it open sourced?

I remember others (employees of company) hailing its wonderful qualities for
quite a while now (years), then I go to the website and all I could find was a
bunch of white papers and a registration form. And here it seems a bit of a
"my vaporware's features are better than your shipped product's features".

Now matter how many white papers there are I would still put my data in
Cassandra rather than this new thing (last I checked I couldn't even download
it, I had to fill out a form of some sort).

~~~
itp
FoundationDB isn't open source software, but it is a 1.0 product that's freely
available to download right now[1]. (You are right in recalling that during
our alpha and beta programs there was a simple registration form, and there's
still an account signup for our community site and of course for Enterprise
licensing and support).

[1] [https://foundationdb.com/get](https://foundationdb.com/get)

------
penguindev
I'm concerned that the paxos test couldn't handle 50/sec without timing out
practically all of them. Also, the 'official' datastax blog post I read
describing the feature was not clear at all describing the granularity /
sharding of the paxos state machines. It said a 'partition'. Does that mean a
'vnode' in cassandra speak?

Otherwise, kudos for adding a potentially useful feature to cassandra.

~~~
jbellis
As I mentioned above, we fixed the bugs he ran into; this is not
representative behavior of 2.0.1.

The granularity is a CQL partition:
[http://www.datastax.com/documentation/cql/3.0/webhelp/index....](http://www.datastax.com/documentation/cql/3.0/webhelp/index.html#cql/ddl/ddl_anatomy_table_c.html)

------
penguindev
Interesting findings re. the paxos bugs. Dare I ask how reliable cass's
sharded paxos is (in theory, at least) while you're adding/removing nodes from
the cluster? How do all clients see consistent membership?

~~~
jbellis
The principle is the same as with normal reads; expand the quorum to include
both old and new owners until the transition is done:
[https://issues.apache.org/jira/browse/CASSANDRA-833?focusedC...](https://issues.apache.org/jira/browse/CASSANDRA-833?focusedCommentId=13028232&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#comment-13028232)

------
dschiptsov
What is the reason for trying to squeeze into JVM an engine which must be
implemented on the one level up - an OS level? ,)

Is it still not obvious that "universal object storage" is a naive idea?)

