Hacker News new | past | comments | ask | show | jobs | submit login
Call me maybe: Cassandra (aphyr.com)
268 points by lucian1900 on Sept 26, 2013 | hide | past | web | favorite | 51 comments

Great article on the fundamental problems associated with mutable state. The fundamental problem is that the idea of an object with a set of state that is the same to all observers violates pretty much the whole of information theory. It's not a problem that will ever be fixed with out changing the fundamental laws of the universe.

Ditch the mutable data and you can stop asking questions like what do we do if 10 becomes 10.5 before it becomes 11 and start storing values which never change.

Information theory does not have a problem, only, as you say, our universe. Mathematicians and their various hangers-on like programming language researchers often prefer to deal with models that have no concept of time, in which the very concept of "observer" is extraneous since there isn't really anything like a "point of view". Everything just... is.

Also, I think the focus on immutability misses the more interesting discussion about lattices and how important they are to distributed programming. Check out this video, which will take you up from the beginning: http://vimeo.com/53904989

(Seriously, if you do anything remotely distributed, this is required viewing. There's some serious stuff going on in the distributed world and this is a great intro.)

The talk is getting interesting at 11 minutes in. But I can't find the slides anywhere and the video was edited by someone who likes to show the slide for 5 seconds and the person pointing at it for 2 minutes.

The question of consensus and mutability are, to some extent, separable; immutable data does not provide liveness without consensus guarantees. See, for instance, Datomic's design, where a strong coordinator is needed to serialize updates to otherwise immutable/replayable state.

Indeed. Although the use of confluent data structures where possible combined with immutable data stores combined and 'live' functional reactive programming across device boundaries -would- drastically simplify cases like these.

The argument here is basically to replace mutable state with event sourcing. It's an interesting idea and sometimes the right one, but if each user action that triggered a one cell update becomes an event I have to keep forever, I see my database size exploding. This is also going to yield performance problems I'll be tempted to avert with caching, leading to fantastically bad performance or cascading failures whenever the caches fail or are restarted.

I'm sure it's the right answer for some people, some of the time. In fact, I'm sure it's not applied as frequently as it should be. But it's definitely a specialized tool with special applications, where mutable state is, for better or worse, the hammer we can and ought to continue relying on.

> but if each user action that triggered a one cell update becomes an event I have to keep forever, I see my database size exploding

I think there might be a balance here: you can always garbage-collect events that are already merged in an updated value of the given object. Depending on the requirement, this GC can be done e.g. after days or after months of that merge...

Doesn't having an "updated value" imply mutable state? If I have some, why not have it all? Having both mutable state and a complex folding operation to maintain it is going to give me multiple sources for the same information. Which one will be authoritative? I'd expect a compromise to be worse than either extreme.

I assume you have 'events' that you will 'merge' together in a single state object (in case you want to display something). So the operation is to fetch every related event, merge, display.

Now the 'folding' can be defined as snapshotting the 'merged state'. Instead of fetching 10 events, after the folding + GC, you will fetch e.g. 2 + the folded one. You are saving some CPU and bandwidth over time and that's it.

> The argument here is basically to replace mutable state with event sourcing. It's an interesting idea and sometimes the right one, but if each user action that triggered a one cell update becomes an event I have to keep forever, I see my database size exploding.

Once an event is visible to all users, it can be merged into the base state of the system and no longer needs to be stored (or at least kept online) as a separate event. For the performance reasons you allude to, you probably want to do that as much as possible.

(This is still effectively an append-only series of immutable states, but losing, at least from regular on-line access, "older" states that no one can see anymore.)

Once an event is visible to all users,

There's the rub: any GC/compaction (and when you get down to it, any read) is actually a distributed consensus problem. If you're interested in this problem, you might take a look at the CRDT garbage collection literature for more details.

But don't you buy yourself a stronger guarantee that all inputs have been collected by waiting to converge?

> if each user action that triggered a one cell update becomes an event I have to keep forever, I see my database size exploding

Yes I could see that getting prohibitively expensive when SSDs cost 70 cents/GB and hard drives 5 cents/GB. You should really throw out your historical data at those kinds of costs, probably not worth 5 cents per GB.

Personally, that's why I don't keep backups, files change all the time, and I was going broke making sure I had older copies of my data. I'd rather just rewrite all my code and retake all my pictures.

Hardware cost is hardly the only cost. Managing a few GBs of data is quite a bit different than managing 100s of GBs or more of data.

Perhaps 100s of TBs but 100 of GBs is not a difficult problem to solve, perhaps once you get to more than 100 TB you'd be beyond the capabilities of a single chassis.

Even a petabyte should fit in a rack or two.

I really fail to understand how a business could acquire that much data and not be able to sell it.


Here ya go, 180 TB for $10K in 4U, which means 10 to a rack which means 1.8 PB per rack. Who has 180 TB of database that isn't worth $10K?

Hi, I work in computational fluid dynamics.

Again you need to look at more than the cost of hardware, that's not the issue. More data requires more managing; performance, backup, failures, etc

Hi, I work in astronomy.

As the other commenters have pointed out, I'm pretty sure that you're supposed to compact old events once they have been determined to be eventually consistent amongst all (or almost all) of the nodes.

i've been following kyle's jepsen project for a few months (check out his other posts) - and it seems (thankfully) to be nudging the focus of discussion away from only "performance metrics" (transactions/sec, etc.) of distributed systems and toward data consistency and behavior in the face of partitions and other faults. ...what's the use of being able to make 20k updates per second if half your data is lost during a common failure?


The bugs he ran into with lightweight transactions were fixed within days, two weeks ago, and included in the 2.0.1 release shortly afterwards.

DataStax still claimed these properties while the bugs were live. It's important they fixed them quickly, but the point still stands—the point of the entire series, I think—that database companies very frequently make big claims that few people go through the process of checking... even at the companies themselves.

Have you ever seen a product feature description edited on account of a bug report, particularly one that was resolved within days?


Congratulations then.

Seriously, I've seen plenty of bugs found in database transaction mechanisms, and I've never seen a database pull claims of, "transactional integrity". I can't even imagine that getting through the product team in a couple of days.

It's important to distinguish architecture and design from implementation. The former may be correct, even if the latter is found to have a bug.

If you distinguish between those two, and implementation doesn't match design, then you must concede you have a level of incompetence in your development chain.

When dealing with data, you don't get any points for "well, we meant to do it the right way, but it didn't happen. sorry."

I guess I'll have to settle for joining the crowd of incompetents behind PostgreSQL, Linux, the JVM, et al., who have also been known to release bug fixes on occasion. :)

I guess I'm just quite surprised Aphyr turned that one up. I didn't look at the fix, but it's a pretty severe bug if it completely, silently disables a feature.

It is, and I think the author did.

You mean companies sometimes promote features that still have bugs ? My god that is incredible. I can't believe that has never happened before. Oh wait it has. On nearly every single software project in the history of software projects.

His test amounts to inserting integers without even simulating a network partition. This is about the most basic test of the purported functionality you could come up with, not a fencepost error or some cosmetic problem. This is advertising a fundamental property of your software and not even doing the most minimal checking that it is true.

Databases are and must be held to a higher standard than the software sitting above them in the stack, just as kernels must be held to an even higher standard, because bugs in lower layers cause more damage with higher costs. DBAs and commercial databases are expensive because data is valuable and there are liabilities. If the database developer made a remark like yours, I would run in the opposite direction. A smug reply like that exposes a fundamental disrespect for other people's data--their property. If one has no respect for our property, one shouldn't find it surprising that we have no respect for one's software or services.

I note and appreciate that the Cassandra developer's reply below is even-handed and serious.

Of course, if aphyr hadn't reported the bugs before publishing, they'd be live right now, right?

And who knows how many more are lurking? I think that's the issue exposed by this post -- if these really basic things are broken, why would anyone believe that the actual tricky stuff has been tested?

It's awfully hard to prove a negative, but our test suites are open [1] [2], and so is our bug tracker [3].

[1] https://github.com/apache/cassandra/tree/trunk/test/unit/org... [2] https://github.com/riptano/cassandra-dtest [3] https://issues.apache.org/jira/browse/CASSANDRA

Interesting. Care to comment on why you think the issues brought up in this blog post weren't covered by your test cases? To me they seem to be approximately the first thing you would test with any transaction feature, but I don't work on a mutable datastore, so maybe I have a misapprehension of the situation.

I guess what I'm trying to say is, if I see something I think should obviously be tested not tested, my go-to assumption is that the failure was in bad testing practices. To dispel that belief, I need something to replace it with that explains the observation even better. If you're willing to share your view of the situation, maybe that would do it. (There's no real business case for you to do it, since I'm not in your market. But as a hacker, I'm interested in what I can learn from this.)

No excuses. It's as simple as me dropping the ball on this one.

> I think that's the issue exposed by this post -- if these really basic things are broken, why would anyone believe that the actual tricky stuff has been tested?

So, I think it is relevant that LWT is still a very new feature in Cassandra and not something basic to it at all (arguably counter to a lot of its original design goals).

Personally, I was much more concerned by the server side timestamps only using millisecond granularity (and even that is somewhat understandable given the JVM's limitations).

The fact that these bugs were so trivially reproduced and made it into an initial release at all should be cause for concern.

Conflict of interest disclosure -- I work for FoundationDB, where we put a shockingly high level of effort into testing our software in simulation and the real world. [1][2]

[1] https://foundationdb.com/white-papers/testing

[2] https://foundationdb.com/blog/quicksand-continuous-real-worl...

> The fact that these bugs were so trivially reproduced and made it into an initial release at all should be cause for concern.

Umm... no. The .0 releases of the project are actually where you'd expect the most bugs. That's where you have the first commit with an implementation of a new design.

You'll note that DataStax's commercial version of DSE is still based on 1.2.x...

If foundation db released and is it open sourced?

I remember others (employees of company) hailing its wonderful qualities for quite a while now (years), then I go to the website and all I could find was a bunch of white papers and a registration form. And here it seems a bit of a "my vaporware's features are better than your shipped product's features".

Now matter how many white papers there are I would still put my data in Cassandra rather than this new thing (last I checked I couldn't even download it, I had to fill out a form of some sort).

FoundationDB isn't open source software, but it is a 1.0 product that's freely available to download right now[1]. (You are right in recalling that during our alpha and beta programs there was a simple registration form, and there's still an account signup for our community site and of course for Enterprise licensing and support).

[1] https://foundationdb.com/get

advice from someone who's been at this nosql thing for about 6 years now. schadenfreude will always come back to bite you. tread lightly.

Sorry for the offtopic post: just noticed now that three months ago, you mentioned I misrepresented your words. I deeply apologize; that's not at all what I meant. I had found your Ricon talk refreshingly honest and quoted it admiringly to people I know. (I had zero issue with anything you said; but rather with certain institutions.)

I'm sorry for any errors/misrepresentations I made about your words, and will be more careful in the future. (Unfortunately, I can no longer edit, delete nor respond to that post.)

haha no worries my friend :) glad you enjoyed the talk!

Please. I have yet to work in any environment that hasn't had a brown bag release or two, regardless of how well tested it is.

Nice marketing attempt though.

I'm concerned that the paxos test couldn't handle 50/sec without timing out practically all of them. Also, the 'official' datastax blog post I read describing the feature was not clear at all describing the granularity / sharding of the paxos state machines. It said a 'partition'. Does that mean a 'vnode' in cassandra speak?

Otherwise, kudos for adding a potentially useful feature to cassandra.

As I mentioned above, we fixed the bugs he ran into; this is not representative behavior of 2.0.1.

The granularity is a CQL partition: http://www.datastax.com/documentation/cql/3.0/webhelp/index....

Interesting findings re. the paxos bugs. Dare I ask how reliable cass's sharded paxos is (in theory, at least) while you're adding/removing nodes from the cluster? How do all clients see consistent membership?

The principle is the same as with normal reads; expand the quorum to include both old and new owners until the transition is done: https://issues.apache.org/jira/browse/CASSANDRA-833?focusedC...

What is the reason for trying to squeeze into JVM an engine which must be implemented on the one level up - an OS level? ,)

Is it still not obvious that "universal object storage" is a naive idea?)

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact