
Storage is Cheap, Don't Mutate - Copy on Write - hashbo
http://hashbo.posterous.com/storage-is-cheap-dont-mutate-copy-on-write-te
======
wheels
There's a problem here in going from their specific case to where this may
make sense, to the more general case they're trying to make, where it doesn't.

For instance, if you're mapping disk space to memory addresses and have large
blocks of data, this would mean that every time you needed to update a single
byte, you'd either have to duplicate the entire block (possibly several mega-
or giga-bytes) or give up linear access. Yuck. There are a lot of use cases
that would absolutely be miserable for.

You can get away with such things a lot further if you're implementing these
things at the file-system level.

~~~
hvs
There's a great presentation of these types of issues in the book _Purely
Functional Data Structures_ :

<http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf>

~~~
rlander
This is a great read for any programmer, not only those interested in
functional programming.

Also related (HN discussion): <http://news.ycombinator.com/item?id=1138979>

And Cris Okasaki's blog: <http://okasaki.blogspot.com/>

------
hvs
I'm all for finding ways to avoid locking and using immutable objects is one
way to help, but this:

    
    
      Remove the mutation part and now you have a version history,
      allowing conflict resolution.
    

is a little too "hand-wavy" for my tastes. Conflict resolution is not an
easier problem than resource contention.

~~~
hashbo
Fair point, maybe I should have been clearer (and an edit will be
forthcoming). My point is that with a version history it’s possible to let the
content originator know there is a conflict and allow them to choose which
version they wish to go with. This can be as simple as diplaying an icon to
show there has been a conflict and giving the choice to keep or revert,
showing the other users changes, like most wikis. Or much more complex like in
a version control system.

My point is more the negative here, you can’t do conflict resolution if you
write over the previous state.

Sorry if it was hand wavy, was just a quick Friday afternoon blog :-)

~~~
hvs
Fair enough. I'm on your side about keeping a version history and using
immutable data (for some things). Its just that my biggest question when I was
reading (as it is any time someone discusses concurrency) is: how does he
handle updates to the "same" data? Because, ultimately, you will have state
somewhere in your system.

~~~
hashbo
So the simple answer is that right now we use a most recent wins strategy
because our data is not critical or highly correlated like relational data.
For us it’s simple to add at a future date UI features to allow wiki style
history. Obviously with correlated data it’s not so easy :-) and that is, as I
believe you were implying, where versioning gets tricky! Of course point still
applies about the need to avoid mutation :-)

------
robertskmiles
I followed the link at the bottom, and it made me pretty angry. Hashbo is a
very neat example of the kind of bullshit I hate in upcoming software products
and services.

The only information available before asking for signup is this;

> Hashbo is a simple innovative, new way to communicate, collaborate and
> curate.

What the hell is that supposed to mean? Is it a blogging system?
Microblogging? A social network? IM? Is it some combination, or a new thing?

People come to a site like this with the question "What is this and _why
should I care?_ ". At the end of that (quite nice) animation, I have no idea,
and that doesn't make me intrigued and curious, it makes me annoyed and
frustrated at not getting a straight answer.

 _Nobody cares about your buzzwords_ , they care about what your site is, and
what it can enable them to do that they couldn't do before.

~~~
hashbo
Hey Robert

It’s not attempting to give intrigue, I just really haven’t figured the right
copy out yet:) Take a look at some more of my blog articles if you like, they
explain at least the general idea of what we’re doing. We hope to get everyone
on the beta soon.

The problem for me is that it’s actually difficult to describe what it does
and is better experienced then you can make your own valuation as to whether
it suits you or not. Much like when the Internet started, it was easier to
show than explain. So while we're just in the early stages of beta I didn’t
want to prejudice people by giving partial information.

So I’ll see if I can get some feedback from early beta testers to find out
what words they use to describe it - that may help me get a more unbiased
view.

Glad you liked the animation btw., we’re (mostly) a one man band in reality,
and I’m a techy at that, so any complement towards design is gratefully
accepted.

All the best and thanks for the feedback, it’s really helpful to know what
people think, Neil.

~~~
sgrove
Yeah, don't mind the haters. I haven't read the copy from your page - just
from the op, it does look lame. But that's fine. Work on it, make it specific,
and pitch concrete benefits.

If you product involves education (new model, etc.), you should be pitching to
everyone, counting the mean-time-to-understanding, and relentlessly honing it
downwards.

~~~
hashbo
Really appreciate the combination of honest and helpful feedback. I will
certainly keep doing exactly that :-)

------
balakk
Ironic that SQL is mentioned here - I think the idea is close to Multi-version
concurrency control, which "real" databases have been doing for years now.

[http://en.wikipedia.org/wiki/Multiversion_concurrency_contro...](http://en.wikipedia.org/wiki/Multiversion_concurrency_control)

~~~
hashbo
Absolutely but that is internal to the database and cannot be cached locally.
With copy on write semantics being at the application level it means you can
cache your data locally and not have to worry about distributed cache
invalidation etc.

~~~
birken
How do your distributed nodes figure out which is the most recent version of a
particular object?

~~~
hashbo
Hiya, by an index. The index of course isn’t cached. This model applies
equally well to RDBMS as well as non-RDMS systems. You can index which is the
most recent version without losing the benefit of caching the data (and
potentially it’s de-serialized form within the application). Not suggesting
one size fits all, just highlighting a useful strategy.

~~~
birken
And how do you keep this consistent across different objects? IE object A and
object B are updated at the same time, going from generation 1 to generation
2. I am on a remote machine reading from the cache. How can I ensure I don't
get a view where I see object A at generation 1 and object B at generation 2?

IE the basic point I am making here is that MVCC does a lot of relatively
complicated stuff to handle this for you. I don't disagree that you are
gaining a lot of power at your remote nodes to do things that would increase
performance, but for 99% of applications I think it a huge amount of work for
a small amount of benefit.

~~~
hashbo
Hi so I believe you’re asking how does CoW solve consistency issues. This
seems to be a recurring theme. I suspect and correct me if I’m wrong this
comes from the fact we’re using a non-RDMS database or is it because our
actual caching is by nature distributed. In either case the questions being
asked are applicable to all distributed caching, if you need to have an
absolute guarantee of consistency (nuclear power plant, large financial
transactions) then I would personally avoid using caches anyway and reconcile
these relationships in the database anyway. I might add this is not our use
case :-)

My experience has been in both high volume financial systems and consumer
startups and I personally have found that sometimes ACID behaviour is near
essential. Or at least near essential to keep the implementor sane :-) And
other times it’s just a hinderance. Nothing in my article implied that I’m out
to solve all these, interesting cases I hope - rather that a particular
pattern which can be implemented in different ways has a set of benefits. For
the record Neo4J supports ACID transactions, in case I have inferred
otherwise.

What’s interesting to me is that in various of the comments I have received
seem to extrapolate my article into, I believe an attempt to remove or somehow
change the working of RDMS systems. Which I can honestly say has me at a loss,
goodness knows what I said to trigger that view. No offence to the commenters
I just am uncertain where this came from. What I was discussing originally was
a design pattern which I have had recent experience of within the context in
my case of persistence.

Again forgive me if I misunderstood but I believe you are talking the
internals of your database when you’re talking MVCC. This wasn't the problem
we’re trying to solve. In fact Neo4J is a very high speed transactional
database that we’re more than happy with.

In fact we’re not trying to solve a problem, we needed versioned data so we
implemented that, now we’re getting lots of benefits including zero cache
invalidations which means we have a safe write through cache, which of course
can be caching application style representations of the data (i.e. objects).
And the various other benefits I mentioned.

I can only state my practical experience here, which is that we do get a
massive benefit from being able to arbitrarily cache data without fear of
stale data.

But hey horses for courses, enough from me for tonite - a huge pile of code
awaits me and a beta needs to get underway - if you’d like to get back to me,
please drop a comment on the original article and I will be pinged. Always
happy to chat.

Thanks for the discussion!

Neil

------
joevandyk
This is what bugs me about state machines (like
<https://github.com/pluginaweek/state_machine>).

Adding a history or audit always seems like a giant hack.

~~~
hashbo
Actually a history on a state machine give’s you undo functionality a fairly
common pattern, no? You have a stack of states instead of a single state.
Trouble is though inter-related states isn’t it :-) things can get messy if
you don’t have atomic states.

Actually someone else just pointed me in the direction of Event Sourcing -
every pattern has a name somewhere -
<http://martinfowler.com/eaaDev/EventSourcing.html>

~~~
jchrisa
There is a lot of complexity in that Event Sourcing document. The pattern can
be simpler: <https://github.com/iriscouch/follow/blob/master/README.md>

------
EGreg
One of the most versatile ways to have scalability and partition tolerance
while also having consistency is having eventual consistency.

And having eventual consistency with MVCC implemented with vector clocks
really rocks.

Riak takes Brower's CAP theorem seriously and even lets you configure the
values per bucket so that you can get more of C, more of A or more of P.

You don't need to store all the history, just the contentious history. Riak
pretty much rocks in the sense that you can add nodes to it and it just works,
or take them away and it just works. Just like Dynamo. (Not like Cassandra.)

------
RexRollman
Just thinking out loud: does doing copy on write result in less filesystem
fragmentation (assuming the data is not held in RAM)?

~~~
hashbo
In theory yes, as it avoids holes in the databases or files used for storage.
It’s also the semantics of ZFS - <http://en.wikipedia.org/wiki/ZFS>

~~~
ora600
... and ZFS copy-on-write has been known to cause severe performance
degradation for databases. Not to mention that too many SAs keep the default
ZFS block size, which is far too large for OLTP databases.

