
There is something between append only and reusing blocks - ot
http://antirez.com/post/btree-reuse-blocks-with-delay.html
======
ot
> But you are already near to the I/O limit, what happens once you start
> writing a new file to rewrite the btree? The additional I/O can affect badly
> the performance of the btree.

I wonder if there are any benchmark that compare append-only + compaction vs.
updates in-place (with a free list or something). To me it is not obvious why
the second should be less I/O-intensive (for example, in-place does random
writes while append-only is sequential).

Also, if persistence is done by checkpointing every n seconds (as far as I
understand this will be the default behaviour) part of the compaction can be
done in memory before committing the batch of the last n seconds worth of
operations. In append-only, this can save a lot of I/O if the operations are
on a small set of keys.

I think that Bitcask (<http://downloads.basho.com/papers/bitcask-intro.pdf>)
uses a similar approach, with append-only storage. I would be curious to see a
performance comparison with Redis diskstore.

~~~
moe
Many smart people have pondered long and hard over this exact problem for
decades. Why not climb on their shoulders and adopt the WAL approach that's
working so well for most RDBMS?

~~~
jbellis
bq. Why not climb on their shoulders and adopt the WAL approach

The WAL approach solves a different problem: "how do we provide durable writes
with minimal performance impact?"

WAL is not queryable, so the question of "how do we efficiently provide
random, indexed access to data" is a separate one. (And the traditional answer
of B-trees is working increasingly poorly in the face of a growing
capacity:seek time gap.)

~~~
moe
You are right, I realize I misjudged/simplified the problem somewhat.

My mind had spontaneously formed this idea of an in-memory overlay b-tree of
committed (WAL) but not yet merged changes... But thinking it through further
I have to admit that this would quickly lead to a complexity explosion
(MVCC'ish).

I also notice my tone in that comment was more snarky than I meant to be,
sorry for that. However I still believe that this is a wheel that's been
reinvented often enough (and that is complex enough) that it'd be a good idea
to focus on researching how the others have eventually solved it - to minimize
the risk of repeating history.

------
carterschonwald
I believe that the right perspective is to take the "view it as a gc" and
leave the file system to do it's job!

I don't have the references on hand, but among possible memory management
schemes, "stop and copy" garbage collection is asymptotically fastest, though
it does potentially result in "pauses" in the mutator/executing program. Any
other scheme has to worry about keeping track of unused data and all the bad
news that comes with multiple things going on at the same time.

~~~
ot
You don't have to stop to copy from a log-structured file

