
Building a Distributed Log from Scratch, Part 1: Storage Mechanics - tylertreat
http://bravenewgeek.com/building-a-distributed-log-from-scratch-part-1-storage-mechanics/
======
marknadal
This is a really neat and good article! I did a talk in Sweden the other month
about how to build a distributed database, hopefully it may also be
fun/useful/informative for others:
[https://youtu.be/5fCPRY-9hkc](https://youtu.be/5fCPRY-9hkc) (it uses CRDTs
instead, so is a counter-part that isn't totally ordered and isn't append-
only).

I like the OP article though, cause I learned about NATS streaming, which I
hadn't heard of before - just Kafka. Will have to check it out.

~~~
mandelliant
NATS streaming was definitely neat. And thanks for sharing your video, always
looking for new distributed database resources.

------
nicolaslem
I have to admit that I only recently got familiar with logs. I was designing a
B+Tree[1] in Python for fun and was struggling to make it survive crashes: a
single insertion in a tree often results in multiple page writes which is not
atomic.

The solution to this problem is simple and elegant with a Write-Ahead Log.
Every page write is appended to the log and only merged back into the tree
file when it's sure that the log is safely written to storage.

SQLite has an extensive documentation of its WAL file format, which is great
for learning.

[1]
[https://github.com/NicolasLM/bplustree](https://github.com/NicolasLM/bplustree)

[2]
[https://www.sqlite.org/fileformat.html#the_write_ahead_log](https://www.sqlite.org/fileformat.html#the_write_ahead_log)

~~~
rakoo
If you're interested in those, there are at least two other designs you should
have a look at:

\- couchdb uses a single file for each db, which means the write ahead log
_is_ the storage. The atomicity is guaranteed by saying that the latest root
is the valid root. If writes are interrupted, everything since the last root
is invalid and discarded upon restart. A simple design that just works,
although it tends to be wasteful and requires frequent compactions

\- lmdb ([https://en.m.wikipedia.org/wiki/Lightning_Memory-
Mapped_Data...](https://en.m.wikipedia.org/wiki/Lightning_Memory-
Mapped_Database)) uses copy on write to make sure space is properly used, and
atomicity is provided by sharing only the strict minimal pieces of
information, so small in fact atomicity is guaranteed by the os. Follow the
different links in the wikipedia page, there's a lot of interesting stuff

------
kthielen
This is a good introduction, and a very useful abstraction. At Morgan Stanley
we have built a PL/compiler and tools around a method of logging like this --
logging algebraic data types and live querying out of them with Haskell-like
comprehensions/pattern-matching/etc:

[https://github.com/Morgan-Stanley/hobbes](https://github.com/Morgan-
Stanley/hobbes)

------
ww520
One question about the index file, "in Kafka, the index uses 4 bytes for
storing an offset relative to the base offset and 4 bytes for storing the log
position." Isn't the relative offset to the base offset already pointing to
the physical location of the message in the segment file? What's the purpose
of the second 4-byte field log position?

~~~
foota
I think that the offsets are both in terms of number of messages, whereas the
log position is the byte in the file that the message occurs at.

~~~
kungfooguru
Right, I think earlier Kafka it was actually the position as well, but then
they moved to the current method.

