
Building a Distributed Log from Scratch: Sketching a New System - tylertreat
https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-5-sketching-a-new-system/
======
fapjacks
I am usually skeptical of links to blog posts people add to HN themselves
because I'm skeptical of self-promotion more generally, but this series of
posts is actually really insightful. I enjoyed reading it and have been
passing it to some friends of mine.

------
sagichmal
This is interesting but the constant reference to NATS is more than a little
irritating. It's one player among many in the space, and it's decisions aren't
particularly notable. I would have loved to read something that was honestly
"from first principles" rather than a pivot from the NATS architecture or
whatever.

~~~
tylertreat
To be fair, part one of the series explains this:

> we will focus on what it takes to build something like this using Kafka and
> NATS Streaming as case studies of sorts—Kafka because of its ubiquity, NATS
> Streaming because it’s something with which I have personal experience.

------
AndyNemmity
I am in the process of building a system like this, and a question I've been
asked is, why the need for a message broker at all. It's been a little
difficult to justify once I really start to think about it.

Why not just set everything into essentially an S3 bucket, and then process
off of it. What's the benefit of a broker in between them really?

~~~
TheCowboy
It depends. What is your use case? If you mean this project specifically, part
one says "you will probably never need to build something like this yourself
(nor should you)".

Dumping everything into an S3 bucket may be perfectly fine and even efficient
for some projects. I personally don't see myself wanting to use S3 buckets for
transmitting messages between multiple different systems with different
purposes.

If you have a lot of different services producing data that multiple services
will consume, then a pub/sub system can simplify things. Some systems, like
Kafka, also have added redundancy/scaling benefits.

~~~
AndyNemmity
Use case is a Cloud, taking all audit/log data, wrapping in metadata, and then
placing it in a location which then needs analytics available on.

Currently this is done for the control plane by just loading them directly
into Elasticsearch. Increasing the availability for all users, the
consideration is, does a pub/sub add anything that just throwing the data
directly into an s3 like bucket (swift in our case).

I just assumed it did. Everyone does it. Scale, decoupling the sender from the
receiver...

But... I mean, why? What's the point of an additional hop, and another thing
to manage. If your source is already wrapping metadata, and perhaps putting it
into a structure, what is the value of a pub/sub there? If you can load it to
the pub/sub, you can load it to the s3 bucket in the same manner.

~~~
amarkov
Partitioning the S3 bucket scheme is terribly complicated. The senders have to
know exactly how every receiver is partitioned, so they can adopt a compatible
partitioning scheme when writing to S3. And the receivers have to know exactly
how the senders are partitioned, so they can divvy up incoming data
appropriately.

