
WesternDigital/blb: Distributed object storage system for use on bare metal - mastabadtomm
https://github.com/westerndigitalcorporation/blb
======
brokensegue
> In September 2017, Upthere was acquired by Western Digital, and it was
> decided to pause development on Blb and move data to other storage systems

So wait this is abandoned?

~~~
pknopf
It is using go.mod, so it can't be _that_ abandoned.

~~~
andy_ppp
Yes, it has commits to master from 4 days ago ...

~~~
zingplex
The commit date can't really be used to determine activity in this case. There
is only an initial commit.

------
monocasa
What does bare metal mean here, not virtualized?

It's hard to tell sometimes, what's one man's 'bare metal' is another man's
abstraction layer.

~~~
linkmotif
Maybe no filesystem, only block storage?

~~~
justincormack
It says they currently use ext4 but might bypass later.

------
notacoward
Looks broadly similar to the bottom layer of the system I work on, except that
blb only claims that it "should" scale to very high levels whereas the one I
work on has already been running at many-petabyte scale for years.

Not too surprisingly, a lot of things that seem "impossible" at smaller scale
become every-day things for a large enough system. For example, you _will_ get
some kinds of inconsistencies that necessitate various forms of active GC or
scrubbing. You _will_ have hot spots, which you need to explicitly deal with
instead of relying on statistical distribution guarantees. You _will_ have to
migrate whole racks' worth of data at once as equipment (not just disks and
hosts but also network switches and power infra) get upgraded. And of course
you'll have to monitor the hell out of it so you can fix these problems as
they occur instead of having them multiply until your system is irretrievably
broken. Don't take claims of super-duper scalability (from blb or Minio) on
faith. Look for these "extras" as evidence that the system actually has been
run at scale. BTW they don't seem to be there in the blb source.

What does seem to be there is a reference buried in the deployment docs to a
"master" component, not mentioned in the architectural overview. It seems to
be responsible for assigning partitions of the blob space to curators and
directing clients to the right one. Seems like a bottleneck but OK, let's take
a look anyway.

// Since we don't persist the address and last heartbeat time, when a master
// failover happens, the new leader cannot service requests until it hears //
heartbeat from the curators. See PL-1102. (from master.go lines 35-37)

Hm. That seems like a pretty big disruption, even if it's rare. Also, what
kinds of heartbeats are these? I think it's generally a bad idea for systems
like this to implement their own liveness checking. That's a hard problem,
there are tried and true specialist-written systems for doing it, other
systems that do it themselves are almost certainly drifting away from their
own core competency. This comment is getting long enough so I won't do a full
analysis of the blb heartbeat system, but I invite others to look at it with
an eye toward how much load it imposes on masters in a large cluster, how
reliable failure detection is, and what things should be done (but aren't)
when heartbeats fail.

It looks like a pretty good _start_ to a distributed blob store. The basic
architectural principles are sound, the code looks pretty clean and well
commented, etc. OTOH, seems a bit light on tests, and the lower-level
implementation details suggest that in its current form it might not handle
even a hundred-node cluster all that well. Caveat emptor.

~~~
CTrox
Just out of curiosity, what system are you working on?

~~~
notacoward
It's a system within Facebook called Warm Storage. There have been some public
presentations about it, so I'm comfortable mentioning the name, but
unfortunately I can't provide many other details about architecture or scale.
I'll just warn people that the public information on it is _way_ out of date.
Most of it seems to be from 2014, and what it describes is really a separate
system from what we have now.

~~~
pstuart
I (and I'm sure many others on HN) would love to learn more about this when
possible.

------
jagadishg
How does it compare to minio?

~~~
glibgil
No one knows what that is

~~~
octetta
[https://minio.io](https://minio.io)

------
bluedino
Was WD planning on making a block storage service or something?

~~~
scrollaway
Haven't they? WD MyCloud is an end-user storage service.

------
Serow225
Seagate has their Kinetic KVS-harddrive project too...

~~~
notacoward
Are you sure? Even their own website for it seems to be dead.

[https://www.seagate.com/tech-insights/kinetic-vision-how-
sea...](https://www.seagate.com/tech-insights/kinetic-vision-how-seagate-new-
developer-tools-meets-the-needs-of-cloud-storage-platforms-master-ti/)

None of their code repos seem to have been updated for years. Good riddance,
too. Object storage is a fine thing, but their implementation was laughably
bad.

[https://pl.atyp.us/2013-10-comedic-open-
storage.html](https://pl.atyp.us/2013-10-comedic-open-storage.html)

