
Baidu File System – A distributed file system for real-time applications - bluebore
https://github.com/baidu/bfs
======
notacoward
Disclosure: I'm a Gluster developer.

Looks like a pretty good first attempt at a distributed filesystem. Initial
impression is HDFS with a distributed NameNode/Nameserver. The first diagram
also shows a Metaserver layer that's not mentioned at all in the more recent
of the two design docs but "separate Metaerver from Nameserver" appears
(unchecked) in the roadmap. All operations using access methods other than
their own SDK seem to get funneled through the NameServer cluster, which will
severely limit throughput. Not clear how they do replication, though weakly
implied that it's driven from the client (like Gluster) or NameServer rather
than the first ChunkServer (like Ceph, HDFS, everything else). No mention of
how they handle consistency or repair. Likewise no information about
performance or security. Not clear if it's anywhere near POSIX compliant
(probably not).

FUSE support is in the diagrams, but _not_ checked off on the roadmap. Slow-
node detection and avoidance seemed like one of the most interesting features
from the design, but is not checked off either. Other things not even on the
roadmap, using Gluster not as a fair comparison but as a handy list of
possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB,
caching, quota, snapshots.

As I said, looks like a good first attempt. Better than most I've seen, with
lots of potential, but as of _today_ it seems rather bare-bones. Many hard
problems remain to be solved, and I wish them well.

~~~
Jacky007
> Looks like a pretty good first attempt at a distributed filesystem. You are
> damn right. It is. ~~ 3 years ago, the most widely used DFS in baidu was
> Peta which is similar to HDFS V2. We have migrate to AFS now.

------
justinsb
Looks nice. I know Raft better than most of the other pieces, so that's where
I started; I didn't see code for dynamic membership changes nor log
truncation. I can understand getting by with a fixed membership, but log
truncation seems like a requirement for a production system. Would be
interested to hear whether this is planned or whether there is a clever way
around it!

~~~
imafatboy
Well, there's another project in the same organization named iNexus achieved
in log truncation. It uses leveldb as underlying storage and the leveldb is
slightly modified to clean the outdated data when compacting. Maybe BFS will
do something similar. For the source code, please refer to
[https://github.com/baidu/ins](https://github.com/baidu/ins) And I'm sorry for
the lack of English documents in this repo. We are working on it.

~~~
justinsb
Thank you - excited to see this! I do think there is a lack of a C++ library
for Raft that stands alone (and you have two projects just within baidu that
could share code). I'd be excited to help with a standalone project! And I'm
sorry for my lack of non-English, but it seems that the variable names are
still in english so I can follow the code :-)

(It is a pity that Chrome doesn't automatically translate github pages that
contain different languages - not sure why that isn't happening.)

------
usgroup
Looking through the code it supports fuse, but the documentation in ENG is
sparse. It also looks to underpin Tera: the Baidu distributed DB.

I think a low read/write latency dfs suitable for real time applications would
be a game changer. I'm hoping they up the documentation from here and engage
the English speaking community.

~~~
lylei
Thanks for your advice. We are working on translating all the documents :)

~~~
usgroup
PS: if your DFS works within a docker container you'll have a very strong
differentiator since the rest don't. You'd also possibly solve the "how to do
storage in a container cloud without resorting to NAS or separate clusters"
problem.

~~~
notacoward
> if your DFS works within a docker container you'll have a very strong
> differentiator since the rest don't.

Untrue. Gluster is already deployed that way in many places. Yes, in
production and at scale.

~~~
usgroup
Do you mean hackery of this sort:

[http://blog.xebia.com/persistence-with-docker-containers-
tea...](http://blog.xebia.com/persistence-with-docker-containers-
team-1-glusterfs-2/)

Or do you know of clean, container only (no plugins or special external tools)
solution ?

~~~
notacoward
Oh, sorry, didn't realize we were playing the "move the goalposts" game. If
you were to google for "gluster" and "containers" you'd get everything from
slick marketing stuff to a presentation at the recent Gluster developer summit
in Berlin. I have no idea if any of those would meet your next set of
standards but, frankly, meh.

~~~
usgroup
Container hosting with a homogeneous cluster constraint was a real requirement
for me that I could not find a solution for amongst existing options but since
you're a gluster dev you'd probably know better whether its possible; so happy
to stand corrected. Thanks for correcting; and no offence intended.

~~~
notacoward
It is certainly possible. The first user I know of who did this was using
Mesos. Nowadays the push is more around doing it with Kubernetes and
OpenShift; I know there was at least one presentation on it at Red Hat Summit.
I'm a core-infrastructure guy, so that's kind of not my bailiwick, but if
there's nothing in Gluster's own documentation about such things there might
be something in one of those other communities.

------
espadrine
Based on the design[1], it has a leader / follower pattern (although you
should have multiple leaders with Raft consensus to avoid having a single
point of failure), where the leader is called "nameserver" and decides where
to put each piece of data and metadata among a set of chunk servers and
metadata servers.

That design is very reminiscent of CephFS's cluster monitors, metadata
servers, object storage devices.

[1]:
[https://github.com/baidu/bfs/blob/master/docs/design.md](https://github.com/baidu/bfs/blob/master/docs/design.md),
[https://github.com/baidu/bfs/blob/master/docs/BFS_design.md](https://github.com/baidu/bfs/blob/master/docs/BFS_design.md)

~~~
ergo14
what is the leader/follower pattern? Something like master/slave approach?

~~~
kuschku
Leader/Follower is actually something entirely different, but the linked
chinese document talks about a master/client approach.

Sadly, in the past years, due to some political movements, the term
"master/slave" has been declared problematic, and GitHub actively warns that
projects using such language can and will be excluded from the service.

There have been previous discussions about this on HN.

~~~
Cyph0n
Wow, that's very interesting. I thought GitHub delegates moderation to the
repo owners.

There was actually a huge debate about this on Reddit caused by Swift merging
a rename change PR into master. The Swift team was so excited about the change
for some reason that they didn't even run tests before the merge...

~~~
ergo14
Do you have a link for that? I don't follow swift development and I'm unsure
what they actually changed.

Rename of what?

~~~
Cyph0n
Changing variable names using master/slave to leader/follower.

Here's my comment from the thread:

[https://reddit.com/r/ProgrammerHumor/comments/3veu2t/comment...](https://reddit.com/r/ProgrammerHumor/comments/3veu2t/comment/cxn12gl)

The rest of the discussion is a good read too.

------
revelation
They need to stop saying "real-time". Real-time does not mean "fast", it means
"guaranteed performance". This is nothing of the sort.

------
anilgulecha
Can a distributed storage expert comment in what ways this differs from
hadoop?

~~~
00k
first of all, impl in C++ (JVM/GC is pain in the ass) - clear arch (only
master and dataserver) - very concise config file and easy to deploy - most
important, 10k nodes scalability without federation design of namespace

~~~
pkolaczk
Lack of good documentation, no tests and possibly undefined behaviour in a few
places. The code also doesn't look any cleaner than HDFS and uses some weird
mix of C (*printf, error codes) and C++ (vectors, smart pointers, RAII etc).

~~~
jstimpfle
> weird mix of C (*printf, error codes) and C++ (vectors, smart pointers, RAII
> etc).

Haven't looked at any code, but what you describe is very common usage.

------
gravypod
For the distributed FS people out there I've got a complicated question. In my
job I need to poll and collect data from many remote sensor devices, log all
the output, and process that. Not only do I do this buy MANY of my colleges do
this and have a different way to manage this process. Can a distributed file
system help with this case?

Is there any file system that would be able to sync what amounts to
text/binary data across many hosts and allow me to aggregate the data off the
network for more secure storage?

I was thinking about using IPFS for this but this also seems better. I'd
hopefully like to have a private network for this use case so that other
people can't post up a device on this file system and introduce fake data.

------
chubot
Interesting: Google flags, protocol buffers, and Google C++ style.

~~~
puzzle
If you look at Baidu's infrastructure, it's almost like a parallel universe
where the names are identical or almost identical to Google's: BFE, GTC, GSLB.
And BFS does look a lot like GFS2 aka Colossus.

~~~
keketi
Really makes you think...

------
jpgvm
More like a C++ clone of HDFS than most people are likely hoping. While you
seem to be able to mount it with FUSE I imagine it's primarily meant to be
programmed against directly.

Using Raft over a dependency on an external consensus system is nice.
Definitely makes the namenode architecture much better.

~~~
ciucanu
It looks like a faster version of HDFS since it's written in C++ (vs Java).

Another important aspect is that is using SSD + SATA(I suppose) , which could
be a better option than standard SATA/SSD or LV cache using SATA + SSD.

Even if it's just a new thing, if it proves to be faster it may be implemented
in Hadoop ecosystem in the future. HDFS has a lot of features being a mature
piece of software but it lacks on the response time.

~~~
pkolaczk
"It looks like a faster version of HDFS since it's written in C++ (vs Java)."

This is non sequitur. The conclusion does not follow from the premise.

~~~
otterley
During non-GC periods, probably true. But having a realtime filesystem service
that is prone to stop-the-world GC pauses is a showstopper for many
applications.

Also, a C++ implementation is likelier to use far less memory than a Java
implementation, assuming the skills of both programmers are roughly equal.

~~~
pkolaczk
The underlying local filesystem on each node is not truly realtime, so a
"realtime distributed file system" is already quite a stretch. Also JVM is
perfectly fine with pause times below a few tens of ms worst-case (when using
properly tuned G1, CMS GC), which is lower than worst-case latency induced by
network + I/O.

As for using less memory - you don't allocate buffers for file data on the JVM
heap. You allocate them in native memory exactly as you'd do it in C++.
Therefore it is possible to create a JVM-based file system that handles
petabytes of data with just as little as 100 MB heap, used mostly for small
temporary objects.

Also, the code here is using mutexes a lot to synchronize threads and lock out
whole objects. Therefore I think these "realtime" claims are quite
exaggerated.

~~~
GauntletWizard
You're using the academic version of realtime, not the one that anybody cares
about. HDFS's biggest problem is, and has always been, that it's literally
impossible to tune it to give anything like reliable performance, mostly
because the nameserver is a single point of lag for the entire system. "Worst
case network and IO" latency is a huge stretch. Network performance is
predictably sub-ms if you're using a network designed for modern distributed
computing (A real stretch, I know, since almost all HDFS installations are on
old-school core-router-tree infrastructure.) The IO operations are incredibly
unpredictable - For a client at a time. Having individual servers that 10-20ms
worst-case performance hiccoughs is nowhere near as bad for a system as all of
your clients hiccoughing for even 5ms at the same time.

~~~
pkolaczk
HDFS biggest problem is its SPOF master-slave architecture, not JVM nor GC.
With a truly distributed shared nothing system Java Gc would not be a problem,
because servers can now run with no major Gc for hours or days. So two servers
or clients doing Gc at the same time are very unlikely. And even if some of
them do, the pauses from Gc are much more predictable than the pauses from I/O
which on a loaded system can take seconds, not milliseconds.

Also if GC was such a huge problem, exchanges or HFT companies wouldn't use
Java for their low latency stuff, and there definitely are companies which do.

~~~
otterley
> Also if GC was such a huge problem, exchanges or HFT companies wouldn't use
> Java for their low latency stuff, and there definitely are companies which
> do.

Can you name one?

~~~
pkolaczk
LMAX, New York Exchange.

~~~
otterley
Wow, that's neat. Thanks for the pointer!

------
NicoJuicy
What wonders me the most, is when they change titles. 1 chinese character
sometimes matches 1 english word

Eg. lylei changed the title from "cs启动太慢" to "cs start is too slow " ( on
[https://github.com/baidu/bfs/issues/376](https://github.com/baidu/bfs/issues/376)
)

lylei changed the title from "其他SDK写策略" to "SDK writing strategies(fan-out
write for example)" (on
[https://github.com/baidu/bfs/issues/243](https://github.com/baidu/bfs/issues/243)
)

~~~
toxik
"cs启动太慢" means "cs start-up too slow," where 启动 is start-moving and likely a
verb-result construction, a pattern in Chinese that to my knowledge doesn't
exist in Germanic languages. The second one is more accurately translated to
"Other SDK writing strategies."

------
jeffbax
Not commenting on the BDFS so much as its really cool to see large Chinese
companies contributing to open source, does anyone know of other large
projects outside of the main Android forks? Pardon the ignorance.

Also wonder if there will be larger skepticism toward integrating Chinese O/S
in regards to potential influence by the government (like the NSA has tried to
influence in the past)

------
marknadal
This looks extremely promising and good. I work on distributed system, in
particular on databases (so one abstraction layer above file systems). This
looks like it would make for a really nice storage engine for
[https://github.com/amark/gun](https://github.com/amark/gun) . Also it is nice
to see non-English projects! Very exciting work.

------
vonnik
How is this better/different than HDFS? Is this simply an example of NIH?

------
khc
What I really want to know:

"Once your code has passed the code-review and merged, it will be run on
thousands of servers"

And the Chinese text below says tens of thousands of servers, which is it? :-)

~~~
muddyrivers
Considering Baidu's scale, it would be tens of thousands.

There are several other discrepancies in the doc between the Chinese version
and the English one. Some technical proofreading is needed.

~~~
kinkrtyavimoodh
In this case, I think it's fine. Chinese has a named number for 10000 (wàn/万),
so they used that. Since English doesn't, they used 'thousands'. In either
case, the idea is that the code would run on a large number of servers.

For instance, Hindi has special names for 100000 (lakh), 10M (crore/karod)
etc. so a similar translation to Hindi would use those even if it meant
introducing a factor of 10 in the literal interpretation.

------
HammadB
Has anyone found a good deep-dive on the architecture in english?

------
merb
I wonder whey they choose to rewrite raft and didn't use something with etcd
or another working raft solution.

~~~
chronid
You usually don't want to add another dependency you don't control to your
system, if you don't have to (in term of time and resources).

It's another point of failure, more infrastructure you have to keep alive...

------
andrewclunn
Wasn't there already a file system named BFS? This might get confusing.

~~~
codezero
BFS was part of BeOS which is defunct. The creator went on to work at Apple.

~~~
andrewclunn
Is it not still used by Haiku?

------
faizshah
I wonder how it compares to quantcast's qfs, anyone know?

------
sshb
Does it support ipv6?

~~~
p1mrx
The underlying socket libraries might in theory, but they're using them
poorly. Example from nameserver_main.cc:

    
    
        std::string listen_addr = std::string("0.0.0.0") + server_addr.substr(server_addr.rfind(':'));

------
andeb
Seems good! But I have tried to unistall but I cant...

------
qwertyuiop924
Anybody know how this differs from AFS?

~~~
knorker
I think AFS is still only replicated for read-only, not for read-write.

~~~
qwertyuiop924
No, it's replicated read-write, but according to wikipedia, file locks are
only machine-wide, so write collisions are easy to create.

~~~
knorker
Are you sure? From that same wikipedia:

"AFS volumes can be replicated to read-only cloned copies."

~~~
qwertyuiop924
They _can_ be replicated to read-only copies, but AFS also supports multiple
machine writes.

------
fsiefken
How does this compare to IPFS?

------
haosdent
Nice work!

------
taotaowill
awesome

~~~
techolic
Care to explain the awesomeness you found?

~~~
stomato
I see many positive comments that are not downvoted, so when you downvote
someone saying "awesome", I suspect it is because you disagree, not because it
was a low value post, which would be the reason why you would downvote. Also,
your response was "explain why"; again, I don't see people usually questioning
each acclaimation.

~~~
techolic
For the record I didn't downvote GP, at the time I asked there were only two
comments and I was in the mood for learning as this isn't my area.

------
Dowwie
Released right on the heels of the IPFS announcement?

~~~
daenney
"The" IPFS announcement? IPFS itself has been around for quite a while. Could
you be more specific to which announcement you're referring, and how that
relates to BFS?

------
asitdhal
Why is there no English documentation ?

~~~
hardwaresofton
Why is there rarely any non-english-language documentation for most codebases
nowadays? The world doesn't revolved around english-speaking countries.

~~~
thegeomaster
Because English is the lingua franca of the software industry, and developers
are usually expected to know English, no matter where they're from.

~~~
hardwaresofton
The point of my comment was, that shouldn't necessarily be the case forever,
and it might not be reasonable to expect it to be.

~~~
thegeomaster
Sure, nothing ever stays the same, but I think that for the foreseeable
future, we can reasonably expect English to stay the universal language of
software development. It's the default foreign language people learn in their
home countries for a lot of reasons, so a lot of people who want to get into
the field already have at least a basic understanding of English. This aids
them in learning and communication with other developers, and it's just too
convenient to be displaced any time soon.

------
gnipgnip
I just want to note how comments complaining about Mandarin being the dev
language, gets downvoted (by sympathetic Europeans);

while at the same time so do those complaining about English's hegemony in
India (by furious Indians).

Strange is our world.

~~~
rwallace
Maybe both kinds of comments are being downvoted by those of us who like
technical conversations not to be full of people bitching about other people's
choice of language. I don't find that strange at all.

------
jstoja
Too bad that the documentation is so poor... Having PRs in Chinese is not
ideal either.

~~~
kzrdude
Do you even know how it is when not-your-own-language is the dominant one for
everything in computers?

~~~
nowayyeah
Sssshhh don't point out the hypocrisy.

~~~
ricardobeat
Is it hypocrisy? English is not my native language but I consider it the
default CS language. There must be a way for us to share knowledge, and that
turned out to be English.

~~~
gbog
Yes, but it is a problem. Unlike Computer languages, human languages do not
only convey pure meanings, i.e. pure descriptions of relations between
entities. They embed a full baggage of culture, so even if it is convenient
and pragmatic to use English in CS as main language, it is not neutral, it is
both an effect and a cause of the Anglo-saxon cultural, economic and military
hegenomy over the world.

------
sleepychu
[https://github.com/search?q=baidu&type=Everything&repo=&lang...](https://github.com/search?q=baidu&type=Everything&repo=&langOverride=&start_value=1)

I feel like we should just make up random strings when we name things...

------
gnipgnip
Can't speak much for the project, but have to admire them for sticking with
Mandarin.

India is atleast a 100-200 years away from something of this kind happening;
or more likely never at all.

Indeed, there is not a single research university worth its name that isn't
also essentially an export hub of brains to the 5-eyes (& Singapore).

English is crucial for India's system of feudal slavery to work.

~~~
erikb
I don't think they stick to Mandarin because they want to. English is
considered hip and intellectual in China, especially in the first tier cities
which I suppose the developers of such modern technology are. But the problem
is that English is really, really hard for Chinese native speakers, since
grammar, words, culture and pronunciation are so different from all the
Chinese languages.

So as a team leader or project manager in China I would probably also stick
with Chinese since it is much easier to find really good and not too expensive
employees that way. Let them try to use English, support the ambition, but
don't enforce it.

And I don't know much about India but from what I heard is that it is more a
cultural issue that India lacks behind. Everything is (so I heard) still very
traditional and backward focussed. While China as a country spent 20-30 years
to become more open for new ideas and approaches. How true is that from other
people's perspective here?

~~~
gnipgnip
There are deep social divisions in India, considering its tortured (and
propagandized) history.

[http://sankrant.org/2011/03/the-english-class-
system-2/](http://sankrant.org/2011/03/the-english-class-system-2/)

There are systematic faults, which prevent much change, if the current
policies are kept up (note: India's literacy rate is ~78 %, since literacy
(except in English) brings no great advantage).

[http://www.nytimes.com/2015/03/22/opinion/sunday/how-
english...](http://www.nytimes.com/2015/03/22/opinion/sunday/how-english-
ruined-indian-literature.html?_r=0)

[http://www.forbes.com/sites/realspin/2014/11/06/the-
problem-...](http://www.forbes.com/sites/realspin/2014/11/06/the-problem-with-
the-english-language-in-
india/&refURL=https://duckduckgo.com/&referrer=https://duckduckgo.com/)

Imagine China, with only the expensive class of engineers, for instance. Or
atleast, one with this class, and another class that was educated in English,
but barely knows the language, let alone possessed of any usable skill.

There are now villages, driven by this economics, where rural-children are
being taught in English. Considering how bad the Japanese/Chinese are with
English, it shouldn't be hard to interpolate how disabling this is when
everything is being taught in a foreign language.

------
GoToRO
[https://github.com/baidu/bfs/blob/master/src/client/bfs_clie...](https://github.com/baidu/bfs/blob/master/src/client/bfs_client.cc)

    
    
        std::string pad;
        if (path[path.size() - 1] != '/') {
            pad = "/";
        }
    

Else?

~~~
yvxiang
It's an interface that's no longer in use. Would you like to write an issue to
us or make a pr to fix it? :)

~~~
GoToRO
:) send me your billing details.

~~~
kjs3
You took time to find it and complain, but want to bill to fix? Classy.

