

Ask HN: Distributed File System - ErrantX

Hey peeps.<p>So I need some recommendations.<p>I've been building a distributed file system for work to store our hash tables with. These are 1 gig files (about 40TB worth of them) that are write once, read many.<p>It needs to duplicate the data across servers and make the file available via HTTP. Oh and it needs to scale quite well because as from next month we are potentially adding another TB per month.<p>So far I havent been able to find a DFS that does all the above so have been working on my own. But am nervous - the files <i>are</i> mission critical but I am not too worried about losing stuff per se (there are alternative backup solutions that make sure we have multiple static copies safe). Im more worried about not being able ot cope with the load. My current implementation is in Python and simply uses a central MYSQL server to track file locations.<p>So. Can anyone recommend a DFS option I have missed that fulfills my requirements. Or even better can anyone offer technical ideas to help with the development of our code.<p>:)
======
comice
GlusterFS is a FUSE based distributed filesystem. You can mirror the files on
as many servers as you like, aggregate storage from different servers. It has
no central metadata server either.

It doesn't serve via HTTP directly, but it's easy to point a web server at the
filesystem (it has an Apache module to provide direct access without going
through FUSE if you want higher performance).

<http://www.gluster.org>

~~~
bjclark
I highly recommend _against_ glusterfs. We used to use it at my current
employer and had nothing but trouble serving large files from it. It was fine
until we starting putting large files in it, then healing would randomly bring
the whole server to it's knees.

------
kierank
MogileFS is what you're looking for. Native http support with a MySQL tracker.

------
cperciva
I'm not sure I understand what the problem is. You say that you've got
approximately 40000 files; a database which keeps track of where those files
are stored is a trivially small problem.

~~~
ErrantX
It needs to scale robustly to support at least a million files (with 3
duplications - so 4 million entries minimum) and accept really high load (min.
1000 requests simultaneously, better 5000 requests simultaneously). The main
concern is MYSQL is the single point of failure.

~~~
cperciva
The load you're talking about is insignificant -- even 4M entries is going to
fit into RAM, so you should be able to perform lookups at wire speed.

 _The main concern is MYSQL is the single point of failure._

Don't make it a single point of failure, then. Set up several boxes. Heck,
don't use MYSQL -- your update rate is going to be low (even if you have
10Gbps of incoming traffic, that's only one new GB file per second), so you
could run the file-location service over DNS if you wanted.

~~~
sho
Exactly what I was thinking. On a box with a decent amount of memory, MySQL's
query cache will easily handle 5k/sec or, probably, quite a bit more. From
fuzzy memory of a friend's test earlier this year I think he got about 30k/sec
over a socket connection and a little less over tcp; sounds about right to me.
And that's nothing but MySQL, default installation, on an iMac. Just make sure
you have plenty of memory, that's always the key.

SPOF concerns could be alleviated by having a standard master/slave failover
setup. You might have other needs you haven't mentioned but otherwise sounds
like MySQL can handle your needs for now, and keeping the setup as simple and
off-the-shelf as possible is always a good thing ..

~~~
ErrantX
wicked that makes me feel better about it. I was worried a lot about one
server being the focus of such a lot of activity. I know it _should_ handle a
lot of traffic but it's always best to sanity check thoughts like that (when a
crashed server ultimately = lost revnue :P)

~~~
sho
Well don't take anyone's word for it, make sure to test it yourself! But I am
fairly confident you'll be reassured by the results of those tests. Just make
sure you're not actually testing the language the driver is in ...

Do share more details of the setup you end up with eventually, though, it's an
interesting topic and I for one will be curious to see what you came up with.

~~~
ErrantX
Yes I plan to.

Hopefully some part of our system will be open sourced :)

------
Tichy
Not sure if something like Cassandra would help?
<http://en.wikipedia.org/wiki/Cassandra_%28database%29>

I think there must be open source clones of the FS Google uses, but I don't
know the names.

~~~
ErrantX
a GFS clone was the angle I was looking at.

Cassandra looks pretty fun - are you suggesting that as the database right? Im
thinking a quick python implementation for PUT (maybe DELETE) and meta
operations, using cassandra as a backend and Lighttpd for the GET (high
performance) might work..... cheers.

~~~
simonw
If you want a GFS clone, you /really/ want HDFS from the Hadoop project - I'm
reading the new O'Reilly book on Hadoop on the moment which is excellent.

~~~
skorgu
No first hand experience here but my understanding is that HDFS and Hadoop in
general are geared towards high-throughput and not necessarily low latency.
You'll likely have to put a caching layer in front of it if this is still [1]
the case. There is already some [2] HTTP accessibility for HDFS, I'm again not
100% clear on its status but if it works well enough a nice caching proxy
(varnish) with plenty of memory should help with latency on heavily-accessed
files.

[1] [http://www.nabble.com/Using-HDFS-to-serve-www-requests-
td227...](http://www.nabble.com/Using-HDFS-to-serve-www-requests-
td22725659.html)

[2] <https://issues.apache.org/jira/browse/HADOOP-5010>

------
simonw
Sounds like a job for the Hadoop HDFS.

~~~
tk999
agree. You need to look into Hadoop HDFS. (a GFS clone...)

------
ajb
Hmm, sounds like tahoe may do what you want: <http://allmydata.org/trac/tahoe>

~~~
Top80sSons
Tahoe, looks very promising, but I've been unable to install it any time I
tried. Do you know any enviroment it would work out the box? Tried Fedora 11,
Ubuntu 9 and Windows XP myself.

~~~
zooko
Was this with our most recent release -- Tahoe v1.4.1? Could you tell me more
about what went wrong? Perhaps by opening a ticket on <http://allmydata.org>
so that the other people who are responsible for making it easy to install can
see the issue.

~~~
zooko
By the way, I should point out that our overall strategy of how to deal with
"packaging and installation" issues is to use automation. The creation of
packages such as Python .egg's, Debian .deb's, and Mac .dmg's is done
automatically by our buildbot, and we also have tests run automatically by
buildbots to see whether the resulting packages can be installed and run.

I'm sure that it isn't all there yet, and your bug reports would help, but you
might be interested in the overall approach -- applying the principles of test
driven development (if it isn't automatically tested by your buildbot, then it
isn't done) to packaging.

------
fh973
I suggest you also check out XtreemFS (<http://www.xtreemfs.org/>). It
supports striping/parallel IO and your throughput will scale with the number
of storage servers you deploy (we ran at GB/s, not Gbits, on a single file).
It gives you full file system semantics (you mount it via FUSE) and is quite
easy to setup.

~~~
bjko
XtreemFS can also replicate write-once files over the WAN in addition to
striping. There is a screencast at <http://www.youtube.com/watch?v=0co_-
_e0Hq4>

------
gamache
I am surprised that no one has mentioned Amazon S3. You will definitely plunk
down a few bucks to store 40TB there, but it's a ready-made solution to your
problem.

Edit: OK, now I see that S3 was suggested but vetoed. Unwise decision, in my
opinion. If you need data security, encrypt your data.

~~~
ErrantX
Bandwidth costs would probably kill us TBH. Ultimately we will have 20,000
clients connecting at least 3 times a week and uploading once a fortnight. It
adds up to about 500TB of new data a year.

S3 is great for us atm but the work to encrypt it is too much because the
_cost_ doesnt scale for us long term. Potentially within a couple of years we
will be spending 1/2 a million on bandwith and 1/2 a million on storage -
which will continue upwards :)

Were facing the Google model: commodity hardware on a cheap T1 connection :)

------
dizz
Ceph? <http://ceph.newdream.net/about/>

------
speek
Ever thought about getting some super-duper SAN toys?

~~~
ErrantX
Teensy bit expensive :)

We have 4000 clients atm but hope to increase that quickly (20,000 within a
year). That's TB's of data a month. SAN was considered but it too expensive to
scale :( EDIT: well, not over the top expensive. But commodity servers with
software is cheaper/better for us.

Also we at some point need a HTTP interface to the outside world.

~~~
moe
Your use-case doesn't sound like rocket surgery so I'd say you're on the right
track. Write _once_ , read many is trivial to scale. Just build it according
to your needs and throw in memcached hosts as needed.

If your access pattern allows for good caching you could also skip all the
hassle of maintaining physical spindles and just move your stuff to S3. Their
storage fees are quite affordable, 100T would set your back around $15k a
month.

The killer with S3 is in the traffic bill, though. So that's only an option
when most of your requests can be served from cache.

~~~
ErrantX
S3 would fit our needs _perfectly_. But the boos vetoed it :( security issues
(I know, I know but the data is fairly high value). Sucks.

~~~
moe
Use encryption, perhaps?

------
gcv
I'm about to look into somewhat similar requirements for an app I'm consulting
on, and I plan to research GlusterFS first.

