

Ask HN: Content Serving Cluster - ajkirwin

I run a small website, which shall remain unnamed so as to avoid unneccessary threads. I'm pushing probably a few terabytes of data a month now, so to improve speed whilst keeping costs low, I plan to serve files from multiple machines. A mini-CDN if you will.<p>The files will be the same across each distribution server, but I am not sure of the best way to replicate the uploaded files across machines.<p>Example: Someone uploads xyz.tar.gz to Machine #1. I need it replicated, as fast as possible, to Machine #2 and #3, so that when people visit the site, if they get http://cdn-3.mysite.com/, they'll get the file.<p>Does HN have any suggestions as to the best way to go about this?<p>Edit: My files aren't very large, they're maybe 5 gb in total and don't grow too fast, they're just accessed a lot.
======
jbyers
MogileFS will get you close to this, but stores files with its own ID and
directory scheme. If filenames don't matter, or if you have a gateway that's
handling the mogile lookup and filename translation, it's a great solution.

If you already have a reliable central filestore, varnish or squid might
accomplish faster distribution without having to replicate all your files.

Otherwise, I'm curious to see everyone's suggestions. I've looked at more
*sync programs than I can count to handle this use case and come up empty-
handed.

------
patio11
Honestly, I think you're likely to find that Amazon S3 is the best option. It
costs money, but assuming your business generates money (many businesses do)
it will probably be more reliable and cost less expensive, expensive you-time
than anything else.

Otherwise: rsync is your friend. Run it in daemon mode. If you've just got a
handful of machines I'd nominate one machine as the server to receive all
uploads. Everybody else just syncs their upload directory to that machine's.

You may also want to consider offering bittorrent as an option, since this
situation appears to be tailor-made for it.

~~~
pmjordan
I don't see how rsync will work here, during the sync time, the new file won't
be available (entirely) from the other machines. As I understand it, the OP is
looking for something that will sync while already serving up the file,
presumably giving priority to syncing the parts that are already being
requested by clients. (which will be the beginning of the file most of the
time unless clients will be using partial downloading)

I've never set one up myself, but a clustering file system might work, e.g.
OCFS: <http://oss.oracle.com/projects/ocfs2/>

------
cperciva
You say "someone uploads xyz.tar.gz to Machine #1"; does this mean that all
the bits are uploaded to the same machine, or will you have some files
uploaded to machine #1, some files uploaded to machine #2, et cetera?

If files are uploaded to multiple machines, is it possible that you'd get two
different files with the same name uploaded to different machines? If so, how
do you want to handle this?

Will you ever have files deleted?

Do you have any ordering requirements, e.g., files have to appear on each
machine in the same order as they were originally uploaded?

~~~
ajkirwin
Files will never be deleted and there are no ordering requirements, files just
need to exist across all machines (Every machine must have a copy of every
file so that they can serve files in a round-robin fashion)

------
bluelu
Why not write a custom 404 handler which when triggered tries to download the
file from the original server, and when the file hasn't yet been downloaded,
redirects to the master server as well?

This worked pretty well for a friend's site and you don't to care about
replication anymore.

------
jawngee
Panther express is cheap, cheaper than S3.

Let someone else worry about that issue. The amount of time you'll spend
setting it up, testing it, maintaining it, is time wasted on other important
aspects of your business/application. You are not going to be able to do it
any better or any cheaper.

~~~
fizx
Panther really is dirt cheap.

------
wheels
You could use a distributed file system like:

<http://en.wikipedia.org/wiki/Lustre_(file_system)>

A simpler option would be to do a little scripting magic to catch inotify
signals and then trigger an rsync.

------
olefoo
Your options (if you are going to do it yourself and not use S3 or cachefly):

1\. rsync on the backend; it's easy, it's relatively fast, but it is
asynchronous and files won't be immediately available on all servers.

2\. An inotify watcher that copies files to the slaves when a file is written
or changed on the master. Faster than an rsync solution, but you'll need to
write it yourself.

In either case you will want to seriously question whether you should do it
yourself, look at the costs for keeping machines operational and how much you
are paying for bandwidth.

------
Steve0
rsync is built for this: <http://samba.anu.edu.au/rsync/>

------
vlisivka
I recommend to look at GlusterFS. It fast, modularized, layered clustering
file system. It is not tied to Linux kernel because it built for (patched)
Fuse. With Infiniband hardware, it is fastest clustered system available for
free. See
[http://www.gluster.org/docs/index.php/GlusterFS_1.3.pre2-VER...](http://www.gluster.org/docs/index.php/GlusterFS_1.3.pre2-VERGACION_vs_Lustre-1.4.9.1_Various_Benchmarks)
for example.

