

Ask HN: How do Flickr, Youtube and other high traffic websites handle file uploads? - wave

How do Flickr, Youtube and other high traffic websites handle file uploads? I am mostly interested to know what they do on the server side.<p>I might be mistaken, but I read PHP might not be great at being a daemon to handle file uploads due to memory leaks. Is this true? Is Python or Java a better choice? How do you handle your file uploads?<p>I appreciate any help in pointing me to a better solution.
======
eax
one of the Flickr engineers, cal Henderson, wrote a book with a title
something like "building scalable websites" that was published by O'reilly.
I'm pretty sure he covers that topic. You may be able to get acces online via
your public libraries web site (you can in seattle, at least).

There are the obvious issues with file uploads, they can take a lot of
bandwidth, and disk space, but there are a lot of less obvious problems.

1\. File uploade take a lot longer than most web requests, both because of the
size of the data, and because most client connections download faster than
they upload.

2\. As a result, file upload requests hold server resources longer than other
requests. This usually comes down to memory, but there can also be file handle
and socket limits. Also, more in the past than in the present day, just the
CPU overhead from dealing with lots of open sockets could get to be an issue.

3\. File uploads often carry a lot of memory overhead. The braindead simple
way of handling way fileuploads in PHP, etc ends up buffering the whole file
in memory until it the upload is complete. That can really add up.
Furthermore, the process handling the upload request has the memory overhead
of the PHP (or ruby, or python...) interpreter, and any code and libraries
associated with your application. This overhead is carried even though most of
that code and data structures are unnecessary for most of the request
durration.

This memory useage really stacks up when each upload request lives for
seconds, or minutes, rather than the milliseconds required for most requests.

There are lots of ways to deal with the resource issues. Writing the upload to
disk as it arrives is a big improvement. You can go further by having a
separate app/server instance that is tuned to minimize the size of each
app/interpreter instance is another.

There are also ways to take advantage of file upload features built in to a
front end webserver (like nginx) to buffer the whole upload to disk before
your app has to get involved. Not to mention the amazon examples mentioned.

Turning to a specialized custom file upload server written in Java or C seems
like an optimization you undertake if you outgrow the other solutions
(including more memory per server, or more servers)

Imple

------
mlLK
Here are some interesting stats in how much data 4chan handles on any given
day while the upstream might not quite compare to something of the likes of
Flickr or Youtube, but from what I can tell while software is certainly
important it is your hardware that will make or break whether or not your site
can handle such volume.

 _4chan is currently powered by seven servers (five content, two database). We
are colocated on a full 500mbps Global Crossing connection, allowing us to
push over 5TB (5,000GB) of data per day_ [Image:
<http://content.4chan.org/img/traffic.feb5-12.png>]

------
stillmotion
I'm not entirely sure about them, but I use S3 and EC2 with SQS so that the
user can upload the file, it will be encoded while waiting in a queue, then
place itself back into storage. That way nothing ever touches my production
server.

~~~
bprater
Yep, I use this model, too. Using this model is great, too, because multiple
disparate servers can be dealing with the media coming from/heading to S3. You
don't get stuck with media files on one box.

It's like a giant disconnected filesystem.

Oh, and use whatever language you are most comfortable using.

~~~
goodgoblin
I use merb on ec2 with a redirect back to the app server to write the db
reference. We are using a regular html based file uploader and also have a
flash uploader - which sometimes gives us fits.

------
ezmobius
Use this nginx module: <http://www.grid.net.ru/nginx/upload.en.html>

It is highly scalable and spools the uploads to disk and does the mime parsing
in efficient nginx C code> then once it finishes it just passes some params to
your backend processes with the location of the file on disk and you can
process it however you want.

~~~
bjclark
I use this too.

And, FWIW, I see some people suggesting merb, which is cool, but even the guy
that wrote merb(ezmobius) uses this now.

------
staunch
One really easy way that works extremely well is to use an old school CGI
script to handle the upload. It will die as soon as the upload process is
finished, which keeps things very self contained and clean.

The most important thing (as others have noted) is that you do processing
asynchronously. Get the file on the server, queue it (however simply), and
then process the uploads in however big of batches your machine(s) can handle
optimally. 99% of the time you're going to want ffmpeg for videos and
ImageMagick for images.

------
mdasen
So, what you want to do is stream the file to disk. The problem that most
people will face is that someone is uploading a 100MB video file and your code
is trying to hold it in memory. Bad! Get it on disk, then deal with it by
opening the file.

In terms of a daemon, you don't need one. PHP and other languages can execute
other processes. So, you want to convert that AVI to Flash? Save it to disk,
then execute another process to convert it.

------
bprater
Sniff around the Amazon S3/EC2 documentation and you'll find pipelines
demonstrating what you want to do, such as:

[http://developer.amazonwebservices.com/connect/entry.jspa?ex...](http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1602)

------
tzury
If you plan to deploy your client application as an HTTP application then
there isn't much you can do other than let your web application handle it
(php/python/ruby whateveer).

Another option is writing an uploader as a stand-alone application (such as
flickr uploader, facebook's iPhoto plug-in and alike)

The third option is BAD but still exists on facebook as java applet within the
browser.

The forth option is to write the client in one (or many) of the browser
extenders such as Google Gears, MS Silverlight, Adobe Air/Flex/Flash to do
this (look also at <http://www.jnext.org> and
<http://www.google.com/search?q=flash+uploader>).

All these 3 can be implemented at the server side in whichever language you
choose.

------
ars
You are mistaken and it's not true. If your site is php, let php handle the
uploads.

------
jdavid
so for the most part i don't think i can answer in detail, but in public
situations hi5's response has been that they have a dedicated pool of servers
running the upload, and they store the file on a static server. once that
server fills up, they provision a new one. i will call this the viking image
upload process, as its like a viking burial process. each one of those images
is then buffered by a CDN.

i should point out that hi5 has more photo uploads per day than flickr does.

------
dhotson
<http://code.flickr.com/svn/trunk/uploadr/>

