Does the webapp push stuff in a queue and call a commandline tool like mencoder or something is there an industry standard tool ? How do you deal with concurrency (some kind of Actor model ) ? And most importantly, do you have to tune the linux kernel to achieve performance on this (just saw the LinkedIn NUMA post as well, so thinking about that)?
I am sure Youtube and all do it using the enviable Google infrastructure, but how does someone else do it
Right now there's a cron that runs every few seconds, finds the next unprocessed file, and processes it via command line. If/when volume gets really high, I'll probably have to do this somewhat differently to make sure it scales.
DistroKid uses MediaInfo to figure out what the user uploaded:
Then uses SOX for audio conversions:
And native Railo (the backend programming language I use) functions for image processing/resizing.
Hah! Really cool. I wrote a very similar script/program to do that for huge video files using tons of different case/switch methods that carve up mplayer2.
Really awesome project, BTW.
could you talk about why you chose Railo, which looks to be a fairly esoteric stack. Is it something you chose specifically for its media capabilities ?
This isn't the type of service that would do massive Youtube-like volume, so I cannot imagine that they are doing anything special to handle high volumes of uploads. I would guess that a single lower-end AWS server would do an adequate job for the volume they'll be handling.
AWS makes a lot of the concurrency issues easy (and scaling). Basically you can use their SQS ("Simple Queue Service"), add tasks to it, and when the individual drones check out a song from the queue, it's no longer available for a set amount of time.
If the drone finishes the process completely, it removes it from the queue permanently, but if the drone fails, dies, whatever, after that time-out it gets bumped back into Queue for the next worker drone.
We use FFMPeg for conversion.
note: tracktrack.it, if you're curious about watermarking.
Plus a delayed job worker process managed by upstart, tada, got yourself a video-encoding-and-streaming system.
> And most importantly, do you have to tune the linux kernel to achieve performance on this (just saw the LinkedIn NUMA post as well, so thinking about that)
Ha, no. Definitely not until you have really significant scale - it's remarkably fast on a reasonable dedicated server and scales well across cores, run N/2 to N+1 delayed_job processes for N cores depending on how well ffmpeg et al make use of your cores. Well faster than realtime.
And yeah, it's an actor-like model, except you don't really have a problem with concurrency. You just need some kind of task queue that you add encode jobs to and a bunch of worker machines that take tasks off this queue, run them, and respond. Almost every big system seems to start looking like this.