sigh The article makes it sound like these sites are doing everything themselves including pushing the bits.
Maybe some are but I can say from personal experience that most of your traffic, if you're smart, comes out of a CDN. The sites themselves are definitely not that interactive which makes them simpler to publish. The pages are almost all cached and that doesn't take much horsepower to serve up. The big video sites have ratings and comments but they are not that big of a deal. People go to watch porn sites to watch porn, not interact. Customer analytics have shown that over and over.
I know of virtually no porn company that handles their own transactions, either. They all go through billing companies that handle things like PCI compliance for them.
Most sites also use a system like NATs to do their affiliate management. You need one that the affiliates trust isn't shaving sign-ups from their account. They tend to trust NATs.
For the data on the backend you just have a SAN to manage the data or you just manage it on a few servers with lots of disks but if you are really at the 100TB mark then you get a SAN I would think. That's what we did. Sure, it's a lot of space but they're big files so managing them isn't that hard.
I'd say the largest issue that a company like YouPorn will have is the amount of data in their working set for a CDN. CDNs generally charge you for the size of your working set that they keep at each POP in their network so you want to keep it as small as possible.
At the end of the day running a large porn network is more about integrating the myriad of partners you need to run the network. The infrastructure is interesting for a while but once you have it working the business of doing deals and handling promotions and figuring out why integration point A isn't working like it should is what keeps you busy.
It's really quite easy to serve large volumes of porn. The dataset doesn't change often and 90% of your working set is the first couple pages of content. Back in 2007 when I was in the business, there was only one CDN that would touch porn (LimeLight) and they were absurdly expensive. Today there are hundreds of porn-friendly CDNs and they charge 1/10th the price (no exaggeration).
Storing a couple hundred terabytes of porn is not expensive or complicated.
Sites like YouPorn don't authenticate their content. Most of the high-volume web page content is static. Even then, you're looking at just a few page views before the user spends 5 minutes watching a video that streams from the CDN.
Payment transactions are handled by third parties, and usually abstracted through third-party affiliate software like NATS. Which, BTW, is a piece of junk and the one part of our system which we had trouble scaling.
Big bandwidth numbers sound impressive but the truth is running even a mildly successful social network with heavily personalized pages is ten times harder than running even the largest porn sites.
Serious misconception: it's just a couple of boxes and two dudes, nothing more. It runs itself. CDN FTW!
And the only thing a CDN will help you with in this case, is offloading CSS, images and JS. You can't put that much streaming content up unless you host it yourself or want to spend every penny you have.
And the only thing a CDN will help you with in this case, is offloading CSS, images and JS. You can't put that much streaming content up unless you host it yourself or want to spend every penny you have.
This is utter nonsense.
It is the nature of YouPorn's UX that the vast majority of requests are for the first couple pages of data. You don't have to put all the content on the CDN, only the part that represents 80-90% of your traffic. If you have a pull-based CDN you don't even need to plan it; the CDN automatically populates itself with what it considers a reasonable working set.
Updated: I should add, I designed Kink.com's modern porn-serving architecture back in 2007. Prior, it ran off of 20 apache httpd boxes at 365 Main. Now it runs off of a handful of appservers, a couple MySQL boxes, and a lot of CDN capacity... on vastly more traffic. Believe me when I say there's no reason that the bulk of YouPorn's traffic couldn't be served off of one or more CDNs.
This is really in reply to powertower's statement (above or below since I couldn't reply directly to the comment) that they aren't using a CDN for their content. Here's the source domain for the content on one of their videos:
This domain resolves out to:
Which is hosted by a CDN company called SwiftWill.
Besides, the article you referenced says that they are using nginx to act as an external engine for static content such as css, js, etc.
According to the info, that's all YouPorn uses CDNs for (page assets minus video). That might, or might not have changed recently.
I'd imagine that paying extra for shaving a few 10s of milliseconds off latency might not really be much of a benefit in this type of a business, they are not doing VoIP phone calls. I'd imagine having fat pipes on a decent tier is #1 here.
The point of using a CDN (in this case) is not to reduce latency. The problem is that it becomes exponentially more expensive to serve high data rates out of a single data center. Basic infrastructure like switches and loadbalancers start to get crazy expensive, as do their support contracts. Also, it requires a lot of fairly rare (and highly-paid) expertise to set it up.
Distributed CDNs are like the RAID of content serving. Each node can be simpler, cheaper.
Another bonus of using CDNs is that you're in a great negotiating position. If you're serving 80% of traffic through one and 20% through another, you can flip it around the moment one offers to shave a percent or two off the price. I've had people in the sales department of the formerly-80% side notice the traffic drop and suddenly call up with counteroffers. In contrast, getting someone to draw fiber cables across the datacenter usually requires a lot of onetime expense and long-term contracts.
I'd be really curious what kind of CDN deal they're getting.
At regular CDN rates you're looking at ballpark $150k/month for that kind of traffic (rather optimistic extrapolation from my own rates...).
Also the figures remain mind-boggling regardless how you slice them. 900T/day breaks down to a healthy ~80 GBit/s average. That's more than most mid-sized datacenter uplinks (plus conveniently ignoring any bell curves they may have).
Yup that seems more realistic (my estimate was too optimistic then). Works out to around 2.2ct/GB. Personally I haven't seen a CDN quote below 10ct/GB, but we also measure our traffic in TB/month, not TB/day.
My guess is that this one came from a post a while ago where someone at Youporn wrote about how they used Redis. Obviously not for the videos - the article writer clearly didn't read that part very thoroughly, or didn't understand it.
I think you are. That just means that you hit MySQL but doesn't necessarily imply that the data itself are served from MySQL. Filesystems are just fine for this task and as was already mentioned, most of the data are in the CDN anyway.
What's really interesting to me personally is how porn continues to stay ahead of, or at least at/near the front of, the pack technology/performance-wise. Back in the late 1990s and early 2000s I co-ran the technology department at a very large network of high-traffic adult web sites (I'm not sure exactly where we would have been in the rankings, but I'd take a wild guess to say it was top 20, if not top 10.) We were doing streaming video (in Real, QT, and WM) at a time when it was still images as the default. Reading comments from SystemOut and stickfigure reminded me of just how (obviously) primitive everything seems compared to today, but we still made it work. Some broad notes from the period:
- Started with single processor Sun SPARCs, which were later replaced by a dual and quad core ones (went from 32 to 64 bit early due to file size limitations), along with a collection of Linux boxes from Penguin Computing (remember them?) Most were in the mid-hundreds MHz range, topping out at a blazing 1GHz by the end.
- Apache, mod_perl, MySQL (postgres for one system), later replaced some of the front end code with PHP.
- No CDNs! Akamai was more or less the only game in town and was still unproven/considered too expensive at the time so we did traditional multiple-host setups (things like image1, image2, along with RRDNS for some other bits)
- No really good, well-integrated turnkey billing systems. The ones at the time often took too large a chunk of the revenue or were designed for low volume/were very inflexible. Custom billing code to directly talk to charge processors (we spoke a custom protocol right over UDP to ours. We had a dedicated line to the processor, too IIRC. Every time a transaction was processed, you got to hear a classic modem-like noise. The hardware on our side was connected to a text-terminal (Monochrome, orange text.)
- In-browser video started out using NPH tricks(!), later used a custom Java applet. Most, however, was served directly to separate client applications. In the days before the YouTubes and Vimeos came along, you had to yes, have your customers download 3rd party software and then provide support for it.
- RAID 1 under Linux at the time had some ugly bugs which would partially corrupt one of the mirrors, requiring weekly manual rebuilds. I had a script monitoring for corruption which would send an email to this crazy old device called a "pager." The corruption always seemed to occur 15 minutes after I fell asleep, too.
Anyhow, interesting to see just how far things have come. Impressive numbers.
This and the article about YouPorn's stack make me really want to go work for these places. I'm sure that the day to day challenges would be fascinating and it would be a thrilling technical experience.
I worked in porn running sites like these (Not in that scale, but hundreds of thousands of daily visitors) for about 10 years... Sure traffic is easy to come by, but making money is a lot harder. You have to sell memberships to porn sites that have premium memberships and you get either revenue sharing or a flat fee.
Unless you're targetting working for a major company (Brazzers who owns PornHub, Tube8, KeezMovies, Extremetube, etc) you've gotta have some serious skills, otherwise you have to create your own site... Not many in the realm of porn actually have full companies behind them, most of them are just your average joe running a site by himself.
Not to mention most porn website owners are cheap... Doing any type of work in that realm is a pain.
Believe me... The idea of being paid to watch porn wears off after the first 3 months... After that it screws with you...
One of the big things is, normally you see a hot girl and you're wondering how she's in bed... After working in porn for a year plus, you're wondering how much you could make off selling her on your site.
That doesn't solve the problem at all, it just moves it from the generate-HTML request to a followon get-the-content-appropriate-to-this-user ajax request. Generating the information for each page is complicated, especially at Kink where your ability to interact with each "shoot" is determined by subscription rights, admin rights, microcurrency purchases, and who knows what else these days (I haven't seen the codebase in over four years).
Hibernate's clustered 2nd-level cache is still pretty magical. It means that the vast majority of web page requests are serviced out of RAM with zero database hits - without writing any special caching code. And it's transactional. For a certain set of scaling problems, this feature is golden.
I've worked for a gaming site under the same constraints (probably stricter because real money was involved with nearly every transaction, no caching allowed).
Hibernate caching can surely be helpful, but you make it sound like a silver bullet and like there's no other approaches. There are plenty without tying yourself to J2EE hell. A little bloom filtering with memcached or redis can work wonders and might be more predictable than an opaque caching layer that can make you very unhappy once your working set exceeds a "magical" threshold (been there, with hibernate).
Oh, I don't claim it is a silver bullet. It can be frustrating as hell at times. And of course you can build your own distributed, transactional caching layer.
My point is that once you end up with a certain level of sophistication and scaling, you create your own hell. Hibernate is fairly refined technology for dealing with this exact situation. Homebrewing your solution is like wandering around in the desert - if you're smart, you'll make it out, if not, you'll be the next Friendster.
And FWIW, Java EE is not hell if you avoid the overengineered pieces like JSF.
Pron doesn't have that much of a negative stigma attached to it when it comes to the tech industry. At least none that I've experienced. I do admit that I use the holding company's name on my resume/linked in profile but I always disclose pretty early on what it was just to make sure. If they have a problem with it I don't want work for them anyways.
Also, you try to stop saying the word hard after working in the adult business....everyone snickers when they know where you've worked. ;-)
> While it obviously varies from site to site, most adult sites will probably store in the region of 50 to 200 terabytes of porn. This is quite a lot for a website (only something like Google, Facebook, Blogger, or YouTube would store more data),
Netflix, Hulu, Apple, Flickr, Dropbox, Steam...
I find it disappointing that this list (and the one about bandwidth saying only YouTube or Hulu comes close to Xvideos) are incomplete but they aren't really presented as such.
Does anyone have real world experience monetizing porn sites on a such a large scale? I am not familiar with the business aspects of it. Is it driven through affiliate? Direct advertising? Something else?
It depends on the kind of business you are. If you are actually a producer of porn (bang bros, naughty america, kink, etc.) you sell subscriptions or "points" packages for pay-per-view/live streaming. To get those subscriptions you use just about every method out there and continue to try since the way traffic is driven to porn sites changes every year. It's a mix of affiliate traffic with a different mix of programs like revenue sharing, pay-per-signup, etc. and then you throw in a mix of other direct ad programs like ad words.
One of the difficult parts of being in the porn business, aside from the difficulty in getting a bank that will actually process your transactions, is that the way traffic is driven to your business changes annually. The affiliates bear the brunt of that, though, but you pay them about 50% of the revenue they bring in for that so it is costly. But you're always trying to find ways to bring in traffic yourself since you don't like giving up 50% of your revenue to an affiliate.
>To put that 800Gbps figure into perspective, the internet only handles around half an exabyte of traffic every day, which equates to around 50Tbps — in other words, a single porn site accounts for almost 25% of the internet’s total traffic.
That should be more like 1.6% if those numbers are correct...
As someone who used to work over there (Pornhub, Youporn, ...), it is not a question of legacy.
If you use PHP the way it's meant to be used, you are not gonna have any surprise, and it'll run faster than the alternatives (or close too), for lower development time, as well as easiness in finding developers.
Also, the article is a bit off on some points, a website like Pornhub (100Million+ pageviews/day), is on the most standard stack you could imagine: PHP, Apache, MySQL, Memcached/Redis. Varnish get mentioned a lot, but when I was working there (not so long ago), it was not in use, and as far as I know Youporn might be the only one relying on it right now.
If you know what you are doing with PHP, you will have no surprise, no performance issue, and maintenance will be trivial. But sadly I have to admit few PHP developers actually use PHP the way it should be.
PHP actually seems like a good balance in terms of server support vs. (what I can guess of) the application requirements. It's brain-dead simple to run and if you weren't precomputing everything a page needs for response-time reasons the language would push you towards doing so anyways.
I wouldn't be surprised if they're using PHP for what it was originally meant to do (add a thin layer of dynamic-ness to straight html) and precomputing all of the data it uses in something else.
PHP is not particularly known for fast and no, you wouldn't expect PHP to move petabytes of data. The normal approach for moving data with PHP is to readfile() the whole file into a byte string in memory before echo()ing it to the user; doing something chunked, incremental, and seekable is probably just as much difficulty in PHP as in Perl. It also isn't legacy -- as they say, they switched from Perl to PHP. (Although, it might be. They may have switched to PHP just for the library MySQL functions, or perhaps they wanted to switch to Nginx and couldn't get it at the time to run exactly how they wanted with Perl -- either way, it could be the case that now they're staying with it for legacy reasons.)
I can think of some special cases where PHP would be better, especially in a porn site's case -- the most common clicks are front-page links and there are probably a bunch of common keywords and clicks to links off the first page of those searches, which means that caching whole pages is probably economical. As far as I know, both Perl and PHP are identically suited to talking with upstream caching proxies, but PHP might have felt more natural for day-to-day feature development.
Most of the html content will be pushed out by Varnish. PHP just generates the most popular pages once before Varnish takes over. As for pushing out videos, I doubt they're using PHP readfile(). They're probably serving it out of a CDN.
It was recently (2011) rewritten on top of Symfony2 -- http://symfony.com -- and more importantly a modern, documented, and stable framework. It was likely done because finding quality PHP programmers is easier than finding quality Perl programmers.
Like stated before, most adult website owners are average joes, and PHP being easy to learn with a low entry barrier it was the logical reason to get a site up and running fast. Also because many of the tools written for the adult industry were done in PHP. Just like ICQ it won't ever be replaced as the standard for the industry.
From what I recall its essentially how people do business. Everything from technical support with adult oriented hosting companies to making deals to sell sites/traffic/etc or talking to sponsors about promotions/etc. Plus a great deal of just general bs-ing. You have to remember outside of maybe a dozen large companies, most adult sites are just run by 1 or 2 guys so its their form of talking at the water cooler.
i heard it offhand from an acquaintance who did a google internship that google has to downrank porn sites by several orders of magnitude, otherwise all that would ever come up in google searches would be porn...
I noticed this a lot with the regional languages. For example searchig for a simple word 'bhabhi' (this means brother's wife) returns a whole lot of erotic stories instead of the meaning of it or its usage etc. Looks like Google does not downrank regional porn/erotica sites.
I love it how the article ends with >The Internet really is for porn<
>It’s probably not unrealistic to say that porn makes up 30% of the total data transferred across the internet.< If this is the case, is the online porn industry held up as models of high tech and innovation? I thought I heard somewhere investors and VC's in particular, shy away from porn...
Most new technologies were. Photography, film cameras, home video, etc. Porn was always a great help in getting it to the main stream (years ago a read a good article about the history of porn in technology, but can't find it now).
Now I'm really looking forward to the creative things to be done to Google's new "Glass" product.
On a non-tech note, I read on the internet (yes, aldaily, do not have the link right now) that 1/3 of Casanova's autobiography was about his affairs with women. It came to me when the article said that a third of the internet was dedicated to porn.
Conclusion: A third of our lives are dedicated to sex.