Here's how I think blogging should work:
1. Visit a web app where you create your blog post, add pictures, use your rich text editor, that sort of thing.
2. Click the "Publish" button, which generates static HTML and runs any other processing like tag generation or pngcrush.
3. Your static HTML gets pushed out to some server that does the hosting for you. It could even be one of those really cheap shared hosting providers.
If you really want comments, let someone like Disqus or Intense Debate handle it. Pretty much any dynamic feature you need can be outsourced.
Fun fact: it was terrible. Most bloggers I know hated it.
Wordpress is a bucket of poop too, but it has the quality that when you press 'publish', the site is updated in seconds, not minutes.
(I believe MT has changed since then).
The main problem to solve is inter-page dependency. The rest is not that hard /famouswords.
It's called a caching plugin, but it also compresses and minifies stuff. Any site can benefit from loading faster. You can turn it on and let it go. It has sane defaults.
I want you to print this out and stick it somewhere out of sight. When one day you are linked to on Reddit, Slashdot, a major news site or the like, pull it out for a hearty laugh.
Define "sufficient resources". It might be enough for your normal traffic, but without caching those blogs will crash from the /. effect (unexpected traffic spike).
Even with caching, I suspect Apache would fall over if I got slashdotted.
An easy solution like those plugins is such a big win for so little work, I just can't agree with "it's good enough for normal use". The last thing you want is for your blog to fail when you are getting a traffic spike.
Around 2002-2003, I built a gigantic, sub-optimal, regex-heavy Perl script for automatic updating of webcomic sites (it's still in use on several sites I know of, ugh).
Each day of the archive had its own separate HTML page generated from a template.
I never got around to the complexity of supporting selective updates, because brute-force regenerating the entire archive every time the script ran, even to several thousand comics, took just seconds on anything but Windows (whose file I/O was just too bloody slow, but even then the run only took a minute or so).
That is an implementation issue. That problem is in no way inherent to static content.
Pre-gzipped pages are nice too. :)
The permissions changes it asks for are because caching plugins need to create files on the server and most servers are either poorly configured or managment of groups:users is difficult, so temporarily requesting that users modify permissions so that the caching plugin can create the files for the user improves the installation experience via lowering the technical experience needed and the number of steps required.
I don't recommend suPHP in practice as it's quite slow, similar if not better security can be realized by using a reverse proxy, and then running PHP in fastcgi mode for example with a backend web server.
If you are fine with your comments residing somewhere else, yes. I usually prefer to have them with me and to have them search engine indexable (that's free content created from users for me)
9 million hits/day with 120 gigs RAM
Especially because of the scalability discussion around it. Only later in the article it dawned on me that he is using a cheap cloud instance.
* Whenever anything is updated the entire cache is invalidated and each item needs to be fetched again. This means you'll have some page loads being slow and others being fast. If you have a very dynamic website you will hardly ever even see a cached version.
* You can't cache things forever, primarily because when anything is updated the entire version namespace is invalidated. This means that if you have a site that isn't updated at all in a long time then the cache is still invalidated by the TTL and has to be updated. Of course, if you decide to cache forever and the version namespace is incremented then...
* You never know when your cache is full. Since the method of updating the cache isn't to invalidate keys but rather to just fill it with new keys, you will have a lot of stale data. This data will eventually have to get evicted from the cache. This means you don't reliably know when you need to upgrade your cache memory.
All that said. Version namespacing your cache is better than not caching at all and it's usually also better than having a lot of stale data as active keys. If you want to do proper cache invalidation in case you have a highly dynamic site then it's still possible, but it requires a lot more work, there's a reason for this famous quote: http://martinfowler.com/bliki/TwoHardThings.html
Now, I agree with you that the shared hosting market will continue to move more towards constraining the user to a particular language within that shared hosting environment, as that allows the host to provide better service, but this has been true for a while. For the last 10+ years, if a shared-hosting provider allowed you to run an arbitrary binary, it only allowed you to do so as a second-class citizen, behind several layers of setuid wrappers and other (usually slow) security. Back in the day, if you wanted to run at reasonable speed on a shared host, you'd write in php, and use the supplied MySQL database, which conceptually isn't that different from what many of the language specific cloud providers do today.
The interesting thing is that this means shared hosting, usually with a well-supported but language constrained environment, is actually becoming the premium hosting option. VPSs, formerly the high-end option, are now the cheap stuff. Which makes sense; as a provider, sure, the VPS costs me more ram, but I can get an 8GiB reg. ecc ddr3 ram module for under a hundred bucks now. Ram is cheap. what costs me money is the support, and it is a lot easier to rent out unsupported VPSs than it is to rent out unsupported shared hosting.
If anything, with this new 'premium' image and advances in billing customs, I think we are seen a renascence in 'premium' shared hosting services.
Exactly this. I have been using shared hosting for around $100/yr, which was cheap 3 years ago when private servers were double or triple that. But recently I bought a VPS for $15/year. Sure, I have to do my own admin, but I'm fine with that. If the cost for the shared hosting doesn't come down, I'm going to have to move everything to my VPS.
That's an interesting counterpoint to this discussion; advanced chroot jail hosts are still cheaper than Xen hosts, and OpenVZ (the common advanced chroot jail software on linux) is in between shared hosting and full virtualization with regards to how many resources are shared and how much host sysadmin effort is required.
With OpenVZ, all users share a kernel and share swap and pagecache; you can fit more OpenVZ guests on a particular bit of hardware than Xen guests, and generally speaking, more host sysadmin involvement is required when running an OpenVZ vs. a xen or kvm host.
The interesting part of the xen/OpenVZ dichotomy is that it goes the other direction; so far the market price for OpenVZ guests is much lower than the market price for Xen guests; OpenVZ mostly occupies the very low end of the VPS market.
As far as I can tell (from the kernel name), it is OpenVZ. I only discovered that last night, coincidentally.
It is tiny, only 256MB, bursting to 512MB, of RAM. But at that price it is fine for my needs; I even get my own static IP address (the full cost of the VPS is cheaper than just adding a static IP address on my shared site!).
I was reading up on OpenVZ vs Xen or KVM last night, and I take on board your comments above. The virtual server market seems to be transitioning right now, with pricing yet to settle down within a common range like they were a few years ago.
Disk space is 10GB and traffic is 500GB.
I found the vendor on www.lowendbox.com .
I've thought quite a lot about this, in fact; for a while I was talking about starting a PaaS company that used the customer's equipment. As far as I can tell, the current model (where the PaaS company controls everything) is by far the easiest (thus, if sysadmin/programmer time is your constraint, the best) way to solve the problem.
1. Use caching.
2. Use Nginx.
3. Use PHP-FPM.
People bitch a lot about the memory consumption of apache, but it's often overstated. They add up the resident set for each individual apache process and come up with a huge number, missing the fact that a significant amount is actually shared between processes.
Apache's memory consumption problems have more to do with how it handles clients. Each client ties up a worker process for the entire duration of a request, or beyond if keepalives are enabled (as they should be). That means that the memory overhead of the PHP interpreter is locked up doing nothing after the page is generated while it's being fed to the client. Even worse, that overhead is incurred while transmitting a static file.
I use nginx as a reverse proxy to Apache. It lets me deploy PHP apps easily and efficiently. Apache returns the dynamic and static requests as quickly as it can and moves on to the next request. Nginx buffers the result from apache and efficiently manages returning the data to all the clients. I used to have Nginx serve the static files directly, which is more efficient, but added complexity to my config. I chose simplicity.
A PHP opcode cache, like APC, is also a big win, because it cuts the overhead of parsing and loading the PHP source files. I'm not convinced of the value of other caching for most uses. CPU time usually isn't the scarce resource, RAM is. The DB and filesystem cache are already trying to keep needed data in RAM. Adding more caching layers generally means more copies of the same data in different forms, which means less of the underlying data fits in RAM.
Funny, I felt the same way about ditching Apache entirely. Just one more moving part I don't need.
In fact FastCGI still ties up an Apache process for the duration of the request: Apache hands the PHP request off to a FastCGI worker, then waits for that PHP worker to send back the output, so the Apache process is still blocked waiting on PHP in either scenario.
Also the overhead of Apache serving static content is miniscule compared to the amount of work a PHP does per dynamic request, unless the static content is very large, like large media files.
Excessive traffic means a spike in simultaneously served connections, which means a spike in apache threads (assuming worker MPM, iirc there's 1 process per 25 threads by default). With the mod_php model, the per-thread php memory usage can be very expensive when you have dozens or hundreds of apache threads serving requests. A spike in running php instances leads to a spike in mysql connections for a typical web app. If you haven't tuned mysql carefully, which most typical environments have not, mysql memory usage will also skyrocket.
Then for the coup de grace you get stupid apps which think it's perfectly fine to issue long-running queries occasionally (occasionally meaning something like .1% to a few percent of page loads). When that happens, if you're using myisam tables which were the default with mysql < 5.5 (and which therefore dominate deployments, even if "everyone knows" you're supposed to be using innodb), then those infrequent long-running queries block the mysql thread queue, leading to an often catastrophically severe mysql thread backlog. Since php threads are issuing those queries, the apache+mod_php threads stack up as well, and they do not use trivial amounts of memory.
The result is that you have to severely over-engineer the machine with excess memory if you want to survive large traffic spikes. If you don't, you can easily hit swap which will kill your site temporarily, or worse, run out of swap too and have the oomkiller kill something... either your webserver or mysql.
The benefit to fastcgi is it takes the memory allocation of php out of apache's hands, so every new apache thread is more limited in how much bloat it adds to the system. With a limited pool of fastcgi processes, you can also limit the number of db connections which further improves the worst-case memory usage scenario.
The advantage of in-apache-process php is that it serves php faster when there are few parallel requests, but it's on the order of single-digit milliseconds difference (the extra overhead of sending requests through a fastcgi socket), which is dwarfed by network rtt times even if none of the above pathologies rear their heads.
The apache+mod_php model is to do php processing for all active connections in parallel. The fastcgi model is to do php processing for at most x connections where x is the php fastcgi pool size, leaving all other requests to wait for an open slot. It may intuitively seem like the fastcgi model is going to be slower because some requests have to wait for a fastcgi process to become free, but if you think about average php request time it's going to be better for high parallelism, because the limiting factors are cpu and i/o. The apache model ends up using ridiculous amounts of resources just so no php request has to wait to begin getting processed by php. The high contention particularly for i/o created by apache and mysql when they have large numbers of threads is what makes the fastcgi model superior.
I hope that answers your question.
We were recently paying a small fortune for hosting one of our websites. It was bumping up against memory limits even after a serious code rework and aggressive caching. Instead of upgrading we decided to test a new config using Nginx.
Now we run three sites, one fairly popular, on a 512Mb Linode with Nginx, APC, FPM, Varnish and a CDN, and it can take an amazing amount of load. Varnish needs memory, but without Varnish we could run this setup on a box a fraction of the size.
This plan costs $19/month! I still can't believe we're paying so little.
Instead of focussing just on the server though, and like the TumbleDry article somewhat suggests, HTTP cache is probably the best place to start in terms of performance. Varnish, CDNs, etc all rely on intelligent HTTP caching. And if you do it right, you don't need to worry (too often) about cache invalidation.
What I'm really looking forward to is making use of ESI in Symfony2 and Varnish. That will mean setting different cache headers for portions of pages, which will further reduce the need to manually invalidate cache.
For now though, I'm loving Nginx + FPM + APC.
New self hosted blogs can't get the load of peak times, because new bloggers start a WordPress blog, install on it a lot of plugins and never think about performance. I think most of us have made that mistake some time.
What I don't understand is why the various CMS's don't offer automatic on-the-fly reconfiguration. It's 2011. We should be able to have the best of both worlds:
--when load is light, your blog software hits the database 42 times with every page loaded, no problem
-- when site load shifts from 1 page per hour to ten pages per second, the CMS should automatically, with no user intervention, say "oh shit" (perhaps audibly in the server room) and then automatically generate a static front page, and static article pages for whatever pages are being linked to, turn off the bells, turn off the whistles, email and tweet at the server administrator, etc. The CMS should cheerfully weather a storm all by itself. And when the load dies down, the static page should revert to a dynamic, 42-queries-on-the-database page, again without any intervention from the server administrator.
Does this exist anywhere, out of the box?
There was a loop on the page that instantiated objects with an id, the number of ids dependent on the results of a previous query. What did that object instantiation do under the hood? It performed a query to fetch state. I calculated that 5000 queries were being run, and it only cropped up every once and a while (and seemingly 'never' on the live site) because queries were automatically cached for a set amount of time.
I was new on the project, and in what is probably poor form, went around the whole office, letting my horror be fully known. People just shrugged though.
edit: I forgot to add, modifying it to only perform one query was trivial.
I managed to have it up and running again in about half an hour by converting some content which didn't need to be dynamically generated for every request, into static content.
Once things calmed down, I spent some time optimising it further and making lots of content static, which didn't need to be dynamically generated for every request.
I'm glad I did, because I hit the #2 spot on Slashdot for most of a day this very weekend because of this article:
My lowly Linode VPS didn't even break a sweat this time.
This caching (I’m using APC right now) got my page load times down to about 170 microseconds for most pages, and 400 microseconds
400 microseconds ~= 2,500hits/sec; 170 microseconds ~= 5,880hits/sec; Both of which seem reasonable for a simple PHP site using a single core CPU @ 1+Ghz.
- Movable type
- Drupal + boost
- Wordpress + SuperCache
- Jekyll or other static website generators
Better if Nginx is serving those static files, LAMP can be behind creating the static files.
I used that way with Drupal+boost for a lot of time and worked.
This is purely speculation, but I would be interested to see if there is any actual research to back it up.
I suppose if the guest in this instance is not swapping very often, then this is fairly irrelevant, but the article didn't mention anything about swap.
[Edit: This is 100 qps. It's a lot for a blog, but is not an unreasonable load by any means.]
Note: This is a variation of a previous comment I made, but a variation nonetheless. Sorry to belabor the point.
Plan was to offer quick tests (ala loadimpact), whole-site test (you give it the url to start, and it'd hit the linked pages with some probability as well) and custom scenarios (a list of urls to hit in order, rinse, repeat), and API to trigger tests automatically (for integration in routine integration/regression testing).
Got the backend working, never finished the frontend/UI. If anyone reading this is interested, let me know in reply - might put up a quick working demo page for it.
The free version can give you an idea of how your server will handle a small spike, and the other versions are per day so if you only need to test once or so a week as you finish a sprint then it wouldn't be a horrible cost.
But the way his comments work, they way he has implemented the infinite scroll it's great and I'm no expert, but I think that is not possible with static pages.
> Internal Server Error