Ideas from the HN crowd?
Building Scalable Websites by the Flickr guy.
For example, the Facebook Engineering Blog (http://www.facebook.com/eblog) or Google Tech Talks (http://au.youtube.com/results?search_query=google+tech+talk+...).
Web applications don't have a problem handling one person. They have a problem handling 1,000 people who all want to load a page at the same time. That means that you're not worried about using all your server resources for one query. You're going to have 1,000 queries and so 250 go to each core in a quad-core box and you don't care that MySQL (InnoDB really) doesn't do the multi-core thing because it's not like PostgreSQL would benefit from it since it's just easier to do different queries using a different core.
Multi-core capabilities could help when you are running fewer concurrent queries than you have processor cores. However, once you get to 4-8 concurrent users, you don't need to worry about it so much because you'll saturate all the cores through the different queries and when you're below 4-8 concurrent users you don't need to worry about it because it's easy to serve few users.
Now, for data warehousing and other applications, multi-core capabilities can be the difference between a query taking 1 hour and a query taking 4 hours. But that's not a web application. In fact, using MySQL for data warehousing is just a bad idea.
The author is totally right that you shouldn't just expect technology to save you (and I personally think DHH is taking the "wait until it is a problem. if becomes a problem and new hardware can solve it, do that. if not, then deal with it" approach rather than simply expecting a solution). However, web applications are about running many things at once, not running one big thing.
P.S. Databases usually aren't CPU constrained. It's almost always memory or disk that slows you down. Spend the money on more RAM or better disks and don't worry about the CPU so much.
P.P.S. Both MySQL and PostgreSQL are wonderful databases for web applications so let's not turn this into some silly flame war, please?
When you have 1,000 people who all want to load a page at the same time, you're only going to hit the database once for that.
What you will have instead is 5,000 people who want to go to 1,000 different pages, some of them updating data and some of them retrieving it, and you will have 1,000 different queries that are either selects or updates.
Now, PostgreSQL was designed from the start to be robust enough to handle this type of concurrency - to allow the minimum amount of data to be locked while updating, to free the selects from being blocked by those writes. As a result of being built on such a solid foundation, it has been easy to optimize over the years, and has supported the emergence of commmon multi-core computing quite well.
MySQL, on the other hand, was designed to return queries extremely quickly, with as little getting in the way of doing so as possible. Unfortunately, their model did not scale as well. When you have 500 reads and 1 write, and you lock the entire table for that write, it's not a big deal because the write takes no time and you're unblocked again. But when you have an array of hundreds of simultaneous reads and writes, it becomes a mess if you don't have fine-grained locking capabilities. That having been said, MySQL is progressing as well.
But I think (not trying to start a flame war) that PostgreSQL did things slow and right in the beginning and it's paying off pretty well.
Finally, I disagree that databases are not CPU constrained. Any website with decent traffic will have the database entirely cached in RAM at all times, with the only disk activity being the journaling.
You start by saying you'll have 5,000 people who want 1,000 different pages with some updates, some selects, etc. Well, once you have more queries than cores, the benefit of splitting work over multiple cores is gone since none of the queries can logically use even a whole single core given the level of concurrency.
You go on to say that PostgreSQL was designed to eliminate unnecessary locking so that selects wouldn't be blocked. MySQL's MyISAM database doesn't support this, that is correct. However, InnoDB does support the same MVCC model that PostgreSQL uses and likewise eliminates the locking issue.
Your argument rests on MySQL locking an entire table to do a write: "When you have 500 reads and 1 write, and you lock the entire table for that write, it's not a big deal. . . But when you have an array of hundreds of simultaneous reads and writes, it becomes a mess if you don't have fine-grained locking capabilities." A great article on this problem is here: http://www.onlamp.com/pub/a/onlamp/2001/05/25/postgresql_mvc.... It's great that PostgreSQL supports that, but MySQL supports it too today (http://wiki.oracle.com/page/InnoDB?t=anon).
There are plenty of things that one can say are issues between the two. MySQL's inferior query planner. The fact that PostgreSQL can suggest indexes based on query history. PostgreSQL's weak replication (Slony-I's communication costs grow quadratically, yuck!). MySQL's acceptance of February 31st as a real date. MySQL's poor subquery optimization. PostgreSQL's more limited data partitioning.
They aren't equal in all ways and PostgreSQL is a wonderful database, but saying that MySQL needs to lock a table to do a write is just wrong in a very partisan manner. I've never really understood such partisanship. Knowing the strengths and weaknesses of multiple products makes you aware of what is good for a project and what isn't. Even better, once you're using one of them, you know what to do and what to avoid with it. Not confronting the reality of how alternative systems work just means that the chance of picking the best system is more luck than information. For what it's worth, I use PostgreSQL in my personal projects. It's great. However, it's also important to understand that MySQL of 2008 is not MySQL of 2001. It's come a long way in the "correctness" camp and the old arguments about Multi-Version Concurrency Control don't apply anymore.
Oh, and from Power PostgreSQL, Disks > RAM > CPU (http://www.powerpostgresql.com/PerfList).
2) I am not trying to slam MySQL. But seven years ago, we had one database which was not ACID compliant and had inconsistent behavior, but was very fast. And we had another which was designed and built properly from the ground up - with the future in mind - but was quite slow. There were advantages and disadvantages to each one. But over the past seven years, PostgreSQL has had time to optimise and stabilize code, and as a result it performs just as well as MySQL in most cases. Whereas in the past seven years, MySQL has worked to implement the essential features that allow it to be a robust database. If you were starting a website in 2009, why would you pick MySQL? I could easily understand why in 2001, but not now.
3) My point is that the "Disks > RAM > CPU" argument is no longer valid. The point of that statement was that you should spend your money on disks first, then RAM, and focus less on CPUs. This is not true in 2009! Now (for web apps), you can easily buy enough RAM to cache your entire database. All of a sudden, RAM and Disks are no longer an issue and your DB becomes CPU limited. All that discussion about "more spindles = better" and "raid 1+0 > raid 5" is not very important anymore; you just need a disk that's fast enough to log your transactions.
As for why someone would choose MySQL, there are a bunch of reasons. There are a lot more people with MySQL experience out there. MySQL has better replication facilities (and I've set up replication with MySQL, Slony-I and PgPool-II). I'd really like to see Mammoth Replicator become the standard in the PostgreSQL community (as well as for 1.8 to be out of beta) as I think it's a considerably better replication design than the other options in PostgreSQL, but right now MySQL replication looks a lot better. Maybe you have a good use for one of MySQL's less used storage engines. MySQL Cluster looks interesting, but I wouldn't trust my data to it today (even if Zillow seems to think it's the best thing since sliced bread).
The differences between the two are really minor today. Choose whichever one you like, but there are definite reasons to choose either one.
Asking out of curiosity/ignorance, not trolling
It takes more programming to think how best to organize things - what stuff should be in memory? How best to store it in ram for performance/size? What should be on the disk as flat files? What needs to be in a db? Should parts of the db be cached in ram and just used for writes etc.
Also does everything need to be in a db? Or are some things better dealt with by just passing messages around, queuing them up if needed etc.
You are working way too hard. Let the OS figure out what needs to be in ram - it does it automatically anyway, and it does a better job that you an since it's caches what actually used, and not what you think should be used.
You should not use flat files for web apps - they don't handle concurrency very well.
> Also does everything need to be in a db? Or are some things better dealt with by just passing messages around, queuing them up if needed etc.
Message passing, and db are not interchangeable, so that's a false question.
>> "You should not use flat files for web apps - they don't handle concurrency very well."
That's a silly blanket statement. If I have a single thread that deals with something, of course it can use a flat file to store it. The problem is, that some people decide to use webservers that cause concurrency issues by having multiple threads doing similar things for different users.
It sounds like you're still thinking about standard accepted many threads, database, etc etc.
If you have just a single thread of course you don't have to worry about concurrency. Isn't that what I just said?
And you are seriously making a webserver that serves just one request at a time?
There is a good reason it's "standard accepted" to use many threads, and a database. I guess you could have just one thread, with a queue, and do just one thing at a time without a database. Don't know why you would want that though.
If you follow "standard accepted", you won't get anywhere.
Can you handle 1 million requests (per day) on a single thread?
By using just one thread you are serializing your bottlenecks (CPU, IO, network).
If you have more than one, one can be waiting for IO, while the other uses CPU.
Plus CPU speeds are not getting faster, the future is multi core.
He's not talking about the number of requests per day, he's talking about the number of simultaneous requests.
And I knew he couldn't possibly mean 5K per day, but he didn't say how many it was.
I'm not too shocked you don't know about it, to be honest - not enough people do for some reason. Axod and I are lucky to have worked together on a very large-scale problem at a previous company that could never have been handled with a threaded approach.
I've since moved on to Justin.TV, where I wrote a single-threaded chat server that scales to over 10k connections per server cpu (we run it on many 8-cpu boxes). Axod is now the founder of mibbit, and he's obviously using a single-threaded approach there too.
You have one program, that handles multiple requests in the same program - but it's just one program.
As opposed to multiple programs, each handling one request.
I can see how that will handle any IO issues, and if starting a program has overhead, that will help too, but it still seems like it won't do a good job of keeping the CPU busy.
But you did say earlier that you were not CPU bound. All my websites have been CPU bound (well I think they are CPU bound), so I guess that's why I didn't get it at first.
- Adding threads "works" up to some small number (maybe a few hundred or so - depends on your platform). Then adding more threads just takes up more cpu without doing any useful work. Your program can seem cpu-bound, when actually you just have so many threads that none of them can do anything.
- The approach axod and I are talking about uses a single thread to service many network connections. Obviously you have to write your code quite differently to handle this: Your code is typically "triggered" by an event like receiving a few bytes over one network connection. You do a (very!) small amount of work, and then you quickly return control to the network core (called the "reactor" in Python's Twisted library). The core then waits for another event (e.g. more bytes arrive on a different network connection), and the cycle repeats.
Hope that helps.
I was letting apache do the threads, so under 100 probably.
Thanks for posting this - and staying on the thread. I should probably go back and re-read the thread now that I get what you are saying.
Paul Tyma claims handling 40000 chat messages per second on a quad core desktop system with it.
It's also just far far simpler to go with a single networking thread. Then pass off any cpu intensive, or long running tasks, or blocking tasks, to other threads.
You are of course right though. Easy to forget...
I'm not trying to be argumentative, I have never done a very large scale website, and only now did I check your profile to see that you did.
But I still don't understand how come it's better to do a single thread. Also, all my websites have been CPU bound.
And even if it's IO bound, if you have more than one disk, adding a thread can only help, no?
But even then, it's better to have a set number of threads doing different tasks, rather than one per user. eg have a network thread, a db thread(s) etc.
But is it really better to statically allocate resources to threads? You may have 8 cores on a box and 1 of them burning and 7 of them cruising. By utilizing a small thread pool and letting the scheduler spin things off dynamically you can turn that into 8 cruising instead.
So why does memcached exist?
The kernel can cache the FS calls that the DB makes, but it can't cache calls to the DB!
To cache the results of complicated joins (or queries without indexes).
It's pretty much pointless if all you are doing is caching the result of a simple query using an index.
Not if your database is under heavy load, and you can easily shift some of that load by putting frequently accessed things in memcached instead.
But that's clearly not true. In the most extreme case, that hash table is referenced simply by a variable in your program - it's already in your program's address-space! There's no way a database can come close to that.
hash at: key put: anObject
The second one commits to using a relational database, one often easily triples the size of the code base. There's nothing simple about that.
Not true at all. The purpose of Memcached is to completely avoid a call to the database because the database, even if it keeps everything in memory can't touch the read speed of a distributed hash table. Memcached allows you to spread the reads across farms of boxes instead of sending them all to what is usually a single database server.
Although what I find interesting is the data store in google app engine - if you can work within that sort of database, you have a much better chance of scaling if you need to (and like you said, you rarely need the relational part of a RDBMS anyway).
This should be enough to show you that at some point, Moore's law will end. It is less important right now (there is still some development going, and new paradigms in the far future) , but the world is not limitless.
Stackless seems like a good alternative to Erlang.
Take the two statements above as standalone.
Yeah, not a big difference really.
"python interpreter doesn't even support multiple real threads let alone multiple cores"
multiprocessing allows you to have the interpreter start another entire OS process (another interpreter). It mimics the threading API.
It's useful and it does allow you to take advantage of multiple cores without unfamiliar APIs which I take it is your point...
You can see in the PEP that for many situations the overhead of processes is not a big concern:
It greatly depends on the task but my usage of the module is in disagreement with the PEP i.e., I don't use its threading-like API preferring its "distributed" capabilities.
The main point is there are multiple "concurrent" approaches and Java-like "threading" approach is the worst for many tasks i.e., It would be nice if GIL had been gone but for many "concurrent" approaches It doesn't matter. http://wiki.python.org/moin/ParallelProcessing
If I need to do some tightly coupled thing with threads and shared data etc. I'm not going to be looking at Python anyhow.
It seems like MySQL is therefore dreck. I can't see any reason not to use PostgreSQL.
(But do use Postgres, it's better than MySQL for most cases!)
Thanks for the info, I'll have to dig further, but at least now I know it should be fast even in MySQL.
Moore's law only states that the density of transistors on a chip will double every 2 years. This will fail because of the Zeno's paradox-like effect of limited miniaturization.
At some point, transistors will have to become molecule size, and then atomic, at which point it should be theoretically impossible to get any smaller.
(Sorry for the derailment, and yes, I also go bonkers over centrifugal vs centripetal.)
Finally, DHH may be amazing at writing frameworks, but he is just about the absolute last person I would trust with anything that resembled math.