Hacker News new | past | comments | ask | show | jobs | submit login
Don't Bet on Moore Saving Your Ass (zawodny.com)
82 points by prakash on Jan 7, 2009 | hide | past | web | favorite | 78 comments

I'm very interested in book recommendations for scaling websites from hundreds of users to millions of users. I'm interested in both database and web server configurations as well as hardware details, and war stories from the trenches. Also, I'd like to know if there bits of wisdom I can incorporate into my 1.0 codebase to make scaling easier should I be fortunate enough to have that problem.

Ideas from the HN crowd?


Building Scalable Websites by the Flickr guy.

Reading articles and watching videos from the engineering teams of big sites is interesting, though not always easily applicable.

For example, the Facebook Engineering Blog (http://www.facebook.com/eblog) or Google Tech Talks (http://au.youtube.com/results?search_query=google+tech+talk+...).

Does anyone know if PostgreSQL has this similar multicore bottleneck?

PostgreSQL is much better at using multiple cores, but it doesn't matter (at least not for web applications).

Web applications don't have a problem handling one person. They have a problem handling 1,000 people who all want to load a page at the same time. That means that you're not worried about using all your server resources for one query. You're going to have 1,000 queries and so 250 go to each core in a quad-core box and you don't care that MySQL (InnoDB really) doesn't do the multi-core thing because it's not like PostgreSQL would benefit from it since it's just easier to do different queries using a different core.

Multi-core capabilities could help when you are running fewer concurrent queries than you have processor cores. However, once you get to 4-8 concurrent users, you don't need to worry about it so much because you'll saturate all the cores through the different queries and when you're below 4-8 concurrent users you don't need to worry about it because it's easy to serve few users.

Now, for data warehousing and other applications, multi-core capabilities can be the difference between a query taking 1 hour and a query taking 4 hours. But that's not a web application. In fact, using MySQL for data warehousing is just a bad idea.

The author is totally right that you shouldn't just expect technology to save you (and I personally think DHH is taking the "wait until it is a problem. if becomes a problem and new hardware can solve it, do that. if not, then deal with it" approach rather than simply expecting a solution). However, web applications are about running many things at once, not running one big thing.

P.S. Databases usually aren't CPU constrained. It's almost always memory or disk that slows you down. Spend the money on more RAM or better disks and don't worry about the CPU so much.

P.P.S. Both MySQL and PostgreSQL are wonderful databases for web applications so let's not turn this into some silly flame war, please?

PostgreSQL is much better at using multiple cores, and it absolutely does matter for web applications.

When you have 1,000 people who all want to load a page at the same time, you're only going to hit the database once for that.

What you will have instead is 5,000 people who want to go to 1,000 different pages, some of them updating data and some of them retrieving it, and you will have 1,000 different queries that are either selects or updates.

Now, PostgreSQL was designed from the start to be robust enough to handle this type of concurrency - to allow the minimum amount of data to be locked while updating, to free the selects from being blocked by those writes. As a result of being built on such a solid foundation, it has been easy to optimize over the years, and has supported the emergence of commmon multi-core computing quite well.

MySQL, on the other hand, was designed to return queries extremely quickly, with as little getting in the way of doing so as possible. Unfortunately, their model did not scale as well. When you have 500 reads and 1 write, and you lock the entire table for that write, it's not a big deal because the write takes no time and you're unblocked again. But when you have an array of hundreds of simultaneous reads and writes, it becomes a mess if you don't have fine-grained locking capabilities. That having been said, MySQL is progressing as well.

But I think (not trying to start a flame war) that PostgreSQL did things slow and right in the beginning and it's paying off pretty well.

Finally, I disagree that databases are not CPU constrained. Any website with decent traffic will have the database entirely cached in RAM at all times, with the only disk activity being the journaling.

PostgreSQL was designed "correct" from the start. However, MySQL does many of the same things today.

You start by saying you'll have 5,000 people who want 1,000 different pages with some updates, some selects, etc. Well, once you have more queries than cores, the benefit of splitting work over multiple cores is gone since none of the queries can logically use even a whole single core given the level of concurrency.

You go on to say that PostgreSQL was designed to eliminate unnecessary locking so that selects wouldn't be blocked. MySQL's MyISAM database doesn't support this, that is correct. However, InnoDB does support the same MVCC model that PostgreSQL uses and likewise eliminates the locking issue.

Your argument rests on MySQL locking an entire table to do a write: "When you have 500 reads and 1 write, and you lock the entire table for that write, it's not a big deal. . . But when you have an array of hundreds of simultaneous reads and writes, it becomes a mess if you don't have fine-grained locking capabilities." A great article on this problem is here: http://www.onlamp.com/pub/a/onlamp/2001/05/25/postgresql_mvc.... It's great that PostgreSQL supports that, but MySQL supports it too today (http://wiki.oracle.com/page/InnoDB?t=anon).

There are plenty of things that one can say are issues between the two. MySQL's inferior query planner. The fact that PostgreSQL can suggest indexes based on query history. PostgreSQL's weak replication (Slony-I's communication costs grow quadratically, yuck!). MySQL's acceptance of February 31st as a real date. MySQL's poor subquery optimization. PostgreSQL's more limited data partitioning.

They aren't equal in all ways and PostgreSQL is a wonderful database, but saying that MySQL needs to lock a table to do a write is just wrong in a very partisan manner. I've never really understood such partisanship. Knowing the strengths and weaknesses of multiple products makes you aware of what is good for a project and what isn't. Even better, once you're using one of them, you know what to do and what to avoid with it. Not confronting the reality of how alternative systems work just means that the chance of picking the best system is more luck than information. For what it's worth, I use PostgreSQL in my personal projects. It's great. However, it's also important to understand that MySQL of 2008 is not MySQL of 2001. It's come a long way in the "correctness" camp and the old arguments about Multi-Version Concurrency Control don't apply anymore.

Oh, and from Power PostgreSQL, Disks > RAM > CPU (http://www.powerpostgresql.com/PerfList).

1) I understand that MySQL (with InnoDB) supports row-level locking today. I was speaking of both databases as they existed in the past, to express how much easier it was for PostgreSQL to optimise and scale over the past decade.

2) I am not trying to slam MySQL. But seven years ago, we had one database which was not ACID compliant and had inconsistent behavior, but was very fast. And we had another which was designed and built properly from the ground up - with the future in mind - but was quite slow. There were advantages and disadvantages to each one. But over the past seven years, PostgreSQL has had time to optimise and stabilize code, and as a result it performs just as well as MySQL in most cases. Whereas in the past seven years, MySQL has worked to implement the essential features that allow it to be a robust database. If you were starting a website in 2009, why would you pick MySQL? I could easily understand why in 2001, but not now.

3) My point is that the "Disks > RAM > CPU" argument is no longer valid. The point of that statement was that you should spend your money on disks first, then RAM, and focus less on CPUs. This is not true in 2009! Now (for web apps), you can easily buy enough RAM to cache your entire database. All of a sudden, RAM and Disks are no longer an issue and your DB becomes CPU limited. All that discussion about "more spindles = better" and "raid 1+0 > raid 5" is not very important anymore; you just need a disk that's fast enough to log your transactions.

Well, I personally choose PostgreSQL for my personal stuff, but your logic doesn't hold. PostgreSQL was slow, but correct. Now it is fast and correct. MySQL was fast, but wrong. Now it is fast and correct. Most of the reasoning behind picking one or the other has disappeared and we're left scraping the bottom of the nitpick barrel trying to convince people to use one over the other.

As for why someone would choose MySQL, there are a bunch of reasons. There are a lot more people with MySQL experience out there. MySQL has better replication facilities (and I've set up replication with MySQL, Slony-I and PgPool-II). I'd really like to see Mammoth Replicator become the standard in the PostgreSQL community (as well as for 1.8 to be out of beta) as I think it's a considerably better replication design than the other options in PostgreSQL, but right now MySQL replication looks a lot better. Maybe you have a good use for one of MySQL's less used storage engines. MySQL Cluster looks interesting, but I wouldn't trust my data to it today (even if Zillow seems to think it's the best thing since sliced bread).

The differences between the two are really minor today. Choose whichever one you like, but there are definite reasons to choose either one.

Your ignoring the issue of locking, which can effectively transform your multi core CPU in a single core. Secondly databases can be limited depending on what your asking them to do not just what type of work your doing. IMO the largest advantage to multi core CPU's is handing a longer list of requests the the disk controller so it can better optimize the the head's path.

Sun has previously demonstrated Postgres scaling well on their 32-way (8 cores x 4 threads/core) hardware. http://it.toolbox.com/blogs/database-soup/postgresql-publish...

One thing to make note of: if you are in a space where you have massive concurrency, the right kind of hardware can help you out. There was a time in the mid 90's when mainframes with CPUs we would regard as quaintly slow could still achieve I/O throughput that could dwarf the throughput of a PC with an order of magnitude faster CPU.

Quite a few webapps just shouldn't be using a db for most things. It's lazy programming.

What then?

Asking out of curiosity/ignorance, not trolling

I think this website (hacker news) just uses persisted hash tables. I'm not sure why he thinks databases indicate "lazy programming" though.

It's really easy to just throw everything in a db, and then start worrying when you scale.

It takes more programming to think how best to organize things - what stuff should be in memory? How best to store it in ram for performance/size? What should be on the disk as flat files? What needs to be in a db? Should parts of the db be cached in ram and just used for writes etc.

Also does everything need to be in a db? Or are some things better dealt with by just passing messages around, queuing them up if needed etc.

> Should parts of the db be cached in ram and just used for writes etc.

You are working way too hard. Let the OS figure out what needs to be in ram - it does it automatically anyway, and it does a better job that you an since it's caches what actually used, and not what you think should be used.

You should not use flat files for web apps - they don't handle concurrency very well.

> Also does everything need to be in a db? Or are some things better dealt with by just passing messages around, queuing them up if needed etc.

Message passing, and db are not interchangeable, so that's a false question.

Often a database is used to do simple message passing - campfire, twitter, pretty sure they both use a db to pass messages around.

>> "You should not use flat files for web apps - they don't handle concurrency very well."

That's a silly blanket statement. If I have a single thread that deals with something, of course it can use a flat file to store it. The problem is, that some people decide to use webservers that cause concurrency issues by having multiple threads doing similar things for different users.

It sounds like you're still thinking about standard accepted many threads, database, etc etc.

Uh what?

If you have just a single thread of course you don't have to worry about concurrency. Isn't that what I just said?

And you are seriously making a webserver that serves just one request at a time?

There is a good reason it's "standard accepted" to use many threads, and a database. I guess you could have just one thread, with a queue, and do just one thing at a time without a database. Don't know why you would want that though.

My webserver is currently serving 5k clients in a single thread...

If you follow "standard accepted", you won't get anywhere.

Um, everyone's webserver can handle 5k requests (per day) in a single thread. Not sure how many requests your clients do per day since you didn't say.

Can you handle 1 million requests (per day) on a single thread?

By using just one thread you are serializing your bottlenecks (CPU, IO, network).

If you have more than one, one can be waiting for IO, while the other uses CPU.

Plus CPU speeds are not getting faster, the future is multi core.

Um, everyone's webserver can handle 5k requests (per day)

He's not talking about the number of requests per day, he's talking about the number of simultaneous requests.

Maybe I'm not understanding something - how do you do simultaneous requests in a single thread?

And I knew he couldn't possibly mean 5K per day, but he didn't say how many it was.

Maybe I'm not understanding something - how do you do simultaneous requests in a single thread?


I'm not too shocked you don't know about it, to be honest - not enough people do for some reason. Axod and I are lucky to have worked together on a very large-scale problem at a previous company that could never have been handled with a threaded approach.

I've since moved on to Justin.TV, where I wrote a single-threaded chat server that scales to over 10k connections per server cpu (we run it on many 8-cpu boxes). Axod is now the founder of mibbit, and he's obviously using a single-threaded approach there too.

I'm starting to understand what you mean.

You have one program, that handles multiple requests in the same program - but it's just one program.

As opposed to multiple programs, each handling one request.

I can see how that will handle any IO issues, and if starting a program has overhead, that will help too, but it still seems like it won't do a good job of keeping the CPU busy.

But you did say earlier that you were not CPU bound. All my websites have been CPU bound (well I think they are CPU bound), so I guess that's why I didn't get it at first.

They may well be CPU bound because of context switching.

Right. Ars, here are a couple of points you may be missing:

- Adding threads "works" up to some small number (maybe a few hundred or so - depends on your platform). Then adding more threads just takes up more cpu without doing any useful work. Your program can seem cpu-bound, when actually you just have so many threads that none of them can do anything.

- The approach axod and I are talking about uses a single thread to service many network connections. Obviously you have to write your code quite differently to handle this: Your code is typically "triggered" by an event like receiving a few bytes over one network connection. You do a (very!) small amount of work, and then you quickly return control to the network core (called the "reactor" in Python's Twisted library). The core then waits for another event (e.g. more bytes arrive on a different network connection), and the cycle repeats.

Hope that helps.

It does, thanks.

I was letting apache do the threads, so under 100 probably.

Thanks for posting this - and staying on the thread. I should probably go back and re-read the thread now that I get what you are saying.

I'd be curious to know what you think about this alternative architecture:


Paul Tyma claims handling 40000 chat messages per second on a quad core desktop system with it.

I tried a similar sort of architecture at one point, the issue is that if you share blocking calls in a thread, at some point something will block that you don't expect to.

It's also just far far simpler to go with a single networking thread. Then pass off any cpu intensive, or long running tasks, or blocking tasks, to other threads.

The internet has made me sad. :(


Be happy, there was a time when we didn't know about this stuff either ;-)

Ah the days of threadpools, the cpu battling to context switch. The late nights battling to hold the cpu steady and add just another few threads, good times, good times :)

You are of course right though. Easy to forget...

I don't think this discussion is going anywhere, but for reference, my webserver does around 45 million requests a day, and yes, that's in a single thread. Webservers aren't typically cpu bound.

Thanks for the numbers, 45 million and 5k are kinda different.

I'm not trying to be argumentative, I have never done a very large scale website, and only now did I check your profile to see that you did.

But I still don't understand how come it's better to do a single thread. Also, all my websites have been CPU bound.

And even if it's IO bound, if you have more than one disk, adding a thread can only help, no?

Adding threads only helps if you have more cores, or if you're forced to use something that may block.

But even then, it's better to have a set number of threads doing different tasks, rather than one per user. eg have a network thread, a db thread(s) etc.

First: this is a really interesting thread and I have a lot of respect for your experience.

But is it really better to statically allocate resources to threads? You may have 8 cores on a box and 1 of them burning and 7 of them cruising. By utilizing a small thread pool and letting the scheduler spin things off dynamically you can turn that into 8 cruising instead.

Just curious.

The networking thread shouldn't be using hardly any CPU. If it is, then somethings badly wrong. It's better to have a single networking thread, trawling through the connections, moving data around, and have it talking to other threads that handle long jobs, cpu intensive tasks, or anything that might block.

You are working way too hard. Let the OS figure out what needs to be in ram - it does it automatically anyway, and it does a better job that you an since it's caches what actually used, and not what you think should be used.

So why does memcached exist?

Because the kernel's cache is for things accessed via the VFS layer. It is extremely good at caching file accesses.

The kernel can cache the FS calls that the DB makes, but it can't cache calls to the DB!

> So why does memcached exist?

To cache the results of complicated joins (or queries without indexes).

It's pretty much pointless if all you are doing is caching the result of a simple query using an index.

It's pretty much pointless if all you are doing is caching the result of a simple query using an index.

Not if your database is under heavy load, and you can easily shift some of that load by putting frequently accessed things in memcached instead.

That's if you have two machines. The comparison was vs keeping stuff in a hash table in memory, and I was saying databases are no worse.

The comparison was vs keeping stuff in a hash table in memory, and I was saying databases are no worse.

But that's clearly not true. In the most extreme case, that hash table is referenced simply by a variable in your program - it's already in your program's address-space! There's no way a database can come close to that.

Not to mention that you can hash arbitrary objects in a hash table with no mapping of any kind.

    hash at: key put: anObject
Databases are vastly more complicated and require me to completely disassembly the object graph anObject may contain into a set of tables and rows to store it, and then reassemble the graph from tables and rows back into their object form when fetching.

The second one commits to using a relational database, one often easily triples the size of the code base. There's nothing simple about that.

> To cache the results of complicated joins (or queries without indexes).

Not true at all. The purpose of Memcached is to completely avoid a call to the database because the database, even if it keeps everything in memory can't touch the read speed of a distributed hash table. Memcached allows you to spread the reads across farms of boxes instead of sending them all to what is usually a single database server.

yeah but if you get its going quick - is that really lazy? If you value more getting it done and working over getting things "right" from the start - for many cases that is a good enough approach.

Although what I find interesting is the data store in google app engine - if you can work within that sort of database, you have a much better chance of scaling if you need to (and like you said, you rarely need the relational part of a RDBMS anyway).

Care to expand on this?

You have to be more specific. What do you mean exactly by "db"? a relational DBMS? a file-based one? a tree-based one? document-oriented maybe? column-oriented?

Most websites are likely to be using mysql/postgresql.

One other thing: Moore's law is fundamentally limited. Remember that the limit of the exponential function as t-> infinity is infinity.

This should be enough to show you that at some point, Moore's law will end. It is less important right now (there is still some development going, and new paradigms in the far future) , but the world is not limitless.

I think all kinds of components and applications are going to hit the wall. The python interpreter doesn't even support multiple real threads let alone multiple cores. Programmers don't know how to write the code. Here is an article musing about the problems in Java: http://www.devwebsphere.com/devwebsphere/2006/11/multicore_m... Do we all need to learn erlang?

Clojure has put a major emphasis on fixing multithreading for Java - I don't know about garbage collection and problems with the JVM itself, though.

One of the few things I liked about Java was how easy it was to do multi-threaded programming. Granted it's clunky but Java network apps are much simpler than doing the same thing in C / C++.

Java has a very nice available tool for scaling out a webapp: terracotta.org

Languages (e.g. Haskell) which support transactional software memory can also be useful.

I'd rather have some sort of support for multiple cores without real threads.

Stackless seems like a good alternative to Erlang.

Take the two statements above as standalone.

I agree that Stackless is better. Threads are very heavyweight. The problem is that the underlying C interpreter works on only one core. If Stackless can work with multiple interpreters then it may be the greatest thing since sliced bread. I haven't been following it closely enough to say for sure.

All you have to do is build a mechanism that distributes threads between interpreter processes then just spawn an interpreter for each core. For many applications, that's all you need.

I seem to recall a benchmark that showed Erlang as much faster than Stackless (for an admittedly relatively arbitrary problem).

Python uses native threads which can be scheduled by the OS to multiple cores. It's just that can't run the interpreter at the same time (C-coded extensions can, and most of them release the lock).

Yeah, not a big difference really.

`multiprocessing` module is in the Python's stdlib.

The parent's statement is still technically right:

"python interpreter doesn't even support multiple real threads let alone multiple cores"

multiprocessing allows you to have the interpreter start another entire OS process (another interpreter). It mimics the threading API.

It's useful and it does allow you to take advantage of multiple cores without unfamiliar APIs which I take it is your point...

You can see in the PEP that for many situations the overhead of processes is not a big concern:


multiprocessing module supports multiple "concurrent" programming paradigms.

It greatly depends on the task but my usage of the module is in disagreement with the PEP i.e., I don't use its threading-like API preferring its "distributed" capabilities.

The main point is there are multiple "concurrent" approaches and Java-like "threading" approach is the worst for many tasks i.e., It would be nice if GIL had been gone but for many "concurrent" approaches It doesn't matter. http://wiki.python.org/moin/ParallelProcessing

The endless GIL debate: why not remove thread support instead? http://mail.python.org/pipermail/python-dev/2008-December/08...

I'm in agreement (note that I could not really tell what you were saying in your original comment).

If I need to do some tightly coupled thing with threads and shared data etc. I'm not going to be looking at Python anyhow.

I would love to have parallel list comprehensions, for instance.

I only just starting using MySQL to store tables of around 100,000 entries. But MySQL would take literally > 30 seconds (sometimes taking minutes) for a query that PostgreSQL (which I installed after it) can handle in 10 seconds.

It seems like MySQL is therefore dreck. I can't see any reason not to use PostgreSQL.

There's something terribly horribly wrong with either your database setup or your queries. 100,000 entries is miniscule. Do you have indexes on the join columns in your queries?

(But do use Postgres, it's better than MySQL for most cases!)

I learned SQL at University, but that was 5 years ago so I'm a bit rusty. There's 1 table with a primary key (a string ID), which is used for rows with duplicate ID's, so in theory MySQL should be quick with it.

Thanks for the info, I'll have to dig further, but at least now I know it should be fast even in MySQL.

Try using EXPLAIN on your queries. That will give you a lot of info on how many rows are examined by your SELECT statements.

A website can grow a lot faster than Moore's law would account for.

Zuckerberg's Law?

Though I suppose Moore's law has a head start, and grows faster than the overall population. At some point, if it continued indefinitely, databases would become powerful enough to serve 100% of the human population for a typical app on a single server.

All this Moore's law talk is driving me insane. Improvements in performance are only a side effect of Moore's law.

Moore's law only states that the density of transistors on a chip will double every 2 years. This will fail because of the Zeno's paradox-like effect of limited miniaturization.

At some point, transistors will have to become molecule size, and then atomic, at which point it should be theoretically impossible to get any smaller.

(Sorry for the derailment, and yes, I also go bonkers over centrifugal vs centripetal.)

Finally, DHH may be amazing at writing frameworks, but he is just about the absolute last person I would trust with anything that resembled math.

The issue isn't about math. The opinion of anyone who's maintaining a profitable business should be given extra attention.

I don't agree with that. Running a profitable business doesn't automatically make your opinions on everything unusually worthy of consideration.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact