

Don't Bet on Moore Saving Your Ass - prakash
http://jeremy.zawodny.com/blog/archives/010841.html

======
ruby_roo
I'm very interested in book recommendations for scaling websites from hundreds
of users to millions of users. I'm interested in both database and web server
configurations as well as hardware details, and war stories from the trenches.
Also, I'd like to know if there bits of wisdom I can incorporate into my 1.0
codebase to make scaling easier should I be fortunate enough to have that
problem.

Ideas from the HN crowd?

~~~
mdasen
<http://oreilly.com/catalog/9780596102357/>

Building Scalable Websites by the Flickr guy.

------
nihilocrat
Does anyone know if PostgreSQL has this similar multicore bottleneck?

~~~
mdasen
PostgreSQL is much better at using multiple cores, but it doesn't matter (at
least not for web applications).

Web applications don't have a problem handling one person. They have a problem
handling 1,000 people who all want to load a page at the same time. That means
that you're not worried about using all your server resources for one query.
You're going to have 1,000 queries and so 250 go to each core in a quad-core
box and you don't care that MySQL (InnoDB really) doesn't do the multi-core
thing because it's not like PostgreSQL would benefit from it since it's just
easier to do different queries using a different core.

Multi-core capabilities could help when you are running fewer concurrent
queries than you have processor cores. However, once you get to 4-8 concurrent
users, you don't need to worry about it so much because you'll saturate all
the cores through the different queries and when you're below 4-8 concurrent
users you don't need to worry about it because it's easy to serve few users.

Now, for data warehousing and other applications, multi-core capabilities can
be the difference between a query taking 1 hour and a query taking 4 hours.
But that's not a web application. In fact, using MySQL for data warehousing is
just a bad idea.

The author is totally right that you shouldn't just expect technology to save
you (and I personally think DHH is taking the "wait until it is a problem. if
becomes a problem and new hardware can solve it, do that. if not, then deal
with it" approach rather than simply expecting a solution). However, web
applications are about running many things at once, not running one big thing.

P.S. Databases usually aren't CPU constrained. It's almost always memory or
disk that slows you down. Spend the money on more RAM or better disks and
don't worry about the CPU so much.

P.P.S. Both MySQL and PostgreSQL are wonderful databases for web applications
so let's not turn this into some silly flame war, please?

~~~
sounddust
PostgreSQL is much better at using multiple cores, and it absolutely _does_
matter for web applications.

When you have 1,000 people who all want to load a page at the same time,
you're only going to hit the database once for that.

What you will have instead is 5,000 people who want to go to 1,000 different
pages, some of them updating data and some of them retrieving it, and you will
have 1,000 different queries that are either selects or updates.

Now, PostgreSQL was designed from the start to be robust enough to handle this
type of concurrency - to allow the minimum amount of data to be locked while
updating, to free the selects from being blocked by those writes. As a result
of being built on such a solid foundation, it has been easy to optimize over
the years, and has supported the emergence of commmon multi-core computing
quite well.

MySQL, on the other hand, was designed to return queries extremely quickly,
with as little getting in the way of doing so as possible. Unfortunately,
their model did not scale as well. When you have 500 reads and 1 write, and
you lock the entire table for that write, it's not a big deal because the
write takes no time and you're unblocked again. But when you have an array of
hundreds of simultaneous reads and writes, it becomes a mess if you don't have
fine-grained locking capabilities. That having been said, MySQL is progressing
as well.

But I think (not trying to start a flame war) that PostgreSQL did things slow
and right in the beginning and it's paying off pretty well.

Finally, I disagree that databases are not CPU constrained. Any website with
decent traffic will have the database entirely cached in RAM at all times,
with the only disk activity being the journaling.

~~~
mdasen
PostgreSQL was designed "correct" from the start. However, MySQL does many of
the same things today.

You start by saying you'll have 5,000 people who want 1,000 different pages
with some updates, some selects, etc. Well, once you have more queries than
cores, the benefit of splitting work over multiple cores is gone since none of
the queries can logically use even a whole single core given the level of
concurrency.

You go on to say that PostgreSQL was designed to eliminate unnecessary locking
so that selects wouldn't be blocked. MySQL's MyISAM database doesn't support
this, that is correct. However, InnoDB does support the same MVCC model that
PostgreSQL uses and likewise eliminates the locking issue.

Your argument rests on MySQL locking an entire table to do a write: "When you
have 500 reads and 1 write, and you lock the entire table for that write, it's
not a big deal. . . But when you have an array of hundreds of simultaneous
reads and writes, it becomes a mess if you don't have fine-grained locking
capabilities." A great article on this problem is here:
[http://www.onlamp.com/pub/a/onlamp/2001/05/25/postgresql_mvc...](http://www.onlamp.com/pub/a/onlamp/2001/05/25/postgresql_mvcc.html).
It's great that PostgreSQL supports that, but MySQL supports it too today
(<http://wiki.oracle.com/page/InnoDB?t=anon>).

There are plenty of things that one can say are issues between the two.
MySQL's inferior query planner. The fact that PostgreSQL can suggest indexes
based on query history. PostgreSQL's weak replication (Slony-I's communication
costs grow quadratically, yuck!). MySQL's acceptance of February 31st as a
real date. MySQL's poor subquery optimization. PostgreSQL's more limited data
partitioning.

They aren't equal in all ways and PostgreSQL is a wonderful database, but
saying that MySQL needs to lock a table to do a write is just wrong in a very
partisan manner. I've never really understood such partisanship. Knowing the
strengths and weaknesses of multiple products makes you aware of what is good
for a project and what isn't. Even better, once you're using one of them, you
know what to do and what to avoid with it. Not confronting the reality of how
alternative systems work just means that the chance of picking the best system
is more luck than information. For what it's worth, I use PostgreSQL in my
personal projects. It's great. However, it's also important to understand that
MySQL of 2008 is not MySQL of 2001. It's come a long way in the "correctness"
camp and the old arguments about Multi-Version Concurrency Control don't apply
anymore.

Oh, and from Power PostgreSQL, Disks > RAM > CPU
(<http://www.powerpostgresql.com/PerfList>).

~~~
sounddust
1) I understand that MySQL (with InnoDB) supports row-level locking today. I
was speaking of both databases as they existed in the past, to express how
much easier it was for PostgreSQL to optimise and scale over the past decade.

2) I am not trying to slam MySQL. But seven years ago, we had one database
which was not ACID compliant and had inconsistent behavior, but was very fast.
And we had another which was designed and built properly from the ground up -
with the future in mind - but was quite slow. There were advantages and
disadvantages to each one. But over the past seven years, PostgreSQL has had
time to optimise and stabilize code, and as a result it performs just as well
as MySQL in most cases. Whereas in the past seven years, MySQL has worked to
implement the essential features that allow it to be a robust database. If you
were starting a website in 2009, why would you pick MySQL? I could easily
understand why in 2001, but not now.

3) My point is that the "Disks > RAM > CPU" argument is no longer valid. The
point of that statement was that you should spend your money on disks first,
then RAM, and focus less on CPUs. This is not true in 2009! Now (for web
apps), you can easily buy enough RAM to cache your entire database. All of a
sudden, RAM and Disks are no longer an issue and your DB becomes CPU limited.
All that discussion about "more spindles = better" and "raid 1+0 > raid 5" is
not very important anymore; you just need a disk that's fast enough to log
your transactions.

~~~
mdasen
Well, I personally choose PostgreSQL for my personal stuff, but your logic
doesn't hold. PostgreSQL was slow, but correct. Now it is fast and correct.
MySQL was fast, but wrong. Now it is fast and correct. Most of the reasoning
behind picking one or the other has disappeared and we're left scraping the
bottom of the nitpick barrel trying to convince people to use one over the
other.

As for why someone would choose MySQL, there are a bunch of reasons. There are
a lot more people with MySQL experience out there. MySQL has better
replication facilities (and I've set up replication with MySQL, Slony-I and
PgPool-II). I'd really like to see Mammoth Replicator become the standard in
the PostgreSQL community (as well as for 1.8 to be out of beta) as I think
it's a considerably better replication design than the other options in
PostgreSQL, but right now MySQL replication looks a lot better. Maybe you have
a good use for one of MySQL's less used storage engines. MySQL Cluster looks
interesting, but I wouldn't trust my data to it today (even if Zillow seems to
think it's the best thing since sliced bread).

The differences between the two are really minor today. Choose whichever one
you like, but there are definite reasons to choose either one.

------
stcredzero
One thing to make note of: if you are in a space where you have massive
concurrency, the right kind of hardware _can_ help you out. There was a time
in the mid 90's when mainframes with CPUs we would regard as quaintly slow
could still achieve I/O throughput that could dwarf the throughput of a PC
with an order of magnitude faster CPU.

------
axod
Quite a few webapps just shouldn't be using a db for most things. It's lazy
programming.

~~~
mixmax
What then?

Asking out of curiosity/ignorance, not trolling

~~~
mynameishere
I think this website (hacker news) just uses persisted hash tables. I'm not
sure why he thinks databases indicate "lazy programming" though.

~~~
axod
It's really easy to just throw everything in a db, and then start worrying
when you scale.

It takes more programming to think how best to organize things - what stuff
should be in memory? How best to store it in ram for performance/size? What
should be on the disk as flat files? What needs to be in a db? Should parts of
the db be cached in ram and just used for writes etc.

Also does _everything_ need to be in a db? Or are some things better dealt
with by just passing messages around, queuing them up if needed etc.

~~~
ars
> Should parts of the db be cached in ram and just used for writes etc.

You are working way too hard. Let the OS figure out what needs to be in ram -
it does it automatically anyway, and it does a better job that you an since
it's caches what actually used, and not what you think should be used.

You should not use flat files for web apps - they don't handle concurrency
very well.

> Also does everything need to be in a db? Or are some things better dealt
> with by just passing messages around, queuing them up if needed etc.

Message passing, and db are not interchangeable, so that's a false question.

~~~
abstractbill
_You are working way too hard. Let the OS figure out what needs to be in ram -
it does it automatically anyway, and it does a better job that you an since
it's caches what actually used, and not what you think should be used._

So why does memcached exist?

~~~
ars
> So why does memcached exist?

To cache the results of complicated joins (or queries without indexes).

It's pretty much pointless if all you are doing is caching the result of a
simple query using an index.

~~~
abstractbill
_It's pretty much pointless if all you are doing is caching the result of a
simple query using an index._

Not if your database is under heavy load, and you can easily shift some of
that load by putting frequently accessed things in memcached instead.

~~~
ars
That's if you have two machines. The comparison was vs keeping stuff in a hash
table in memory, and I was saying databases are no worse.

~~~
abstractbill
_The comparison was vs keeping stuff in a hash table in memory, and I was
saying databases are no worse._

But that's clearly not true. In the most extreme case, that hash table is
referenced simply by a variable in your program - it's already in your
program's address-space! There's no way a database can come close to that.

~~~
gnaritas
Not to mention that you can hash arbitrary objects in a hash table with no
mapping of any kind.

    
    
        hash at: key put: anObject
    

Databases are vastly more complicated and require me to completely disassembly
the object graph anObject may contain into a set of tables and rows to store
it, and then reassemble the graph from tables and rows back into their object
form when fetching.

The second one commits to using a relational database, one often easily
triples the size of the code base. There's nothing simple about that.

------
likpok
One other thing: Moore's law is fundamentally limited. Remember that the limit
of the exponential function as t-> infinity is infinity.

This should be enough to show you that at some point, Moore's law will end. It
is less important right now (there is still some development going, and new
paradigms in the far future) , but the world is not limitless.

------
russell
I think all kinds of components and applications are going to hit the wall.
The python interpreter doesn't even support multiple real threads let alone
multiple cores. Programmers don't know how to write the code. Here is an
article musing about the problems in Java:
[http://www.devwebsphere.com/devwebsphere/2006/11/multicore_m...](http://www.devwebsphere.com/devwebsphere/2006/11/multicore_may_b.html)
Do we all need to learn erlang?

~~~
stcredzero
I'd rather have some sort of support for multiple cores without real threads.

Stackless seems like a good alternative to Erlang.

Take the two statements above as standalone.

~~~
russell
I agree that Stackless is better. Threads are very heavyweight. The problem is
that the underlying C interpreter works on only one core. If Stackless can
work with multiple interpreters then it may be the greatest thing since sliced
bread. I haven't been following it closely enough to say for sure.

~~~
stcredzero
All you have to do is build a mechanism that distributes threads between
interpreter processes then just spawn an interpreter for each core. For many
applications, that's all you need.

------
zandorg
I only just starting using MySQL to store tables of around 100,000 entries.
But MySQL would take literally > 30 seconds (sometimes taking minutes) for a
query that PostgreSQL (which I installed after it) can handle in 10 seconds.

It seems like MySQL is therefore dreck. I can't see any reason not to use
PostgreSQL.

~~~
zmimon
There's something terribly horribly wrong with either your database setup or
your queries. 100,000 entries is miniscule. Do you have indexes on the join
columns in your queries?

(But do use Postgres, it's better than MySQL for most cases!)

~~~
zandorg
I learned SQL at University, but that was 5 years ago so I'm a bit rusty.
There's 1 table with a primary key (a string ID), which is used for rows with
duplicate ID's, so in theory MySQL should be quick with it.

Thanks for the info, I'll have to dig further, but at least now I know it
should be fast even in MySQL.

~~~
thamer
Try using EXPLAIN on your queries. That will give you a lot of info on how
many rows are examined by your SELECT statements.

------
mattmaroon
A website can grow a lot faster than Moore's law would account for.

~~~
mattmaroon
Though I suppose Moore's law has a head start, and grows faster than the
overall population. At some point, if it continued indefinitely, databases
would become powerful enough to serve 100% of the human population for a
typical app on a single server.

~~~
icey
All this Moore's law talk is driving me insane. Improvements in performance
are only a side effect of Moore's law.

Moore's law only states that the density of transistors on a chip will double
every 2 years. This will fail because of the Zeno's paradox-like effect of
limited miniaturization.

At some point, transistors will have to become molecule size, and then atomic,
at which point it should be theoretically impossible to get any smaller.

(Sorry for the derailment, and yes, I also go bonkers over centrifugal vs
centripetal.)

Finally, DHH may be amazing at writing frameworks, but he is just about the
absolute last person I would trust with anything that resembled math.

~~~
skenney26
The issue isn't about math. The opinion of anyone who's maintaining a
profitable business should be given extra attention.

~~~
mattmaroon
I don't agree with that. Running a profitable business doesn't automatically
make your opinions on everything unusually worthy of consideration.

