
Disaster: MySQL 5.5 Flushing - LiveTheDream
http://www.mysqlperformanceblog.com/2011/09/18/disaster-mysql-5-5-flushing/
======
seldo
TLDR: on big machines with lots of memory (72GB), vanilla MySQL has great
performance but suffers from minutes-long "stalls" during which throughput
drops to zero. The only viable way the author found to avoid the stalls is to
decrease throughput by 20x by cutting the amount of memory MySQL uses (thus
wasting the hardware).

This is really terrible news for anyone trying to do high-throughput DB
operations on mySQL.

The only "good" news is that the average web app isn't going to be running on
hardware anywhere near as powerful as this, so you are highly likely to
already be in the "slower", consistent state (e.g. anything less than 40GB of
RAM).

~~~
dmk23
Not really, if you are using hardware of this size chances are you are hosting
it yourself and can install SSDs.

Solid State Drives is the easiest solution to most problems requiring high I/O
throughput and data persistence.

Too bad most cloud hosting vendors are way behind on offering anything like
that.

EDIT: Not sure why this comment got down voted, but perhaps whoever did it
might comment on why using SSD is not a solution.

~~~
zokier
Using SSD is kinda "throw more hardware at the problem so that we don't need
to fix our precious crappy software" type of solution.

~~~
rickmb
Although abhorred by software makers such as most HN-readers, throwing
hardware at the problem is often the fastest and cheapest way to tackle
performance issues.

Especially when it indeed is "crappy software" that will take major effort to
rewrite/refactor.

Hardware is cheaper than developers, and the former doesn't fail to deliver
half as often as the latter...

~~~
fauigerzigerk
The problem with this principle is that adding hardware tends to scale
linearily at best whereas improving an algorithm sometimes buys you an orders
of magnitude improvement.

Also, you need to compare the cost of a single developer who fixes innodb to
the cost of hardware incurred by _all_ users of innodb who would benefit from
the solution.

~~~
dmk23
Adding SSD is not linear scaling. SSD has entirely different characteristics
from rotational drives which in itself brings 1-3 orders of magnitude I/O
performance improvement.

Trying to fit high I/O workloads on spinning disks is a lot like spending your
efforts to make your program run on 64K RAM. An interesting challenge, that
would surely exercise your coding skills, but would not deliver the optimal
performance.

~~~
fauigerzigerk
I'm not denying the benefits of SSD. I just don't think throwing hardware at a
problem caused by bad algorithms is a good general principle. It may still be
the best choice in a particular situation and timeframe.

------
mmaunder
This is awesome data as usual from Vadim and team. I always thought the upper
bound of InnoDB performance was limited by memory and how much you can afford.
Now I know different.

I wanted to share some of our data to help in your decision making:

A typical server in our cluster gets 5000 to 7500 TPS:

<http://i.imgur.com/oT26G.png>

This is a 24 hour graph and as you can see there are no multi-minute lockups.
We've been extremely happy with the performance our servers are delivering.

The config is 8GB of memory, RAID 1, 2X 15K RPM disks, single Intel E5410 quad
core CPU.

Total DB size is around 12GB with around 15 million rows. Our TPS is higher
than Vadim's but our data is smaller than the benchmark he uses which is 200
million rows in a 58 GB database.

With the following for innodb:

    
    
      innodb_file_per_table = 1
      innodb_flush_log_at_trx_commit = 0
      innodb_buffer_pool_size = 5G
      innodb_additional_mem_pool_size = 20M
      innodb_log_file_size = 1024M
      innodb_log_buffer_size = 16M
      innodb_max_dirty_pages_pct = 90
      innodb_thread_concurrency = 4
      innodb_log_group_home_dir = /var/lib/mysql/
      innodb_log_files_in_group = 2
    

So my approach to avoiding this issue will be to shard to finer granularity
and avoid architectures with monolithic DB's. Keep in mind that this only
affects your ability to cache greater than around 16GB, so spreading massive
data across multiple servers with fast drives and moderate memory will help
bring your TPS back up. I would also add that servers with memory > 32GB are
very expensive - especially the memory itself. So it may not cost much more to
have 5 servers with fast disk and moderate memory vs one server with massive
memory.

Having said all that this is obviously a very serious problem and perhaps an
opportunity for a talented computer scientist to branch InnoDB and solve it if
Oracle doesn't want to.

~~~
spudlyo
_innodb_flush_log_at_trx_commit = 0_

This is cheating, you've effectively disabled the 'D' in ACID; you aren't
doing an fsync() at every commit.

~~~
mmaunder
Cheating what? As I said, it's just a typical server in our cluster and
hopefully helpful data.

~~~
robryan
_A value of 1 is required for ACID compliance. You can achieve better
performance by setting the value different from 1, but then you can lose at
most one second worth of transactions in a crash. With a value of 0, any
mysqld process crash can erase the last second of transactions. With a value
of 2, then only an operating system crash or a power outage can erase the last
second of transactions. However, InnoDB's crash recovery is not affected and
thus crash recovery does work regardless of the value._

Just depends whether or not you can deal with a small amount of lost data in
the event of a crash.

------
sgt
I've switched to PostgreSQL several years ago. Bullet proof, exactly the
features I need (and more), and no problems like these. Speaking of, I'm about
to upgrade my installations to 9.1 now.

~~~
sc68cal
Are you running an equivalent hardware setup? Similar to:

>Hardware: HP ProLiant DL380 G6, with 72GB of RAM and RAID10 on 8 disks.

Playing devil's advocate - there is no proof that PostgreSQL does not have any
teething problems on high capacity hardware either.

~~~
pointyhat
The issue is x86/x64 being a turd. Throughput of just about everything drops
when you go over the 36Gb boundary for some reason. I have no idea why. It can
drop by as much as 30%. This is from experience operating VMware on the
physically the same type of kit (except with FC SAN / NetApp).

I'd personally rather buy high end SPARC64 kit which has linear performance
scalability but we all know what happened to Sun, plus it doesn't run Windows
anyway :( _whimpers a little_

~~~
jeremyw
Be careful of your NUMA boundaries. 20-30% is exactly the cross-bank access
cost.

~~~
spudlyo
That's what I thought too. There are a lot of weird problems that can happen
due to NUMA architecture issues. Reading the numactl for me really drives home
just how much I don't understand it.

<http://www.linuxmanpages.com/man8/numactl.8.php>

------
schiptsov
Yet another self-promotion article. Oh, we stuff data in memory and then when
it is time to write changes on disk everything got frozen! MySQL suck, buy our
solution! Nice guys.)

First, it is useless to store data in memory if you want them be committed
into disk storage. The general idea here isn't about switching to some new
version of mysql or SSD disks, it is about to realize that you have a data-
flow inadequate for your one-server architecture.

Second - check-point intervals should be adjusted to your actual data-flow,
which means they should be executed often enough. If there is situation of
almost constant checkpoint - non-stop data writes, that is the sign that you
need to consider sharding/multi-server solution.

The hints that there must be no other disk activity on the same hardware
volume or any swapping in OS, I suppose, are obvious. People who have a
/var/log and /var/db on the same volume are idiots.

There are also good idea to use one file per table storage and put a data and
physical logs on a separate hardware volumes (links are your friends). One
raid-X volume that fits all is a quite naive solution. Raid isn't a guaranty
of reliability. Replications to a back-up servers are.

Third, when you test your configuration before put it into production, you
should tune-up your servers to perform with data and log syncing, and then
figure out appropriate buffer sizes and checkpoint intervals. Then, in
production, when you're experiencing an increasing flow of queries, you may
choice to switch into different syncing strategy and/or more often but a
little bit faster checkpoints.

Configuring mysql with huge buffers and no sync means lack of understanding
the basic concepts, self-delusion and misuse of software and hardware. ^_^

~~~
morgo
"Yet another self-promotion article. Oh, we stuff data in memory and then when
it is time to write changes on disk everything got frozen! MySQL suck, buy our
solution! Nice guys.)"

You got this part all wrong. He explicitly said that 'their solution' is
unlikely to do any better. This is a real problem.

~~~
schiptsov
_What to do? I whish I could say you should use Percona Server with XtraDB. If
we were using SSD as storage, then I would recommend it._

------
teoruiz
How is this different in MySQL 5.1?

From the article I get the impression that this is a new problem introduced
with 5.5.

~~~
spudlyo
It's not. The adaptive flushing algorithm has improved over the various
versions, but checkpointing has been the cause of performance stalls for a
long time. Here is an early bug report of the problem from 2006.

<http://bugs.mysql.com/bug.php?id=19673>

------
moe
Why does he use flush_method = O_DIRECT?

Last time I checked that was a recipe for terrible performance. We normally
leave the flush_method set to default and keep a couple slaves around to
minimize the potential for data loss.

~~~
morgo
Common myth. Real answer: it depends. I recently had a bug approved to update
the manual: <http://bugs.mysql.com/bug.php?id=54306>

------
Andys
I can't see what was wrong with the second graph. Regular bursts of writes for
flushing leaves the disks more idle for low-latency reads. Optimising
throughput doesn't mean you have to have a smooth, flat write graph.

(For comparison, see ZFS)

------
zokier
I wonder if sharding in some clever way could alleviate the problem. Some way
of having another instance processing queries while other one is flushing
logs.

------
tedjdziuba
> Disaster: MySQL 5.5 Flushing

> Disaster: MySQL 5.5

> MySQL

I found your problem right here.

------
mrinterweb
I have personally noticed MySQL having horrible performance for views that use
joins. Every time a view does a join, it creates a temporary table and you
might as well throw indexing out once that happens. I had a query that was
taking 4 seconds with a view and used the same sql minus the view and got it
down to 5ms. I know I am not being specific to flushing, but still it is an
example of a potential performance problem.

~~~
rhizome
That sounds like a bad data model and/or bad MySQL usage to me.

~~~
mrinterweb
I recently did learn about the difference between the merge and temptable
algorithm approaches a MySQL view can take. I was using the default temptable
approach and did not know about the merge algorithm that can be specified for
views. By my experience with MySQL's views, MySQL does not do much to optimize
execution plans for views.

~~~
IgorPartola
By default MySQL will use the merge algorithm. It will create a temp table if
the result of the view is large. You can control what "large" means. That is
the better method of controlling the behavior, than potentially telling MySQL
that it jeeds to create a multi GB result set but must keep it all in RAM.
Read the manual or the O'Reilly book for more info.

Anither option is to use materialized views, which are not natively supported
in MySQL, but can easily be simulated.

~~~
jeltz
I do not see why views need to be so complex in MySQL. Why not do like in
PostgreSQL where they are almost just subqueries saved within the database.

~~~
IgorPartola
That is essentially what they are. The merge algorithm _merges_ your query
with the query of the view and returns the result. However, if the query of
the view returns a very large result set, it is much faster to pre-flight it,
then select against it.

~~~
jeltz
Yeah, but letting users specify if views are MERGE or TEMPTABLE seems quite
pointless. Why not always use UNDEFINED and let the planner chose. The planner
knows which tables are large.

