
Fixing a Disk Space Alert at 3 AM - mrastian
http://slashmili.org/blog/2015/07/13/fixing-disk-space-alert-at-3am/
======
lmm
Randomly rewriting a file that mysqld has open seems a lot more dangerous than
restarting mysqld. What if that was e.g. where mysqld was batching up query
results before streaming them to the client? Now you've not merely had a query
fail, you've silently sent back incorrect results.

If you can't restart mysqld you have an architectural problem. Your database
will go down sometimes and your system should be built to tolerate that.

~~~
peterevans
Good thing the author wasn't randomly rewriting a file, then, and that the
file was deleted and inferably unlikely to be used in a critical operation!

Restarting MySQL is a violent operation. For large databases, it can take
several minutes. If I can avoid it, I do so, and this was a really clever
workaround for an issue that indeed avoided a restart. That's a good thing.

~~~
takeda
I find it quite ironic that you and author of the article completely
misunderstand what is happening.

Creating then deleting a temporary file is very common technique to make sure
that that for will be removed as soon as the application closes it, it also is
used to make sure that nothing will tamper with it. I find amusing that author
found a way to still tamper with it.

Now depending on what the file was used for either the corrupted files caused
database to return corrupted data to the user, write corrupted data to the
database, our in the best scenario return error to the application issuing the
query.

~~~
leni536
open's O_TMPFILE (since Linux 3.11) does exactly that without creating a named
file and immediately removing it from the filesystem. I don't argue that the
other technique isn't common though.

~~~
takeda
The older technique is probably more common since it is available in older
version of Linux and also compatible with other Unix based systems.

That said it still doesn't prevent the author for modifying the file by
referencing file descriptor.

------
falcolas
Former MySQL admin's advice (for OP, and others who run into this same issue):

Someone probably started running queries which were doing disk sorts. Look for
abusive queries coming in and kill those, the temporary files will go away at
that point. Based on the size of the files, simply looking for long running
queries should be sufficient; if it's the backend for a web server, it's
likely that the client and server have already given up on the query (or in
the worst case, re-sent the debilitating query).

As stated by other users, truncating random files will cause more problems for
MySQL than just restarting mysql. In fact, I'd recommend going in and
restarting it now, to ensure that you're at a good state. Failing over to your
slave (you do have a slave and failover procedure, right?) is less of a
headache than trying to identify what problems were caused by manually
truncating these sort files.

Finally, have a look at Skyline from Etsy[1]; a trend monitoring tool like
this would have alerted you when the ramp started closer to 1am, well before
this was suddenly an outage event.

[1] [https://github.com/etsy/skyline](https://github.com/etsy/skyline)

~~~
jerf
There's a quick mention that this was a reporting server. I'd guess the
reports have a common query in them that yesterday did not spill to file
sorts, and today does, so literally overnight the report process goes from
using virtually no disk space to using arbitrary multiples of how much you
have.

Nothing really great leaps to mind to solve this in the general case. These
sorts of correlated behaviors can really be jerks.

~~~
falcolas
In this case I'd look for a recent table alter which involves a blob-style
column - queries involving those will always be sorted on disk. Limiting the
queries to not include the blob fields, or doing alters to varchar fields
(even big ones) will help this out.

Having a daemon which kills off long running queries (such as ones with
extensive disk sorts) can help as well, just be sure to follow up on the
queries which were killed to fix the frontend or chastise the person doing
`SELECT * FROM blob_table ORDER BY id`.

------
perlgeek
Opening files and the deleting them is a classical pattern for using temp
files; it makes sure that they will be gone when the process exits. Finding a
deleted but still opened file is no indication that it's not needed anymore.

My approach would have been to kill the mysql worker process that keeps them
alive. That way the program that started the query gets an error message,
instead of whatever undefined behaviour you get by empying the files and
surprising mysqld.

------
feld
Top commenter on that article:

    
    
      First of all... If you're getting paged on disk alerts at 3 in the morning,
      you're doing it wrong. Write a script that checks every file system and 
      makes sure that X amount of free space is always present, adding free space
      as-needed, and emailing you on the backend during business hours whenever 
      the volume group is nearing depletion.
    

I'm sorry, but no -- I'm not going to automate adding more disk to a VM,
extending a volume group, and growing a filesystem. Not now, not ever. No way
in hell.

That needs to be done by a human after a backup has been tested and writes
have been quiesced.

    
    
       This is what grown-ups do. Storage is cheap. Downtime is not.
    

Then I don't want to grow up like you. Have fun recovering your corrupted
filesystem.

~~~
digi_owl
Welcome to the devops world, where admin is boring and coding is fun...

~~~
toomuchtodo
Eh, sort of. I do Devops/Infrastructure exclusively, and while the tedious
stuff is automated away, I've seen too much shit break on the other side of
the fence to trust it fully (AWS API throwing errors when you absolutely need
it to work, etc).

If something is broken, you should have enough automation to get it into a
known good state without a human involved, while maintaining data consistency.
Anything else should be automated, but you should still be around to babysit
it while its going through the motions.

If you think you can automate everything and always trust it to work
flawlessly, you haven't been around long enough for the edge cases.

~~~
digi_owl
> If you think you can automate everything and always trust it to work
> flawlessly, you haven't been around long enough for the edge cases. <

And then you have some "clever" coder coming along with a fix for said edge
case (that invariably create some new edge cases just outside the domain of
the fix).

------
Maakuth
It is sometimes useful to keep around a 1-10GB 'ballast' file full of zero you
can remove if you're in this kind of emergency. Tuning the root reserved space
as mentioned in the article comments is another useful trick.

~~~
michaelx386
I love this ballast file idea. It reminds me of a programming story by Noel
Llopis[0] where a game developer was working with a strict memory limit and
placed _static char buffer[1024_ * _1024_ * _2];_ in the code as an insurance
policy[1]:

[0] [http://gamesfromwithin.com](http://gamesfromwithin.com)

[1] [http://www.dodgycoder.net/2012/02/coding-tricks-of-game-
deve...](http://www.dodgycoder.net/2012/02/coding-tricks-of-game-
developers.html)

------
rhpistole
I cannot imagine being in this situation and not either looking for queries to
kill in mysql or restarting the daemon.

And as others have mentioned, having an open file descriptor to a deleted temp
file is a classic unix pattern, truncating them is a horrific idea.

------
Daneel_
As mentioned in the article comments, this is a classic example of why you
should partition your systems.

Still an interesting way to troubleshoot and alive the issue though. I'm sure
during on-duty at 3am I wouldn't have come up with anything better.

------
marknadal
This is the canonical story of databases, I had similar frightening nights
several years ago with my MongoDB setup. I'm not a DevOps guy and having to
wake up to figure out what is broken (and it always being the database)
eventually took a toll on me. What's the point of running my database in the
cloud if my storage space is still finite? Isn't the whole point of the cloud
unlimited scalability? So why on earth do I have to be an expert in MDADM, LVM
and all that other junk in order to make use of those unlimited number of hard
drives? Ugh, sorry for the rant but this story causes triggers in me back to
those days. I tried doing stuff differently since, experimenting with new
database concepts, trying to make things better, and the good news is that I
have never had this problem since - so hopefully I can save the pain and woes
of future souls, [http://gunDB.io/](http://gunDB.io/) .

------
woof
Mysql uses TMPDIR for sorting, multiple large tempfiles indicates heavy
queries that should be optimised and/or missing indexes.

Deleting those files while the server runs is an awesome way to "f@ck s*it
up"! Do you really trust Mysql to behave nicely in that situation?

Add more disk space and point TMPDIR somewhere else than /tmp, /var/tmp or
/usr/tmp!

------
Buetol
To find easily about disk usage, I'm personally a big fan of `ncdu`. It's a
really nice ncurses interface to du.

~~~
DEinspanjer
Second this in general, although I don't think du (and consequently ncdu) will
show anything about disk space claimed by deleted but unclosed file
descriptors.

------
brazzledazzle
Is there any risk to mysql when we nuke those handles?

~~~
djcapelis
Yes. Never do what this person did.

------
stephengillie
In Windows we use TreeSizeFree to find the offending files. Tracking down the
lock and unlocking can be tricky, but Process Monitor and File Unlocker are
usually go to tools.

------
amelius
I'm still looking for a way to close open sockets of a process. I know about
the gdb hack [1] mentioned in the comments of the article, but it seems like
an opportunity for somebody to write a nice tool for this.

[1] [http://hacktracking.blogspot.nl/2013/06/closing-process-
file...](http://hacktracking.blogspot.nl/2013/06/closing-process-file-
descriptor-while.html)

