
Curious Case of a 99.9% Latency Hike - myinnerbanjo
https://mahdytech.com/2019/01/13/curious-case-999-latency-hike/
======
hinkley
I've watched team after team get surprised that their app gets _substantially_
slower when they use up nearly all of the memory on the servers.

They look at the free memory and think, "wow look at all that free space, I
can totally take up half of that no problem".

The moment the collective working set exceeds total memory things go sideways.
They learn that the performance of the system tanks when there's not enough
space left for OS-level caches and _especially_ for memory mapped files.

~~~
aybassiouny
The surprising part for me is, OS does not recover. As if once it tanks, it
becomes more cautious about which memory is paged and which is not, even if
there's plenty of free physical memory later.

~~~
existencebox
Preemptive apologies for this glorified "it's nice to hear it's not just me"
of a post.

To let you skip the anecdote if you don't care, I'd be very curious if someone
can answer _why_ this happens.

Earlier this week I was running into Very Strange errors coming out of some
deep libraries in one of my hobby services. Nothing was specifically failing
but things were going very slow and vomiting a bit. I observed that I had over
25g of active Ram (Typically idle at ~8g) and some absurd amount of paged data
too. Even after killing almost all non-system running processes I was sitting
at ~20g total, including a 10g nonpaged pool and 6g process private. On
reboot, and for the week since? 8g. (On Win10)

~~~
aybassiouny
I recommend to do more debugging next time and see which process exactly is
costing you all this. It sounds like a classic memory leak - those can be
triggered nondeterministically. Maybe try some of the tools suggested here
[https://mahdytech.com/2019/01/05/task-manager-memory-
info/](https://mahdytech.com/2019/01/05/task-manager-memory-info/)

~~~
existencebox
That's the fascinating part; I followed _that specific_ top google result plus
a few others, dug around in RamMap, but couldn't find anything to finger any
particular process as owning all the consumed Ram. Admittedly a bit out of my
league in terms of system level debugging here, but all of the typical
culprits said "everything looks normal except you actually have like no ram
free," and given that it was impeding my primary work machine + that I had no
track record of these symptoms occurring before, I eventually resigned myself
to a reboot.

------
tjungblut
You could add an aliveness probe that makes sure the binary is paged in
completely.

------
Thaxll
Why did you not see that on metrics / dashboards ... Tracing an app should be
last resort, a proper setup would have told that you had memory issues right
away.

~~~
tayo42
There's that and you could be monitoring swap. Though this reads like the
writer might not the potential impact swapping has on performance. They seem
to have their own datacenters? Id imagine the servers would have some sort of
monitoring?

It would be nice to language and os your talking about.

Also, to the writer, I think you want to turn off swap?

~~~
devereaux
Better than turning it off, on Linux try to use vm.overcommit_memory

It will give you the option of swap when OOM would otherwise kill process, but
you can decide exactly how much, to prevent swapping from hurting performance
-- say have a 10% extra.

Assuming you have a significant number of identical servers, turning off swap
should be reserved for after you have done your tuning of the daemons and
their memory use, and you are confident nothing will run out (still, you
should monitor dmesg) to maximize bang for bugs.

If however you need swap a good old trick- don't put the swap on LVM or RAID.
Just have separate partition on each physical device. The kernel will always
know to use them in parallel, and you get to squeeze out a bit of extra
performance

------
conorh
Nice insight into the tools, but wouldn't a simple memory monitor have shown
the issue?

~~~
aybassiouny
That is already tracked, but it is not very meaningful. We already have a lot
of page faults (and paged memory), they are just not on the hot path.

------
loa-in-backup
Nice, easy read. Light but to the point.

