

Visualizing AWS Storage with Real-Time Latency Spectrograms - degio
http://sysdigcloud.com/aws-storage-latency-sysdig-spectrogram/

======
btown
> Every few seconds one of the writes takes forever [~5s]. You can notice the
> long periods of inactivity, and after that a green dot at the right of the
> chart: that’s our slow call. What is likely happening is: the local cache
> saturates and when that happens the application has to wait until the local
> data is pushed to the remote volume. Boy, you sure don’t want one of your
> critical code paths to hit one of these slow calls.

I'm surprised that there's no asynchronous way that the FS cache will flush
itself i.e. when it reaches 50% capacity, and rate-limit incoming requests if
it's too full. The idea that an FS cache is so dumb that it can't do
_anything_ while it's flushing its entire self is a bit scary - I'd expect
that circular buffers and granular locking mechanisms could be used to great
effect here. Is this kernel code? Userspace code? Is there research into this?
Fundamental tradeoffs that I'm missing?

~~~
mrjones
It would be interesting to see the client/benchmarking program. It almost
sounds like it could be single-threaded ... which would mean the delay is an
artifact of the benchmark only having one op outstanding, rather than
something inherent in the storage layer.

~~~
btown
Even with one client thread, though, shouldn't there be a background OS thread
maintaining the FS cache and flushing parts of it? I don't think it should
block the client just because it decided it was too full.

------
huhtenberg
That's clever and well executed. Wrong palette though :P

Red implies problems, green implies "normality", but here this association is
misplaced. Perhaps a typical "fire" palette would be better - from dark brown
to red to orange to yellow and, ultimately, to white for the extremes.

~~~
degio
OP here. Unfortunately the ansi palette is pretty limited so I didn't have a
lot of flexibility in the color choice. That said, this can definitely be
improved. I can work on it if people find it useful.

In the meantime, it's very easy to tune the colors your own: just modify this
line
[https://github.com/draios/sysdig/blob/master/userspace/sysdi...](https://github.com/draios/sysdig/blob/master/userspace/sysdig/chisels/spectrogram.lua#L40)
in your local version of the script, using this as a reference
[http://misc.flogisoft.com/_media/bash/colors_format/256_colo...](http://misc.flogisoft.com/_media/bash/colors_format/256_colors_fg.png).

~~~
chrisan
> Unfortunately the ansi palette is pretty limited so I didn't have a lot of
> flexibility in the color choice.

I believe the issue raised isnt the palette range itself, but rather that it
is the reverse of what it is typically expected. The current red area "should"
be green indicating there are many calls in the fast region while the current
trailing green blocks "should" be red indicating problem issues

This color of green=good and red=bad I believe stems from Triage tags:
[http://en.wikipedia.org/wiki/Triage_tag](http://en.wikipedia.org/wiki/Triage_tag)

Sometimes white is used below green as 'dismiss/not an issue'

~~~
morpher
Here, "good" is on the left and "bad" is on the right. The color is orthogonal
(it gives the number of operations with latencies in a given bucket). For
example, a red square on the right side of the output would have definitely
been "bad".

------
bcantrill
Neat! This is definitely a step forward -- and thanks for the shout-out to our
(that is, Sun's and Joyent's) prior work here. Tempted to also incorporate
this into agghist and aggpack, the new DTrace actions I added for this kind of
functionality.[1] Anyway, good stuff -- it's always good to see new
visualizations of system behavior!

[1] [http://dtrace.org/blogs/bmc/2013/11/10/agghist-aggzoom-
and-a...](http://dtrace.org/blogs/bmc/2013/11/10/agghist-aggzoom-and-aggpack/)

------
andrewguenther
It would be interesting to run these tests on different instance sizes,
specifically for data on the instance store. The larger the instance, the
fewer neighbors you have to worry spending those precious IOPS.

As for SSD vs Magnetic EBS, I can't say that I'm surprised. I'd assume that
EBS implements some sort of cache in between you and your actual disk on the
other side of the network so that the writes can return even faster. Try doing
this again with reads and I'd bet you'd get some interesting results.

Edit: Also, did you pre-warm your EBS volumes?
[http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-
prewa...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewarm.html)

~~~
degio
Yes, I did pre-warm the volumes before using them.

And yes, there are several interesting workloads that I didn't test, including
read only and read+write. It's potential material for another blog post.

------
robszumski
Nice job on the graphics for the post. Thanks for taking the time to animate
and annotate well.

------
amulyasharma
In the world of IOPS provisioned iops application demanding faster and faster
iops this tool is handy for devops guy to find the truth of iops being used
and how its performing, selecting if there is need to upgrade the storage ..

------
outputlogic
Calling this visualization a heatmap would be more appropriate than a
spectrogram.

------
digikata
I really want to lop off the 'ns' and '10 sec' divisions of all the charts and
expand the resolution...

------
simonebrunozzi
Well done!

------
armandomonaco
cool project

