
Two frequently used system calls are ~77% slower on AWS EC2 - jcapote
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
======
brendangregg
Yes, this is why we (Netflix) default to tsc over the xen clocksource. I found
the xen clocksource had become a problem a few years ago, quantified using
flame graphs, and investigated using my own microbenchmark.

Summarized details here:

[https://www.slideshare.net/brendangregg/performance-
tuning-e...](https://www.slideshare.net/brendangregg/performance-tuning-
ec2-instances/42)

~~~
brendangregg
This reminds me: I should give an updated version of that talk for 2017...

~~~
jankedeen
I've been in a couple positions recently where they mention your name and I
look at your work and think to myself..here is a sysadmin with modest skills
who (by exposure) has become notably vocal and somewhat adept at scale
computing. In general if a company mentions Netflix or Brendan Gregg I flinch.
Just an FYI.

~~~
brendangregg
Sorry to make you flinch! I'm curious what of my work you were looking at; on
this thread I had mentioned this:

[https://www.slideshare.net/brendangregg/performance-
tuning-e...](https://www.slideshare.net/brendangregg/performance-tuning-
ec2-instances)

I think it's a pretty good summary, and includes work from my team and some
original work of my own.

Is there something I could change in it that would make it more helpful for
you?

------
drewg123
Another option is to reduce usage of gettimeofday() when possible. It is not
always free.

Roughly 10 years ago, when I was the driver author for one of the first full-
speed 10GbE NICs, we'd get complaints from customers that were sure our NIC
could not do 10Gbs, as iperf showed it was limited to 3Gb/s or less. I would
ask them to re-try with netperf, and they'd see full bandwidth. I eventually
figured out that the complaints were coming from customers running distros
without the vdso stuff, and/or running other OSes which (at the time) didn't
support that (Mac OS, FreeBSD). It turns out that the difference was that
iperf would call gettimeofday() around every socket write to measure
bandwidth. But netperf would just issue gettimeofday calls at the start and
the end of the benchmark, so iperf was effectively gettimeofday bound. Ugh.

~~~
LanceH
Maybe you could set up some caching.

~~~
sly010
For time?

~~~
coldtea
Yes, for time too. For one, if you don't need over second precision, they why
have some of your servers e.g. ask for the current time thousands of times per
second? There are ways to get a soft expiration that don't involve asking for
the time.

~~~
kccqzy
In case someone is interested in a concrete example, I first learned about
caching time by discovering this package in my dependencies:
[https://hackage.haskell.org/package/auto-
update](https://hackage.haskell.org/package/auto-update)

Its README basically says instead of having every web request result in a call
to get current time, it instead creates a green thread that runs every second,
updating a mutable pointer that stores the current time.

------
nneonneo
The title is misleading. 77% slower sounds like the system calls take 1.77x
the time on EC2. In fact, the results indicate that the normal calls are 77%
faster - in other words, EC2 gettimeofday and clock_gettime calls take _nearly
4.5x longer_ to run on EC2 than they do on ordinary systems.

This is a _big_ speed hit. Some programs can use gettimeofday extremely
frequently - for example, many programs call timing functions when logging,
performing sleeps, or even constantly during computations (e.g. to implement a
poor-man's computation timeout).

The article suggests changing the time source to tsc as a workaround, but also
warns that it could cause unwanted backwards time warps - making it dangerous
to use in production. I'd be curious to hear from those who are using it in
production how they avoided the "time warp" issue.

~~~
klodolph
77% faster is not correct either. "Speed" would probably by ops/s.

4.5x longer = 350% slower.

~~~
stouset
Even _this_ is confusing as hell.

Just say the native calls take 22% of the time they do on EC2. Or that the EC2
calls take 450% of the time of their native counterparts.

"Faster" and "slower" when going with percentages are ripe with confusion.
Please don't use them.

~~~
klodolph
I can't agree. Speed is usually units/time, and everyone knows that 100 mph is
2x as fast as 50 mph, or 100% faster.

------
binarycrusader
I prefer the way Solaris solved this problem:

1) first, by eliminating the need for a context switch for libc calls such as
gettimeofday(), gethrtime(), etc. (there is no public/supported interface on
Solaris for syscalls, so libc would be used)

2) by providing additional, specific interfaces with certain guarantees:

[https://docs.oracle.com/cd/E53394_01/html/E54766/get-sec-
fro...](https://docs.oracle.com/cd/E53394_01/html/E54766/get-sec-
fromepoch-3c.html)

This was accomplished by creating a shared page in which the time is updated
in the kernel in a page that is created during system startup. At process exec
time that page is mapped into every process address space.

Solaris' libc was of course updated to simply read directly from this memory
page. Of course, this is more practical on Solaris because libc and the kernel
are tightly integrated, and because system calls are not public interfaces,
but this seems greatly preferable to the VDSO mechanism.

~~~
jdamato
This is precisely what the vDSO does. The clocksources mentioned explicitly
list themselves as not supporting this action, hence the fallback to a regular
system call.

~~~
binarycrusader
Not quite; vdso is a general syscall-wrapper mechanism. The Solaris solution
is specifically just for the gettimeofday(), gethrtime() interfaces, etc.

The difference is that on Solaris, since there is no public system call
interface, there's also no need for a fallback. Every program is just faster,
no matter how Solaris is virtualized, since every program is using libc.

There's also no need for an administrative interface to control clocksource;
the best one is always used.

~~~
jdamato
Not quite. The vDSO provides a general syscall-wrapper mechanism for certain
types of system call interfaces. It _also_ provides implementations of
gettimeofday clock_gettime and 2 other system calls completely in userland and
acts precisely as you've described.

Please see this[1] for a detailed explanation. For a shorter explanation,
please see the vDSO man page[2]. Thanks for reading my blog post!

[1]: [https://blog.packagecloud.io/eng/2016/04/05/the-
definitive-g...](https://blog.packagecloud.io/eng/2016/04/05/the-definitive-
guide-to-linux-system-calls/#virtual-system-calls) [2]:
[http://man7.org/linux/man-pages/man7/vdso.7.html](http://man7.org/linux/man-
pages/man7/vdso.7.html)

~~~
binarycrusader
I'm aware of the high level about VDSO implementation, but I would still say
that the Solaris implementation is more narrowly focused and as a result does
not have the subtle issues / tradeoffs that VDSO does.

Also, I personally find VDSO disagreeable as do others although perhaps not in
as dramatic terms as some:

[https://mobile.twitter.com/bcantrill/status/5548101655902617...](https://mobile.twitter.com/bcantrill/status/554810165590261761)

I think Ian Lance Taylor's summary is the most balanced and thoughtful:

 _Basically you want the kernel to provide a mapping for a small number of
magic symbols to addresses that can be called at runtime. In other words, you
want to map a small number of indexes to addresses. I can think of many
different ways to handle that in the kernel. I don 't think the first
mechanism I would reach for would be for the kernel to create an in-memory
shared library. It's kind of a baroque mechanism for implementing a simple
table.

It's true that dynamically linked programs can use the ELF loader. But the ELF
loader needed special changes to support VDSOs. And so did gdb. And this
approach doesn't help statically linked programs much. And glibc functions
needed to be changed anyhow to be aware of the VDSO symbols. So as far as I
can tell, all of this complexity really didn't get anything for free. It just
wound up being complex.

All just my opinion, of course._

[https://github.com/golang/go/issues/8197#issuecomment-660959...](https://github.com/golang/go/issues/8197#issuecomment-66095902)

~~~
amluto
> Not quite; vdso is a general syscall-wrapper mechanism.

It's not. On 32-bit x86, it sort of is, but that's just because the 32-bit x86
fast syscall mechanism isn't really compatible with inline syscalls. Linux
(and presumably most other kernels) provides a wrapper function that means "do
a syscall". It's only accelerated insofar as it uses a faster hardware
mechanism. It has nothing to do with fast timing.

On x86_64, there is no such mechanism.

> It's true that dynamically linked programs can use the ELF loader. But the
> ELF loader needed special changes to support VDSOs. And so did gdb. And this
> approach doesn't help statically linked programs much.

That's because the glibc ELF loader is a piece of, ahem, is baroque and
overcomplicated. And there's no reason whatsoever that vDSO usage needs to be
integrated with the dynamic linker at all.

I wrote a CC0-licensed standalone vDSO parser here:

[https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vDSO/parse_vdso.c?h=v4.10)

It's 269 lines of code, including lots of comments, and it works in static
binaries just fine. Go's runtime (which is static!) uses a vDSO loader based
on it. I agree that a static table would be _slightly_ simpler, but the
tooling for _debugging_ the vDSO is a heck of a lot simpler with the ELF
approach.

------
jdamato
Author here, greetings. Anyone who finds this interesting may also enjoy our
writeup describing every Linux system call method in detail [1].

[1]: [https://blog.packagecloud.io/eng/2016/04/05/the-
definitive-g...](https://blog.packagecloud.io/eng/2016/04/05/the-definitive-
guide-to-linux-system-calls/)

~~~
a_t48
Nitpick - `77 percent faster` is not the inverse of `77 percent slower`. The
line that says `The results of this microbenchmark show that the vDSO method
is about 77% faster` should read `446% faster`.

~~~
woolly
Should that not be 346% faster? If A takes 1 second and B takes two seconds,
then B is 100% faster than A. So the calculation would be (B/A - 1) * 100.
Applying this here gives around 346%.

EDIT: B would, of course, take 100% longer than A, rather than be 100% faster.

~~~
mulmen
How can something that takes twice as long be faster?

~~~
woolly
You're right, of course: hadn't had the morning coffee. It should have been
'takes 100% longer' in the 1 second/2 seconds example. The point I was trying
to make is that you have to factor in the initial 100% which doesn't
contribute to the final value.

------
JoshTriplett
For anyone looking at the mentions of KVM "under some circumstances" having
the same issue and wondering how to avoid it with KVM: KVM appears to support
fast vDSO-based time calls as long as:

\- You have a stable hardware TSC (you can check this in /proc/cpuinfo on the
host, but all reasonably recent hardware should support this).

\- The host has the host-side bits of the KVM pvclock enabled.

As long as you meet those two conditions, KVM should support fast vDSO-based
time calls.

------
masklinn
So… it's not that the syscalls are slower, it's that the Linux-specific
mechanism the Linux kernel uses to bypass having to actually perform these
calls does not currently work on Xen (and thus EC2).

~~~
yellowapple
Depends on if you're looking at this from userspace or kernelspace. From the
latter, you're spot on. From the former, the headline's spot on.

~~~
masklinn
> From the former, the headline's spot on.

Only if you're using Linux guests and assuming vDSO so not really. The
headline made me first go to issues with the host/virtual hardware and some
syscalls being much slower than normal across the board.

------
andygrunwald
This was also presented at the last AWS re:Invent in December. See AWS EC2
Deep Dive: [https://de.slideshare.net/mobile/AmazonWebServices/aws-
reinv...](https://de.slideshare.net/mobile/AmazonWebServices/aws-
reinvent-2016-deep-dive-on-amazon-ec2-instances-featuring-performance-
optimization-best-practices-cmp301)

------
chillydawg
Interesting way to find out the version of the hypervisor kernel. If the gtod
call returns faster than the direct syscall for it, then you know the kernel
version is prior to that of the patch fixing the issue in xen.

I expect there are many such patches that you could use to narrow down the
version range of the host kernel. Once you've that information, you may be in
a better position to exploit it, knowing which bugs are and are not patched.

------
nodesocket
If anybody is interested, Google Compute Engine VM's result.

    
    
        blog   ~ touch test.c
        blog   ~ nano test.c
        blog   ~ gcc -o test test.c
        blog   ~ strace -ce gettimeofday ./test
        % time     seconds  usecs/call     calls    errors syscall
        ------ ----------- ----------- --------- --------- ----------------
          0.00    0.000000           0       100           gettimeofday
        ------ ----------- ----------- --------- --------- ----------------
        100.00    0.000000

------
gtirloni
Previous related discussion:
[https://news.ycombinator.com/item?id=13697555](https://news.ycombinator.com/item?id=13697555)

------
amluto
vDSO maintainer here.

There are patches floating around to support vDSO timing on Xen.

But isn't AWS moving away from Xen or are they just moving away from Xen PV?

------
apetresc
Does anyone have any intuition around how this affects a variety of typical
workflows? I imagine that these two syscalls are disproportionally likely to
affect benchmarks more than real-world usage. How many times is this syscall
happening on a system doing things like serving HTTP, or running batch jobs,
or hosting a database, etc?

~~~
TheDong
You can use strace and see!

Go to your staging environment, use `strace -f -c -p $PID -e
trace=clock_gettime` (or don't use -p and just launch the binary directly),
replay a bit of production traffic against it, and then interrupt it and check
the summary.

HTTP servers typically return a date header, often internally dates are used
to figure out expiration and caching, and logging almost always includes
dates.

It's incredibly easy to check the numbers of syscalls with strace, so you
really should be able to get an intuition fairly easily by just playing around
in staging.

------
anonymous_iam
I wonder if they tried this: [https://blog.packagecloud.io/eng/2017/02/21/set-
environment-...](https://blog.packagecloud.io/eng/2017/02/21/set-environment-
variable-save-thousands-of-system-calls/)

------
xenophonf
Is this just an EC2 problem, or does it affect any Xen/KVM guest?

I ran the test program on a Hyper-V VM running CentOS 7 and got the same
result: 100 calls to the gettimeofday syscall. Conversely, I tested a vSphere
guest (also running CentOS 7), which didn't call gettimeofday at all.

~~~
officelineback
>Is this just an EC2 problem, or does it affect any Xen/KVM guest?

Looks like it's how the Xen hypervisor works.

~~~
ahoka
It is slower because it misses an optimization where you can get the current
time without having to enter the kernel. The trick is using the RDTSC
instruction, which is not a privileged instruction, so you can call it from
userspace. The Time Stamp Counter is a 64 bit register (MSR actually), which
gets incremented monotonically. You can get the current time by calibrating it
against a known duration on boot or get the frequency from a system table
first, then with a simple division and adding an offset. There are sone
caveats though, like you have to check if the CPU has an invariant TSC using
CPUID and every core has a separate register. I think the problem with XEN is
that the VM could be moved across hypervisors or CPUs which would suddenly
change the value of the counter. The latter could be mitigated by syncing the
TSCs across cores (did I mention that they are writable?) and XEN supports
emulating the RDTSC instruction too. I'm not sure how it's configured on AWS,
so it may be perfectly safe or mostly safe.

------
MayeulC
Wasn't a workaround posted for this some time ago, that requires setting the
TZ environment variable?

[https://news.ycombinator.com/item?id=13697555](https://news.ycombinator.com/item?id=13697555)

It seems very closely related, unless I am mistaken.

~~~
daenney
You are not mistaken in that the topics are (somewhat) related, they all have
to do with time. But setting the TZ environment variable doesn't mean your
programs don't execute the syscalls discussed in this article.

This is about the speed of execution of the mentioned syscalls, which will be
called regardless of the TZ environment variable, and how vDSO changes that.
However, by setting the TZ environment variable you can avoid an additional
call to stat to as it tries to determine if /etc/localtime exists.

------
pgaddict
I wonder why the blog post claims setting clock source to 'tsc' is considered
dangerous.

~~~
bandrami
Because if the clock rate changes, tsc can become out of sync.

[https://lwn.net/Articles/209101/](https://lwn.net/Articles/209101/)

~~~
pgaddict
Not really. Recent CPUs (at least those from Intel, which is what EC2 runs on)
implement constant_tsc, so the frequency does not affect the tsc.

A worse issue is that the counters may not be synchronized between cpus, which
may be an issue when the process moves between sockets.

But I wouldn't call that "dangerous", it's simply a feature of the clock
source. If that's an issue for your program, you should use CLOCK_MONOTONIC
anyway and not rely on gettimeofday() doing the right thing.

~~~
blibble
how does constant_tsc interact with VMs being silently migrated from one
physical machine to another?

~~~
pgaddict
Not sure, but it can't be better than moving processes between CPUs I guess.
Also, does EC2 silently move VMs like this?

~~~
poofyleek
Even without migration, the synchronization can be an issue. In older multi-
core machines, tsc synchronization was an issue among cores. Modern systems
take care of this. And core CPU clock frequency change is also taken care of,
so that constant rate is available via tsc. However, when hypervisors such as
VMWare or paravirtualization like Xen come into play, there are further
issues, because RDTSC instruction either has to be passed through to physical
hardware or emulated via a trap. When emulated a number of considerations come
into play. Xen actually has PVRDTSC features that are normally not used but
can be effective in paravirtual environments. The gettimeofday() syscalls (and
clock_gettime) are liberally used in too many lines of existing software.
Their use is very prevalent due to historical reasons as well as many others.
One reason is that the calls are deceptively "atomic" or "isolated" or "self-
contained" in their appearance and usage. So liberal use is common. A lot of
issues come about due to their use, especially in time sensitive applications
(e.g. WAN optimization). This is especially true in virtual environments.
There are complex issues described elsewhere that are kind of fun to read.
[https://www.vmware.com/pdf/vmware_timekeeping.pdf](https://www.vmware.com/pdf/vmware_timekeeping.pdf)
and
[https://xenbits.xen.org/docs/4.3-testing/misc/tscmode.txt](https://xenbits.xen.org/docs/4.3-testing/misc/tscmode.txt).
The issue becomes even more complex in distributed systems. Beyond NTP. Some
systems like erlang has some provisions, like
[http://erlang.org/doc/apps/erts/time_correction.html#OS_Syst...](http://erlang.org/doc/apps/erts/time_correction.html#OS_System_Time).
Other systems use virtual vector clocks. And some systems, like google
TrueTime as used in Spanner, synchronize using GPS atomic clocks. The
satellite GPS pulses are commonly used in trading floors and HFT software.
This is a very interesting area of study.

~~~
pgaddict
It's complex stuff, no doubt about that.

For me, it's much simpler - I come from the PostgreSQL world, so
gettimeofday() is pretty much what EXPLAIN ANALYZE does to instrument queries.
Good time source means small overhead, bad time source means instrumented
queries may take multiples of actual run time (and be skewed in various ways).
No fun.

~~~
poofyleek
It is complex and interesting. I am a novice database user. But I do know many
databases use 'gettimeofday' quite a lot. Just strace any SELECT query. Most
databases I have used, including Postgresql, also have to implement MVCC which
mostly depend on timestamps. Imagine the hypervisor CPU and memory pressure
induced time drift, or even drift in distributed cluster of database nodes. It
hurts my head to think of the cases that will give me the wrong values or
wrong estimate for getting the values. It is an interesting area.

~~~
pgaddict
MVCC has nothing to do with timestamps, particularly not with timestamps
generated from gettimeofday(), but with XIDs which you might imagine as a
monotonous sequence of integers, assigned at the start of a transaction. You
might call that a timestamp, but the trouble is that what matters is commit
order, and the XID has nothing to do with that. Which is why the MVCC requires
'snapshots' \- a list of transactions that are in progress.

------
teddyuk
How common are get time calls so that they would actually be an issue?

I've worked on quite a few systems and can't think of a time where an api for
getting the time would have been called so much that it would affect
performance?

~~~
tyingq
Timestamped logs, transaction timeouts, http keepalive timeouts, cache
expiration/eviction, etc.

Apache and nginx for example, both call gettimeofday() a lot.

Edit: Quick google searches indicate software like redis and memcached also
call it quite often.

~~~
Anderkent
So does cassandra.

------
westbywest
OpenJDK has an open issue about this in their JVM:
[https://bugs.openjdk.java.net/browse/JDK-8165437](https://bugs.openjdk.java.net/browse/JDK-8165437)

------
peterwwillis
> All programmers deploying software to production environments should
> regularly strace their applications in development mode and question all
> output they find.

Or, instead, you could just not do that. Then you could go back to being
productive, instead of wasting time tracking down unstable small tweaks for
edge cases that you can barely notice after looping the same syscall 5 million
times in a row.

When will people learn not to micro-optimize?

~~~
jankedeen
Crapulent and without merit.

------
known
Just curious to know the status on Azure;

------
damagednoob
w

