
Running a database on EC2? Your clock could be slowing you down - drob
https://heapanalytics.com/blog/engineering/clocksource-aws-ec2-vdso
======
scarface74
Why would you run your own Postgres instance on EC2 within AWS? That kind of
defeats the purpose of paying for AWS. Why not use Postgres RDS or Aurora?

It makes some sense with Sql Server and Oracle in a few cases because of
licensing but hosting your own Postgres instance on AWS is the worse of both
worlds -- you're paying more than with a cheaper VPS and you have to do all of
the maintenance yourself and not taking advantage of all of things that AWS
provides -- point in time restores, easy cross region read replicas, faster
disk io (Aurora), etc.

~~~
aidos
In my experience you get much better performance outside of RDS and you can
inspect and tune it better. Maybe I’m missing something and no doubt I could
put more work into it but we’ve actually talked about moving our RDS dbs back
to EC2 because there are plenty of queries we do that are embarrassingly slow
on RDS when they shouldn’t be.

Also, you can’t replicate out of RDS. I like to know where my data is and how
to bring it back online during a disaster.

~~~
chc
I've worked on a project that migrated from MySQL on EC2 to MySQL on RDS and
then back to EC2 because the performance was massively worse — a process that
took a few hours before now took days. We contacted Amazon support to try and
resolve whatever was going wrong with the RDS instance, and their response was
basically "Yeah, we don't guarantee performance on RDS. If you want to
maximize performance, you should run your own DB on EC2."

~~~
nodesocket
It makes sense because of the dependency on EBS, but what instance type were
you using on RDS? Using provisioned IOPS RDS disk?

~~~
chc
We tried a few very large instance types to see if that made any difference,
and provisioned IOPS did help, but everything was still slower than the EC2 DB
and it cost a lot more on top of that.

------
AbacusAvenger
This is exactly why I wrote the "clockperf" tool during the time I was working
at AWS:
[https://github.com/tycho/clockperf](https://github.com/tycho/clockperf)

At the time, we were trying to benchmark disk I/O for new platforms, but we
found that things were underperforming compared to the specifications for the
hardware. We figured out that fio was reading the clock before/after each I/O
(which isn't really necessary unless you really care about latency
measurement) and just by reading the clock we were rate limiting our I/O
throughput. By switching to "clocksource=tsc" in our fio config, we managed to
get the performance behavior we expected.

~~~
logicallee
>we managed to get the performance behavior we expected.

can you put this into roughly quantitative terms? How much of a performance
hit did you remove this way?

~~~
AbacusAvenger
I don't remember the exact numbers (this was 2011), but the overhead of using
e.g. CLOCK_MONOTONIC were substantial. Under Xen, the cost of reading
CLOCK_MONOTONIC was a few orders of magnitude higher than reading the TSC. I
think on Xen PV it was like 500ns per read, while on HVM it was at about
2000ns-3000ns or something like that.

I remember with 8 disks that should have been able to do 60K 4K IOPS each
(early SSD models), we were capping out at 90K IOPS with all disks in parallel
at a queue depth of 32 while reading from CLOCK_MONOTONIC. When we switched to
TSC I think we ended up getting around 320K IOPS. Still not perfect, but we
were also capped by the particular HBA we chose (which didn't have multiqueue
support).

------
wallstprog
Nice article!

If you're interested in clocks on Linux, you might also find this article
useful (shameless plug): [http://btorpey.github.io/blog/2014/02/18/clock-
sources-in-li...](http://btorpey.github.io/blog/2014/02/18/clock-sources-in-
linux/)

~~~
amluto
> Note that the 100ns mentioned above is largely due to the fact that my Linux
> box doesn’t support the RDTSCP instruction, so to get reasonably accurate
> timings it’s also necessary to issue a CPUID instruction prior to RDTSC to
> serialize its execution.

Huh? That’s definitely not true now, and I don’t think it ever was. Linux uses
LFENCE or MFENCE, depending on CPU.

~~~
_msw_
Using CPUID as a serializing instruction before RDTSC{,P} is a bad bad thing
to do inside a virtual machine on Intel processors. CPUID will cause a VMEXIT,
and the CPUID instruction will be emulated. The Intel Software Development
Manual Instruction Set Reference gives good guidance on using MFENCE and
LFENCE as required.

[https://software.intel.com/sites/default/files/managed/39/c5...](https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-
vol-1-2abcd-3abcd.pdf#page=1667)

~~~
amluto
Linux has mostly stopped using CPUID to serialize at all. When full
serialization is needed, we use IRET now. In the future, we could optimize a
bit by writing to CR2, except on Xen.

------
misiti3780
here is a link that doesnt have the ssl problem:
[http://archive.is/7zVmu#selection-1533.0-1536.0](http://archive.is/7zVmu#selection-1533.0-1536.0)

~~~
xahrepap
This also works, and still has SSL:
[https://archive.is/7zVmu#selection-1533.0-1536.0](https://archive.is/7zVmu#selection-1533.0-1536.0)

------
aarongolliver
an older discussion of this:
[https://news.ycombinator.com/item?id=13813079](https://news.ycombinator.com/item?id=13813079)

~~~
kalmar
Yup, that's the blog post I mentioned in the intro. Sometimes obsessively
reading HN pays off. Even if it's a year later... :-)

------
Johnny555
_And, EC2 does not live migrate VMs across physical hosts. I couldn’t find
anything explicit from AWS on this, but it’s something that Google is happy to
point out._

Is it a good idea for a production database to depend on a feature not being
used when the vendor hasn't said that they don't or won't use it? They may
very well live-migrate when convenient, but just don't expose that
functionality to customers since they don't want customers demanding it.

~~~
KayEss
With the instance types they're using the migration isn't really an option
because the point of the i3 instances is the locally attached NVMe SSD disks
where the database files are.

------
tofflos
Is AWS NVMe still ephemeral and how do you deal with it? What happens if a
machine, or several, reboots?

~~~
kalmar
Post author here. It's ephemeral, yes. It survives reboots, so that's not a
problem. It doesn't survive instance-stop, so if a machine is being
decommissioned by AWS we do indeed lose its data. As for how we protect
against it, the main thing is replication: the data is stored on more than one
machine. If we lose a machine for whatever reason, the shards from that
machine are copied from a replica to another DB instance.

~~~
_msw_
As local NVMe storage does not have any interaction with the "classic" block
device mapping APIs (the storage shows up as a PCI device, the same way that a
GPU or FPGA does, and it doesn't matter in any way how the block device
mapping is set up), there is no reason to use "ephemeral" to describe it.

Said more directly: no, it is not ephemeral. It is local storage that is tied
to the life cycle of the instance.

------
abraham_lincoln
We set our server 4 hours ahead!

