
Why is my NTP server costing $500 per year? Part 1 (2014) - t0mas88
https://blog.pivotal.io/labs/labs/ntp-server-costing-500year
======
preinheimer
Reminds me of when Netgear decided to use the University of Wisconsin NTP
servers as the default in their consumer products:
[http://pages.cs.wisc.edu/~plonka/netgear-
sntp/](http://pages.cs.wisc.edu/~plonka/netgear-sntp/)

~~~
BoorishBears
The most frustrating part of things like this, and that Snapchat issue, is the
largest abusers could probably swallow the cost of their own NTP server usage
as a rounding error to their bottom line.

------
tyingq
I wonder if they somehow mistakenly joined their server to the region-specific
pr.pool.ntp.org group[1]. At the moment, that pool exists, but has no servers
in it.

So, if you were the only server in the pool, perhaps you would get a lot of
Puerto Rican traffic?

[1][http://www.pool.ntp.org/zone/pr](http://www.pool.ntp.org/zone/pr)

~~~
matt_wulfeck
Someone else commented in the article, but it's probably related to Puerto
Rico ISPs using a NAT because of lack of IPv4 address space. That single ip is
probably many many people.

~~~
tyingq
That one doesn't make sense to me. Most people don't have ntp configured to
point to ntp.pool.org...they are mostly PC's pointed at time.windows.com. And,
the pool is big enough anyway that it would spread the load from a relatively
small island pretty well. NAT could be a small part of it, but there's a
different primary cause.

~~~
matt_wulfeck
It doesn't necessarily need to be Windows making the calls. Cell phones use a
NAT typically and there was recently an issue with Snapchat DDoSing NATs.

~~~
tyingq
That's an example of a "different primary cause". It's not NAT in that case,
it's an app using a library with terrible defaults.

------
libeclipse
An interesting theory in one of the comments.

> I wonder if Puerto Rico has run out of its pool of IPv4 addresses. After
> Europe and Asia, just this month Latin America as well, have exhausted their
> IPv4 pools, many local ISPs have resorted to using NAT to deal with the
> scarcity of addresses (of course, after years procrastinating IPv6 and
> pretending that this day wouldn't come about). Given that the source is a
> Puerto Rican ISP, and one of the offending addresses from a small /21
> network, it's possible that NAT is to blame. As ISP NAT increasingly becomes
> more prevalent, this is going to be rather touchy to deal with abuses. For
> is it an abuser or just several innocent users behind a NAT?

~~~
tyingq
I'm not understanding why NAT would cause it. I could see something like a
misconfigured forwarding DNS cache causing it. Where it only queries
pool.ntp.org once, and continues returning the result in the same order (with
pivotal's ip at the top of the list) to a large number of querying clients.
Then, perhaps, if there are a bunch of natted clients behind one ip? NAT, on
it's own, without some other contributing factor, shouldn't cause this.

~~~
discordianfish
NAT wouldn't cause it but hide that in fact those are many client all having
the same source IP. Of course, that wouldn't explain why they observed a
general increase in traffic.

------
sigio
part 2: [https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year...](https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year-part-2-characterizing-ntp-clients)

~~~
rincebrain
I wish he'd explained somewhere how they leapt to examining virtualized NTP
clients, or what they ultimately did (since there's no part 3 that I can
find).

~~~
spydum
Virtualization and time sync have had notorious problems. One ugly work around
was frequent NTP polling and adjustments. NTP has a min and max poll interval,
and it determines how frequently it should poll _automatically_ based on how
far it sees drift happening. If it drifts pretty fast, it will quickly
gravitate to the minpoll value, which is exactly what they show in their first
graph: tons of polling at the minfrequency for certain hypervisors.

~~~
raverbashing
I wonder, should this be solved with local NTP servers that resolve from the
name pool.ntp.org?

~~~
rhizome
I've found it to be good practice to run a single/few stratum 2 node(s) to
serve local resources. To whatever degree, it's usually more important that
these resources be more in sync with each other than with a satellite, which
is fostered by having as few nodes as possible trying their luck over the
internet to bogged down public stratum 1 NTP sources, instead configuring them
to use a single source of time from a box more in its vicinity.

------
Spooky23
Hats off to everyone contributing to public services like this.

My then company wanted to give back ny doing this many years ago and it was an
eye opening experience. We had troubles almost immediately with utilization
and script kiddies. The company ended up only doing it for a relatively short
period and ended up making contributions to projects instead

~~~
zanchey
Our student-run computing club added a machine to the pool and melted the
University's firewall. Oops.

~~~
jlgaddis
Honestly, that's the University's fault then. Properly configured, it
should've had very little noticeable effect on the firewall (i.e. "permit udp
any host 10.11.12.13 eq 123") as there's no need to do any inspection or
tracking state ...

... unless they saturated the available bandwidth but, really, that's a
different issue (although also preventable!).

------
ChuckMcM
I think this is great look at walking through the analysis. I too experienced
a huge spike in NTP traffic in 2014 but it was because of people exploiting
NTP for reflection attacks to DDOS other parties. The forced me to use a GPS
module and a Beaglebone Black as an internal time server (which has been
great)

~~~
daveguy
I have a few questions about that if you have a minute:

What GPS module did you go with and is it still available? Did you have
problems getting signal inside (need to be by a window, run an antenna, etc)?

~~~
ChuckMcM
I used the Adafruit "ultimate" GPS module
([https://www.adafruit.com/product/746](https://www.adafruit.com/product/746))
which has the 1pps output and can connect to an external antenna. Then I got
this antenna
([https://www.adafruit.com/product/960](https://www.adafruit.com/product/960))
and this adapter
([https://www.adafruit.com/product/851](https://www.adafruit.com/product/851)).
Soldered a header connector to Beaglebone protocape
([https://www.adafruit.com/product/572](https://www.adafruit.com/product/572)),
wired it to the serial port and PPS to the GPIO pin (just like this:
[https://web.archive.org/web/20131209092059/http://the8thlaye...](https://web.archive.org/web/20131209092059/http://the8thlayerof.net/2013/12/08/adafruit-
ultimate-gps-cape-creating-custom-beaglebone-black-device-tree-overlay-
file/)).

I put the GPS antenna on my window sill, it has no problem at all staying
locked. My plan had been to stick it outside the window but turned out not to
be necessary.

------
johansch
Because you use AWS and they charge insane fees for outgoing bandwidth.

~~~
z92
This $500 would have been $60, on a Digital Ocean box. DO has 1 TB/month
limit. Their usage was 300 GB/month.

~~~
brian_cunnie
[author]

Digital Ocean is a great deal! Thanks for pointing that out.

The reason I use {aws,azure,google} to host my NTP servers is that my day job
is developing a VM orchestrator (BOSH) for Cloud Foundry, and BOSH doesn't
support Digital Ocean yet (AFAIK). But that's a personal choice, and an
admittedly expensive one.

~~~
diegorbaquero
What about amazon lightsail?

~~~
ilaksh
It also runs on AWS so even though they try to trick you into thinking you
don't pay for bandwidth you do. Also small EC2 instances are throttled so
their server would probably not function most of the time.

~~~
diegorbaquero
They include 1TB BW in the $5 instance, that's $90 savings.

------
fxlv
I'm surprised there are no comments about the fact that these guys decided to
run NTPD on a VM.

~~~
brian_cunnie
[author]

NTP runs fairly decently in a VM. Don't take my word for it — look at the
graphs of my servers:

Here's my Google VM, notice the jitter is within +/\- 5 milliseconds:

[http://www.pool.ntp.org/scores/104.155.144.4](http://www.pool.ntp.org/scores/104.155.144.4)

Here's my Hetzner VM (Germany). +/\- 10 milliseconds, though I can't help but
suspect the distance from the monitoring station (Los Angeles) may have more
to do with it than being a VM:

[http://www.pool.ntp.org/scores/78.46.204.247](http://www.pool.ntp.org/scores/78.46.204.247)

Here's my AWS VM. Much worse than Google in that it's +/\- 50 milliseconds,
but still good enough to pass muster with pool.ntp.org:

[http://www.pool.ntp.org/scores/52.0.56.137](http://www.pool.ntp.org/scores/52.0.56.137)

Here's my Azure VM. It's in Singapore, and I re-deployed it last night, so the
numbers are still coming in, but it has a pretty tight distribution:

[http://www.pool.ntp.org/scores/52.187.42.158](http://www.pool.ntp.org/scores/52.187.42.158)

~~~
jlgaddis
Everyone's needs differ, I suppose, so some might consider that "decent". 10ms
-- or even 50ms -- might be acceptable for many (most?) use cases but not for
me.

From a quick look, my own (stratum 2) server in the pool currently has an
offset of just under 1/20th of one millisecond.

Regardless, thanks for contributing to the pool!

------
feld
Why are people joining VMs to the NTP pool? These servers should be identified
by address space and blacklisted.

~~~
foota
Why?

~~~
therein
Because VMs themselves might not be able to keep track of time accurately
(potentially inconsistent tickrate) the way that a bare-metal setup would be
able to. That's why they should be mere consumers (as in sync their time to
whatever the remote says rather than contribute to the pool).

~~~
feld
Correct, unless you have a very specific VM configuration where you are truly
dedicating a CPU/core to a VM, it's not fit for being an NTP server.

------
technion
I ran an NTP server on a Raspberry Pi for some time.

The bottleneck I kept hitting was the 65535 NAT translation limit on my Cisco
router, at which point, load was quite manageable on the Pi.

It's extraordinary how much traffic one cheap device could service.

------
Steeeve
This is a 2 year old article. Where's the follow up? What did they end up
doing?

EDIT: AHA! Part 2: [https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year...](https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year-part-2-characterizing-ntp-clients)

------
jelder
Why would anyone run an authoritative time service on a virtual server in the
first place? My experience is that system time suffers greatly from noisy
neighbor.

------
ddorian43
... because it's hosted on the cloud and you have no amount of free bandwidth
with your vps ?

------
lucb1e
I'm not sure if I've missed it, but is the question (from the title) ever
answered? The discrepancy between expected traffic volume and actual traffic
volume is huge and seemingly unexplained.

~~~
brian_cunnie
[author]

My bad — I never wrapped it up. Thanks to the HN interest, I'll try to write
Part 3 over the winter break.

The short version is this: it's gonna cost a couple of hundred dollars to run
a 1Gbe NTP server in pool.ntp.org, but you can tweak the ntp.conf to save
~$100.

------
Animats
This is the Snapchat bug reported yesterday, right?

Incidentally, how is AWS dealing with the leap second next week? Google is
going to have their time servers start to run fast around 20 minutes in
advance of the leap second, so they're back in sync at 00:00:60 UTC.

~~~
jeffbarr
Details on AWS at [https://aws.amazon.com/blogs/aws/look-before-you-leap-
decemb...](https://aws.amazon.com/blogs/aws/look-before-you-leap-
december-31-2016-leap-second-on-aws/)

------
Jabdoa
Second part is here: [https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year...](https://blog.pivotal.io/labs/labs/ntp-server-
costing-500year-part-2-characterizing-ntp-clients)

------
Faaak
It was mainly due to the poorly coded snapshat program:
[https://news.ntppool.org/2016/12/load/](https://news.ntppool.org/2016/12/load/)

EDIT: this post was indeed from 2014. My bad then. however the same issue
started again two weeks ago (~17 dec 2016).

~~~
brian_cunnie
Here is a visual representation of the affect of the snapchat broken-ness on
my NTP server:

[https://cloud.githubusercontent.com/assets/1020675/21468123/...](https://cloud.githubusercontent.com/assets/1020675/21468123/ac026274-c9d2-11e6-8334-2f56e9c9d20f.png)

Note that inbound traffic which was steady at ~4k packets/sec spikes as high
as five times as much. Also note that the snapchat traffic followed a
circadian rhythm (much higher traffic during the daytime).

------
rupellohn
Could this have been the result of an NTP amplification attack?
[https://www.us-cert.gov/ncas/alerts/TA13-088A](https://www.us-
cert.gov/ncas/alerts/TA13-088A)

~~~
moxious
Article said no, because the traffic was symmetrical and not lopsided. If this
had been part of an attack you'd expect to see far more outgoing bandwidth
than incoming.

