
Some DNS lookups causing 5xx errors due to leap second bug - nomadicactivist
https://www.cloudflarestatus.com/incidents/1fczgjmknplp
======
nullc
My CDMA phone dropped service for a few minutes after the leap second.

It's absurd that we continue to keep subjecting ourselves to these disruptions
and the considerable amount of work that goes into handling leap seconds for
the systems that aren't disrupted by them.

Leap seconds serve no useful purpose. Applications that care about solar time
care usually care about the local solar time, while UT1 is a 'mean solar time'
that doesn't really have much physical meaning (it's not a quantity that can
be observed anywhere, but a model parameter).

It would take on the order of 4000 years for time to slip even one hour. If we
found that we cared about this thousands of years from now: we could simply
adopt timezones one hour over after 2000 years, existing systems already
handle devices in a mix of timezones.

[And a fun aside: it appears likely that in less than 4000 years we would need
more than two leapseconds per year, sooner if warming melts the icecaps. So
even the things that correctly handle leapseconds now will eventually fail.
Having to deal with the changing rotation speed of the earth eventually can't
be avoided but we can avoid suffering over and over again now.]

There are so many hard problems that can't just easily be solved that we
should be spending our efforts on. Leapseconds are a folly purely made by man
which we can choose to stop at any time. Discontinuing leapseconds is
completely backwards compatible with virtually every existing system. The very
few specialized systems (astronomy) that actually want mean solar time should
already be using UT1 directly to avoid the 0.9 second error between UTC and
UT1 _. For all else that is required is that we choose to stop issuing them (a
decision of the ITU), or that we stop listening to them (a decision of various
technology industries to move from using UTC to TAI+offset).

The recent leap smear moves are an example of the latter course but a half-
hearted one that adds a lot of complexity and additional failure modes.

(_In fact for the astronomy applications that leap seconds theoretically help
they _still_ add additional complication because it is harder to apply
corrections from UTC to an astronomical time base due to UTC having
discontinuities in it.)

~~~
hartator
Or maybe we just need to code system that handles leap seconds correctly.

~~~
lend000
Two valid points. A third: convince everyone to adopt epoch time for data
transfer (seconds since epoch), and let applications that require formatted
time do the transformation where it will be used (not earlier). It doesn't
make a lot of sense that a timestamp represented as HOUR/MIN/SEC:DAY/YEAR
should be passed around on the network of a production system. Leave it to the
recipient to convert. I guess this is a subset of your point.

~~~
rspeer
Leap seconds aren't just an issue of formatting times. Leap seconds actually
involve turning the UTC count of seconds back by a second.

You'd have to switch the "seconds since epoch" count to TAI, and that would
cause new formatting bugs because all kinds of software assumes that the
minute changes on a multiple of 60 seconds since the epoch.

~~~
lend000
Yes but even here, seconds since Epoch should remain unaltered, and the
correction should be made by whatever is rendering a human readable date
format (to address every leap second). In most cases, the renderer wouldn't
have to address it (since it's only being read by humans, and a second
difference does not usually matter) and it's truly a non-issue! The
application-layer dev can choose to increment time in whatever blocks he wants
instead of having an if-else chain for every "official" leap second. Like
adding a minute every few hundred years, to that other commenter's point.

~~~
samrolken
Unfortunately, epoch time is not literally "seconds since epoch", at least not
as implemented/standardized as "Unix time". It skips or repeats itself in case
of leap seconds. So it can't save us here.

I think if there were such a thing as a different kind of epoch time that
literally actually is "seconds since epoch" it would help a lot and work like
you suggest.

~~~
pycal
Is this correct? Because I am having trouble understanding the rationale
behind making the unix epoch relative to an earth solar year, as opposed to
just the "number of seconds which have elapsed since the unix epoch". Do you
have an example of this implementation? The Wikipedia article regarding epoch
notes many counter-examples.

~~~
zeta0134
The Wikipedia Article here, describing Unix Time, indicates precisely the
issue. Unix time does not include leap seconds, which means that when a leap
second occurs, the midnight transition to the next year needs to insert
additional time. Strictly following the standard thus, the Unix timestamp
_rolls time backwards by one second_ over midnight, which is precisely the
kind of behavior that breaks systems depending on continuous timestamps:
[https://en.wikipedia.org/wiki/Unix_time#Leap_seconds](https://en.wikipedia.org/wiki/Unix_time#Leap_seconds)

It sounds like in your scenario, you would prefer Unix time to instead include
the leap second, so that no rollback or time smearing behavior would need to
occur. I believe the reason it does not has to do with simplicity: current
systems rely on a day being 86,400 seconds, making each year (regardless of
leap days) a multiple of 86,400. Leap seconds break this simple assumption.
While it would be simple for a new time formatting system to take leap seconds
into account, it is not so simple to go and retrofit all of the existing
systems for a new formatting standard, and convince so many different groups
of developers to change that much code while also agreeing with one another
about the changes.

------
ChuckMcM
Once again we're screwed by different people wanting "time" to mean different
things. There is no hope for humanity once we start traveling anywhere close
to light speed into and out of the solar system.

I propose a new "non-time" time system. It has exactly two real values which
range from 0 to tau and an integer, the first real number is radians of earth
rotation, and the second is radians of the rotation around the Sun. The
integer reflects the number of complete cycles. So lunch time in Greenwich
'pi'.

It has the benefit that its "source" is actually the planet, so we can use a
telescope at Greenwich to pick a certain alignment of stars as the "zero",
"zero" point and then each time it realigns to that exact point, you can
increment the "year" count.

I believe we can build a robust system to support this out of stone. We'll
need to create a circle of stones but using a small hole drilled through a
stone and a marker on the ground we can always identify 0.0,0.0, 0.0,pi/2,
0.0, pi, and 0.0, 3*pi/2.

~~~
perennate
If we somehow manage to "reverse" the Earth's orbit around the sun, does that
mean we will have invented a time machine?!

~~~
cmdrfred
I'd be more impressed with a machine that can decelerate the earth and reverse
it's spin direction without killing everyone as opposed to a time machine
personally.

------
gamegoblin
I guessed most big services would be using something akin to time smearing [1]
since the first big leap-second outages years ago. Is there any reason why
cloudfare would be unable to use this technique?

[1]
[https://developers.google.com/time/smear](https://developers.google.com/time/smear)

~~~
wyager
It's pretty lame that a lot of software is so fragile that it breaks if we
give it the correct time.

The solution here is that any software that relies on accurate timing and/or
breaks when you change the time should be using epoch seconds, not any sort of
human-oriented time format.

~~~
rspeer
Leap seconds _are_ a change to the number of UTC epoch seconds.

~~~
wyager
> Leap seconds are a change to the number of UTC epoch seconds.

UTC is different from epoch time. Epoch time by definition does not count leap
seconds.
[https://en.m.wikipedia.org/wiki/Unix_time](https://en.m.wikipedia.org/wiki/Unix_time)

~~~
nshepperd
"Does not count leap seconds" has to win some kind of award for misleading
terminology. I'd wager it's responsible for a significant portion of leap
second bugs due to confusion & misunderstanding about what Unix time is:

It _sounds_ like what it means is that Unix time counts the number of _real,
actual, by-the-clock_ seconds that have passed since the epoch. That would be
logical. But what it actually means is that it counts the number of real,
actual, by-the-clock seconds, _minus_ the number of those those that have been
designated "leap seconds".

That is to say, whenever a "leap second" occurs, the nice monotonic isotonic
progress of unix time is mutilated by suddenly adding or removing 1 to the
total count so far. That's what "does not count leap seconds" means, and
sometimes even what "ignores leap seconds" means (which is of course even
worse terminology).

~~~
wyager
Shit, you're right. It had me totally fooled.

------
karlhughes
This was shared a while ago, but it's relevant again:
[http://www.madore.org/~david/computers/unix-leap-
seconds.htm...](http://www.madore.org/~david/computers/unix-leap-seconds.html)

------
userbinator
I'm curious what if anything would be problematic if everything just
effectively "ignored" leap seconds (i.e. would this outage not have occurred?)
--- one minute is always 60 seconds, an hour is always 60 minutes, and a day
always 24h. I mean, if you consider the fact that human society has managed to
function perfectly well with almost everyone not knowing nor caring what a
leap second is, and yet apparently some software does --- leading to problems
like this --- something doesn't feel right.

~~~
DigitalJack
100 seconds per minute, 100 minutes per hour, 20 hours a day. New seconds are
0.432 old seconds, or whatever ratio they need to make to quit leaping around.

~~~
colejohnson66
Fun fact: the reason we use 60 seconds and 60 minutes is because of the
Babylonians who used base-60. IIRC, it's also why we use 360° for a circle.

~~~
fnj
It might also have something to do with the fact that 60 is evenly divisible
by 30, 20, 15, 12, 10, 6, 5, 4, 3, and 2.

------
jlgaddis
I'm guessing CloudFlare runs their own custom DNS server software?

~~~
ckdarby
Is this not it?
[https://github.com/cloudflare/dns](https://github.com/cloudflare/dns)

~~~
alexforster
Yeah, and Go doesn't expose a monotonic clocksource in its stdlib[1]. I'd bet
that's what this boils down to.

[1]
[https://github.com/golang/go/issues/12914](https://github.com/golang/go/issues/12914)

~~~
kondbg
I'm confused on why a DNS server would need to rely on a monotonic clock for
its use cases. Is there a part of DNS that relies on the assumption of
synchronized, monotonic time? (Perhaps TTL/expiry of records? But I still
don't see why having a non monotonic clock source would harm if CF is using Go
timers for expiry)

~~~
tyingq
One example might be rate limiting. Count requests over elapsed time. If
elapsed time is a negative number, the math might trigger a bug that causes CF
to block requests...too many requests over time period X.

~~~
kondbg
Cloudflare posted a post-mortem [1]. They were measuring round trip time, and
supplying the result of that into the golang rand.Int63n() function, which
panics the process when given a negative number.

[1] [https://blog.cloudflare.com/how-and-why-the-leap-second-
affe...](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-
cloudflare-dns/)

------
ComputerGuru
I was at a relative's and tried to load two different web sites.. my first
thought was that their wifi sucked. My second was "will we finally learn a
lesson today about the disturbing trend towards constant re-centralization of
all our online services?"

------
justinholmes
Funny that they wrote about it in 2014 [https://blog.cloudflare.com/its-go-
time-on-linux/](https://blog.cloudflare.com/its-go-time-on-linux/)

------
brongondwana
Was glad things have improved since 4 years ago!

[https://blog.fastmail.com/2012/07/03/a-story-of-leaping-
seco...](https://blog.fastmail.com/2012/07/03/a-story-of-leaping-seconds/)

This time I didn't get paged for anything on leap second day :)

------
zkms
What causes real-world problems with leap seconds is actually unrelated to the
nasty interactions of metrology and solar time -- it's a specific and
avoidable problem with how NTP (and many OSes/languages) represent time --
it's a types issue.

The right way for computers to represent time is with a number that represents
the number of constant-rate ticks that have elapsed past a some agreed-upon
epoch. If you know what the epoch is and how long each tick is (lots of people
use 1 / 9.192 GHz), it is easy to know how many ticks are between any two time
values, and you can convert a time value with one epoch to one with a
different epoch and tick rate -- you can do everything people expect to do
with time. There are no numbers that represent an invalid time value, and for
each moment, there is a __unique __time value that represents it. There 's a
one-to-one mapping with no nasty edge cases.

Leap seconds are a step function that is added to a constant-rate timescale
(whose name is "TAI") in order to generate a discontinuous timescale (whose
name is "UTC") that never is too different from solar time. There is nothing
fundamentally abhorrent about leap seconds -- there are just good and bad ways
to represent, disseminate, and compute with timescales that involve leap
seconds.

The right way to handle leap seconds can be seen with many GNSSes and PTP
(very high precision hardware-assisted time synchronization over Ethernet).
GPS, BeiDou, Galileo, and PTP all involve dissemination and computation on
time values -- and with dire consequences for failure/downtime/inaccuracy.

The designers of those systems all somehow converged on the choice to
_separate out_ the nice, predictable, constant-rate and discontinuity-free
part of UTC from the nasty step function (the leap second offset). Times in
all those systems are represented as the __tuple __(TAI time at t, leap offset
at t). This means that the entire system can calculate and work with
(discontinuity-free and constant-rate) TAI times but also truck around the
leap offsets so when time values need to be presented to a user (or anything
that _requires_ a UTC time), the leap offset can be added then. Crucially, all
the maths that are done on time values are done on TAI values, so calculating
a time difference or a frequency is easy and the result is always correct,
regardless of the leap second state of affairs. Representing UTC time as a
tuple makes the semantics of that data type easy to reason about -- the "time"
bit is in the first element and is completely harmless -- the edge cases have
all live in the second half of the tuple.

NTP and Unix (and everything descending and affected by those) have made the
mistake of representing and transmitting time as a single integer, TAI(t) +
leap offset(t). This is not a data representation that has sensical semantics
and it is very hard to reason about it. First of all, the leap second offset
is nondeterministic and also unknown -- there is no way to get it from NTP and
there is no good way to know the time of the next leap event. Second of all,
there are _repeated_ time values for different moments in time (and when a
negative leap second will happen, there will be time values that represent
_no_ moments in time). Predictably, introducing nondeterministic
discontinuities doesn't work so well in the real world. There are a bunch of
bugs in NTP software and OS kernels and applications that make themselves
shown every time there is a leap second. It's not even just NTP _clients_ that
struggle -- 40% of public Stratum-1 NTP servers had erroneous behavior [0]
related to the 2015 leap second! Given that level of repeated and widespread
failure, the right solution is not to blame programmers -- it should be to
blame the standard. The UTC standard and how NTP disseminates UTC are
fundamentally not fit for computer timekeeping.

GNSS receivers and PTP hardware get used in mission-critical applications
(synchronizing power grids and multi-axis industrial processes, timestamping
data from test flights and particle accelerators) all the time -- and even
worse, there's no way to conveniently schedule downtime/maintenance windows
during leap second events! "Leap smear" isn't an acceptable solution for those
applications, either -- you can't lie about how long a second is to the Large
Hadron Collider. GNSS and PTP systems handle leap second timescales without a
hitch by representing UTC time with the right data type -- a tuple that
properly separates two values that have the same _unit_ (seconds) but have
vastly different _semantics_. The NTP and unix timestamp approach of directly
baking the discontinuities into the time values reliably causes problems and
outages. The leap second debacle is not about solar time vs atomic time; it's
about the need for data types that accurately represent the semantics of what
they describe.

[0]:
[http://crin.eng.uts.edu.au/~darryl/Publications/LeapSecond_c...](http://crin.eng.uts.edu.au/~darryl/Publications/LeapSecond_camera.pdf)

~~~
anderskaseorg
Except people want to be able to talk about times years in the future despite
not knowing the number of leap seconds that may happen in the intervening
time. It is more useful in most fields to talk about an event happening every
N years/months/days than an event happening every N seconds. Most people do
not want a leap second to shift their scheduled event from 10:00:00 every
Monday to 9:59:59 or 10:00:01 in the name of using a whole number of
86400-second intervals.

~~~
TazeTSchnitzel
> Except people want to be able to talk about times years in the future
> despite not knowing the number of leap seconds that may happen in the
> intervening time.

Doesn't the (TAI, leap second count) tuple solution work for this? Maybe I
misunderstand the purpose, but you could use the leap second count to figure
out how many seconds the TAI is off by.

But that doesn't matter, because date intervals shouldn't be represented with
seconds anyway. Months and years have different lengths.

~~~
wereHamster
You can not represent "Next Monday at 12:00" with a tuple (TAI, leap second
count), because you don't know how many leap seconds there should be. Or maybe
you know for next Monday, but you definitely don't know for the Monday in a
year, as leap seconds are only announced ~6 months in advance.

~~~
beering
I think you cannot represent "Monday in a year at 12:00" with a simple integer
either, right? For example, the king of the country may decide to cancel DST
for the year. Either way you would have to store it as a calendar event and
figure out the exact time once you're closer.

------
zitterbewegung
So you can basically see if a tool or company hasn't experienced a leap second
if their system goes down because of it.

------
mikehollinger
I'll just leave this here:

Have a look at a j excellent video that explains why time algorithms are hard
to sort out: [https://m.youtube.com/watch?v=-5wpm-
gesOY](https://m.youtube.com/watch?v=-5wpm-gesOY)

Happy New Year from Austin!

------
tyingq
They apparently run their own DNS proxy called "RRDNS", written in golang.
[https://blog.cloudflare.com/tag/rrdns/](https://blog.cloudflare.com/tag/rrdns/)

------
dmd
Half the people posting here need to read
[https://qntm.org/calendar](https://qntm.org/calendar)

------
aburan28
I was wondering who this leap second was going to affect!

------
tscs37
I knew it.

I knew that something is going to break somehow because for some reason people
continue to falsely believe that 1 minute always has 60 seconds.

------
web007
Are there any public "skewing" NTP pools that distribute the leap seconds as
lag / gain over 24 or 48 hours as some of the large providers do? That seems
to be the generally accepted answer to leap-second chaos, and certainly seems
simpler than all of the hidden bugs in systems all over the place trying to
deal with :60 on a clock.

~~~
benjiweber
Google have
[https://developers.google.com/time/smear](https://developers.google.com/time/smear)
but it can introduce different problems.

------
thisrod
There is a higher order issue here. DNS time stamps have been stable for
decades. Why has anyone written new code to format them since the last leap
second?

------
tim_hutton
Surely now we can agree that it is FINALLY time to adjust our planet's orbit
to correct for this problem once and for all.

------
known
I just did sudo rdate -s time-a.nist.gov

------
homero
I didn't see an outage at all

------
iopq
Is that why Google Maps was down?

