Hacker News new | past | comments | ask | show | jobs | submit login
Some DNS lookups causing 5xx errors due to leap second bug (cloudflarestatus.com)
267 points by nomadicactivist on Jan 1, 2017 | hide | past | web | favorite | 126 comments



My CDMA phone dropped service for a few minutes after the leap second.

It's absurd that we continue to keep subjecting ourselves to these disruptions and the considerable amount of work that goes into handling leap seconds for the systems that aren't disrupted by them.

Leap seconds serve no useful purpose. Applications that care about solar time care usually care about the local solar time, while UT1 is a 'mean solar time' that doesn't really have much physical meaning (it's not a quantity that can be observed anywhere, but a model parameter).

It would take on the order of 4000 years for time to slip even one hour. If we found that we cared about this thousands of years from now: we could simply adopt timezones one hour over after 2000 years, existing systems already handle devices in a mix of timezones.

[And a fun aside: it appears likely that in less than 4000 years we would need more than two leapseconds per year, sooner if warming melts the icecaps. So even the things that correctly handle leapseconds now will eventually fail. Having to deal with the changing rotation speed of the earth eventually can't be avoided but we can avoid suffering over and over again now.]

There are so many hard problems that can't just easily be solved that we should be spending our efforts on. Leapseconds are a folly purely made by man which we can choose to stop at any time. Discontinuing leapseconds is completely backwards compatible with virtually every existing system. The very few specialized systems (astronomy) that actually want mean solar time should already be using UT1 directly to avoid the 0.9 second error between UTC and UT1. For all else that is required is that we choose to stop issuing them (a decision of the ITU), or that we stop listening to them (a decision of various technology industries to move from using UTC to TAI+offset).

The recent leap smear moves are an example of the latter course but a half-hearted one that adds a lot of complexity and additional failure modes.

(In fact for the astronomy applications that leap seconds theoretically help they _still_ add additional complication because it is harder to apply corrections from UTC to an astronomical time base due to UTC having discontinuities in it.)


CDMA system time is already defined as free of leap seconds.

---

3GPP2 C.S0002-A section 1.3 "CDMA System Time":

All base station digital transmissions are referenced to a common CDMA system-wide time scale that uses the Global Positioning System (GPS) time scale, which is traceable to, and synchronous with, Universal Coordinated Time (UTC). GPS and UTC differ by an integer number of seconds, specifically the number of leap second corrections added to UTC since January 6, 1980. The start of CDMA System Time is January 6, 1980 00:00:00 UTC, which coincides with the start of GPS time.

System Time keeps track of leap second corrections to UTC but does not use these corrections for physical adjustments to the System Time clocks.

---

I'm pretty sure the only use of leap seconds in CDMA is for converting system time to customary local time, along with the daylight-time indicator and time-zone offset also contained in the sync channel message.

Edit: C.S0005-E section 2.6.1.3 says the mobile station shall store most of the fields of the sync channel message; it may store leap second count, local time offset, and daylight time indicator. This suggests that these fields aren't really that important for talking CDMA.


But yet, parent poster's phone dropped, so that migration from UTC=GPS+N to UTC=GPS+N+1 would result in the same conniptions we all have to deal with an extra second in the day. Even if it is at the phone's presentation layer, that's several GB of software that might hold lurking N+1 bugs causing the data layer to drop.


Let's reframe that: His phone dropped on the new year, when everyone is sending a happy new year message to everyone they know.

No link with leap second till proven.


It wasn't midnight anywhere with CDMA service when the leap second happened.


Agreed, while the CDMA specification requires tight time syncing, like everything else the UNIX OSs used to run the equipment can receive the leap indicator. Any problem within the OS, or the software reading date/time from the OS can cause instability.

Also, without knowing more about what exactly went wrong with your phone, it's possible other infrastructure within the network was unstable, signalling equipment, etc. I can't remember exactly, but I think some of the CDMA equipment at my previous company had a leap second problem previously. And the equipment is no longer being maintained or patched really.


Past life $work made UTDOA equipment for GSM networks. The hardware used a number of GPS modules from various vendors to obtain GPS time, which someone noted above includes leap offset broadcast in the periodic almanac.

Anyway, of three different module vendors, two got leap handling wrong. Then our own code had its own leap bugs, on top of the OS (Solaris) timekeeping bugs such as clock jumps and timezone update issues. Good times.


Some GPSDO that are used to time CDMA base stations are known to misbehave around leap seconds (though often when the GPS signal sends the leapsecond warning, not at the leapsecond itself).


But is it a bug or network saturation under people wishing each others a happy new year? (And Perhaps more so with a selfie than a text msg as prev yrs)


Given that it's cdma, moat likely the user was in the US or thereabouts, so the leap second was several hours before local new year.


Happy new year at midnight UTC ... in the Pacific timezone?

Doubtful. :) I've observed a similar outage at the last leapsecond (and in that case, dropped me off a call-- which is why I even checked this time.)


Att got hit too. Dupe sms everywhere. Neville, the T-Mobile CTO was kind enough to answer me asking on their prep for it. https://mobile.twitter.com/vvtgd/status/814654159614050304


While we're at it, can we get rid of daylight savings time too?


As a programmer, I hate daylight savings, but asa human being, I love daylight savings.

It's just so nice to get that extra bit of sunlight in the evening.


As a human being, I hate daylight saving time... It's very disruptive to sleep schedules (particularly for children, but adults as well). Traffic accidents spike in the days after the DST switch (likely due in part to the aforementioned sleep disruptions). Summer days are plenty long already...


Yes, exactly my thoughts. While it's great that automated systems and networks can account for this man-made invention of time change, it is very stressful on humans and also unnecessary. A dissolving and averaging out to a new standard would help everyone out in the long run..


Or we could just keep time on DST and stop switching. No reason to give up those nice, long summer evenings.


Or maybe we just need to code system that handles leap seconds correctly.


Two valid points. A third: convince everyone to adopt epoch time for data transfer (seconds since epoch), and let applications that require formatted time do the transformation where it will be used (not earlier). It doesn't make a lot of sense that a timestamp represented as HOUR/MIN/SEC:DAY/YEAR should be passed around on the network of a production system. Leave it to the recipient to convert. I guess this is a subset of your point.


Leap seconds aren't just an issue of formatting times. Leap seconds actually involve turning the UTC count of seconds back by a second.

You'd have to switch the "seconds since epoch" count to TAI, and that would cause new formatting bugs because all kinds of software assumes that the minute changes on a multiple of 60 seconds since the epoch.


Yes but even here, seconds since Epoch should remain unaltered, and the correction should be made by whatever is rendering a human readable date format (to address every leap second). In most cases, the renderer wouldn't have to address it (since it's only being read by humans, and a second difference does not usually matter) and it's truly a non-issue! The application-layer dev can choose to increment time in whatever blocks he wants instead of having an if-else chain for every "official" leap second. Like adding a minute every few hundred years, to that other commenter's point.


Unfortunately, epoch time is not literally "seconds since epoch", at least not as implemented/standardized as "Unix time". It skips or repeats itself in case of leap seconds. So it can't save us here.

I think if there were such a thing as a different kind of epoch time that literally actually is "seconds since epoch" it would help a lot and work like you suggest.


Yes, I agree -- I am not referring to any specific implementations. Didn't know that about Unix time, but it makes sense for compatibility given the current way we adjust for leap seconds.


Is this correct? Because I am having trouble understanding the rationale behind making the unix epoch relative to an earth solar year, as opposed to just the "number of seconds which have elapsed since the unix epoch". Do you have an example of this implementation? The Wikipedia article regarding epoch notes many counter-examples.


The Wikipedia Article here, describing Unix Time, indicates precisely the issue. Unix time does not include leap seconds, which means that when a leap second occurs, the midnight transition to the next year needs to insert additional time. Strictly following the standard thus, the Unix timestamp rolls time backwards by one second over midnight, which is precisely the kind of behavior that breaks systems depending on continuous timestamps: https://en.wikipedia.org/wiki/Unix_time#Leap_seconds

It sounds like in your scenario, you would prefer Unix time to instead include the leap second, so that no rollback or time smearing behavior would need to occur. I believe the reason it does not has to do with simplicity: current systems rely on a day being 86,400 seconds, making each year (regardless of leap days) a multiple of 86,400. Leap seconds break this simple assumption. While it would be simple for a new time formatting system to take leap seconds into account, it is not so simple to go and retrofit all of the existing systems for a new formatting standard, and convince so many different groups of developers to change that much code while also agreeing with one another about the changes.


I wrote a WebDAV client the other year and dates where one of the hardest parts to implement because they expected them in calendar format! It seemed so odd to me that they would do that.


Http/1.* headers are intended to be human readable


That one is called ephemeris time (ET). Astronomical applications require the information of the current difference UT-ET.


I believe those efforts could be better spent making systems more robust against other threats that can't be avoided by simply deciding to stop cutting ourselves.

Besides: Even expensive commercial time keeping devices frequently mishandle leap seconds. History suggests that we are underestimating how difficult they are to get right in complex systems.


It's difficult to write bug-free code for events that happen very infrequently. Even more so when the distributed nature of the system makes effective testing under real-world conditions nearly impossible.


Yeah, that's exactly why Leap Day getting skipped every 4th year is such a disaster.


The ratio of embedded systems that care about the calendar to those that care about time has to be astronomical.


Maybe we should code systems with zero bugs. /s


Well, yeah, that's not a bad idea. It was kind of Dijkstra's whole thing. The problem is that, at current levels of technology, it's economically better to write cheap buggy software than more expensive bug-free software for almost all consumer applications. We are gradually pushing the optimality curve towards the provably bug-free end of the spectrum, but it will take time.


I thought we were pushing the optimality curve toward ever greater volumes of ever buggier code.


Glad TOTP refresh every 30 seconds, and are generally valid for at least 1 minute. One second less wouldn't make a large difference.


Someday when everyone switches to Rust we can have a standard library that handles all of this and software bugs will be obsolete ;)


Or maybe we should start holding project managers accountable for these issues instead of simply basing their performance on deadlines.


Classic Hacker News, it's always the PM or the management at fault, engineers are faultless.


I'd have time to develop a new toy programming language as a service alongside 3 new JavaScript frameworks in 2017 if it weren't for the evil PMs.


;-) 98% of developers would be clueless to leap second issues. Hell, 50% struggle with leap years.


> if you can’t measure it, you can’t manage it.


Just do a leap minute every 60 theoretical leap seconds and reduce the number of times these problems occur by 60 (and they always do because fallible humans programming machines).


Once again we're screwed by different people wanting "time" to mean different things. There is no hope for humanity once we start traveling anywhere close to light speed into and out of the solar system.

I propose a new "non-time" time system. It has exactly two real values which range from 0 to tau and an integer, the first real number is radians of earth rotation, and the second is radians of the rotation around the Sun. The integer reflects the number of complete cycles. So lunch time in Greenwich 'pi'.

It has the benefit that its "source" is actually the planet, so we can use a telescope at Greenwich to pick a certain alignment of stars as the "zero", "zero" point and then each time it realigns to that exact point, you can increment the "year" count.

I believe we can build a robust system to support this out of stone. We'll need to create a circle of stones but using a small hole drilled through a stone and a marker on the ground we can always identify 0.0,0.0, 0.0,pi/2, 0.0, pi, and 0.0, 3*pi/2.


> It has the benefit that its "source" is actually the planet, so we can use a telescope at Greenwich to pick a certain alignment of stars as the "zero", "zero" point and then each time it realigns to that exact point, you can increment the "year" count

If you're going for such drastic change to get rid of the occasional minor issue with leap seconds, then a star clock is a bad idea - the stars move relative to us and each other. The constellations we look upon are differently arranged to the ones Julius Caesar & Co looked upon. You're basically swapping one source of error for another.

Similarly - the oddity of choosing a planet-based time system for the synchronisation of clocks moving interstellar distances? How do they accurately measure time when they're no longer on the planet? And, as others have mentioned, the reason why we have leap seconds in the first place is because the length of a day (and of a year) changes.

It's also worth noting that when stonehenge was used to mark the time, webpages came in the form of bardic tales. If your bard was asleep, you get a 500 error... and they were asleep a lot. Stonehenge time was terrible for information delivery :)


> Once again we're screwed by different people wanting "time" to mean different things. There is no hope for humanity once we start traveling anywhere close to light speed into and out of the solar system.

Time is relative, you'll have the feeling of angst until you accept the relativity. No pun intended.


and if the earth speeds up or slows down we loose sync with far off time keepers who dont know yet ?

mostly this is tektonic re-adjustments on a minor level or asteroid strike on a major level. of course you could say that an asteroid strike would be a larger problem than re-syncing the stone clock


If we somehow manage to "reverse" the Earth's orbit around the sun, does that mean we will have invented a time machine?!


I'd be more impressed with a machine that can decelerate the earth and reverse it's spin direction without killing everyone as opposed to a time machine personally.


This isn't exactly HN style, but here ya go...

https://s-media-cache-ak0.pinimg.com/originals/f3/f7/7e/f3f7...



What if the stones move?


Guess we better use really large stones :-)


You do know that stars move, right? And not even all in the same direction or at an easily predictable speed.

It's illustrative about how hard time is that you tried to create a new system from scratch, with the express purpose of being future proof for space travel, and it's already broken because the fixed point you chose is not, in fact, fixed.

Edit: grammar


I believe that ChuckMcM is making a lighthearted, elaborate reference to Stonehenge. (If not, nevermind!)


It has been interesting to see the responses. And yes it was a not so oblique reference to Stonehenge. :-)


I guessed most big services would be using something akin to time smearing [1] since the first big leap-second outages years ago. Is there any reason why cloudfare would be unable to use this technique?

[1] https://developers.google.com/time/smear


It's pretty lame that a lot of software is so fragile that it breaks if we give it the correct time.

The solution here is that any software that relies on accurate timing and/or breaks when you change the time should be using epoch seconds, not any sort of human-oriented time format.


Leap seconds are a change to the number of UTC epoch seconds.


> Leap seconds are a change to the number of UTC epoch seconds.

UTC is different from epoch time. Epoch time by definition does not count leap seconds. https://en.m.wikipedia.org/wiki/Unix_time


"Does not count leap seconds" has to win some kind of award for misleading terminology. I'd wager it's responsible for a significant portion of leap second bugs due to confusion & misunderstanding about what Unix time is:

It sounds like what it means is that Unix time counts the number of real, actual, by-the-clock seconds that have passed since the epoch. That would be logical. But what it actually means is that it counts the number of real, actual, by-the-clock seconds, minus the number of those those that have been designated "leap seconds".

That is to say, whenever a "leap second" occurs, the nice monotonic isotonic progress of unix time is mutilated by suddenly adding or removing 1 to the total count so far. That's what "does not count leap seconds" means, and sometimes even what "ignores leap seconds" means (which is of course even worse terminology).


Shit, you're right. It had me totally fooled.


But it is affected by leap seconds. According to the example labeled "Unix time across midnight when a UTC leap second was inserted on 1 January 1999" the unix time went backwards. So using epoch/unix time does not help.


Is it safe to say that time() in php and Date.now() in JS do not care about leap seconds?


From [1], it is likely messier than that:

> The time() function will resolve to the system time of that server. If the server is running an NTP daemon then it will be leap second aware and adjust accordingly. PHP has no knowledge of this, but the system does.

> It took me a long time to understand this, but what happens is that the timestamp increases during the leap second, and when the leap second is over, the timestamp (but not the UTC!) jumps back one second.

[1] http://stackoverflow.com/questions/7780464/is-time-guarantee...


It would seem much simpler to just use TAI for timestamps, and get civil datetimes from UTC. Why didn't people do this? Why use UTC for timestamps??


That doesn't seem simpler at all.

- UTC timestamps can unambiguously refer to times years in the future; TAI timestamps cannot, because it is unknown how many seconds will be in each year.

- Converting UTC timestamps to human-readable UTC times is simple modular arithmetic. A beginner in any programming language can do it. Converting TAI to UTC requires a lookup table, and it must be updated after the software is released.

What would be simpler would be ending the use of leap seconds for a millennium or so.


TAI timestamps can unambiguously refer to times years in the future. You can subtract the TAI timestamp from the current time, set an alarm for that number of seconds and it will actually occur on cue.

TAI timestamps don't refer unambiguously to UTC timestamps or calendar dates in the future, because the latter two depend on the variable rotation of the earth and (for zoned times) geopolitical whimsy.

I don't see why this matters though - most "timestamps" are for events in the past. The proper representation for events in the future will depend on your application (eg. are you writing a calendar for humans, or a spacecraft guidance system - does the event happen at a fixed point in time or a fixed point in the human work day?).


Think through your proposal to track all times in TAI and then convert them for display. What TAI time do you pick to represent the time that will be displayed as "00:00:00 on January 1, 2020"? Are you going to be okay with it changing to "23:59:58 on December 31, 2019" due to the geopolitical whimsy you mention? What if you just wanted it to represent the day "January 1, 2020"? Do you not use timestamps for anything but exact seconds anymore?

UTC doesn't know how long a second is, and TAI doesn't know how long a day or a year is. But most people need to specify "10 years from now" more often than they need to specify "300 megaseconds from now".

Banning leap seconds would be fine. UTC leap seconds are messy but at least we get by. But updating all time conversion software (instead of just updating authoritative clocks) every time there's a leap second is ludicrous.

It's true that applications should use the proper representation for what they're intended to do. And they do. Most applications use UTC because they describe human-centric events. Astronomers use TAI.


"00:00:00 on January 1, 2020" UTC is a rather useless timestamp for most practical purposes, unless you live close to th zero meridian. In most places you want to talk about 00:00:00 on January 1, 2020 localtime, which depends on what timezone you are in, and of course the daylight savings rules. Both can change. In some countries, daylight seems to be even more random than leap seconds.

In my opinion, the real problem is that TAI is not an option in most current systems. There is no way to get the time in TAI, no way to convert between TAI and UTC etc.

So even in applications where it makes sense to use TAI (think logging and billing) we don't do that because the necessary infrastructure is not available.

I think it is time that the technical community gets together can make TAI a first class citizen.

TAI doesn't help with timestamps in the future, but usually those applications don't need second level granularity anyhow. The applications that break during a leap second are the ones that need to track the current time or passage of time with sub-second accuracy. And those can be served perfectly with TAI.


"00:00:00 on January 1, 2020, local time", if it is to represent a calendar event in the human work day in the future, should be represented as neither a TAI "timestamp" nor a UTC "timestamp", but as a data type containing exactly "00:00:00 on January 1, 2020, local time".

> But updating all time conversion software (instead of just updating authoritative clocks) every time there's a leap second is ludicrous.

All time conversion software is already updated every time a government changes a time zone - by downloading the most recent tzdata. All software that needs second-level granularity is constantly updated, by synchronizing with NTP. There's nothing at all ludicrous about distributing leap second tables instead of mutilating the NTP time signal.


"- UTC timestamps can unambiguously refer to times years in the future; TAI timestamps cannot, because it is unknown how many seconds will be in each year."

Isn't it the other way around, since TAI has no leap seconds?

"What would be simpler would be ending the use of leap seconds for a millennium or so."

That effectively means using TAI consistently, which is what software not aware of leap seconds would be doing anyway (despite the fact that it's actually working with UTC.)


TAI has no leap seconds, but years, as currently defined, do.

And yes, I'm advocating for changing that definition, and making UTC a constant offset from TAI that works the same as TAI for the foreseeable future.


Read what I wrote carefully. For things that you need to convert to human-readable dates, by all means use UTC.

But for timestamps that can be used to obtain a time difference between them, TAI should be readily available to be used. Most timestamps are for figuring out time differences here on Earth and not relative to astrological signs, they are for knowing what came before what, etc. They are not for generating human-readable dates. It's silly that such a major use case is hardly implemented on major systems, and instead an unreliable Unix Time is used which can "at any moment" have the same second twice.


And alternatively I guessed that big services like Cloudflare, responsible for fronting 2.5 million websites, would have been running a preproduction environment clocked-forward to 2017.


This was shared a while ago, but it's relevant again: http://www.madore.org/~david/computers/unix-leap-seconds.htm...


I'm curious what if anything would be problematic if everything just effectively "ignored" leap seconds (i.e. would this outage not have occurred?) --- one minute is always 60 seconds, an hour is always 60 minutes, and a day always 24h. I mean, if you consider the fact that human society has managed to function perfectly well with almost everyone not knowing nor caring what a leap second is, and yet apparently some software does --- leading to problems like this --- something doesn't feel right.


100 seconds per minute, 100 minutes per hour, 20 hours a day. New seconds are 0.432 old seconds, or whatever ratio they need to make to quit leaping around.


Fun fact: the reason we use 60 seconds and 60 minutes is because of the Babylonians who used base-60. IIRC, it's also why we use 360° for a circle.


It might also have something to do with the fact that 60 is evenly divisible by 30, 20, 15, 12, 10, 6, 5, 4, 3, and 2.


The metric system is a great system because it is almost universal. Base 10, as well as the other choices for what is ideal might be argued as inferior to other measurement systems.

Since the time we are talking about is going to be used by computers it might as well be base 2.

64 seconds per minute, 64 minutes per hour, 36 hours per day. You could then choose 8 days a week, 32 days per month, and 11 months (44 weeks) per year followed by 4.24... days of festivals to the pagan gods.

Just like before, the problem is that there are exogenous values: a non-constant length of year at an Earth location, a non-constant length of day at an Earth location, and a more constant period of time defined by a lower level process of nature like caesium atom vibrations for the seconds that scientists use.


Financial systems need very precise timekeeping.

I'm sure other fields do as well.


They do, and actually HFT did come to mind when I was writing that comment, but then I realised that, as explained in https://news.ycombinator.com/item?id=13294747 , they have no need to precisely synchronise time with the rotation of the Earth, and would be fine without leap seconds.


But financial systems don't care if the position of the sun in the sky is a couple of seconds off from where a model says it should be. Astronomers would care about that, but they already don't use UTC.


But they do care what GPS reference time is. Which are satellites very much dependent on holding an accurate position in the sky (which is dependent on the Earth's rotational speed, which changes, which is why we have leap seconds).

So go figure: which part of this system should be broken because people keep ignoring that leap seconds happen?


Actually, GPS time is not adjusted for leap seconds:

http://tycho.usno.navy.mil/leapsec.html


That's neither here nor there: GPS receivers, and the GPS satellites, do broadcast the leap-second insertions (though they don't reset their own clocks, they simply maintain the differential as additional information).


I'm guessing CloudFlare runs their own custom DNS server software?



Yeah, and Go doesn't expose a monotonic clocksource in its stdlib[1]. I'd bet that's what this boils down to.

[1] https://github.com/golang/go/issues/12914


Wow, that discussion is cringeworthy

I really thought anyone discussing systems programming should be aware of the need for a monotonically increasing clock source

REALLY


I'm confused on why a DNS server would need to rely on a monotonic clock for its use cases. Is there a part of DNS that relies on the assumption of synchronized, monotonic time? (Perhaps TTL/expiry of records? But I still don't see why having a non monotonic clock source would harm if CF is using Go timers for expiry)


One example might be rate limiting. Count requests over elapsed time. If elapsed time is a negative number, the math might trigger a bug that causes CF to block requests...too many requests over time period X.


Cloudflare posted a post-mortem [1]. They were measuring round trip time, and supplying the result of that into the golang rand.Int63n() function, which panics the process when given a negative number.

[1] https://blog.cloudflare.com/how-and-why-the-leap-second-affe...


That's really odd. You can't even make a working progress display without that.


I was at a relative's and tried to load two different web sites.. my first thought was that their wifi sucked. My second was "will we finally learn a lesson today about the disturbing trend towards constant re-centralization of all our online services?"


Funny that they wrote about it in 2014 https://blog.cloudflare.com/its-go-time-on-linux/


Was glad things have improved since 4 years ago!

https://blog.fastmail.com/2012/07/03/a-story-of-leaping-seco...

This time I didn't get paged for anything on leap second day :)


What causes real-world problems with leap seconds is actually unrelated to the nasty interactions of metrology and solar time -- it's a specific and avoidable problem with how NTP (and many OSes/languages) represent time -- it's a types issue.

The right way for computers to represent time is with a number that represents the number of constant-rate ticks that have elapsed past a some agreed-upon epoch. If you know what the epoch is and how long each tick is (lots of people use 1 / 9.192 GHz), it is easy to know how many ticks are between any two time values, and you can convert a time value with one epoch to one with a different epoch and tick rate -- you can do everything people expect to do with time. There are no numbers that represent an invalid time value, and for each moment, there is a unique time value that represents it. There's a one-to-one mapping with no nasty edge cases.

Leap seconds are a step function that is added to a constant-rate timescale (whose name is "TAI") in order to generate a discontinuous timescale (whose name is "UTC") that never is too different from solar time. There is nothing fundamentally abhorrent about leap seconds -- there are just good and bad ways to represent, disseminate, and compute with timescales that involve leap seconds.

The right way to handle leap seconds can be seen with many GNSSes and PTP (very high precision hardware-assisted time synchronization over Ethernet). GPS, BeiDou, Galileo, and PTP all involve dissemination and computation on time values -- and with dire consequences for failure/downtime/inaccuracy.

The designers of those systems all somehow converged on the choice to separate out the nice, predictable, constant-rate and discontinuity-free part of UTC from the nasty step function (the leap second offset). Times in all those systems are represented as the tuple (TAI time at t, leap offset at t). This means that the entire system can calculate and work with (discontinuity-free and constant-rate) TAI times but also truck around the leap offsets so when time values need to be presented to a user (or anything that requires a UTC time), the leap offset can be added then. Crucially, all the maths that are done on time values are done on TAI values, so calculating a time difference or a frequency is easy and the result is always correct, regardless of the leap second state of affairs. Representing UTC time as a tuple makes the semantics of that data type easy to reason about -- the "time" bit is in the first element and is completely harmless -- the edge cases have all live in the second half of the tuple.

NTP and Unix (and everything descending and affected by those) have made the mistake of representing and transmitting time as a single integer, TAI(t) + leap offset(t). This is not a data representation that has sensical semantics and it is very hard to reason about it. First of all, the leap second offset is nondeterministic and also unknown -- there is no way to get it from NTP and there is no good way to know the time of the next leap event. Second of all, there are repeated time values for different moments in time (and when a negative leap second will happen, there will be time values that represent no moments in time). Predictably, introducing nondeterministic discontinuities doesn't work so well in the real world. There are a bunch of bugs in NTP software and OS kernels and applications that make themselves shown every time there is a leap second. It's not even just NTP clients that struggle -- 40% of public Stratum-1 NTP servers had erroneous behavior [0] related to the 2015 leap second! Given that level of repeated and widespread failure, the right solution is not to blame programmers -- it should be to blame the standard. The UTC standard and how NTP disseminates UTC are fundamentally not fit for computer timekeeping.

GNSS receivers and PTP hardware get used in mission-critical applications (synchronizing power grids and multi-axis industrial processes, timestamping data from test flights and particle accelerators) all the time -- and even worse, there's no way to conveniently schedule downtime/maintenance windows during leap second events! "Leap smear" isn't an acceptable solution for those applications, either -- you can't lie about how long a second is to the Large Hadron Collider. GNSS and PTP systems handle leap second timescales without a hitch by representing UTC time with the right data type -- a tuple that properly separates two values that have the same unit (seconds) but have vastly different semantics. The NTP and unix timestamp approach of directly baking the discontinuities into the time values reliably causes problems and outages. The leap second debacle is not about solar time vs atomic time; it's about the need for data types that accurately represent the semantics of what they describe.

[0]: http://crin.eng.uts.edu.au/~darryl/Publications/LeapSecond_c...


Except people want to be able to talk about times years in the future despite not knowing the number of leap seconds that may happen in the intervening time. It is more useful in most fields to talk about an event happening every N years/months/days than an event happening every N seconds. Most people do not want a leap second to shift their scheduled event from 10:00:00 every Monday to 9:59:59 or 10:00:01 in the name of using a whole number of 86400-second intervals.


If you want to say "10:00 every Monday" then say "10:00 every Monday" and accept that what you have is not an unambiguous point in time, nor an integer, but a calendar event that may occur at some time in the future depending on geopolitical changes to the local time zone and the rotation of the earth.

Mutilating all timestamps and network time representations by adding a variable unknown step function (the leap second "correction") in order to preserve the illusion that days are always 86400 "seconds" long doesn't help solve this problem at all.


> Except people want to be able to talk about times years in the future despite not knowing the number of leap seconds that may happen in the intervening time.

Doesn't the (TAI, leap second count) tuple solution work for this? Maybe I misunderstand the purpose, but you could use the leap second count to figure out how many seconds the TAI is off by.

But that doesn't matter, because date intervals shouldn't be represented with seconds anyway. Months and years have different lengths.


…I forgot to mention this in my original comment, but real-world wall-clock time is in any case discontinuous due to daylight savings time and other timezone changes. This means that it's not only months and years that change in length, but days and weeks too.


You can not represent "Next Monday at 12:00" with a tuple (TAI, leap second count), because you don't know how many leap seconds there should be. Or maybe you know for next Monday, but you definitely don't know for the Monday in a year, as leap seconds are only announced ~6 months in advance.


I think you cannot represent "Monday in a year at 12:00" with a simple integer either, right? For example, the king of the country may decide to cancel DST for the year. Either way you would have to store it as a calendar event and figure out the exact time once you're closer.


You should store that similar to this:

    begin = (today, 12:00) (eg. 2017-01-01T12:00:00)
    repeat = RRULE:FREQ=WEEKLY;COUNT=1;BYDAY=MO
Note that "begin" is usually something software figures out itself.


I think this is solved by storing an event date as UTC (since we can't always know how many leap seconds will be required), but when triggering an event, we calculate the UTC from TAI + Leap Seconds.

An event in the future isn't necessarily a known number of seconds away, which I think is the point you were trying to make. But the parent comment wasn't suggesting all instances of time should be stored as (tai, leap seconds). Calculating a UTC value from (tai, leap seconds) is trivial, but if the thing you care about is the UTC value then that's what you store.


Sometimes it's best to store a scheduled events as "Event localtime" and "Timezone" (where timezone is a named description - e.g. "Europe/Madrid" - rather than an offset - e.g. "+1:00").

This allows the record to stay consistent, even if there are changes to the local time rules - e.g. leap seconds, daylight savings, timezone offset.

Imagine a tech-camp had been planned in Cairo, Egypt, to start on 9am on July 10, 2016: that would have been scheduled for 06:00 UTC. When Egypt cancelled daylight savings with three days notice, that record should then have been 07:00 UTC.


Yup, I'm aware of this and should have mentioned it in my comment. Thanks for the follow up.

As an aside, how often do the tz databases for each language get released? Are they usually responsive to notices 3 days out?

Edit: I went looking into the pytz release for the Cairo example from parent.

Olson Timezone Database:

Release 2016f - 2016-07-05 16:26:51 +0200

https://github.com/stub42/pytz/commit/03a4e9b31dd90f3dace1eb...

Pytz:

Release 2016.6 - 2016-07-13

https://pypi.python.org/pypi/pytz/2016.6

So even if the tz database is up to date, there's no guarantee that various library usages of the tz database will be correct for these kinds of changes. Interesting.


I just came across a note about Morocco, which entered daylight savings time in March 2016, but then left daylight-savings in June for 35 days, re-starting daylight-savings in July [1].

I've read that the explanation for this temporary suspension of daylight-savings is Ramadan [2], and Ramadan is dependent on the observed sighting of the new moon - so you can't necessarily predict the date in advance.

I ended up coming across that after looking for an explanation for something bizarre I experienced on a trip to Morocco in March 2016…with my iPhone set to use "Marrakesh, Morocco", the time on the phone displayed correctly, but the time on my sync'd Apple watch was an hour out. I think I ended up manually setting it to Paris time to get the correct time, but never did get an explanation for the difference.

So even across two devices from the same manufacturer, theoretically sharing the same date-time information, they can be inconsistent.

Conclusion: time is hard!

[1] https://www.timeanddate.com/time/change/morocco/tanger?year=...

[2] http://codeofmatt.com/2016/04/23/on-the-timing-of-time-zone-...


Anything less than 2 weeks is a gamble; I follow the time zone list closely and go out of my way to poke some maintainers of libraries we depend on when something like the Egypt change happens


Another neat example in the "UTC ain't always the right thing to do" category.


Thank you so very much for the effort to elucidate the real problem hiding underneath the usual slew of Leap Second issues.

I keep telling people to use TAI. I once contemplated writing kernel code to rebate internal clock stuff to TAI but at the end of the day it was not worth doing because I would have needed to build a completely new stack of things above the kernel to use it in order to avoid problems.


So you can basically see if a tool or company hasn't experienced a leap second if their system goes down because of it.


I'll just leave this here:

Have a look at a j excellent video that explains why time algorithms are hard to sort out: https://m.youtube.com/watch?v=-5wpm-gesOY

Happy New Year from Austin!


They apparently run their own DNS proxy called "RRDNS", written in golang. https://blog.cloudflare.com/tag/rrdns/


Half the people posting here need to read https://qntm.org/calendar


I was wondering who this leap second was going to affect!


I knew it.

I knew that something is going to break somehow because for some reason people continue to falsely believe that 1 minute always has 60 seconds.


Are there any public "skewing" NTP pools that distribute the leap seconds as lag / gain over 24 or 48 hours as some of the large providers do? That seems to be the generally accepted answer to leap-second chaos, and certainly seems simpler than all of the hidden bugs in systems all over the place trying to deal with :60 on a clock.


Google have https://developers.google.com/time/smear but it can introduce different problems.


There is a higher order issue here. DNS time stamps have been stable for decades. Why has anyone written new code to format them since the last leap second?


Surely now we can agree that it is FINALLY time to adjust our planet's orbit to correct for this problem once and for all.


I just did sudo rdate -s time-a.nist.gov


I didn't see an outage at all


Is that why Google Maps was down?




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: