Nice to see Amazon being proactive about the forthcoming change. Time is just a prime example of one of those things that seems superficially simple, but turns out to be deviously complicated. (If you ever want to bore non-technical types to death at a party, just start talking about the history of time. Bonus points for mentioning the proleptic Gregorian calendar in context.)
I don't think the leap second bug had any effect on EC2 itself; the only reason EC2 was ever blamed is that many of the sites which had issues happened to be hosted on EC2.
And lets not forget:
Case in point: the Linux leap second kernel bug that made problems in 2012. Read the commit of the fix:
"This patch tries to avoid the problem by reverting back to not using an hrtimer to inject leapseconds, and instead we handle the leapsecond processing in the second_overflow() function. The downside to this change is that on systems that support highres timers, the leap second processing will occur on a HZ tick boundary, (ie: ~1-10ms, depending on HZ) after the leap second instead of possibly sooner (~34us in my tests w/ x86_64 lapic)."
So the bug was the effect of the programmers more worrying that the leap second adjustment happens as fast as possible, in 34 microseconds, ignoring that the call from that particular point in code made the livelock.
And instead of pushing the change "as fast as possible" we see that both Google and AWS solve the problem by spreading the changes over the long periods of time. Which is the right approach generally for all automatic adjustments to the system clock -- avoiding discontinuities.
There was an interesting HN thread along these lines back when the current impending leap second was announced: https://news.ycombinator.com/item?id=8840440
UTC is an offset from TAI, which changes over time (leap seconds).
The time zone files already keep track of historical changes.
Conceptually, they're pretty similar; the only difference is that leap seconds have a special clock value (23:59:60 instead of showing you 23:59:59 twice).
TAI is defined like UNIX time, as a notation of the progression of proper time. It is the primary reference by which we build all other times, UTC is a humanist overlay on TAI to maintain norms, since we need an approximate terrestrial solar time for sanity purposes.
If the math changes to TAI as the "base storage representation" for time stamps and reference time internally, then the math becomes immediately sane, since TAI can be relied on as a direct sequence of mathematically related linear time without lookup tables or other crap. Move the crap "up the stack" to where it doesn't cause issues like these we see every time things need a leap second.
Then we have the signal from GPS, but typically only on the mobile phones, and some other signals on some other distribution mechanisms:
"GPS time was zero at 0h 6-Jan-1980 and since it is not perturbed by leap seconds GPS is now ahead of UTC by 16 seconds.
Loran-C, Long Range Navigation time. (..) zero at 0h 1-Jan-1958 and since it is not perturbed by leap seconds it is now ahead of UTC by 25 seconds.
TAI, Temps Atomique International (...) is currently ahead of UTC by 35 seconds. TAI is always ahead of GPS by 19 seconds. "
And we have NTP servers, which differ from one another all the time, and to which our computers connect and try to adjust what they report.
So the bugs are already just in how the adjustments are handled, not that the world can be made simpler.
A: Honestly, nobody knows.
You can estimate the number of leap seconds, but not know (much) in advance. Having (future) date representation chance occasionally does not lead to sanity either.
I personally implemented it 12 years ago, to support (a) a kind of geo-balancing based on time coordinated spoofed DNS responses, and (b) time stamped URL token expiration validation, sensitive to seconds, across distributed datacenters. I got the idea from Microsoft Windows' ability to drift the clock back into sync.
Google has adopted or reinvented a lot, and gets credit because they have extra time and resources to publish.
// TBC, they also invented a lot. But not this one. Sibling comments point out other implementations predating Google as well.
1. "Each second is 1/86400 longer and AWS clocks fall behind UTC. The gap gradually increases to up to 1/2 second."
If the seconds are longer, wouldn't the AWS clocks be creeping ahead, not falling behind? Bear in mind the leap second hasn't been added to UTC yet at this point in the table.
2. "AWS clocks gain 1/2 second ahead of UTC."
They do? But an entire leap second was just added to UTC. Aren't the AWS clocks 1/2 second BEHIND at this point?
Note the time of the AWS smearing of this 1 second straddles the UTC injection with 12 hours on each side.
"AWS clocks keep falling behind and the gap with UTC shrinks gradually."
Shouldn't it be "AWS clocks keep catching up and the gap with UTC shrinks gradually"?
Not sure how I can be reading this so backwards, but if it's me who is wrong here, I'd love to hear why. In any case it shows how time is easy to get wrong (whether it's me who is wrong, or hah, doubtful, AWS).
What am I missing here?
Edit: duh yeah I get it now. Amazon is right, of course. Obviously I'm a n00b when it comes to dealing with leap seconds. I was thinking the leap second sets time ahead, but it doesn't; it effectively does the opposite. Leaving this message in place as an example of how thinking about time and calendar-related programming is easy to mess up.
Here's some examples I thought of that may help clarify:
- AWK clocks do indeed fall-behind when their "seconds" tick longer. Because their seconds tick longer, over a fixed period of time, AWS will count less ticks than UTC.
(Think of it this way: A mile is longer than a kilometer. After you have traveled 800 kilometers, you've only traveled ~500 miles.)
2. "AWS clocks gain 1/2 second ahead of UTC."
- Before the addition of the leap second, AWS clocks are behind by 1/2 second as per (1). The addition of a leap second to UTC is another second that the UTC clock must tick - the AWS clocks don't have to tick this amount - so the AWS clocks are now ahead by 1/2 second.
Basically, this is what AWS is doing, using our distance analogy again:
1. Normally, we have to cover 1000 km every day. This includes a Civil group of travelers and an AWS group of travelers.
2. Today, we decide we're going to cover 1001 km. Everyone in Civil decides this is OK to count to 1001 instead of 1000, just for today. But the AWS group only wants to count to 1000 because counting to 1001 is a Very Bad Thing. However, the AWS group still has to cover 1001 km.
3. The AWS group comes up with the ingenious idea of just making each "kilometer" 1.001 real-world kilometers, just for today. Thus, they will only count to 1000, but each time travel 1.001 km. The end result is the same - they will have covered 1001 km.
Leaving this message in place as an example of how
thinking about time and calendar-related programming is
easy to mess up.
lie(t) = (1.0 - cos(pi * t / w)) / 2.0
I think I'd prefer the curved one. I don't see the advantages of the linear one except the less human capability needed to implement and test it.
Let's suppose that the amount of time it takes to run is relatively small. Now suppose between two runs the linear decrease starts. All of a sudden the second run appears to be slower than it is by a constant factor.
With a curved change, this effect still happens, but it's less for smaller time intervals.
bjackman "never worked on an application that would care about this kind of thing," fine, but for those that do care, the curve is obviously better.
I wonder if they're actually "slowing down time" or just not implementing the leap second and just injecting the 1/..... second into all user-visible or exported fields?
However, I do wonder how they implement this. Do you change timekeeping in the kernel so that a fraction of extra time must pass for each second to be "counted"?
Clock: inserting leap second 23:59:60 UTC
The stock linux kernel works such that if ntpd will have set the STA_INS flag via adjtimex some time before, the kernel will do the leap-second insertion at the end of the UTC day.
If you disable ntpd and it doesn't reset this flag (which I doubt it does, but you'd have to check), the kernel will insert the leap second on its own, even if ntpd is not running.
If you disable ntpd, and either ntpd on termination (which I doubt), or you via the adjtimex syscall, clear the STA_INS flag, then the kernel will not insert the leap second. After UTC midnight, the clock will then be one second off, and a restart of ntpd will slowly steer the clock back to correct time.
For playing with all of this, there's a adjtimex tool which can display and even change the timex values:
➜ sbin ./adjtimex -pV | sed 's/^/ /'
raw time: 1432018768s 308111us = 1432018768.308111
return value = 5
In the most basic case, code to fetch a date n days ahead that naively assumes fixed length days will return a date n-1 days ahead instead if it runs at exactly midnight eg date("Y-m-d", mktime()+($days*86400)) will return a date $days-1 in the future if it runs on leap second day. An edge case certainly, but if you're adding millions of records a day something you ought to consider.
No it won't. Those functions deal in UNIX-epoch times, which ignore leap seconds.
The one place I know I've assumed a day is 86400 seconds I also assumed a month was 30 days and a year 365, because it only needed to be an approximate gauge of how much time had passed, didn't need any relation to actual calendar time, and would be dropping far more precision elsewhere. I think it's incorrect to call that a bug, even a non-critical one.
It affected pretty much everything running (a recent version of) Linux.
I'm also very curious about what kind of applications will fail if the kernel simply ignored leap seconds, and let the clock get a second out of sync.
A computer does not last for long enough for leap seconds to add to anything. I'm unable to come out with any hypothetical use case where that small difference is relevant, but specialized hardware and software to deal with the issue are not necessary.