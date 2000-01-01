I sometimes think it would've been better if a few things had visibly failed in January 2000.
I spotted something like half a dozen failures in various systems I interact with which I strongly suspected, based upon the timing, were likely Y2K problems that slipped through testing. For example, I received duplicate bills for one of my credit cards for the January 2000 billing period, and then a subsequent apology for the duplicate bills. They never said Y2K, but the timing was very suggestive.
It's pretty much exactly what I expected from most companies...the big stuff had been largely been dealt with, but a few things slipped through which they could dismiss with some hand-waving. The thing that surprised me was that there didn't really seem to be any high profile disasters (like a company that couldn't ship products, an airline that couldn't issue tickets, or whatever) at all...I figured there'd be at least a couple.
The call centre staff told me that this wasn't a Y2K bug, but a year-end bug. As if that was meant to make me feel better about an obvious, grim, failure.
They were there before then too. Things that could go wrong at midnight on NYE were only one of a few classes of problem associated with roll-overs. There were a lot of bugs in like scheduling applications (and similar system tools) in the run up to 2000 that the man on the street didn't associate with the Y2K issue because it didn't happen at that exact moment.
No parade would be thrown for this senator for having prevented 9/11 and likely he'd be castigated for having given airlines an excuse to raise prices due to restrictive government regulations.
The guy at work whose work doesn't cause a problem and things keep working fine: why aren't you as much a team player as that other guy?
I suspect the parable it alludes to passes the cultural literacy threshold for comprehension, at least in the US.
The availability heuristic is a mental shortcut that relies on immediate examples that come to a given person's mind when evaluating a specific topic, concept, method or decision. The availability heuristic operates on the notion that if something can be recalled, it must be important, or at least more important than alternative solutions which are not as readily recalled. Subsequently, under the availability heuristic, people tend to heavily weigh their judgments toward more recent information, making new opinions biased toward that latest news.
"You use Head & Shoulders? But you don't have dandruff!"
"Exactly!"
https://www.urmc.rochester.edu/encyclopedia/content.aspx?con...
> Confirmation bias, also known as Observational selection or The enumeration of
> favorable circumstances is the tendency for people to (consciously or
> unconsciously) seek out information that conforms to their pre-existing view
> points, and subsequently ignore information that goes against them, both
> positive and negative [0]
*"Thinking fast and slow", best good recommendation I got from the HN crowd
I was thrilled that we had something to point to as a "see, this is why we put in so much work". Prior to that had received lots of criticism about the amounts spent, people hired, blah blah blah.
A lot of money was spent fixing the Y2K issue. Can't exactly recall how much time I spent myself, but it was a dominant factor as far as IT projects went back then.
Your car runs into a bus full of children.
Other lawyer: Jury, they saved millions of dollars per year not checking the brakes on their cars, you should award those millions of dollars, and some more millions of dollars to the families whose children died that day.
Maybe this isn't a "real" failure, and the symptom of some IT departments working diligently to solve the problem before it happened. In any case though, I'm curious how inaccurate the televised reality from my youth actually was.
In short and very simple terms, some software stored date as DD/MM/YY.
It would asume that for year it would always prefix 19. The problem was when you reached /00. Any calculations or software decisions that happened based on that date, it would be off. Way off.
Some solutions where expanding the date to YYYY or adding a prefix to the new dates. Dates without the prefix would be 19xx. Date with the prefix X was 19xx and dates with prefix Y where 20xx.
This turned out to be one of the most cost-effective methods of fixing the problem, and was probably one of the most likely to be implemented. This was especially the case in situations involving software which ran closer to the hardware (for example, BIOS or firmware) or on systems where RAM or storage couldn't be increased and/or the change might increase the software's requirements beyond the system's capabilities.
32-bit Unix timestamps have the 2038 problem.
Despite how common these problems have been in the history of computers, we keep making them: https://en.wikipedia.org/wiki/Time_formatting_and_storage_bu...
So, while a Y2K compliance program dealt mostly with software, a complete program went through and tested all hardware in inventory.
The image of a bunch of perfectly capable computers being discarded for no reason describes this media frenzy wonderfully.
Except many countries who spent no money on upgrading their system has very few problems too.
A fair number of websites went from showing "Dec. 31, 1999" to showing "Jan. 1, 19100".
Having to spend billions of dollars on programmers' goofy means of saving a few bytes of memory is a pretty visible failure.
My dad was a programmer in the early days. The machines he started on in the 1960s had 8 KB of RAM. Saving a byte then is the equivalent today of saving 1 MB on an 8 GB machine.
Multiply that times, say, the thousands of customer orders you're trying to process and the goofy thing would be burning a lot of additional RAM because it might help somebody 35 years later. Who among us is writing code today worried about how it will be used in 2052?
This decade I knowingly wrote code that will break in 2036 [1]. My supervisor was against investing the time to do it future-proof (he will be retired by 2036), and I have good reason to believe the code will still be around by that time. I don't think I'm the only programmer in that position.
[1] Library specific variant of the y2038 problem https://en.wikipedia.org/wiki/Year_2038_problem
Sure, but how bad was it really? Something you could fix relatively quickly with a little time and money, or an instance of Lovecraftian horror unleashed upon the world like so much COBOL code?
In one of those scenarios, where we expected the growth of an integer to last at least 100 years, due to certain unaccounted for pathological behaviors, a user burned through 20% of that in a single day. But we had heavy warnings around this, so we were able to address the problem before it escalated.
strip a.out
Spending that much money on storing "19" just so your code keeps working in the unlikely event that it's still in use 3+ decades into the future isn't a good tradeoff. Obviously things are different now.
And in some ways, even "hundreds of dollars per date" doesn't quite convey it. These machines were rare and fiendishly expensive. In 2017 dollars, they started at $1M and went up rapidly from there. Getting more memory wasn't a quick trip to Fry's; even if you could afford it and it was technically possible, it was a long negotiation and industrial installation.
Another constraint that we forget about is the physicality of storage. Every 80 columns was a whole new punch card. That's a really big incentive to keep your records under 80 characters. Each one of those characters took time to punch. Each new card required physical motion to write and read, and space to store.
There were just so many incentives to make do.
Even a 32-bit int could hold 11 million years worth of dates. And if your software is used for longer than that, you can just change it to a 64-bit long and have software that will outlast the sun.
As computer hardware grew out of that, it maintained much of the legacy, down to hardware data paths and specialized processor instructions. It was more than a programming convention.
That was the right choice for the era. As mikeash points out, your approach takes more bits and more CPU cycles. But it also takes a computer to decode. Any programmer can look at a punch card, a hex dump, or even blinkenlights and read BCD. Decoding a 32-bit int for the date takes special code. Which you have to make sure to manually include in your program, the size of which you are already struggling to keep under machine limits.
We've come a long way.
Running a complicated date routine to convert to/from 32-bit timestamps would also have cost a huge amount. These machines had speeds measured in thousands of operations per second, and the division operations needed to do that sort of date work would take ages, relatively speaking. All on a machine that cost dozens of times the average yearly wage at the time, and accordingly needed to get as much work done as possible in order to earn its keep.
Even if a system internally can store a timestamp with nanosecond precision since the beginning of the universe, all that precision is lost when communicating with another system if it must send the timestamp as a six character string formated as "yymmdd" in ASCII.
Sure, they could have used a custom encoding. But that increases maintenance cost and extra development work. All to solve a problem that nobody cared about at the time.
You are assuming 8 bits per byte, but a byte can be any number of bits.
With two bytes of 7 bits each, the range is only about 40 years.
Is is also impractical when the storing media is punch-cards, and the systems adder unit only counts in binary coded decimal.
It means that data interchange is now much more complicated too. How do you get everybody to agree on the same 2-byte representation for dates? This is the 1960s, so you can't just email them. You have to have somebody type up a letter and mail it. Or if you want to get on the phone, a 3-minute international call will cost $12, which is about $100 in 2017 dollars.
Plus then you can't look at a hex dump or a punch card or front panel lights and see the date, so now you've made debugging much harder.
Some systems, where storing numbers in columns of characters were common practice (COBOL idiomatic style?) stored the date as two digits (possibly BCD), so the possible range is 00-99 no matter how many bits are used.
But it's worse than that. In the 90s a lot of code used 16-bit values, character strings. That is, it stored a char(2), parsed it as 2-digit number and then converted it to a date by adding 1900.
So it was only really "saving space" when compared with storing a char(4).
No punch cards or BCD, I'm talking about DOS/Windows systems.
The good programmers who are fortunate enough to be working on software that matters.
There are plenty of good programmers working on software that matters that should absolutely not be trading off hazy possible benefits in 2052 for significant costs now.
It's occasionally necessary. When I wrote the code for Long Bets [1], I took a number of prudent steps to make sure things would have a good shot at surviving for decades. But I only took the cheap ones; the important thing was to ship on time.
And I think that's the right choice for most people. Technological change has slowed down some, but 35 years is still an incredible amount of volatility. Betting a lot of money on your theories of what will be beneficial then is very risky.
I guess it's not obvious, but I think there's really a continuum here. You don't necessarily need to write software that will run perfectly in 2052, but it'd be good if you wrote software that can be comprehended, adapted and altered later on. Maintainability is never a "hazy benefit." (If the problem isn't a total throwaway.)
wat
But Moore's Law is now dead:
https://www.technologyreview.com/s/601441/moores-law-is-dead...
Per-thread performance has been basically flat after decades of exponential gain:
http://preshing.com/20120208/a-look-back-at-single-threaded-...
The iPhone is a decade old; every phone now looks like it, and it's highly plausible that they'll look basically the same a decade from now, possibly much longer. Laptops are 30 years old; they've gotten cheaper, faster, and better, but are recognizably the same. HTML is coming up on 30, and it will be in use until long after I'm dead. TCP is nearly 40; Ethernet is over 40; even Wifi is 20.
So it's just easier now to guess what programmers will be doing in 35 years compared to 1965.
It's the quick hacks and bodges that stick around forever.
It's not just the shitty programmers who do this. Sometimes we have shitty product managers who won't push back against this kind of thing. And you're forced into a creating something evil because most of the job is very good but this one time, you have to suck it up.
My response to that, though I agree with you, is that when a supervisor or PM or whoever gets on you about something you know is bad, you negotiate.
"Yes, I'll do this for now because the company needs it now. But only if you guarantee me the time (and possibly people) to do it right later."
You get agreement in an email, create the ticket and assign it to yourself as mustfix two months from now. And you shove it down their throats.
That's not an ideal place to work if you have to do that, but I have worked at those places and this is how you deal with that situation.
"Yeah, I'll give you a shit solution in 1 day right now. But only if you give me a a couple of weeks for a good solution later."
In reality, I've mostly only had to deal with this situation in startups. Mid-level and mature companies are usually open to pushing back and getting things right. But there are exceptions. Today was an exception. But that's also one of the reasons I don't really want to work at startups anymore.
I'm not saying this is true in your case. But there are so many different classes and types of programmers and projects that it's hard to generalize.
99% of your shit code isn't getting thrown away. It's sticking around making life hell for people like me.
Stop writing shit code because it's going to get thrown away. If you work for startups, you are always operating in protoduction mode. Everything you write ends up in prod.
Write code that doesn't suck. It doesn't have to be perfect or optimal, but make it not suck before you push.
Probably about 80% of the code I write doesn't even get looked at or used by another developer. If the technique/analysis proves useful, it gets rewritten/refactored. That has the added advantage that I then understand the model better.
For me there's a giant difference between code that lasts, which needs to be sustainable, and disposable code, which doesn't. I'm also very big on YAGNI; my code gets so much cleaner and more maintainable when I'm only solving problems that are at hand or reasonably close. Speculative building for the future can get insanely expensive: there are many possible futures, but we only end up living in one.
Indeed, I think a "do it right" tendency can prevent people from really doing it right. If we invest in the wrong sorts of rightness up front, we can create code bases that are too heavy or rigid to meet the inevitable changes. So then people are forced into different sorts of wrongness, working around the old architecture rather than cleaning it up.
When there are real business reasons to rush something, I'm glad to support that by splitting the work like you suggest. But the flipside is them recognizing that not every thing is an emergency, and that most of the time we have to do it right if they don't want to get bogged down.
(that doesn't stop me from sometimes having a weird admiration for incomprehensible software kept going forever with weird hacks. It's like with movies, sometimes they're so uniquely awful that you have to admire the art of them)
My dad's first programming job was initially to mechanically change how variables where stored thus saving 1 and only 1 byte of disk space. Someone ran the numbers and having someone do that for a few months was a net savings.
A few years after that he was talking about some relatively minor optimizations that saved a full million dollars worth of hardware costs by delaying a single new computer purchase.
uint64_t even
Or a UUID as others have suggested.
Technically C spec doesn't really say exactly how many bits int, long and long long should be. If you want specific sizes and your code to be somewhat portable use the specific bit sizes to make that clear. There are also types for size-like things (size_t) and pointer and offset like things.
I would go further and say you should _always_ use specific sizes, unless forced otherwise. There's no reason not to.
If you're looking for the best performances you shouldn't use leastX types, you should use fastX types (e.g. int_fast32_t for the "fastest integer type available in the implementation, that has at least 32 bits").
The difference between "leastX" and "fastX" is that "leastX" is the smallest type in the implementation which has at least X bits. So if the implementation has 16, 32 and 64b ints and is a 32b architecture, least8 would give you a 16b int but fast8 might give you a 32b one.
int32_t x = call_returning_int();
auto x = call_returning_int();
auto foo = func_returning_int(); to my knowledge worked in C because 'auto' was the lifetime keyword - like 'register' - and the default type in C is 'int'.
That's why when you miss a definition in C++ the compiler warns you that there's no default int.
Best to use the stdint types, just in case.
Remember, these fancy computing devices were built for the rich and the government, not for the average joe, noone thought computing would be this easily accessible.
Can't really use NAT on a primary key...
IPv6 isn't perfect, but we could have avoided a lot of hassle if we'd started off with it.
There are still plenty of people with that mindset. Some will even quote "YAGNI" when you question them.
It gets better yet when you realize that on 32bit systems (like in TFA) long usually is 32bit too ;)
I had some students that asked me if even a long would be enough to handle exponential growth, after all it's only twice as big. As a thought experiement I asked them to come up with a time to fill a 32 bit int completely. They came up with roughly a year. Then as a margin of error I said, let's assume you have 4.3 billion transactions every second instead. This volume can be sustained for 100 years, and we're still not in the danger zone yet.
2^32 ~ 4,294,967,000
2^64 ~ 18,446,700,000,000,000,000
The simplest way to do that is to just throw UUID at all problems from the start. (https://github.com/alizain/ulid s are better, but there aren't libraries to generate them in literally every language + RDBMS.)
For some applications you don't want to leak the time. Choose wisely.
Computers set limits internally on how big numbers can be when they're keeping track of stuff.
Your developers had given each game a number to identify it. So your first game was #1, the 40th game was #40, and so on.
The limit for how big the number could be was a bit over 2 billion, and your players have just now played a bit over 2 billion games, and so that id number suddenly exceeded the computer's internal limit. Specifically, the limit was 2147483648, so basically it crashed on game #2147483649, which is the next id after the last acceptable one (notice the last digit is 1 higher.)
I'm simplifying slightly but that's the idea. It'll be fixable by essentially using a different format for the id number so that the limit is higher, much like telling the computer "use a higher limit for this particular number, it's special."
The original Pacman crashes at level 256 for the same reason. - http://pacman.wikia.com/wiki/Map_256_Glitch
Edit: 32 bits worth of games played means about 4 billion games. 4 billion X 4 bytes for 32-bit = 16GB just for the 32-bit ID's. 64-bit ID's would need 32GB for the 4 billion games. I guess memory and storage weren't that cheap back then.
If the issue happened only on 32-bit iPads, but not on 64-bit iPads, the programmer probably picked a "long", not an "int". Had the programmer picked an "int", the problem would also happen on 64-bit iPads.
I think it's a really easy mistake for the first developer to make (especially because they weren't a C/Obj-C programmer), and then the sort of thing that no one audits after that.
(Another place where Java is confusingly different: "volatile" implies a memory barrier in Java, but not in C and C++.)
A 32bit integer is pretty much the default numeric type for the majority of programming tasks over the last 20 years. Even with 64bit CPUs, 32bit is still a common practice. Probably 99% of all programmers would make the same choice unless given specific requirements to support more than 2 billion values.
Up until recently, Rails defaulted to 32 bit IDs, so there are a ton of apps out there that could have these issues, especially since Rails has always prided itself on providing sane defaults: https://github.com/rails/rails/pull/26266
Many dynamically typed languages have an automatic change from int to bigint rather than allowing overflow. For example, Python.
Computer history is riddled with assumptions like that. The Y2K problem, Unix dates running out in 2037, and 32 bit computers unable to address more than 4 GB of memory are just the big ones. It's everywhere. Smaller software projects are generally built for what you need right now, and less for what might happen in the distant future.
Ideally you want to retain some awareness that this is an issue so you can start working on once you go over a billion games, but in a small company, there are probably always more urgent things to worry about, and nobody ever gets around to fixing this technical debt.
As a developer, this sentence made my skin crawl.
It's also really awesome that you're here, and that you guys were so honest about the nature of the bug - this is really something that should be encouraged.
I disagree. Simple napkin calculation: 100 million players playing 40 games each per year (about 1 per week) over 5 years = 10 billion unique games.
As others pointed out it was likely not a miscalculation, just a lack of calculation. The bug occurred only in the client and the decision to use a smaller data type was likely not a conscious one.
In any case, I wouldn't hold it against an individual programmer. But arguably this sort of bug indicates your development process has flaws (not enough testing, code reviews, etc).
Go easy on your eng team ;)
That's a very poetic typo
Congrats on such a success.
Same deal here. 32-bit numbers are stored as 32 switches, starting from
0000 0000 0000 0000 0000 0000 0000 0000
1111 1111 1111 1111 1111 1111 1111 1111
So what happens on game 4,294,967,296? Just like the odometer, everything rolls back to 0, and things start breaking because the program gets confused.
Pretty common problem, really. The fix would be to use a 64-bit number, which doubles the number of binary digits.
Snarky... Except that there were probably years of games to notice that you were approaching a "magic number" like 2^31.
SomeHacker44 -- sincere
CGamesPlay -- sincere
blktiger -- sarcastic
i_cant_speel -- mildly sarcastic
jazoom -- sincere
But maybe you're just being sarcastic ;)
EDIT: Apparently they already said that: https://news.ycombinator.com/item?id=14540509
I inherited a system where, among other things, the entire response body from a payment gateway callback is saved into a text field using utf8 character set, despite the fact that most of the supported payment gateways send data in iso-8859-x (and indicate the used charset inside the body itself, how's that for a chicken-and-egg problem). Of course when the data gets truncated due to not actually being utf8, nobody notices. Fun times.
[1]: https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sql-mo...
Yes, yes it is - it burned me so badly (catastrophic, unrecoverable production data loss) in the early days of my career (~15 years ago as a junior level dev in a senior level role) that it has forever colored my opinion of MySQL - I will really never trust it again.
Long live PostgreSQL!
EDIT: Though I am curious why MySQL doesn't throw an error when you try to store more than 64KB in BLOB?
> time_t is now 64 bits on all platforms.
Linux[2]:
> The vast majority of 64-bit hosts use 64-bit time_t. This
includes GNU/Linux, the BSDs, HP-UX, Microsoft Windows, Solaris, and probably others. There are one or two holdouts (Tru64 comes to mind) but they're a clear minority.
This is little help for older, already deployed systems, of course.
That said, this is definitely indicative of what's going to happen in just 20 years, 6 months and 20 days from now. I mean, we're still cranking out 32bit CPUs in the billions, running more and more devices, and devs still aren't thinking beyond a few years out. I know of code that I wrote 12 years ago still happily cranking away in production, and there may be some I wrote even longer than that out there... and I guarantee I hadn't given two thoughts about the year 2038 problem back then, and I doubt many devs are giving it much thought today.
It's truly going to be chaos.
I expect 2038 to be a rare hell because of the nature of the devices. Y2K was an IT problem, but 2038 will be an embedded system problem and that's going to be a much more painful thing to audit. Moving from the server room to inside equipment and walls is going to be fun.
That was a valuable lesson.
(I actually generated most entries myself while testing stuff - live in prod of course - and while there were probably fewer than 255 votes, the AUTO_INCREMENT did its job and produced an overflow).
Seems you have learned your lesson :-)
Twitter saw it coming and forced the issue. By saying that at a certain date and time they would manually jump the ID numbers rather than wait for it to happen at some unpredictable time.
(Or we're thinking of different events, I apologize if so)
From a (former) Twitter dev:
> Given the current allocation rate, they'll probably never overflow Javascript's precision nor get anywhere near the 64-bit integer space.
https://twittercommunity.com/t/discussion-for-moving-to-64-b...
The 2^53 problem was for Javascript, which has no native integer type, and is thus limited by the mantissa size of Number (which is defined as an IEEE double-precision float).
Twitter ids are unsigned 64-bit, since they're generated using Snowflake. That link must pre-date the move to snowflake ids, and is speaking to the count of tweets instead.
On a more serious side, that number won't be reached anytime soon..
I could have definitely chosen my words better.
It looks like they are using PHP/MySQL/Javascript/Flash, with only MySQL having any explicit types.
Even so, an error is often preferable to overflow, which is usually undefined behavior and could lead to a duplicate primary key anyways if it wraps to the first game.
A better question is "why 32-bit over 64-bit", but the site dates back to 2005 where that was the norm and the question has the same issues.
Your reserved bottom range is a perfectly good solution. But rolling into negatives seems fine, too.
[1]https://stackoverflow.com/questions/18195715/why-is-unsigned...
The -fstrict-overflow option is enabled at levels -O2, -O3, -Os.
They do it, so they can simplify things like "x < x+1" to "true".
for (int x = 0; x >= 0; x++)
Also, you have problems whenever you compare against signed ints.
It was a mistake to use them for sizes in C++. Google code style requires using int64 to count sizes instead of uint32 for good reasons.
At least they put an actual check in there - you didn't suddenly overrun and wake up with an enormous debt, so I'll give them some credit for that ;-)
And I think it isn't uncommon to have to add digits to the front of telephone numbers as regions grow (or telephones become more common) and the number space isn't large enough.
One interesting aspect of it was that Twitter realized what was going to happen in advance, and artificially pushed their IDs over the edge at a preplanned time so they could have as many people available as possible to work on any problems that appeared.
Thanks for all the comments! Always lots to learn from.
You probably mean 2^31 -1.
Moving to strings for Javascript was really just safety planning for the future since:
Tweet ID from today: 875423039323688960
Number of bits of precision necessary to represent it exactly: 60
Overflowed 53-bit precision long long ago. You can read about it here: https://dev.twitter.com/overview/api/twitter-ids-json-and-sn...
https://groups.google.com/forum/m/#!topic/django-developers/...
Didn't expect Chess.com and YouTube to have a crossover of users? Surprised there isn't active moderation on a site this size.
I think this is more of a capacity planning issue.
Usually when I hit some sort of unexpected bug in production I try to think about what type of testing will prevent similar problems in the future.
https://github.com/ornicar/lila/blob/master/app/controllers/...
Hmmm... :)
I really doubt this is in any way linked to Apple's reasons for dropping 32-bit, though.
It's a pretty lame bug, to be honest and certainly something easily foreseeable as this wasn't an overnight occurrence.
IMPOSSIBLE to predict.
I sometimes think it would've been better if a few things had visibly failed in January 2000.