Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft Exchange stops passing mail due to bug on 1/1/22 (reddit.com)
711 points by technion 6 months ago | hide | past | favorite | 358 comments



Many eons ago...Around 10 to 15 years ago...Seen a similar serialization and date processing bug with a Java based Enterprise Integration Platform used by some of the largest companies in the world.

Several European Central Banks were unable to process transactions ;-). Can't provide more details, to protect the innocent. Since then, it has been a standard within the Testing group of that vendor to always have a running platform setup with a date 6 months ahead ( clock sync from a different source ). Something I added to my own Software Engineering and Testing standard practices.


A couple of years ago, I did work for a major airline where the test suites that would test date processing would fail on every leap year (Feb 29). So they decided not to run tests or deploy on Feb 29th.


Lol, reminds me of of the large EMR platform I used to work on that (until very recently) could not handle the "fall back" DST event (thanks to local date encoding in very parts of the system).... So the recommended 'fix' for hospitals was to just turn it off (their entire electronic health record) for the duration of the DST changeover... I personally believe that a non-trivial amount of people die due to DST and similar time related bugs...


Same for SAP. Seems like it's still 'best practice' to do so:

http://www.sapbasis1solution.com/how-to-prepare-sap-system-f...

    During this time of one hour “Fall Back” on the clock, as is our Basis team best practice (SAP Note #7417, #102088 and others) we intend to gracefully shut down all Production applications prior to 2:00 AM, at which time the “fall back” happens that will revert back clock time by one hour.  We would then wait for clock time to advance past the “new” 2:00 AM Local time.  And start from 2:05 AM EST will bring up applications.


Oh, EMR systems kill patients all the time.

Like, I’m not convinced “turn it off” isn’t an improvement.


I hesitate to find it amusing that healthcare software is so busted that we laugh off the negative, fatal consequences.


No, the commenter isn't laughing or joking.

They're that bad.


Not having one at all is also bad.


How so? "First, do no harm" is the foundation of medicine.


Not of software engineering though.


Paper medical records systems cause errors/deaths for several similar reasons that electronic medical record systems. EMR's with all their faults are probably a net positive.


> How so?

Empirically, it's associated with a higher rate of medical errors, which is a big part of the drive for electronic records.


Poor record keeping practices in general have been linked to killing patients. Especially ones involving medication errors. It’s not like paper systems were perfect or even demonstrably better.


From personal experience, this is Epic.


For anybody who missed this very insightful comment :-) see the "Criticisms and controversies" section: https://en.wikipedia.org/wiki/Epic_Systems#Product_and_marke...


Yes. Epic was deployed in Helsinki in 2021, and you bet it miscalculates the administration of recurring medications during DST changes, which nurses then need to correct manually.


DST has long overstayed its usefulness. I wonder why it's so hard to get rid of?


Our unit tests break twice a year, starting a couple of days before DST change. I'm sure it could be solved, but since the DST change happens on Sunday and the CI builds fail on Friday, no one has really bothered.


So I guess your units tests run with the current date and don't test behavior with different dates. That's something which has the potential to hide quite a few date related bugs until it's too late.

What I like to do to avoid this situation is to use property-based testing to test that the code works for all dates in the desired time span. For Python I use hypothesis [1] in combination with freezegun [2] for that. Here is an example of how that could look like:

  from datetime import datetime, timezone as tz
  
  from freezegun import freeze_time
  from hypothesis import given, strategies as st
  
  
  def business_logic():
      # TODO: Refactor before 2038!
      if datetime.now(tz=tz.utc) > datetime(2038, 1, 19, 3, 14, 7, tzinfo=tz.utc):
          raise Exception("This code is not working after UNIX time overflow")
  
      ...
  
  
  @given(dt=st.datetimes(max_value=datetime(9999, 12, 31, 23, 59, 59), timezones=st.just(tz.utc)))
  def test_business_logic(dt):
      with freeze_time(dt):
          business_logic()
  
  
  test_business_logic()
[1]: https://hypothesis.readthedocs.io/en/latest/

[2]: https://github.com/spulec/freezegun


It's just that the unit tests are built on the assumption that if you go forwards 24/48/72 hours, the time will be the same but N day(s) ahead. And that breaks down when you shift time zones.

>don't test behavior with different dates

It basically tests with different dates every time it's run.


> the CI builds fail on Friday

That seems different from what I understand by "continuous integration".


It's just an unintentional read-only Friday


We doing a midnight deploy with 40 people. After boss explained process. I asked him, so are we deploying on the first midnight or the second?

Long quiet pause, then a series of curses. So we deployed at 2am instead.


Hmmm. Where are you? Here in the US DST changes happen at 2AM to avoid ambiguities/weirdness with midnight.


Midnight UTC solves most of these problems.


One project had tests fail every Saturday and Sunday, near the organization’s fiscal week boundary. Ever since then, I have strived to hard-code all dates in test suites, choosing dates several decades in the future.


The outcome of tests shouldn’t depend on the time of day. Are these integration tests and you are having to deal with external bugs you can’t fix?


Doesn't that just hide the bugs? Your tests should pass on any and all date.


No. I write explicit tests on the boundaries, such that they would fail on any day of the week.


I came to ask if anyone was aware of systems that automate tests of dates in the future to find it already answered.

> Something I added to my own Software Engineering and Testing standard practices.

Noted.


Isn’t this basically fuzzing?

Making sure it works at the business domain level is another thing entirely.


Not exactly fuzzing, no.

Most tests are written with a fixed date, or now(). Instead, using now()+delta can help discover future bugs and give you some time to remedy the issue. Often annoying to write tests like that, though, without them breaking for unrelated/timing issues compared to a fixed value.


If you expect your software to exist in 6 months, then the input is entirely expected, and therefore not fuzzing.

It's a nice early-warning system for Y2K or Y2038-like problems.


not especially. Fuzzing would be running the software with an API to test all of the possible dates.


No, that's exhaustive testing.

This is fuzzing: https://en.wikipedia.org/wiki/Fuzzing


A date 6 months in the future is not invalid or off; it is a normal input that the program is expected to receive in the future.

Fuzzing would be testing the date parser by passing it '25-44-2021' to see what breaks.


Fuzzing inputs don’t necessarily have to be invalid, just dynamically generated in order to probe program states not normally exercised by manually generated test data. Per Wikipedia: “providing invalid, unexpected or random data as inputs”


Exactly my original point.

Invalid input is invalid input


Valid, random inputs is still fuzzing, which is what the person you are replying to was sharing with the quote from Wikipedia.

Inputs don't have to be invalid for it to be fuzzing.


Indeed. But you probably meant to address the parent, rather than my comment.


Blerp


Cmon, it’s Microsoft! You shouldn’t need to do that!


Agreed. Microsoft should be doing that already.


Reminds me how Homakov forward-tested github issues by sending a message from 3012 (it worked)

Here's the original link https://github.com/rails/rails/issues/5239

Unfortunately, they fixed the date and his bender account got renamed.


Is it someone's responsibility to periodically use the environment as real users?

Without this, there's no value in running a duplicate environment like this. Automated tests won't capture human interaction perfectly...


You can automate at the UI level and run that set of interactions (as well as screen captures and image matching if needed)


We shouldn't blame Microsoft for this; I once worked in the veterinary industry, and one Practice Management System allowed you to register a pet including their real or approximate birth date, or by entering an approximate age.

If you entered a birth date, the software would calculate the pet's age, and we noticed we had the odd quirk, like a 400 year-old cat etc. I couldn't see an obvious pattern to the anomaly, and it didn't occur often, so I created a support ticket.

After a couple of days, I received a response that 'dates were hard' and we should correct the ages manually as needed.

So there you have it: Dates are hard.

(No, the bug wasn't ever fixed.)


Dates are hard, it's got to be one of the things on the lists of "things programmers don't understand."

Choices to represent dates as an integers (seconds from an epoch), or as strings, or as arrays, or as other data types all come with a lot of non-obvious consequences and edge cases to deal with. I always use a provided "date" datatype from a library, and never try to roll my own with base data types if I have any choice in the matter. And they are still hard, depending on what range of dates you need to be able to handle.


Semi-related, one of my favorite posts on UTC. Seems simple to use until it’s not.

https://zachholman.com/talk/utc-is-enough-for-everyone-right


Heh, if dates are hard now, just wait till 2038 (https://en.m.wikipedia.org/wiki/Year_2038_problem)


You know, when I started working with computers (many moons ago) and found out about the 2038 problem, I didn't worry about it. After all, that will never affect me, right?

Well, it's 2022 and 2038 doesn't look that far away now. Hopefully I'll be retired by then and won't have to deal with it. But, something tells me that's not going to happen.


The issue is that even when retired, it might hit us in one form or the other.


The knowledge has indeed existed for years.

* https://news.ycombinator.com/item?id=4128208


Sure, dates are hard. But I think it's too generous to say Microsoft shouldn't be blamed for releasing such a bug.


Dates are hard but Microsoft is a trillion dollar company that sells products on every locale and timezone on the planet.

Something like this is unacceptable.


I talked to someone who worked in this team. A mere 30 min conversation opened my eyes to the complexity and he wasn’t even scratching the surface. It’s hard for most people to appreciate the true complexity here. I agree that accountability is needed but even a trillion dollars won’t be enough for bug free code - especially with dates.


The nature of this specific bug isn't something related to hard date stuff, like leap years, leap seconds, DST, or time zones. It's just that they formatted version strings as YYMMDDHHMM (e.g. 2201010001) and tried converting that number to an int32, causing an overflow once 2022 struck.

Date bugs are hard, but this one isn't.


It's not even really important that it's a date bug from a testing perspective; it's just a bug and a signal of an insufficient testing pipeline. You make a product (including a definition file), you test it before publishing it. You have canaries. If testing it happens to involve moving the date forward, you do that.


How hard is it to write a test that checks every date in the next 5 years to make sure we know the function will not break without giving us sufficient time to publish a fix in the regular version pipeline?


Good point. It feels like what you're describing is like time travel to erase the bug from the timeline completely, vs. getting a one-time advanced warning. I guess for date-related bugs where you have no other way to avoid them, you would need to set the clocks forward at least as far as you have the capacity to get a fix in, regression tested, and get your users enough time to do the same.


We’re talking about 1 specific instance. Every single problem looks small until you start looking at the big picture It’s hard!


It's painful to learn of such a bug and have no insight into how it was caused. Would love to get my hands on the codebase to figure this one out!

In another universe The 400 Year Old Cat could have been our 500 Mile Email.


We know how it was caused, bottom of reddit post is the update. The developers got clever and used a common method to store a version as an integer, but they never checked that they had enough space to store such a large number for the uncommon format they chose.

Its just another form of y2k, or y2038.


The comment you replied to is talking about the 400-year-old-cat from the previous comment, not the Exchange bug from the article.


Oops! Thank you, I lost track of the thread while scrolling back up! Please disregard, as I'd love to know what that bug is too


> So there you have it: Dates are hard.

Being an amateur developper, I never tried to touch a date in another way than though a specialized module (arrow in Python, moment or luxon in JS, ...).

I know only a few such modules so everything that remotely looks as time/date computation is a nail for my universal hammer.


I learned this one the hard way when using jQuery Datatables, dates aren't natively supported for sorting, so had to integrate moment.js and learn the quirks of date formatting between it, Datatables, .net and sql, especially since with moment you uppercase everything MM/DD/YYYY vs MM/dd/yyyy, plus notation for AM/PM and seconds is also different.

Dates are hard.


> date formatting between it

Oh yes. The nightmare of software that reinvent the wheel.

Instead of using ISO 8601 [1] they feel the need to do something else.

I currently suffer with how time was botched in the otherwise great backup program Borg [2]

[1] https://xkcd.com/1179/ - yes, xkcd has an entry for everything

[2] https://www.borgbackup.org/


As a Borg user, could you explain more about the issues with date formatting in Borg?


The timestamps are naive (no timezone) and therefore hardly useable for a non-local installation (backups on/to servers in different locations).

See for instance https://github.com/borgbackup/borg/issues/4832


Did you check with the cat to see if it was actually 400 years old?


You joke, but if they had been tracking shark ages instead it might well have been!

https://www.bbc.com/news/science-environment-37047168


I feel like after Y2K, we should've learned to never store two digit years again. If the format they used was actually YYYYMMDDhhmm, then it would've crashed as soon as someone tried to convert it to an int and they would've fixed it without another thought. Sure dates are hard, but a few simple rules cover a lot of problems.


Do you mean it would crash from that number being too big for a byte-sized int? Just checking I'm not missing anything here.


Yeah, they converted the DateTime object into a string of the format YYMMddhhmm which for 2022 starts with 2201010000. Then they convert that into a signed long integer or 2,201,010,000

A long integer is 32 bits. Since it's signed, one of those bits are used for the sign and 31 are used to represent the actual number. With 31 bits, you can express numbers from 0 - 2^31 or 2,147,483,648. 2,201,010,000 is larger than 2,147,483,648.

If they used the format YYYYMMddhhmm, then their numbers for every year would have overflowed the field. Because 201,801,010,000, for example, is too bit to fit in the long. Every year past year 99 would be too big.

But, if it's, say 2008 and you realize that putting the thousand and hundreds will always make it overflow, 801,010,000 would not make it overflow. So why not just cut off thousands and hundreds? Because it's not like we'll run into a 1999/2000 problem before the software is replaced.


So this system was implemented sometime after 1999, and presumably after 2010, since leading zeros are truncated (ignored) in ints... which means this bug was implemented, pushed out, and no one for up to 12 years thought this could possibly be a problem, at microsoft?


They probably figure it will be no problem, who will still be using Exchange in 2100? Same thinking as the guys who set up all those databases in 1975. And at least once it overflows (again) it will be somebody else's problem to fix.


As far as I can tell, there's currently no official acknowledgement of this issue.

It has affected 100% of the Exchange servers I'm overseeing. In every case, a complete halt to mail was corrected by a fix only escribed on Reddit and Twitter.


There is one here https://techcommunity.microsoft.com/t5/exchange-team-blog/em... though you'd be hard pressed to recognize it as being this bug, since it doesn't mention the error message that everyone has seen, and the instructions for workaround are not as explicit and detailed as I'd want if I was paged for this.


Phone numbers are not numbers. Zip codes are not numbers. Model numbers are not numbers. Characters are not numbers. Which part of this progression escapes you?

Seriously, if you can’t add, subtract, multiply, and divide them, then they aren‘t numbers and you shouldn’t use a numeric data type for them. This is CS 101 knowledge. How badly does MS run their engineering?


> Seriously, if you can’t add, subtract, multiply, and divide them, then they aren‘t numbers

You’ve missed a ton of other scenarios here, the relevant one being a compare operation with a < or > output. These are version numbers. Version numbers the way they’re used here increment, they are not random strings, which is what you’re effectively suggesting.

If the version numbers here were represented as strings they’d need to be converted to a number at some point to compare them, and you’d probably run into the same problem you have here.

Of course I’m sure you’d have the perfect, completely applicable, and universally accepted solution then as well…


> Version numbers

They still aren't compatible with number types. They are groupings of numbers at best.

2.9 is an earlier (not less than) version than 2.10.

3.1.6 isn't a (strict) numerical format.

5.4.11-beta1 has non-numerics.


Ignoring the tags (which not everything uses) it's possible to express all of those as a single increasing number by simply putting enough zeros in, which does make them representable as simple numbers, so I think you're kinda splitting hairs. Most version number representations/libraries will likely do something similar to read it anyway even if you keep it a string - split it into separate individual numbers and compare major/minor/patch/etc separately, and by virtue of fitting the individual parts into separate integers you're giving them all a maximum value, which is what the zeros would do.

The problem is they shoved the individual parts into a single 32-bit integer, which just isn't enough space so they had to compromise and make some of the numbers have extremely small ranges.


There are exceptions to that rule, however - Maintenance releases. Node, Typescript, Java, .NET, and I'm sure many more obvious instances I need not list.

If you "just" increment you leave no room for showing patching in your versioning strategy.


I'm not entirely sure what you're getting at? That's the point of the `feature` and `patch` version numbers. If you want you can put a 4th or 5th number in there as well to fit whatever uses you have.

Perhaps you're missing what I meant by 'increasing' - it's just a 1:1 conversion of the version number to a single integer, it's not like you're assigning every version a new number. 3.9.3 would become something like 3009003, and 3.10.1 would become 3010001. If you compare those representations then 3010001 always compares higher, and there's also nothing preventing you from releasing 3.9.3 after 3.10.1.

My point was that doing a format like the above (major, 3 digit minor, 3 digit patch) is not fundamentally different from storing each individual number as it's own integer, which most version libraries I've seen do at some point. The problem is just that the range of each number is restricted significantly by requiring 3 digits rather than if you used individual 32-bit or 16-bit integers for each one.


We agree. :)

A "Version Number" is not just a "Number", as I said right at the start.

Just to clarify, having an integral incrementing number as the whole version number strategy explicitly prohibits the use of major/minor/patch.

Unless you intend on using 0s (instead of fullstops) as delimiters I suppose.

Edit: as you demonstrate in your post. I'll stop posting now :)


If you define your version numbers to be numbers, then your version numbers will be numbers. I really don't get the confusion over this.


The assumptions/traits/expectations that come with numbers (and most importantly, that are embedded into the number _types_) may not be consistent with _version number_ types.


I mean if your standard for “numbers” is “can be mapped 1-1 to integers” then everything is a number. The operations you actually do to numbers have to have meaning in the space you’re defining.

Does “1.2.0” + “2.3.0” make sense in your world?

And if you’re like what about ordering and comparisons, then sure, you have a partially ordered set.


Dates can be very meaningfully and sensibly converted to/from a numeric representation, yet the addition of two dates doesn't make sense, so I don't think not bring able to add versions holds.

And not everything can be mapped 1-1 with the integers; reals for example, and therefore arbitrary strings. (Of course, that's only theoretically; with a limited string length you can obviously manage)

You're right that the ordering is the key feature of a version representation; munging it into an integer gives you that for free but risks getting to overflow quite easily (as in this case). I guess the conclusion is, use a language like Rust or Ruby when you can define ordering easily on structured data.


I don't get what you're trying to add. If Microsoft defined their version numbers as an integer then it's an integer value. They didn't define it as say "1.2.0" so not sure what that has to do with anything.


somebody better tell Cantor...


Did you read the link that we’re all discussing here? The versioning scheme they use is not any of those formats.


Google is forcing all Android applications to have both a version name, which is an arbitrary string that's only used for displaying to the user, and a version code, which is a plain integer and is what is used to decide which is the latest version that will be installed.

That should fix a lot of version number problems (prolly not all) and i've been stealing the idea for my own versioning lately.


The reference was

> Version numbers the way they’re used here

And none of that applies to them. Yes, there are versioning schemes that work differently, irrelevant to the example here.


> 2.9 is an earlier (not less than) version than 2.10.

Not necessarily. Some projects go from 2.9 to 3. I'm not sure what you are trying to prove.


For software version numbering, I consider 2.9 to have been released "earlier than" 3 (and not lesser than 3).


How do you deal with a project like Ansible where 2.9.16 was released after 2.10.0 or Python 2.7.18 was after 3.0.4?

I think version numbers are a partial ordering at best.


They aren't numbers, easy.


> They still aren't compatible with number types.


Why?


These should have been strings, end of story. You can sort a string. Amazing to see storing a non-number as as number being defended. Truly CS101.


You can certainly sort a string, but I think there are valid reasons to avoid it if possible. What they shouldn't be doing is trying to shove it into a single 32-bit integer - if they parsed it into separate integers for each of the individual parts (like a simple date representation would do) then the issue would go away. Whether it's worth that extra work to avoid the string comparison is debatable, but if they go on to mask out parts of the version number and look at them individually (Ex. Only the year, or only the revision number) than it would probably be worth it to just do all the parsing upfront and avoid any messing with strings after that.


I guess it depends on the language but using a string for this doesn't seem like a great choice except for user display / logging. Given this is a version number you'd want to support comparison operators, and a struct with those operators defined (and a string format method) would make more sense imo.


Depends on the language and function. Some sorting algorithms happily rank ‘9’ > ‘10’.


I'm curious which languages you learned in CS101 (actual question, not snark)

Every non-number, as you put it, is just a number to a computer. Comparing binary is faster than sorting strings, and it makes zero sense to waste memory simply so a human could potentially read it.


But versions are for human beings.

Otherwise, a timestamp, a date, a commit hash or an increment would do.


That's not true, they can be for either humans, computers, or both. Software still has to understand and do something with that version number, even if it came from a human. And if it has to store, sort, or process that number a lot, it will still be faster to store and compare binary than a string.

For comparison, I think some here would be shocked to learn their IPv4 address is stored as an unsigned 32-bit integer. Its not a number, and definitely not faster to use as a string.


ISO8601 strings are sortable and are not numbers.


No they aren't.

10000 can be a valid year according to ISO8601, but sorted as a string it would come before year 2000, which would obviously be incorrect.

Here's from the ISO:

> 3.5 Expansion By mutual agreement of the partners in information interchange, it is permitted to expand the component identifying the calendar year, which is otherwise limited to four digits. This enables reference to dates and times in calendar years outside the range supported by complete representations, i.e. before the start of the year [0000] or after the end of the year [9999].


Sounds like a problem we don’t have to worry about for another 8,000 years or so.


Which comes first, "2021-W52-6T15:10.4" or "20220101T151115.1-05"? Lexicographic sort between those two ISO-8601 date/time strings will get it wrong.

Did you mistake ISO-8601 for RFC-3339? Even there lexicographic sorting isn't guaranteed, since you can have a pure date or a pure time or a date-time combined, and there's a choice between "T" and " " to separate dates from times in a date-time but it's much more likely to work than with ISO-8601.


Character sorting isn't a natural law. The fact that "A" comes before "Z" is historical; it's 'always been done that way'-- at least since the Phoenicians.

https://en.wikipedia.org/wiki/History_of_the_alphabet#Descen...



Comparing strings is trivial, and would work out of the box with the format that they're using (YYMMDD###).


Posets aren't numbers either.


> the relevant one being a compare operation with a < or >

In Math we call that subtraction!


I have found that any time you would refer to a thing as a "Code" or "Number", you are almost certainly talking about a string. Every property name in our codebase that contains one of these terms is a string type without exception.


If it's not a float or integer, it's a string.


And if it's currency, it should be an integer not a float.


We use decimals for all currency and rate properties. Integer is fine in many cases too.


Be careful with floats:

  20.40 == (20.39 + 0.01)
  => false
("decimal" can either be colloqually-equivalent to "float", or it can refer to a distinct data type which is supported by some languages and databases. A proper decimal type is currency-safe to use, if available.)


It's neither a phone number, zip code, nor model number. It's a version number. And version numbers don't multiply or divide, but they do compare. This kind of code has a decades-long history:

    #if BUILD_VERSION >= FIRST_VERSION_THAT_WORKS
    ...
    #endif
And in fact if you look around, almost every large C or C++ project has a WHATEVER_VERSION token (c.f. LINUX_VERSION_CODE for the kernel) that packs the version into a comparable integer literal.

The bug here is that they picked an encoding that was obviously going to overflow in a Y2.022K bug, not with the technique.


> [...] almost every large C or C++ project has a WHATEVER_VERSION token (c.f. LINUX_VERSION_CODE for the kernel) that packs the version into a comparable integer literal. The bug here is that they picked an encoding that was obviously going to overflow [...]

Speaking of LINUX_VERSION_CODE, it had a similar problem recently. See the article at https://lwn.net/Articles/845120/ which was summarized in this year's LWN.net retrospective at https://lwn.net/SubscriberLink/879053/aaea44782e8c760d/ as follows:

"Sometimes the most obvious things can be the hardest to anticipate; consider, for example, the case of minor revision numbers for stable kernel releases. A single byte was set aside for those numbers, since nobody ever thought there would be more than 255 releases of a single stable series. But we live in a different world, with fast-arriving stable updates and kernels that are supported for several years. It should have been possible for us, and for the developers involved, to not only predict that those numbers would overflow, but to make a good guess as to exactly when that would happen. We were all caught by surprise anyway."


Similar, but with the important distinction that this problem was recognized months in advance and remedied with some care (or kludge, depending on your preferences), and didn't become a bug in production software.

All computers work in finite representations. All values can overflow. Good engineering is about working within that world and not trying to Quixotically design its absence.


YYMMDD + 4 digits of versions. At a glance, even a bit more than a glance, that seems fine for a long. And if 2020 worked and 2021 worked... why would 2022 break?

Lol.

I'd bet the conversation about this way back when it was being decided was "will 3 digits for daily version/build be enough" and the answer "go with 4, better safe than sorry".


The only reason I can think of is YYYYMMDDHHMM represented as an unsigned long is 4 bytes, while the equivalent in characters would require 12 bytes.

Is it possible that at Microsoft's scale this would actually make a significant difference?


Anything you do will take 3x the memory, so yes... I cant see why anyone would store as a string internally, you have to decode into a number anyway so just extra work. Strings are only useful to humans.

Edit: reading all the comments here, I get the feeling high level programmers don't fully understand how things are represented at the low level, there is no purpose in storing a version/serial as a string.


There are lot of "numbers" not actually being numbers. E.g. phone "numbers" are not numbers. The same as a house "numbers" are not numbers. Version "numbers" are also often not numbers (e.g. 2.1.230-rc.1).

If the OP bug was really triggered by a string of numerical characters not fitting into a 32bit integer, all the trouble of saving a few bytes on the client system were not worth it.


Yes, that is what the OP of the thread said, I've read it. Version numbers can be numbers if you want, and plan them to be. Which is the case here, its not a string fit into a 32 bit integer, its actually an integer. But they didn't think to check the max size which would only fit in an unsigned. Its an odd choice on their part, but certainly would have worked.


This very much depends on the protocol. Microsoft loves XML and and if the update package has an XML manifest with update information attached to it, it's very well a string that gets serialized to a number. The same applies to JSON and other text-based formats.


And that is fine though because at some lower level everything becomes a number to a computer. I think everyone is getting hung up on this string/number difference; data from humans is going to come as a string, thats no reason to store or process it as such.

Plenty of data is stored in varied types for processing that are efficient for a computer, for example DNS has been using date-based serial numbers successfully for decades, and stored internally as uint32.


Re your edit: these days, most developers are web developers where people don't care about these optimizations.

It's a little silly that a test didn't catch this problem, but using a version number somewhere high up in the long makes total sense to me. After all, version numbers can be anything you define them to be, as long as they're unique. You'd like them to be sortable, but they don't even have to be.


Sorry, I edited twice (just for confusion hah).

I don't think it's that they don't care, I suspect many simply do not understand what is under the hood in the first place; and they don't need to when doing web development. The fact that some here suggest storing version as a string is some how faster and safer is a big give away.


Perhaps saying they don't care is a bit much, they usually just don't need to care. The data is delivered to most websites in the form of JSON which can be parsed to any data format you want, and at that point it might even be faster to do a string compare to skip the double/int conversion.


I got what you were saying. That certainly may be case, depending on application. It looks in this case like its just definition files being loaded into memory, and if they need to compare that version more than once, doing it as a single operation binary comparison will be significantly faster than continually comparing a UTF-8 string (SIMD aside)


It probably did, 25 years ago


Not likely. Modern software developers laugh this kind of difference away, and will almost always default to taking a memory/storage space hit over other possible trade-offs (performance, correctness, convenience...)


Considering storage/memory requirements for modern desktop/mobile apps, I wouldn't mind if developers would occasionally think about optimizing them.


Storing non-numeric data as strings is not what makes modern apps bloated. It's the technologies used (Electron) and a general “I don't care” approach.


Nah, we computer scientist are happy embedding electron into every application no matter how small.


Application developers not system engineers. People who work on high performance servers will always look to reduce overhead as much as possible.


a simple tuple of appropriately-sized integer types would only need 8 bytes, which coincidentally is what MS will need come 2043.

That does mean rolling your own "greater-than-or-equal" operation though.


The only reason I can think of is that storing a version as a number would allow you to check if your version is incompatible with another server or feature / extension. By using less than or greater than.


Not necessarily.

2.2 is greater than 2.12 but v2.2 is actually an early predecessors to v2.12


That would work if padding is always forced (like date). And in this case, it actually is.

2.02 is smaller than 2.12


Best hope you don't end up releasing 2.100 then


It can also ends up as 2.99.01. Jokes aside, there is another bug caused by some poor programmer hard-coded the length of version field recently. https://news.ycombinator.com/item?id=29702128


that is exactly why you would create a transformation to something like a big fat integer with each segment of the version number isolated to separate parts of the bit mask so you can compare it as an integer.


If they were stored as characters you could still do a string comparison using greater than and less than signs and logic.

"202201010001" > "202112122359"

evaluates to True in Python, SQL and many other languages.


But "1000"<"999" in Python.


String comparison works as long as you order your date time by significance and zero pad all values, which they were doing with their date stamp anyway.

Plus if they weren’t then even numerical comparisons would fail eg 1012021 (Oct/1/2021) < 5102020 (May/10/2020) so you argument is moot.


This is going to be a problem in about 8,000 years tho.


Wouldn't it be done more like this though?

  "%04d%02d%02d" % (999,1,1)


But the format and length for these numerical strings is fixed.


I'd say if they don't form an ordered field, they shouldn't be called "numbers". The complex numbers really ought to be called an algebra, not numbers. They're the dimension-2 Cayley-Dickson algebra, after all!


This is so extremely common that it’s funny to me you scoff at it.

Under the hood these systems will convert to some numerical format to compare which version is newer. You store it as a (proper) numerical format or you serialize it into one later, but at the end of the day it’s a number because that’s how you compare it.


Version numbers can of course be numbers, but the problem here is that they were constructing a number like a string from semantically meaningful constituent parts (and not testing thoroughly).


And again, if there are no negative versions, they should have used an unsigned int, not an int.


pushing the date that it breaks to the version released in 2043?


You're right but thats not the issue here, the version is structured as a number but they failed to give it a type large enough to represent it; simply a signed/unsigned screwup.


That's not a fix, barely an improvement, really just a shitty workaround: the exact same issue rear its head again in 2043.

The actual fix is exactly what the original comment states: a version is not a number.


Oh I agree its a bad long term solution, but assuming they had noticed and used the correct type, 2043 is likely well past the lifetime of the product and would have worked fine till then. Not great, but certainly passable.


> Oh I agree its a bad long term solution, but assuming they had noticed and used the correct type, 2043 is likely well past the lifetime of the product

You expect Microsoft Exchange to stop being used some time in the next 20 years?


That version yes, absolutely. I expect at some point Microsoft will have made enough updates that they will say X version is no longer supported, and it will stop working on its own. And its not like it's a product installed on some embedded device that will never get updates until then, Exchange will get security fixes for as long as it is exists and nobody will be running this version in 20 years.

Even more likely, we'll have new protocols the current version doesn't support, so the bug will have been long fixed or noticed.


> That version yes, absolutely.

But in your alternate timeline the issue would only have been hit in 2043, and since in our timeline it was not noticed before it was hit, there is no reason to believe it would have been noticed in the alternate.

So the exact same issue would have occurred, a few years later.

I fail to see why that would be an improvement.

> Even more likely, we'll have new protocols the current version doesn't support, so the bug will have been long fixed or noticed.

That is a completely unsubstantiated assertion.


I'll quote myself again, "Oh I agree its a bad long term solution, but assuming they had noticed and used the correct type, 2043 is likely well past the lifetime of the product and would have worked fine till then"

I never suggested it was any improvement except for the advantage of time; although I shouldn't have used the word 'correct'. Signed 32-bit dates have a well known limitation that people will likely hunt and fix over the next 20 years. But yes, its all unsubstantiated, no doubt there.

The real main point of my post you first replied to is that it doesn't matter that the version was stored as an integer. The failure happened when they used their own format and didn't confirm it was properly bounded. Everyone here is having a hard time grasping how their data is stored and processed at the low level - your 'string' is still just an array of 8-bit bytes.

Storing an only-increasing number as any form of integer is a perfectly acceptable, and efficient way for a computer to compare and process. Version numbers are one of those. Phone numbers are obviously not.


The version is not structured as a number, it's structured as a string of digits that incidentally resembles a number. There are no operations that take two version numbers and return a new version number, so they are not numbers.


That is how they are entered by a human, yes. Then they are converted to binary and processed, like everything else. There is no need to add two version numbers together, but as they only ever increasing, its sane and efficient to store, and compare binary.

Everything in software takes data in one form and processes it into another. There are no 'strings', there are no version numbers. There is only binary. They simply did not give enough space to store it, thats all.

DNS is a perfect example, which stores serial numbers internally as an unsigned 32 and has worked for decades and will continue to. But they chose the format of YYYYMMDDnn which will last far longer than a 2 digit year.


Just look at Win 10 and at Teams for examples. The only engineering which counts at MS is the one which maximizes profits.


And how is this different for everyone else in the free market economy?


I had a weird kind of epiphany when I was watching the fireworks over Sydney last night.

The harbour bridge has lights on it.

For what financial benefit?

The fireworks display itself, I can't understand where the financial benefit is and it costs millions of dollars. ($7 million as far as I'm aware)

---

I think that you could make a financial motivation statement for making good software (it makes it less likely for people to switch, for example), but my broader point is: why does everything have to be financially motivated?

Because, a lot of stuff that we enjoy doesn't seem to be primarily financial in nature.


>The harbour bridge has lights on it.

>For what financial benefit?

Same for the Opera House. Imagine all of those stock photos of Sydney. Now narrow those down to the nighttime shots of Sydney. Imagine those without lights on the bridge or opera house. For that matter, any of the lights on any of the buildings. What do you have left? A really boring photo. Nobody wants to visit a city with really boring photos. What's the financial benefit of that?


Does not compute. What's the financial benefit of selecting cities based on how well lit their photos are?


Better lit structures make photos of them look better. Tourists choose destinations to spend money at based on how good the photos of the destinations look.


that doesn't answer my question, i was asking why the tourist selects like that. How does selecting based on how well lit photos are create a financial benefit for the tourist? ;-)


It's not for the benefit of the tourists at all, it's an advertisement to increase tourism for the benefit of local businesses.


Yet such adverts should not increase tourism because tourists must surely prefer to visit cities that offer benefits for tourists.


You're such a troll. The benefits of tourists to the tourist for visiting a city can come in many forms, and financial benefit is rarely one of them. In fact, it typically comes at great financial cost.


That's their point. Not everything done in a market economy is done with a goal of gaining a financial benefit.


> cities that offer benefits for tourists

An aesthetically pleasing scenery would be one of those benefits.


A lot of things about humans is based on desire, not on logical conclusion or requirement. Humans do things because they want to, not because they need to or it's the logical or rational thing to do.

Evoking desire is key.


So taking this whole comment chain into consideration, why are Windows 11 and Teams going out of their way to be awful to use?

We agree that the Free Market(tm) incentives are to maximise desire even at an up-front loss (Fireworks, Lighting the bridge) but the parent said that it's free market economics that prevent Teams and Windows from being desirable to use.

Is someone wrong or am I misunderstanding something?

Is there more profit in awful things? Why does the Harbour bridge have lights then?


Why are we even trying to equate the 2 things? The harbor bridge has a very pleasant look and people want to accentuate that, so they have decided to put lights on it so that it can be enjoyed at night. There is a very pleasing affect from things being lit at night. Why why why is this hard/difficult to grasp?

That is so so so different from a group of engineers building a product and totally not grasping that while it technically works, it is not pleasant for the end users. It takes a certain level of asshattery to assume that the devs are going out of their way to make it this way.


city user experience :)

There is the same financial incentive for MS to fix their Y2k22 bug.


It doesn't have to be, but our economy is structured to name it one of the primary ways to justify something's existence, Ave basically the only way to make something sustainable.

The incentive structures are built to make financial incentives take precedence with most things


A lot of people and companies take pride in their work. Saying that people are only in it for a buck is a tired old cliche that is easily disproven through even casual observation.


Isn't it lovely that Bill Gates' famous words are that vaccines are the best investment he has ever made?


> How badly does MS run their engineering?

Bad enough to be a 2.53 Trillion Dollar company.


side note

some numbers^* cannot be meaningfully multiplied or divided. For example, 3^rd place, 30° F, or 37.388N,-122.067S.

Some numbers can only be added to by numbers of a different type. For example x°C + y°K, 3rd grader + year, absolute position + offset. Similarly, subtracting two of these numbers results in a different unit. e.g. position - position = offset

it would be great if someone could comment the exact terminology for what I'm trying to describe.

* Types/Units


In statistics, one usually differentiates between

- nominal scales (e.g. man/woman),

- ordinal scales (elementary school/middle school/high school),

- interval scales (degrees Celsius) and

- ratio scales (human height, degrees Kelvin).

Only ratio scales may be multiplied and divided. Interval scales may be subtracted and added. Ordinal scales may only be compared (less/greater/equal). Nominal scales may only be compared in terms of (in)equality.

https://en.wikipedia.org/wiki/Level_of_measurement


thank you! this is that I was looking for


I'm not sure if it's what you're looking for, but algebraic days types are a fairly common way to encode different kinds of days that have the same shape.

You may also be looking for fields: https://en.m.wikipedia.org/wiki/Field_(mathematics). You could probably also describe the relation between different number types as linear operators over a vector space


F# has Units of Measure: https://docs.microsoft.com/en-us/dotnet/fsharp/language-refe...

There are libraries that implement this concept in other programming languages (C++, Java, C#, etc.)



Time is a number. At least, it makes sense to subtract them (I guess you can say that time is a one dimensional affine space).


Time is a number and a location. If you normalize everything to a reference time such as the UNIX epoch, you can usually do subtraction. Even then you may have got the times from a non-monotonic clock ...


Time is a number, but a date is not. 211231 + 1 != 211232.


To be specific, that representation of date does not implement integer addition. To be fair, neither do most "number" types (uint, int, float, etc).

For uint, you actually have x + y = (x + y) mod MAX_UINT.

For float, sometimes x + 1 = x.

The issue is not that date is not a number, but that people poorly reason about common cases of date arithmetic. Similar stuff comes up for floats and ints but they're just easier to reason about.


> For float, sometimes x + 1 = x

Would you spell this out for us?


If x is a huge number, then elimination can happen. Floating point numbers store a mantissa and an exponent, m^e. When you have a huge number the exponent will be large. So the mantissa can not express the tiny number anymore.


Floats effectively work by specifying a level of precision and a multiplier against that. For big numbers, you need a lower precision (larger gap) between the representable values. Eventually the precision gets so low that adding 1 doesn't reach the next possible value and rounds back down to your current value.


Bit of a nitpick, but due to relativity, time is not simply a number though.


Relativity is not modeled in the vast, vast majority of computer systems that track time. So the nitpick is irrelevant.


But there are rare software that model it (eg. GPS systems famously have to account for it) - so I think it is a textbook definition of a nitpick.


just like pointers, it makes sense to subtract them and add offsets but to straight up add them makes no sense.


Well, to put on my pedantic hat and indulge my inner math major, what you describe is algebra on top of objects—you’re talking about the math that applies to numbers, not the numbers themselves. Specifically you described groups and fields and rings…the symbols themselves are numbers, but they don’t recombine with each other in the same way as other numbers.

You’re right that they shouldn’t be implemented a numeric data types.

That said, dates do have properties of modular arithmetic, so in a strange sense they are very much numbers…

:p


Everything ends up in your database as bytes. You can say everything is a number. It's just naming. Whether it's sane to do basic math operations on them is loosely correlated with whether user sees that number as an integer on his screen.

Most bugs are stupid and should be obvious. They happen.

You can look at your old code and say WTF, so judging decisions in somebody's else complex system without having any idea about requirements and constraints seems pointless.


Not that it excuses the issue, but I suspect the motivation was numeric sorting to identify the newest. So it was sort of a number in that respect.


Not criticizing you, but no, it’s a string which happens to have a natural ordering. The fact that it happens to look like a number is irrelevant.


It’s like nothing could have multiple useful forms. Which is of course not true. This is just a dumb bug.


The idea wasn't justifying the choice, but rather trying to imagine what happened. Because they did want a way to sort and pop the highest numeric value, they made a bad choice.


Sometimes a performance reason? Number is usually smaller, faster, also better if stored in a database and more efficient if used as a primry key,


Gotta free up space for those electron apps


These are version numbers. It's perfectly fine to use numbers to express version numbers, a great majority of protocols reserve numeric fields for this.

The problem here was basic overflow, they reached a version number exceeding the capacity of the numeric type.


But they are. Zips here in Denmark are clustered progressively, so that if you have a package going to (say) 8600 and 8704 then it makes some sense to send those orders to the same distribution node, since they will be close together, but to do that you need to treat them as numbers.

My grandparents first phone number was 13. Not because they got in super early and had the 13th phone number in the country, but because that was the 13th number for their exchange. When a bunch of exchanges were put together everybody got another number in front of theirs. Their phone number is much longer today, but it still ends in 13.

I guess you could treat the phone numbers as strings, but I don't know if you can do that with the Danish zip numbers.


You should never use clustering/subtraction on Danish zip codes as a source of "how hearby". To quote https://www.postnumre.dk

   2400 Copenhagen NV (North West)
   2412 Santa Claus, Greenland 
   2450 Copenhagen SV (South West)    
   ...
   3790 Hasle (close to Roskilde)
   3900 Nuuk (Geenland)
   3992 Dog sled patrol "Sirius"
   4000 Roskilde


Unless it is encoded in its design. For US zip codes the first three digits encode a geographic region. But there is no guarantee that 601 is next to 602


Lots of phone numbers start with a zero, converting string to numbers will most likely change the phone number by removing the information of the leading zero. Doing this in js will store it as base 8. Neither will make anyone happy.


Excel loves replacing phone numbers with 12031041e8 or whatever when importing CSV. So useful.


This is one of the most annoying things to me about Excel (of a long list). In pretty much every time MS applications feel they are smarter than me ends in MS application being wrong. It doesn't matter if the data is manually typed or pasted from else where, when Excel tries to auto-guess how to set the formatting of the cell is just horrid. Dates are shite when trying to be in the US and using UK formatted dates.


What's even worse is if you inadvertently save the CSV with the auto format, the data will be lost, too. Excel is both a wonderful tool and the bane of my existence when dealing with financials.


Even in North America, you shouldn't treat ZIP/postal codes as numbers. The US added 4 digits 30-ish years ago (ZIP+4) and they're separated with a dash. You could store them as separate numbers [0] and rely on the presentation code to add the dash. But since Canada has alphanumeric postal codes (Mexico uses numbers), you're better off treating them as a string field.

The US used to prefix phone numbers with letters (like MElrose 1-2345) as an aide-mémoire during the transition to 8 digits with area codes. And you'll still see businesses advertising with letters in place of digits (like 800-GOT-JUNK which is a junk removal firm) but they never get entered into phone number fields like that. The best reason to have phone numbers as text though, is the digit groupings [1]. In North America, it's 3-3-4 but other countries group their numbers differently to help people remember them. So allowing users to enter numbers in their format you're making it easy for them.

[0] The main reason to do this is if you're in the business of sending out mailings. The US postal service gives discounts if you sort and bundle by ZIP code, and mixed ZIP and ZIP+4 values would make that hard.

[1] https://en.wikipedia.org/wiki/Chunking_(psychology)


A ZIP code is an identifier made of numerals, it is not a number in the mathematical sense. It does not make sense to multiply, add, subtract, or divide them - even though the numerals are assigned in apparently ascending order. There's nothing preventing a future decree from making one that starts with a Z, or a 0, or putting a dash in the middle of it. It is foolish to store it as an integer type.

You can cluster them for distance without storing them as an integer. Just throw them in a graph data type and compute edges having actual integral weights. The ZIPs themselves are not what you want to be doing your math on.


But you can add and subtract dates.


The map is not the territory.

The efficacy of any computational operator on a specific representation does not impute the same for all instantiations since the semantic link between identity and representation may only exist in the mind of the programmer and not the computational device.


You can only add a Duration to a Timestamp; you cannot add two Timestamps.


Add March 5th to June 2nd...


August 5th.


You should try PHP. You can multiply all sorts of things!


We use exchange at work. Daily, I will utter some form of "Microsoft can't do anything right, man".


I used to work on a customer API that accepted these fields as JSON. My first project was improving validation (I used JSON schema) while not breaking existing integrations. While it's true that those fields aren't numbers, I saw all those fields sent as numbers.


Captain Hindsight strikes again!


So.. are database IDs numbers? Because they sure are autoincrement integers in almost every DBMS. I always thought that was a mistake but it's practical.


> So.. are database IDs numbers?

No.

> Because they sure are autoincrement integers in almost every DBMS.

The autoincrement occurs before it's a database ID. The generator for database ids is a sequence, that has no bearing on the semantics of the database ids.


Yes, but as soon as they leave the realm of the DB itself, they (should) become opaque identifiers.


They are absolutely not numbers.

As many people already commented, it's reasonable to encode things as numbers as long as you keep in mind there's an encoding operation with it's possible flaws.


Honestly that seems like a cop out.


The usual habit of typing ids as integral numbers is incorrect, they are not numbers. But the problems it causes are well known and dealt with by what is broadly known as good practices.

Is it better this way?

(You can see the problems clashing with each other if you search for discussions about defaulting ids to 32 or 64 bits. So they are not completely solved, the practices are just good enough that things don't usually break on practice.)

At the end of the day, all you have on a computer is a bitfield to mess around. You will always have to deal with encoding at some point.


I mean strings are numbers if you want to be that reductive. The “on-disk” representation of the thing is independent of its real type and properties.

So sure, store a date in an “integer” but if you try to treat it like one instead of a bitfield you’ve made an error and if your storage isn’t wide enough to hold all dates that’s gonna be a PITA later.


It's becoming pretty standard to use UUIDs over integers now. That provides more architectural flexibility


> How badly does MS run their engineering?

I think the real problem is ... how many good engineers and engineering managers nowadays want to work at MS?


> Seriously, if you can’t add, subtract, multiply, and divide them, then they aren‘t numbers

Sadly, sometimes you can, regardless of whether you should.


isnt working with integers generally faster than working with strings? that might be why some ppl store phone numbers and zip codes as int


that might be why some ppl store phone numbers and zip codes as int

Anyone who stores ZIP Codes as an int should have his dev license revoked. You've just corrupted the data you store for a hundred million people in the northeast.

I'm currently dealing with a situation where a system developed by an offshore team stored Social Security numbers as integers. They had no idea that an SSN can start with a zero, and didn't even do a basic web search to see what the possible range of values is before designing the database and application.


you can tell a 4 digit zip code has a leading zero that was removed though. you'll also get faster sql queries if you search for zip codes as int vs string


you can tell a 4 digit zip code has a leading zero that was removed though

Only if you're 100% sure your date is completely clean. Only very rarely is this the case, especially with ZIP Codes because the data almost always traces back to human input.

The initial query may be quicker, but then you have to compensate for the missing digits elsewhere, likely multiple times. You have to consider the expense to the whole system, not just to one query.


Not if the only thing you're doing with the number is converting it from/to a string...


And its very doubtful that is all Exchange is doing with it. Likely they have a large list of definitions to compare and track.


We need null-terminated binary-coded-decimals as a standard datatype for cases like this. "Numeric String".


Okay, so if it's null-terminated, how does one represent a zero?


F-terminated might work better (given 1-nybble-per-digit BCD), or do a Pascal-style length prefix. I was less fixated on the implementation detail than I am about the basic concept of: arbitrary length numeric strings with numeric semantics and string-like storage and length characteristics.


Decide that the null value is one of the invalid digits - e.g., %1111. 0 can remain %0000.


"0\x00"?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: