If you think that is bad, then you might think that it’s even worse that memcpy() was broken for years when running a 32 bit process on a 64 bit linux system. Also due to a hand optimised assembly function.
We had an issue for years where occasionally our backend database would fall over and we were just seeing strange corruption in the stored data sometimes.
Turns out that memcpy() didn’t handle data that was not aligned to a 4 byte boundary correctly in the 32 on 64 case. It took one of our sysadmins a very very long time to find it.
What blew me away is the idea that there can be a bug in something that is used literally everywhere all the time.
In addition, there was a pull request with a fix that someone had written that got ignored for years that we found. We just applied the diff in the pull request and it fixed our issues. The fix was eventually applied upstream a few years later.
> What blew me away is the idea that there can be a bug in something that is used literally everywhere all the time.
We've become so accustomed to computers being generally unreliable that I guess people see something weird and just reboot. We checksum, run reductant servers, and just generally tolerate and work around way more failures and weirdness than we should.
It's funny because we're theoretically an industry that could be full of maths and proofs but ask anybody outside of it and you'll get "yeah computers just don't work sometimes I guess that's how it is"
The problem is that even a perfectly implemented computer will fail randomly, due to inherent unreliability in the underlying hardware (cosmic rays, bad silicon etc.).
And once you start designing around that unreliability, it is harder to justify chasing esoteric bugs rather than folding that alongside the probability model of random hardware failure.
It’s hard for any system dealing with the real world to be completely “maths and proofs”. There is plenty of it that’s applied even in the real world systems we operate today. At the end of the day, it’s Engineering and not a pure science.
And I guess the other problem is that quite often you don’t need to quantify the failure modes and do “proper engineering”, because a lot of the systems we build and use just don’t need the level of reliability that e.g. a bridge needs (yet). There are definitely industries where software reliability is considered paramount, but that’s not the majority of the market.
> a lot of the systems we build and use just don’t need the level of reliability that e.g. a bridge needs
Sure. As the saying goes, "Anybody can build a bridge that never falls down. But it takes an engineer to build a bridge that _almost_ doesn't fall down" (referring to being able to build it in a cost-conscious way). There's a spectrum there for sure, but as an industry we're on the other side of it from where I personally wish we were.
We should be embarrassed that "it must have been a computer glitch" can be used to explain away nearly any problem. Because it's usually right, and it's usually our fault rather than the cosmic rays.
memmove was broken on 32-bit executables on systems with SSE2 whenever the move occurred over the 2GB memory boundary due to an issue with signed vs unsigned integers.
> Turns out that memcpy() didn’t handle data that was not aligned to a 4 byte boundary correctly in the 32 on 64 case. It took one of our sysadmins a very very long time to find it.
That is surprising outside of the kernel. There are a large number of programs that would fail very hard very fast with that sort of artificial restriction.
> Turns out that memcpy() didn’t handle data that was not aligned to a 4 byte boundary correctly in the 32 on 64 case. It took one of our sysadmins a very very long time to find it.
That sounds like an interesting detective story. Did they write it up anywhere?
They didn’t, but it was one hell of a pain in the ass. We only ever saw it on the production databases and the crash, which was the main thing we were debugging, would only happen every week or so.
Core dumps didn’t show anything useful because the crashes were just that a pointer was mysteriously pointing into a bad place. But how did it get corrupted?
We were doing increasingly insane things to reproduce the problem locally.
An interesting idea I had to reproduce it was using a database replica, which did crash sometimes.
We used TCPDump to capture the replication stream, then waited a week for the replica to crash. We could start up a copy of the replica in the state it was at the beginning and replay the week of replication data into the socket to reproduce the problem.
My theory was that since a DB replica doesn’t do anything other than the sequence of operations that come to it via the replication stream, that this should deterministicly reproduce the problem.
Turns out that it didn’t.
Anyway, if my memory is correct, the issue was that memcpy was copying an extra 1-3 bytes at the end of the range, if the start of the range was unaligned. If there happened to be a pointer there then that would lead to a crash down the line. My memory could be off though.
> Anyway, if my memory is correct, the issue was that memcpy was copying an extra 1-3 bytes at the end of the range, if the start of the range was unaligned. If there happened to be a pointer there then that would lead to a crash down the line. My memory could be off though.
Ahhh I could see how that could be! Fascinating. That's exactly the kind of heisenbug I used to love. I always liked the ones that seemed literally impossible but had a nice explanation.
I cross compile my side-project program from 64 to 32-bit and then I noticed, it would crash when testing after updating linux
The stacktrace said the crash came from SSE memcpy. I did not bother investigating it, since I have no users anyways.
I had an really old CPU (like i5-520m). Perhaps the code required some SSE feature, it did not have
That was libc. I also use Pascal, which has its own memcpy implementation. A while ago, they discovered it does not handle alignment correctly. (like it would adjust by 4-i when it should have adjusted by i) I thought that is what I get when using Pascal rather than a more commonly used language, but apparently nothing works
That is not the first `memcpy()` issue that I have heard. There was the reverse `memcpy()` in glibc 2.13 that broke plenty of software using `memcpy()` in places where `memmove()` should have been used.
I dispute that it has "always" been broken -- the bug is to do with ordering non-ascii input data, when char is unsigned. I suspect strcmp() wasn't expected to provide order on non-ASCII data when that code was written, with char defined as u8. The original K&R code just subtracts the two (signed) chars too:
> I dispute that it has "always" been broken -- the bug is to do with ordering non-ascii input data, when char is unsigned.
No. As the article notes, there's also a bug when char is signed (the subtraction may overflow, which is UB for signed char, and additionally produces the wrong result even assuming defined overflow). It's just more obvious when char is unsigned.
Plain ASCII is 7-bit so it won't overflow with signed char. There are many extensions that use that extra bit to increase the number of characters, but they aren't ASCII, they are ISO 8859-*, or ANSI, or UTF-8 etc.
The problem is not just that it results in a wrong order for 8-bit encodings, but that it doesn’t result in a well-defined order at all. For example, you get:
-2 > 127 > 64 > -2
This can make sorting algorithms loop infinitely or crash.
Signed overflow is undefined in C, but this was implemented in ASM. I'm not familiar with the m68k isa, but I'd be very surprised if it doesn't define signed values to wraparound as is natural for a two's complement implementation.
Subtracting two chars produces an `int`. C generally promotes all smaller types to `int` before doing arithmetic, and `int` is required to be at least 16 bits. So I think the code you linked is correct?
C has no arithmetic operations on types narrower than int and unsigned int. This bug (which involves subtracting 8-bit values and yielding an 8-bit result) could only happen in assembly language (or in an exotic C implementation where CHAR_BIT > 8 and int is 1 byte, or in C code that uses casts to deliberately create the bug).
Note this bug is for the strcmp implementation used in Linux kernel code itself. The userland implementations of strcmp in glibc, etc, are separate so this didn't affect uses of strcmp in application code.
I'm surprised to learn there is no test suite that would test the overflow case. I also realize I have no idea about the extent of testing in the kernel.
The Linux kernel has not historically had great unit testing practices--it's mostly tested with large integration tests. This is in part because it inherits old development practices where unit testing was less popular, and partially because it's kind of hard to test a lot of Kernel code outside of specialized environments (e.g., would need to include a m68k emulator in the test environment to test this code).
The doubly-linked `list` type that's used ~everywhere in the kernel didn't have any in-repo unit tests until 2019 ( https://github.com/torvalds/linux/commit/ea2dd7c0875ed31955c... ), after kunit, a library for userspace testing of kernel code, was added.
The submitter can go back and correct this by clicking "edit". The software changes capitalization only on the initial submission, not subsequent edits.
(Overall, I actually like this feature. You WOULDN'T BELIEVE the CRAZY Headlines We'd See Otherwise!)
"Otherwise" was the operative word in my (slightly sarcastic) example. :)
Avoiding all caps for emphasis means you sometimes have to change "Faa" back to "FAA" after submitting a story about the Federal Aviation Administration. And avoiding grammatically-incorrect or all-lowercase headlines means you sometimes have to change "Strcmp()" back to "strcmp()".
Are we assuming these submitters are changing the headlines? Because I was focusing on articles that have headlines with all-caps words in them on the original site, and I can't think of any I've seen on HN.
This follows the standard practice[1]: in titles, capitalize everything except articles, prepositions, particles, and conjunctions, as well as the first and the last word even if it’s one of those. Sometimes long prepositions etc. may also be capitalized. A linking “has” looks like one of those words that shouldn’t be capitalized, but it’s a verb so (as far as the rules go) it should.
(There also publications that capitalize everything indiscriminately, I think.)
I'm prepared to bet that both the people that ran into this bug went "huh, that's weird, you'd swear that was a bug in strcmp(). Oh well. Tell you what I'll do instead..."
This would have been so, sooooo easy to detect with a unit test. It boggles the mind that no one would bother to write a test for something both so fundamental AND so eminently testable.
I'm skeptical. Odds are any such test would use short ASCII strings and probably not uncovered the problem. Also, the most common uses cases of strcmp are comparisons with 0. Any such unit test would probably do the same, further minimizing the odds of discovery.
I agree, this is the type of bug where the test would not have been written until after the bug was found. When the bug gets found, the test will be written, the code will be updated and no one will touch it unless another bug gets found; rinse and repeat.
Just like it is hard to proof read your own papers, it is hard to write complete tests for your own code.
Writing a good unit tests always means looking at the possible value ranges and specifically the edge cases within those ranges, and adding tests for those. That’s the standard course of action when writing unit tests for some API. Unit tests aren’t supposed to just mimic typical usage.
Right. They're not supposed to test only the typical, minimal usage, but they often do. In very old C code, you'll be lucky if it has any tests at all.
That code is older than the culture of comprehensive automated testing. It's hard for modern minds to believe how little automated testing existed before the xunit revolution.
Yes, it boggles the mind that you didn't. Because if you have enough time to criticize other open source maintainers, you certainly have time to write the unit tests they aren't writing, right?
That's because you lack imagination, you lack knowledge about how we've gotten to where we are, and you lack understanding about the value of keeping software portable.
There are many, many reasons for maintaining support for alternate architectures. How many problems would be baked in if we always had this attitude?
Back in the day when "all the world is an i386", people complained when told about how their code has bugs when compiled on 64 bit processors. Imagine if we had baked in all of those 32-bit-centric bugs that broke compiling on 64 bit.
People do the same thing now when told their code doesn't compile cleanly on 32 bit, on PowerPC, on ARM, on big endian, et cetera. Imagine if we now acted like "all the world is an amd64", and lots of code was simply broken on aarch64?
If you were alive and paid attention through other transitions, you'd understand how every example of programmers being "forced" to care about the correctness and portability of their code has paid dividends years later.
Indeed. The kernel isn't a museum, code or architectures that aren't used can and will get dropped. But the converse is also true, if somebody steps up to regularly test and fix issues in some code it can stay. Even if said somebody happens to be a bunch of retrocomputing enthusiasts and not a zillion dollar corporation.
Linux would still be a portable kernel if it dropped m68k support. It still targets PPC, ARM, x86, Sparc, avr, ia64, s390, blah blah blah. Dropping m68k is a far cry from dropping everything except amd64.
You can still buy a Coldfire (now from NXP, after Motorola and Freescale) and I suspect it would have used that strcmp() implementation. The 68k family is not completely dead.
IIRC Texas Instruments uses 68k CPU's for their calculators. I'm pretty sure my TI Voyage 200 I've bought in highschool (15+ years ago) uses one and it's still great.
However NXP have Coldfire on a "Legacy MPU/MCUs" page, which doesn't exactly show them pushing for new design wins though. It's not dead, but it doesn't really have a future...
It is available [1] as a core for FPGAs (possibly used in some obscure ASICs to boot), and it is very possible that those applications need fairly current support. The ColdFire processors came in up to 300MHz, and modern FPGAs could probably enable even higher speeds. There are variants with Ethernet [2], so security is probably a real concern.
I would very much like to learn how to get one 1) running on an FPGA 2) running Linux and then 3) set it up to automatically run new builds.
ColdFire isn't precisely 680x0. It has a similar instruction set, but it isn't identical, and (crucially) it doesn't have an external memory/peripheral bus.
As far as I'm aware, there are no longer any "true" 680x0 parts in production. NXP stopped production of the 68SEC000 in 2014; anything still in stock is at least that old.
Is anyone making any new m68k chips these days? I looked for a hobby project earlier this year and it looked like Freescale was doing their best to sunset it.
I went through a similar process a few years back. As far as I know, there is some extended availability through Rochester Electronics, but I don't know if anyone's still actually fabbing them or if they're dicing/packaging out of a stockpile of NOS wafers or what. I'm not aware of anyone still having a 68K/ColdFire chip that is "recommended for new designs".
I also read somewhere that the 2011 earthquake+tsunami destroyed the only fab that had still been producing the original (well, CMOS) 68K-family chips, and Freescale had been planning to close that fab anyway.
The Vampire accelerator by Apollo Computers (which is FPGA-based AIUI) is not officially an m68k chip, but it's been developed as a drop-in replacement.
We had an issue for years where occasionally our backend database would fall over and we were just seeing strange corruption in the stored data sometimes.
Turns out that memcpy() didn’t handle data that was not aligned to a 4 byte boundary correctly in the 32 on 64 case. It took one of our sysadmins a very very long time to find it.
What blew me away is the idea that there can be a bug in something that is used literally everywhere all the time.
In addition, there was a pull request with a fix that someone had written that got ignored for years that we found. We just applied the diff in the pull request and it fixed our issues. The fix was eventually applied upstream a few years later.