It is a ridiculous feature of modern C that you have to write the super verbose "mask and shift" code, which then gets compiled to a simple `mov` and maybe a `bswap`. Wheras, the direct equivalent in C, an assignment with a (type changing) cast, is illegal. There is a huge mismatch between the assumptions of the C spec and actual machine code.
One of the few reasons I ever even reached to C is the ability to slurp in data and reinterpret it as a struct, or the ability to reason in which registers things will show up and mix in some `asm` with my C.
I think there should really be a dialect of C(++) where the machine model is exactly the physical machine. That doesn't mean the compiler can't do optimizations, but it shouldn't do things like prove code as UB and fold everything to a no-op. (Like when you defensively compare a pointer to NULL that according to spec must not be NULL, but practically could be...)
`-fno-strict-overflow -fno-strict-aliasing -fno-delete-null-pointer-checks` gets you halfway there, but it would really only be viable if you had a blessed `-std=high-level-assembler` or `-std=friendly-c` flag.
> One of the few reasons I ever even reached to C is the ability to slurp in data and reinterpret it as a struct, or the ability to reason in which registers things will show up and mix in some `asm` with my C.
Which results in undefined behavior according to the C ISO standard.
Quote:
“2 All declarations that refer to the same object or function shall have compatible type; otherwise, the behavior is undefined.”
That "should present no problem unless binary data written by one implementation are read by another" quoth ANSI X3.159-1988. One example of a time where I've used that, is when storing intermediary build artifacts. Those artifacts only exist on the host machine. If the binary that writes/reads those artifacts gets recompiled, then the Makefile will invalidate the artifacts so they're regenerated. Since flags like -mstructure-size-boundary=n do exist and ABI breakages have happened with structs in the past.
Sensitive emotional subjects shouldn't be noted. Reminding C developers of the void* incompatibility is a good way to get them to feel triggered because it makes the language unpleasant.
> Wheras, the direct equivalent in C, an assignment with a (type changing) cast, is illegal.
I don't understand what you mean by that. The direct equivalent of what?
Endianess is not part of the type system in C so I'm not sure I follow.
> I think there should really be a dialect of C(++) where the machine model is exactly the physical machine.
Linus agrees with you here, and I disagree with both of you. Some UBs could
certainly be relaxed, but as a rule I want my code to be portable and for the
compiler to have enough leeway to correctly optimize my code for different
targets without having to tweak my code.
I want strict aliasing and I want the compiler to delete extraneous NULL pointer
checks. Strict overflow I'm willing to concede, at the very least the standard
should mandate wrap-on-overflow ever for signed integers IMO.
I am sympathetic, but portability was more important in the past and gets less important each year. I used to write code strictly keeping the difference between numeric types and sequences of bytes in mind, hoping to one day run on an Alpha or a Tandem or something, but it has been a long time since I have written code that runs on non-(Intel AMD or le ARM)
x86_32, x86_64, arm, arm64, POWER , RISC-V and several others are alive and kicking. China is making their own ISA. And there is still plenty of space and time for new ISAs to be created.
Actually, it is true - which is why endian is a problem in the first place. ASM code is different when written for little endian vs big endian. Access patterns are positively offset instead of negatively.
A language that does the same things regardless of endianness would not have pointer arithmetic. That is not ASM and not C.
You can make the preprocessor condition broader if you care about more compilers and more platforms. Yes, I'm making assumptions about which platforms you want to target... which is fine. No, I don't care about your PDP-11, nor about dynamically changing your endian at runtime. Nearly any problem in C can be made arbitrarily difficult if you care about sufficiently bizarre platforms, or ask that people write code that is correct on any theoretical conforming C implementation. So we pick some platforms to support.
The above code is fairly simple. You can separate the part where you care about unaligned memory access and the part where you care about endian.
Author here. The blog post has that as the naive example. The whole intention was to help people understand why we don't need to do that. Could you at least explain why you disagree if you're going to use this thread to provide the complete opposite advice?
Which as you correctly state in the article, is incorrect code. We agree about this. I proposed an alternate solution, where the READ32BE would be like this:
What I like about this is that it breaks the problem down into two parts: reading unaligned data and converting byte order. The reason for this is, sometimes, you need a half of that. Some wire formats have alignment guarantees, and if you know that the alignment guarantees are compatible with your platform, you can just read the data into a buffer and then (optionally) swap the bytes in place.
Just to give an example... not too long ago I was working with legacy code that was written for MIPS. Unaligned access does not work on MIPS, so the code was already carefully written to avoid that. All I had to do was make sure that the data types were sized (e.g. replace "long" with "int32_t") and then go through and byte swap everything.
So it's nice to have a function like swap32be(), and "you don't have to mask and shift" I would say is true, it just depends on which compilers you want to support. I would say that a key part of being a C programmer is making a conscious decision about which compilers you want to support.
Yes, I'm aware that structs are not a great way to serialize data in general, but sometimes they're damn convenient.
There have been CPU architectures where the endianness at compile time isn't necessarily sufficient. I forget which, maybe it was DEC Alpha, where the CPU could flip back and forth? I can't recall if it was a "choose at boot" or a per process change.
Which nothing will be able to deal with so you might as well not bother to support it. Your compiler will also assume a fixed endianness based on the target triple.
The entire problem of using byte swaps is that you need to use them when your native platform's byte order is different from that of the data you are reading.
You know the byte order of the data. But the tricky part is, what is the byte order of the platform?
It will always be correct, but you can't just assume that the compiler will optimize the shifts into a byteswap instructions. If you look at the article you will see that it tires to no-true-scotsman that concern away by talking about a "good modern compiler".
And what exactly is the problem there? Are you going to be writing code that a) is built with a weird enough compiler that it fails this optimisation but also b) does byte swapping in a performance critical section?
Of course nobody wants C to backstab them with UB, but at the same time programmers want compilers to generate optimal code. That's the market pressure that forces optimizers to be so aggressive. If you can accept less optimized code, why aren't you using tcc?
The idea of C that "just" does a straightforward machine translation breaks down almost immediately. For example, you'd want `int` to just overflow instead of being UB. But then it turns out indexing `arr[i]` can't use 64-bit memory addressing modes, because they don't overflow like a 32-bit int does. With UB it doesn't matter, but a "straightforward C" would emit unnecessary separate 32-bit mul/shift instructions.
So in your 'machine model is the physical machine' flavour, should "I cast an unaligned pointer to a byte array to int32_t and deref" on SPARC (a) do a bunch of byte-load-and-shift-and-OR or (b) emit a simple word load which segfaults? If the former, it's not what the physical machine does, and if the latter, then you still need to write the code as "some portable other thing". Which is to say that the spec's UB here is in service of "allow the compiler to just emit a word load when you write *(int32_t)p".
What I think the language is missing is a way to clearly write "this might be unaligned and/or wrong endianness, handle that". (Sometimes compilers provide intrinsics for this sort of gap, as they do with popcount and count-leading-zeroes; sometimes they recognize common open-coded idioms. But proper standardised support would be nicer.)
Endianness doesn't matter though, for the reasons Rob Pike explained. For example, the bits inside each byte have an endianness probably inside the CPU but they're not addressable so no one thinks about that. The brilliance of Rob Pike's recommendation is that it allows our code to be byte order agnostic for the same reasons our code is already bit order agnostic.
I agree about bsf/bsr/popcnt. I wish ASCII had more punctuation marks because those operations are as fundamental as xor/and/or/shl/shr/sar.
D's machine model does actually assume the hardware, and using the compile time metaprogramming you can pretty much do whatever you want when it comes to bit twiddling - whether that means assembly, flags etc.
> There is a huge mismatch between the assumptions of the C spec and actual machine code.
Right, which is why the kind of UB pedantry in the linked article is hurting and not helping. Cranky old man perspective here:
Folks: the fact that compilers will routinely exploit edge cases in undefined behavior in the language specification to miscompile obvious idiomatic code is a terrible bug in the compilers. Period. And we should address that by fixing the compilers, potentially by amending the spec if feasible.
But instead the community wants to all look smart by showing how much they understand about "UB" with blog posts and (worse) drive-by submissions to open source projects (with passive agressive sneers about code quality), so nothing gets better.
Seriously: don't tell people to shift and mask. Don't pontificate over compiler flags. Stop the masturbatory use of ubsan (though the tool itself is great). And start submitting bugs against the toolchain to get this fixed.
I agree but language of the standard very unambiguously lets them do it. Quoth X3.159-1988
* Undefined behavior --- behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately-valued objects, for which the Standard imposes no
requirements. Permissible undefined behavior ranges from ignoring the
situation completely with unpredictable results, to behaving during
translation or program execution in a documented manner characteristic
of the environment (with or without the issuance of a diagnostic
message), to terminating a translation or execution (with the issuance
of a diagnostic message).
In the past compilers "behaved during translation or program execution in a documented manner characteristic of the environment" and now they've decided to "ignore the situation completely with unpredictable results". So yes what gcc and clang are doing is hostile and dangerous, but it's legal. https://justine.lol/undefined.png So let's fix our code. The blog post is intended to help people do that.
No; I say we force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority. There's a lot more programmers who are not compiler writers.
The standard is really a minimum bar to meet, and what's not defined by it is left to the discretion of the implementers, who should be doing their best to follow the "spirit of C", which ultimately means behaving sanely. "But the standard allows it" should never be a valid argument --- the standard allows a lot of other things, not all of which make sense.
force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority
As far as I understand it, they do neither. Transforming an AST to any level of target code is not done by handcrafted recipes, but instead is feeded into efficient abstract solvers which have these assumptions as an operational detail. E.g.:
p = &x;
if (p != &x) foo(); // optimized out
is not much different from
if (p == NULL) foo(); // optimized out
printf("%c", *p);
No assumption here is idiotic, cause no single human was involved, it’s just a class of constraints, which alone to separate properly you’ll have to scratch your head extensively (imagine telling a logic system that p is both 0 and not-0 when 0-test is “explicit” and asking it to normally operate). Compiler writers do not format disks just to punish your UBs. Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one?
In your example, why should it optimise out the second case? Maybe foo() changed p so it's no longer null.
Compiler writers do not format disks just to punish your UBs.
IMHO if the compiler exploiting UB is leading to counterintuitive behaviour that's making it harder to use the language, the compiler is the one that needs fixing, regardless of whether the standard allows it. "But we wrote the compiler so it can't be fixed" just feels like a "but the AI did it, not me" excuse.
The address of p could have been taken somewhere earlier and stored in a global that foo accesses, or a similar path to that; and of course, p could itself be a global. Indeed, if the purpose of foo is to make p non-null and point to valid memory, then by optimising away that code you have broken a valid program.
If the compiler doesn't know if foo may modify p, then it can't remove the call. Even if it can prove that foo does not modify p, it still can't remove the call: foo may still have some other side-effects that matter (like not returning --- either longjmp()'ing elsewhere or perhaps printing an error message about p being null and exiting?), so it won't even get to the null dereference.
As a programmer, if I write code like that, I either intend for foo to be doing something to p to make it non-null, or if it doesn't for whatever reason, then it will actually dereference the null and whatever happens when that's attempted on the particular platform, happens. One of the fundamental principles of C is "trust the programmer". In other words, by trying to be "helpful" and second-guessing the intent of the code while making assumptions about UB, the compiler has completely broken the expectations of the programmer. This is why assumptions based on UB are stupid.
The standard allows this, but the whole intent of UB is not so compiler-writers can play language-lawyer and abuse programmers; things it leaves undefined are usually because existing and possible future implementations vary so widely that they didn't even try to consider or enumerate the possibilities (unlike with "implementation-defined").
But in fact compilers do regularly prove such things as, "this function call did not touch that local variable". Escape analysis is a term related to this.
I'm more of two minds about that other step, where the compiler goes like, "here in the printf call the p will be dereferenced, so it surely is non-null, so we silently optimize that other thing out where we consider the possibility of it being null".
Also @joshuamorton, couldn't the compiler at least print a warning that it removed code based on an assumption that was inferred by the compiler? I really don't know a lot about those abstract logic solver approaches, but it feels like it should be easy to do.
warning that it removed code based on an assumption that was inferred by the compiler
That would dump a ton of warnings from various macro/meta routines, which real-world C is usually peppered with. Not that it’s particularly hard to do (at the very least compilers know which lines are missing from debug info alone).
Yes, the assumption that p is non-null is idiotic. Also, the implicit assumption that foo will always return.
> no single human was involved
Humans implemented the compilers that use the spec adversarially and humans lobby the standards committee to not fix the bugs
> Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one
The majority of optimizations are harmless and useful, only a handful are idiotic and harmful. I want a compiler that has the good optimizations and not the bad ones.
For essentially every form of UB that compilers actually take advantage of, there's a real program optimization benefit. Are there any particular UB cases where you think the benefit isn't worth it, or it should be implementation-specific behavior instead of undefined behavior?
Most performance wins from UB come from removing code that someone wrote intentionally. If that code wasn't meant to be run, it shouldn't be written. If it was written, it should be run.
Now obviously there are lots of counter-examples for that. You can probably list ten in a minute. But it should be the guiding philosophy of compiler optimizations. If the programmer wrote some code, it shouldn't just be removed. If the program would be faster without that code, the programmer should be the one responsible for deciding whether the code gets removed or not.
MSVC and ICC have traditionally been far less keen on exploiting UB, yet are extremely competitive on performance (ICC in particular). That alone is enough evidence to convince me that UB is not the performance-panacea that the gcc/clang crowd think it is, and from my experience with writing Asm, good instruction selection and scheduling is far more important than trying to pull tricks with UB.
Get the teamsters and workers world party to occupy clang. You should fork C to restore the spirit of C and call it Spiritual C since we need a new successor to Holy C.
I read this, and go "yes, yes, yes", and then "NO!".
Shifts and ors really is the sanest and simplest way to express "assembling an integer from bytes". Masking is _a_ way to deal with the current C spec which has silly promotion rules. Unsigned everything is more fundamental than signed.
> That doesn't mean the compiler can't do optimizations, but it shouldn't do things like prove code as UB and fold everything to a no-op.
UB doesn't just mean the compiler can treat it as a no-op. It means the compiler can do whatever it likes and still be compliant with the spec.
From the POV of someone consulting the spec, if something results in UB, what it means is: "Don't look here for documentation, look in the documentation of your compiler!".
Many compilers prefer to do a no-op because it is the cheapest thing to do.
My read of the standard is the worst the compiler can do, is to do nothing. For example, the blog post links a tweet where clang doing nothing meant generating an empty function so that calling it execution fell through to a different function the author had written which formats the hard drive. However it wouldn't be kosher for the compiler to generate the asm that formats your hard drive itself as an intended punishment for UB since the standard recommends, if you're not going to do nothing, then you can have the compiler act in a way that's characteristic of the environment, or you can crash with an error.
C is carful to distinguish “unspecified behavior” (every compiler must document a consistent choice) and “undefined behavior” which doesn’t necessarily have any safe uses.
Rust gets this right. These primitives are available for all the numeric types.
u32::from_le_byte(bytes) // u32 from 4 bytes, little endian
u32::from_be_byte(bytes) // u32 from 4 bytes, big endian
u32::to_le_bytes(num) // u32 to 4 bytes, little endian
u32::to_be_bytes(num) // u32 to 4 bytes, big endian
This was very useful to me recently as I had to write the marshaling and un-marshaling for a game networking format with hundreds of messages. With primitives like this, you can see what's going on.
There are equivalent functions in C too. The point of the article is about not using them. So how would you implement the above functions in Rust would be more pertinent.
Given that Rust isn’t C, the answer is Rust has a compiler intrinsic for bswap and it calls that as appropriate. LLVM will then turn that into the correct instruction(s) for the target platform.
If I were forced to implement them myself for some reason, I would probably simply do them like this:
fn from_be(bytes: [u8; 4]) -> u32 {
(bytes[0] as u32) << 24
| (bytes[1] as u32) << 16
| (bytes[2] as u32) << 8
| (bytes[3] as u32) << 0
}
It's direct, to the point, and does exactly what it says on the tin because all pertinent behaviour is defined. The way Rust's corelib implements it is to transmute the array into the integer, then call the bswap intrinsic if the bytes need swapping(detected at compile time).
isnt the point to be careful when implementing them? so the compiler detects the intention to byteswap?
when we ported little endian x86 Linux to the big endian mainframe we sprinkled hton/ntoh all over the place, happily so. they are the way to go and they should be implemented properly, not be replaced by a homegrown version.
all that said, I'm surprised 64bit htonll and ntohll are not standard yet. anybody knows why?
Blech. I learned to program (around ‘99) by implementing the crusty old FCS1.0 format, which allows for aggressively weird wire formats. Our machine was a PDP-11/72 with its head sawzalled off and custom wire wrap boards dropped in. The “native” format (coming from analog) was 2143 order as a 36b packet. The bits were [8,0:7] (using verilog notation). However, sprinkled randomly in the binary header were chunks of 7- and 8- bit ANSI (packed) and some mutant knockoff 6-bit EBCDIC.
The original listing was written by “Jennifer — please call me if you have troubles”, an undergraduate from MIT. It was hand-assembled machine code, in a neat hand in a big blue binder. That code ran non-stop except for a few hurricanes from 1988 until 2008; bug-free as far as I could tell. Jennifer last-name-unknown, you were my idol & my demon!
I swore off programming for nearly a year after that.
Functions like ntohl and htonl are the biggest blemish in the design of the Berkeley Sockets API because it's defined to read memory off the wire without deserializing it. Those functions shouldn't have been invented for the reasons described in the linked blog posts. The C standard isn't going to evolve to include functions that only exist to accommodate code that misunderstands the standard.
It's about accessing memory plus an extra conversion step vs. accessing memory in the right order in one step. As an extra, platform-dependent implementations of the accessors could be done, like using the LWL+LWR instruction pair on MIPS.
Because it makes it clear what's going on. Most of those functions just generate a move, but it's the correct move.
I had to read through excessively-clever C++ code that did the same thing to figure out what conversions were happening, then re-express it in Rust. I'm re-implementing a legacy mess that people are afraid to work on. As it happens, in this message system, some items, mainly packet sequence numbers, are big-endian, because they were following what IP and UDP do, and everything else is little endian.
I know how to do this with shifts and masks, and I've done things like that when programming in assembly. That was a long time ago. There's been progress in how to write programs.
Obviously if you are interacting with an old protocol that uses network byte order for some things, you will need to use these functions.
But what's the argument for using them in new game code?
The code will not be running anywhere that has big-endianness. No current platform a game could run on uses it, and I can't imagine a scenario where a new platform would come into existance and use it either.
If you insist on using network-byte order anyway, then you have to do an extra bswap op for each bit of data you send. Sure, the cost of that is super minor and probably not worth worrying about.
But the bigger cost is that you can't just send whole structures at a time. You have to individually serialise each thing. Now you have to have a whole serialisation concept. You have to have some way of enumerating all the fields. You have to walk all the structures. What a pain.
If you want to send a thing over the network, just send it.
This is why, in 2021, the mantra that C is a good language for these low level byte twiddling tasks needs to die. Dealing with alignment and endianness properly requires a language that allows you to build abstractions.
The following is perfectly well defined in C++, despite looking like almost the same as the original unsafe C:
#include <boost/endian.hpp>
#include <cstdio>
using namespace boost::endian;
unsigned char b[5] = {0x80,0x01,0x02,0x03,0x04};
int main() {
uint32_t x = *((big_uint32_t*)(b+1));
printf("%08x\n", x);
}
Note that I deliberately misaligned the pointer by adding 1.
[Edit] Fun twist: the above code doesn't work where the intermediate variable x is removed because printf itself is not type safe, so no type conversion (which is when the bswap is deferred to) happens. In pure C++ when using a type safe formatting function (like fmt or iostreams) this wouldn't happen. printf will let you throw any garbage in to it. tl;dr outside embedded use cases writing C in 2021 is fucking nuts.
Correct me if I'm wrong, but your example is just using a library to do the same task, rather than illustrating any difference between C and C++. If you want to pull boost in to do this, that's great, but that hardly seems like a fair comparison to the OP, since instead of implementing code to solve this problem yourself you're just importing someone else's code.
No, the fact that this can be done in a library and looks like a native language feature demonstrates the power of C++ as a language.
This example is demonstrating:
- First class treatment of user (or library) defined types
- Operator overloading
- The fact that it produces fast machine code. Try changing big_uint32_t to regular uint32_t to see how this changes. When you use the later ubsan will introduce a trap for runtime checks, but it doesn't need to in this case.
Operator overloading is a mixed blessing though, it can be very convenient but
it's also very good at obfuscating what's going on.
For instance I'm not familiar with this boost library so I'd have a lot of
trouble piecing out what your snippet does, especially since there's no explicit
function call besides the printf.
Personally if we're going the OOP route I'd much prefer something like
Rust's `var.to_be()`, `var.to_le` etc... At least it's very explicit.
My hot take is that operator overloading should only ever be used for
mathematical operators (multiplying vectors etc...), everything else is almost
invariably a bad idea.
Ironically, it was proposed not so long ago to deprecate to_be/to_le in favour of to_be_bytes/to_le_bytes, since the former conflate abstract values with bit representations.
That's fine if whatever type 'var' happens to be is NOT usable as an arithmetic type, otherwise you can easily just forget to call .to_le() or .to_native(), or whatever, and end up with a bug. I don't know Rust, so don't know if this is the case?
Boost.Endian actually lets you pick between arithmetic and buffer types.
'big_uint32_buf_t' is a buffer type that requires you to call .value() or do a conversion to an integral type. It does not support arithmetic operations.
'big_uint32_t' is an arithmetic type, and supports all the arithmetic operators.
There are also variants of both endian suffixed '_at' for when you know you have aligned access.
The idiomatic way to do this in Rust is to use functions like .to_le_bytes(), so you have the u32 (or whatever) on one end and raw bytes (something like [u8; 4]) on the other. It can get slightly tedious if you're doing it by hand, but it's impossible to accidentally forget. If you're doing this kind of thing at scale, like dealing with TrueType fonts (another bastion of big-endian), it's common to reach for derive macros, which automate a great deal of the tedium.
Who decides what methods to add to the bytes type/abstraction?
If I have a 3 byte big endian integer can I access it easily in rust without resorting to shifts?
In C++ I could probably create a fairly convincing big_uint24_t type and use it in a packed struct and there would be no inconsistencies with how it's used with respect to the more common varieties
In Rust, [u8; N] and &[u8] are both primitive types, and not abstractions. It's possible to create an abstraction around either (the former even more so now with const generics), but that's not necessary. It's also possible to use "extension traits" to add methods, even to existing and built-in types[1].
I'm not sure about a 3 byte big endian integer. I mean, that's going to compile down to some combination of shifting and masking operations anyway, isn't it? I suspect that if you have some oddball binary format that needs, this it will be possible to write some code to marshal it, that compiles down to the best possible asm. Godbolt is your friend here :)
I agree then that in Rust you could make something consistent.
I think there's no need for explicit shifts. You need to memcpy anyway to deal with alignment issues, so you may as well just copy in to the last 3 bytes of a zero-initialized, big endian, 32bit uint.
Is also has _Generic() so you can roll up a family of endianness conversion functions and safely change types without blowing up somewhere else with a hardcoded conversion routine.
I find you missed the point of the post and the issues described in it.
In my estimation, libraries like boost are way too big and way too clever and they create more problems than they solve. Also, they don't make me happy.
You're overfocusing on a "problem" that is almost completely irrelevant for most of programming. Big endian is rare to be found (almost no hardware to be found, but some file formats and networking APIs have big-endian data in them). Where you still meet it, you don't do endianness conversions willy-nilly. You have only a few lines in a huge project that should be concerned with it. Similar situation for dealing with aligned reads.
So, with boost you end up with a huge slow-compiling dependency to solve a problem using obscure implicit mechanisms that almost no-one understands or can even spot (I would never have guessed that your line above seems to handle misalignment or byte swapping).
This approach is typical for a large group of C++ programmers, who seem to like to optimize for short code snippets, cleverness, and/or pedantry.
The actual issue described in the post was the UB that is easy to hit when doing bit shifting, caused by the implicit conversions that are defined in C. While this is definitely an unhappy situation, it's easy enough to avoid this using plain C syntax (cast expression to unsigned before shifting), using not more code than the boost-type cast in your above code.
The fact that the UB is so easy to hit doesn't call for excessive abstraction, but simply a revisit of some of the UB defined in C, and how compiler writers exploit it.
(Anecdata: I've written a fair share of C code, while not compression or encryption algorithms, and personally I'm not sure I've ever hit one of the evil cases of UB. I've hit Segmentation faults or had Out-of-bounds accesses, sure, but personally I've never seen the language or compilers "haunt me".)
Do you use UBSAN and ASAN? When you write unit tests do you feed numbers like 0x80000000 into your algorithm? When you allocate test memory have you considered doing it with mmap(4096) and putting the data at the end of the map? (Or better yet, double it and use mprotect). Those are some good examples of torture tests if you're in the mood to feel haunted.
Every day I spend futzing around with endianness is a day I'm not solving 'real' problems. These things are a distraction and a complete waste of developer time: It should be solved 'once' and only worried about by people specifically looking to improve on the existing solution. If it can't be handled by a library call, there's something really broken in the language.
(imo, both c and cpp are mainly advocated by people suffering from stockholm syndrome.)
But that's the point: No one spends a day futzing around with endianness, and there are in fact functions for swapping endianness. You can just call them, no need to hide the swap in a pointer cast expression to a type that has the dereferencing operator overloaded.
Re the anecdata at the end. Have you ever run your code through the sanitizers? I have. CVE-2016-2414 is one of my battle scars, and I consider myself a pretty good programmer who is aware of security implications.
Very little, quite frankly. I've used valgrind in the past, and found very few problems. I just ran -fsanitize=undefined for the first time on one of my current projects, which is an embedded network service of 8KLOC, and with a quick test covering probably 50% of the codepaths by doing network requests, no UB was detected (I made sure the sanitizer works in my build by introducing a (1<<31) expression).
Admittedly I'm not the type of person who spends his time fuzzing his own projects, so my statement was just to say that the kind of bugs that I hit by just testing my software casually are almost all of the very trivial kind - I've never experienced the feeling that the compiler "betrayed" me and introduced an obscure bug for something that looks like correct code.
I can't immediately see the problem in your CVE here [0], was that some kind of betrayal by compiler situation? Seems like strange things could happen if (end - start) underflows.
This one wasn't specifically "betrayal by compiler," but it was a confusion between signed and unsigned quantities for a size field, which is very similar to the UB exhibited in OP.
Also, the fact that you can't see the problem is actually evidence of how insidious these problems are :)
The rules for this are arcane, and, while the solution suggested in OP is correct, it skates close to the edge, in that there are many similar idioms that are not ok. In particular, (p[1] << 8) & 0xff00, which is code I've written, is potentially UB (hence "mask, and then shift" as a mantra). I'd be surprised if anyone other than jart or someone who's been part of the C or C++ standards process can say why.
> the fact that you can't see the problem is actually evidence of how insidious these problems are
I've looked for a while now, but still can't see it, would you be willing to share?
> (p[1] << 8) & 0xff00
With p[1] being uint8_t? Because then I cannot imagine why, and also fail to see a reason to apply the 0xff00 mask here.
If this is for int8_t instead, the problem you are alluding to is sign extension? If p[1] gets promoted to an int in the negative range, (then its representation has the high order bit set), and shifting that to the left is UB.
Yes, I was assuming it was char *, as in the OP, which can be signed. And any left shift of a negative quantity is UB in C (I'm not sure if this is fixed in recent C++), it doesn't have to be what's commonly thought of as overflow.
As a very minor counterpoint: I like C because frankly it’s fun. I wouldn’t start a web browser or maybe even an operating system in it today, but as a language for messing around I find it rewarding. I also think it is incredibly instructive in a lot of ways. I am not a C++ developer but ANSI C has a special place in my heart.
Also, I will say that when it comes to programming Arduinos and ESP8266/ESP32 chips, I still find that C is my go to despite things like Alia, MicroPython, etc. I think it’s possible that once Zig supports those devices fully that I might move over. But in the meantime I guess I’ll keep minding my off by one errors.
This has nothing to do with C++ because your example only hides the real issue occurring in the blog post example: The unaligned read on the array. Try adding something like
printf("%08x\n", *((uint32_t*)(b)));
to your example and you'll see that it produces UB as well. The reason there is no UB with big_uint32_t probably is that that struct/class/whatever it is probably redefines its dereferencing operator to perform byte-wise reads.
I fail to see your point. The point of my post is that the abstractions you can build in C++ are as easy to use and as efficient as doing things the wrong, unsafe way...so there's no reason not to do things in a safe, correct way.
Obviously if you write C and compile it as C++ you still end up with UB, because C++ aims for extreme levels of compatibility with C.
Sorry for being unclear. My point is that the example in the blog post does two things, a) it reads an unaligned address causing UB and b) it performs byte-order swapping. The post then goes on about avoiding UB in part b), but all the time the UB was caused by the unaligned access in a).
Of course your example solves both a) and b) by using big_uint32_t, and I agree that this is an interesting abstraction provided by Boost, but I think the takeaway "use C++ for low-level byte fiddling" is slightly misleading: Say I was a novice C++ programmer, saw your example of how C++ improves this but at the same time don't know that big_uint32_t solves the hassle of reading a word from an unaligned address for me. Now I use your pattern in my byte-fiddling code, but then I need to read a word in host endianness. What do I do? Right, I remember the HN post and write *((uint32_t*)(b+1)) (without the big_, because I don't need that!). And then I unintentionally introduced UB. In other words, big_uint32_t is a little "magic" in this case, as it suggests a similarity to uint32_t which does not actually exist.
To be honest, I don't think the byte-wise reading is in any way inappropriate in this case: If you're trying to read a word in non-native byte order from an unaligned access, it is perfectly fine to be very explicit about what you're doing in my opinion. There also is nothing unsafe about doing this as long as you follow certain guidelines, as mentioned elsewhere in this thread.
Sure, the only correct way to read an unaligned value in to an aligned data type in both C or C++ is via memcpy.
I still think being able to define a type that models what you're doing is incredibly valuable because as long as you don't step outside your type system you get so much for free.
You could also mask and shift the value byte-wise just like with an endian swap. Depending on the destination and how aggressive the compiler optimizes memcpy or not, it could even produce more optimal code, perhaps by working in registers more.
Conceptual consistency is a good thing, but there is a generally higher cognitive load to using C++ over C. I've used both C++ and C professionally, and I've gone deeper with type safety and metaprogramming than most folk. I've mostly used C for the last few years, and I don't feel like I'm missing anything. It's still possible to write hard-to-misuse code by coming up with abstractions that play to the language's strengths.
Operator overloading in particular is something I've refined my opinion on over the years. My current thought is that it's best not to use operators in user/application defined APIs, and should be reserved for implementing language defined "standard" APIs like the STL. Instead, it's better to use functions with names that unambiguously describe their purpose.
C is perfect for these problems. I like teaching the endian serialization problem because it broaches so many of the topics that are key to understanding C/C++ in general. Even if we choose to spend the majority of our time plumbing together functions written by better men, it's nice to understand how the language is defined so we could write those functions, even if we don't need to.
For sure, it's a good way to teach that C is insufficient to deal with even the simplest of tasks. Unfortunately teaching has a bad habit of becoming practice, no matter how good the intention.
With regard to teaching C++ specifically I tend to agree with this talk:
One of her slides was titled "Stop teaching pointers!" too. My VP back at my old job snapped at me once because I got too excited about the pointer abstractions provided by modern C++. Ever since that day I try to take a more rational approach to writing native code where I consider what it looks like in binary and I've configured my Emacs so it can do what clang.godbolt.org does in a single keystroke.
One does not simply introduce C++. It's the most insanely hardcore language there is. I wouldn't have stood any chance understanding it had it not been for my gentle introduction with C for several years.
Apparently the first year students at my university didn't had any issue going from Standard Pascal to C++, in the mid-90's.
Proper C++ was taught using our string, vector and collection classes, given that we were still a couple of years away from ISO C++ being fully defined.
C style programming with low level tricks were only introduced later as advanced topics.
Apparently thousands of students managed to get going the remaining 5 years of the degree.
Well there's a reason universities switched to Java when teaching algorithms and containers after the 90's. C++ is a weaker abstraction that encourages the kind of curiosity that's going to cause a student's brain to melt the moment they try to figure out how things work and encounter the sorts of demons the coursework hasn't prepared them to face. If I was going to teach it, I'd start with octal machine codes and work my way up. https://justine.lol/blinkenlights/realmode.html Sort of like if I were to teach TypeScript then I'd start with JavaScript. My approach to native development probably has more in common with web development than it does with modern c++ practices to be honest, and that's something I talk about in one of my famous hacks: https://github.com/jart/cosmopolitan/blob/4577f7fe11e5d8ef0a...
Yes, there is some value in using C for teaching these concepts. But the problem I see is that, once taught, many people will then continue to use C and their hand written byte swapping functions, instead of moving on to languages with better abstraction facilities and/or availing themselves of the (as you point out) many available library implementations of this functionality.
What are the advantages of this over a simple function with the following signature?
uint32_t read_big_uint32(char *bytes);
Having a big_uint32_t type seems wrong to me conceptually. You should either deal with sequences of bytes with a defined endianness or with native 32-bit integers of indeterminate endianness (assuming that your code is intended to be endian neutral). Having some kind of halfway house just confuses things.
The library provides those functions too, but I don't see how having an arithmetic type with well defined size, endiannness and alignment is a bad thing.
If you're defining a struct to mirror a data structure from a device, protocol or file format then the language / type system should let you define the properties of the fields, not necessarily force you to introduce a parsing/decoding stage which could be more easily bypassed.
It is no longer arithmetic if there is an endianness. Some things are numbers and some things are sequences of bytes. Arithmetic only works on the former.
I agree, but a little nitpick: A sequence of bytes does not have a defined endianness. Only groups of more than one bytes (i.e. half words, words, double words or whatever you want to call them) have an endianness.
In practice, most projects (e.g. the Linux kernel or the socket interface) differentiate between host (indeterminate) byte order and a specific byte order (e.g. network byte order/big endian).
I'd say, putting multiple of those types into a struct that then perfectly describes the memory layout of each byte of data in memory/network packet in a reliable and user friendly way to manipulate for the coder.
I see. That does seem helpful once you consider how these types compose, rather than thinking about a one-off conversion. However, I think it would be cleaner to have a library that auto-generated a parser for a given struct paired with an endianness specification, rather than baking the endianness into the types. (Probably this could be achieved by template metaprogramming too.)
By the same token, I think most uses for C++ these days are nuts. If you're doing a greenfield project 90% of the time it's better to use Rust.
C++ has a multitude of its own pitfalls. Some of the C programmer hate for C++ is justified. After all, it's just C with a pre-processing stage in the end.
There's good reasons why many C projects never considered C++ but are already integrating the nascent Rust. I always hated low level programming until Rust made it just as easy and productive as high level stuff
No, because no punning exists here. The code is C++, so this calls a conversion function that likely does the bit manipulation internally in a legal way.
Why is it not ok to convert from a char? Some of the information in the gist is wrong. Type punning with unions for example is legal. ANSI X3.159-1988 is quite clear on that point in its aliasing rules. I've seen a lot of comments people post online saying you must use memcpy to read the bits in a float or that c++ forbids union punning but where is that written. Since if that were true every math library would break.
Remember how we used to have machines with a 7 bit byte? And everything was written to handle either 6, 7, or 8 bit bytes?
And now we've settled on all machines being 8 bit bytes, and programmers no longer have to worry about such details?
Is it time to do the same for big endian machines? Is it time to accept that all machines that matter are little endian, and the extra effort keeping everything portable to big endian is no longer worth the mental effort?
That reminds me of a project to interface with vending machines. (We built a bookshop in a vending machine that would tweet whenever it sold an item, with automated stock management.)
Vending machines have an internal protocol a little like I2C. We created a custom peripheral to bridge the machine to the web, based on a Raspberry Pi.
The protocol was defined by Coca Cola Japan in 1975 (in order to have optionality in their supply chain). It's still in use today. But because it was designed in Japan, with a need for wide characters, it assumes 9 bit bytes.
We couldn't find any way to get a Raspberry Pi to speak 9 bit bytes. The eventual solution was a custom shield that would read the bits, and reserialise to 8 bit bytes for the Pi to understand. And vice versa.
9 bit bytes. I grew up knowing that bytes had variable length, bit this was the first time I encountered it in the wild. This was 2015.
This just doesn't seem right. Granted, I don't know much about your use case, but Raspberry Pi's are powerful computing devices and I find it difficult to believe there was no way to handle this without additional hardware.
I’m not familiar with the “vending machine” protocol he’s talking about, but it’s entirely reasonable that it has certain timing requirements. Usually the way you interface with these is by having a dedicated HW block to talk the protocol, or by bit banging. The former wouldn’t be supported on RPi because it’s obscure, the latter requires tight GPIO timing control that is difficult to guarantee on a non-real-time system like the RPi usually runs.
I'm familiar with both, and have Pi's bit-banging at 8MHz. It's not hard-realtime like a PIC though (where I've bitbanged a resistor D2A hung off a dsPIC33 to 17.734475MHz). It's an improvement over the years, but surprisingly little since bit-banging 4MHz Z80's more than 4 decades ago, where resolution was 1 T state (250ns).
The 9 bit serial OP mentioned likely doesn't have a seperate clock line, so it is hard realtime and timing matters a lot, and I doubt the Pi could reliably do anything over 1 kHz baud with bit banging. You could do much better if you didn't run Linux.
In order to exchange data over a serial connection, the ones and zeroes have to be sent with exact timing, so the receiver can reliably tell where one bit ends and the next begins. Because of this, the hardware that's doing the communication can't do anything else at the same time. And since the actual mechanics of the process are simple and straightforward, most computers with a serial connection have special serial-interface hardware (a Universal Asynchronous Receiver/Transmitter, or UART) to take care of it - the CPU gives the UART some data, then returns to more productive pursuits while the UART works away.
But sometimes you can't use a UART: maybe you're working on a tiny embedded computer without one, or maybe you need to speak a weird 9-bit protocol a standard UART doesn't understand. In that case, you can make the CPU pump the serial line directly. It's inefficient (there's probably more interesting work the CPU could be doing) and it can be difficult to make the CPU pause for exactly the right amount of time (CPUs are normally designed to run as fast or as efficiently as possible, nothing in between), but it's possible and sometimes it's all you've got. That's bit-banging.
The practice of using software to literally toggle (or read) individual pins with the correct software-controlled timing in order to communicate with some hardware.
To transmit a bit pattern 10010010 over a single pin channel, for example, you'd literally set the pin high, sleep for a some predetermined amount of time, set it low, sleep, set it low, sleep, set it high, etc.
To be a bit more explicit: Unicode is a character encoding, to 20-and-a-half-bit 'bytes', that is variable-width in those 'bytes', even before considering how the 'bytes' are encoded to actual bytes. Eg "ψ̊" (greek small psi with ring above) is U+3C8 U+30A (two 'bytes').
To be technical, by "character" I mean "user-perceived character" or (in Unicode speak) "extended grapheme cluster". This is the thing a user will think of as one character when looking at it on their screen.
A code point is the atomic unit of the abstract Unicode encoding. By "abstract" I mean it is not an actual text encoding you can write to a file.
A code unit is the atomic unit of an actual text encoding, such as UTF-8, UTF-16LE or UTF-32LE (and their BE equivalents).
---
So to put it together a "user-perceived character" is made up of one or more "code points". When implemented in an application, each "code point" is encoded using one or more "code units".
> One byte = one "character" makes for much easier programming.
Only if you are naively operating in the Anglosphere / world where the most complex thing you have to handle is larger character sets. In reality, there's ligatures, diacritics, combining characters, RTL, nbsp, locales, and emoji (with skin tones!). Not to mention legacy encoding.
And no, it does not use a "small fraction of memory and storage" in a huge range of applications, to the point where some regions have transcoding proxies still.
This is not about covering ALL of Unicode. This is about starting to cover Unicode.
"Anglosphere" would be just 7(&"8") bit ASCII, and it's the current situation where it takes quite a lot of skill and knowledge just to start learning how to properly deal with Unicode, because it's often not even taught !
IMHO 32-bit bytes would help tremendously with onboarding developers into Unicode, because it would force dumping ASCII-only as the starting point (and sadly, often ending point) for teaching how to deal with text.
And who can blame the teachers, Unicode is already hard enough without even having to deal with the difficulties coming from having to explain its multi-byte representation...
Last but not least : this would have forced standardization between the Unix world now on UTF-8 and the Windows world which is still stuck on UTF-16 (and Windows-1252 ?!?) for some of the core functions like filenames, which, for instance, still regularly results in issues working with files with non-ASCII filenames.
Not all user-perceived characters can be represented as a single Unicode codepoint. Hence, Unicode text encodings (almost[1]) always have to be treated as variable length, even UTF-32.
[1] at runtime, you could dynamically assign 'virtual' codepoints to grapheme clusters and get a fixed-length encoding for strings that way
Every time I see one of these threads, my gratitude to only do backend grows. Human behavior is too complex, let the webdevs handle UI, and human languages are too complex, not sure what speciality handles that. Give me out of order packets and parsing code that skips a character if the packet length lines up just so any day.
I am thankful that almost all the Unicode text I see is rendered properly now, farewell the little boxes. Good job lots of people.
I think we really have the iPhone jailbreakers to thank for that. U.S. developers were allergic almost offended by anything that wasn't ASCII and then someone released an app that unlocked the emoji icons that Apple had originally intended only for Japan. Emoji is defined in the astral planes so almost nothing at the time was capable of understanding them, yet were so irresistible that developers worldwide who would otherwise have done nothing to address their cultural biases immediately fixed everything overnight to have them. So thanks to cartoons, we now have a more inclusive world.
There's supporting Unicode, and 'supporting' Unicode. If you're only dealing with western languages, it's easy to fall into the trap of only 'supporting' Unicode. Proper emoji handling will put things like grapheme clusters and zero-width joiners on your map.
Anyway, it doesn't make much sense to define the size of a “byte“ as anything else then 8 bits, because that's the smallest adressable memory unit. If you need a 32 bit data type, just use one!
My very point is that we should have increased the size of the smallest addressable memory unit from 8 to 32 bits, increased again, as previous computer architectures used from 4 to 7 bits per byte. (There might be still e-mail servers around directly compatible with "non-padded" 7-bit ASCII ?)
(You'll also notice that caring about not wasting the 8th bit with ASCII has lead us into all sorts of issues... and why care so much about it when as soon as data density becomes important, we can use compression which AFAIK easily rids us of padding ?)
You're basically arguing against variable width text encodings - which is ok. But you know, it's entirely possible to use UTF32. In fact, some programming languages use it by default to represent strings.
But again and again, all of this has nothing to do with the size of a byte.
BTW, are you aware that 8-bit Microcontrollers are still in widespread use and nowhere near of being discontinued?
Static width text encoding + Unicode = Cannot fit a "character" in a single octet, which currently is the default addressable unit of storage/memory.
Programming microcontrollers isn't considered to be "mandatory computer literacy" in college, while basic scripting, which involves understanding how text is encoded at the storage/memory level - is.
Again and again and again, a byte is not meant to hold a text character. Also, as the sibling parent has pointed out, fixed width encoding only gets you so far because it doesn't help with grapheme clusters. That's probably why the world has basically settled with UTF8: it saves memory and destroys any notion that every abstract text character can somehow be represented by a single number.
> mandatory computer literacy
I don't understand why you keep bringing up this phrase and ignore a huge part of real world computing. College students should simply learn how Unicode works. Are you seriously demanding that CPU designers should change their chip design instead?
Generally, I think you are conflating/confusing the concept of “byte“ (= smallest unit of memory) with the concept of “character“ resp. “code unit“ (= smallest unit of text encoding). The size of the former depends on the CPU architecture and on modern systems it's always 8 bits. The size of the latter depends on the specific text encoding.
People were holding off on transitioning because pointers use twice as much space in x64. If bytes had quadrupled in space with x64 we would still be using 32 bit software everywhere
MIDI (8 bit), 16 bit PCM, 24 PCM and basically any compressed data format (which is always byte based, because the idea is to save memory). You obviously don't care about memory, but many people do!
To start with, RGB (and I assume Y′CbCr ?) can be encoded in many different ways. The most common one today (still) uses 8 bits per channel, meaning that a separated 1-octet value can only define monochrome. Therefore 8-bpc RGB is a 24-bit sized format, not a 8-bit sized data one.
And, by an interesting coincidence, with the arrival of "HDR", 8-bit per channel is slowly becoming obsolete (because insufficient). The next "step" is 10-bit per channel with 3 channels (hence "HDR10(+)"), and so should fit quite well in 32 bits ?
(However, it would seem that even Dolby's Perceptual Quantizer transfer function might need 12 bits per channel to avoid banding over the "HDR" Rec.2020/2100-sized color gamut..?)
We're debating semantics, but if I reshaped an RGB image into component arrays i.e. u8[yn][xn][3] → u8[3][yn][xn] then would you still view that as a 24-bit format? What if those 24-bit values were huffman or run-length encoded would it be an n-bit format? If your Y′CbCr luminance plane has a legal range of 16..235 and the chrominance planes are 16..240, then would it be a 23.40892 bit format?
I'm arguing about non-compressed, eventually padded data types that make learning Unicode (or any other applicable data format) easier because of the equivalence : 1 atomic unit ("character", pixel) = 1 smallest addressable unit of memory (byte). This involves byte size being at least as large as atom size.
And it's particularly important to have this property for text, because not only data is overwhelmingly stored as text (in importance, not by "weight"), but because computer programs themselves are written using text.
Well I figured since you feel strongly about using a type wider than 8-bits for RGB you must have a really good display that actually lets you perceive the colors that enables you to encode. Most PC displays are garbage including the expensive ones because first, sRGB only specifies a very small portion of light that's perceivable and secondly, any display maker who builds something better is going to run into complaints about how terrible netflix looks, because it reveals things like banding (which you mentioned) that otherwise wouldn't be perceivable. So I was hoping you could recommend me a better monitor, so I can get into >8 bit RGB, because I've found it exceedingly difficult to shop around for this kind of thing.
Ok, so you weren't sarcastic and/or misunderstanding my use of "atomic".
Sadly, I kind of gave up on getting a "HDR" display, at least for now, because :
- AFAIK neither Linux nor Windows have good enough "HDR" support yet. (MacOS supposedly does, but I'm not interested.)
- I'm happy enough with my HP LP2475w which I got for dirt cheap just before "HDR" became a thing. I consider the 1920x1200 resolution to be perfect for now (as a bonus I can manually scale various old resolutions like 800x600 to be pixel-perfect) - too many programs/OSes still have issues with auto-scaling programs on higher resolution screens (which would come with "HDR"). I'm also particularly fond of the 16:10 ratio, which seems to have gone extinct.
- Maybe I'll be able to run this monitor properly in wide gamuts (though with banding), or maybe even in some kind of ""HDR" compatibility mode", though it would seem that the current sellers of "HDR" screens aren't going to make that easy. I might be able to get a colorimeter soon to properly calibrate it.
If you have a $200 monitor then it probably struggles to make proper use of 8-bit formats. I have a display that claims to simulate DICOM but it's not enough I want more. However I'm not willing to spend $3000 on a display which doesn't have engineering specs and then send it back because it doesn't work. I don't care about resolution. I care about being able to see the unseen. I care about edge cases like yellow and blue making pink. That was the first significant finding Maxwell reported on when he invented RGB. However nearly every monitor ever made mixes those two colors wrong, as gray, due to subpixel layout issues. Nearly every scaling algorithm mixes those two colors wrong too, due to the way sRGB was designed. It's amazing how poorly color is modeled on personal computers. https://justine.lol/maxwell.png
Well, when released in 2008 it was a $600 monitor, I got it second-hand for 80€.
I'm not sure what DICOM has to do with color reproduction quality ? Also it seems to be a quite a bit older standard than sRGB...
By definition, you can't "see the unseen". "Yellow" and "blue" are opponent "colors", so, by definition, a proper mixture of them is going to give you grey :
Also, when talking about subtle color effects, you have to consider that personal variation might come into play (for instance red-green "colorblindness" is a spectrum).
It looks like this thing is the thing I want to buy https://www.apple.com/pro-display-xdr/ There's plenty of light that is currently unseeable. Look at chromaticity chart for sRGB. If your definition of color mixes yellow and blue and as grey then you've defined color wrong, because nature has a different definition where it's pink. For example the CIELAB colorspace will mix the two as pink. Also I'm not colorblind. If I'm on the spectrum I would be on the able to see more color more accurately end of the spectrum. Although when designing charts I'm very good at choosing colors that accommodate people who are colorblind, while still looking stylish, because I feel like inclusive technology is important.
Not really. Strings are a list of integers [1], integers are signed and fill a system word, but there's also 4 bits of type information. So you can have a 28-bit signed integer char on a 32-bit system or a signed 60-bit integer.
However, since Unicode is limited to 21-bits by utf-16 encoding, a unicode code point will fit in a small integer.
[1] unless you use binaries, which is often a better choice.
>In the PDP-10 a "byte" is some number of contiguous bits within one word. A byte pointer is a quantity (which occupies a whole word) which describes the location of a byte. There are three parts to the description of a byte: the word (i.e., address) in which the byte occurs, the position of the byte within the word, and the length of the byte.
>POS is the byte position: the number of bits from the right end of the byte to the right end of the word.
SIZE is the byte size in bits.
>The U field is ignored by the byte instructions.
>The I, X and Y fields are used, just as in an instruction, to compute an effective address which specifies the location of the word containing the byte.
"If you're not playing with 36 bits, you're not playing with a full DEC!" -DIGEX (Doug Humphrey)
I feel like big endian is more _intuitive_ because that's what our number notation has evolved to be.
But more _natural_ is little endian because, well, it's just more straightforward to have the digits' magnitude be in ascending order (2^0, 2^1, 2^2, 2^3...) instead of putting it in reverse.
Plus you encounter less roadblocks in practice with little endian (e.g. address changes with casts) which is often a sign of good natural design
I'm curious how you're defining "natural", and if you think ISO-8601 is the reverse of "natural" too.
All human number systems I've ever seen write numbers out as big Endian (yes, even Roman numerals), so I'm really struggling to see how that wouldn't be considered natural.
Counting out change is little endian - usually you start cents then dollars.
I wonder if we went big endian “by mistake” with Arabic numerals given that Arabic is written right to left.
Some ancient texts have “four and twenty” which is little endian.
We also add commas to large numbers to help with a human processing problem - you have to get to the end of the number to know what the first digit represents and then count backwards (groups of three help).
> The bits start from highest to lowest on a serial connection.
This is only true on a big endian serial connection (that is, one that, tautologically, sends the most-significant bit first). Offhand, I think most serial protocols are big endian, but by that logic, most CPUs are little endian, so that doesn't really help.
The thing that's actually useful about big endian is not that it's natural (as kangalioo points out, that's little endian) or that it's how humans write numbers (by that logic crap like BCD or decimal floats is a good idea), but that big endian preserves lexicographic order of fixed-width integers.
Machines that matter to who? Maintainers of packages for OSes that support s390x (RHEL/Fedora, SUSE, Debian), Arduino AVR users, AIX users, embedded systems people with, say, Coldfire or some DSP? I don't know to what extent portable code is relevant to embedded systems, but I guess they care about cryptography libraries, for instance. (I know it's bi- in principle, but SPARC is still even in the Top500.)
In my experience IBM does the right thing and sends patches rather than asking us to fix their problems for them, and I respect them for that reason, even if it's a tiny burden to review those changes.
However endianness isn't just about supporting IBM. Modern compilers will literally break your code if you alias memory using a type wider than char. It's illegal per the standard. In the past compilers would simply not care and say, oh the architecture permits unaligned reads so we'll just let you do that. Not anymore. Modern GCC and Clang force your code to conform to the abstract standard definition rather than the local architecture definition.
It's also worth noting that people think x86 architecture permits unaligned reads but that's not entirely true. For example, you can't do unaligned read-ahead on C strings, because in extremely rare cases you might cross a page boundary that isn't defined and trigger a segfault.
> It's also worth noting that people think x86 architecture permits unaligned reads but that's not entirely true. For example, you can't do unaligned read-ahead on C strings, because in extremely rare cases you might cross a page boundary that isn't defined and trigger a segfault.
But that's not a problem with an unaligned read but rather that you are reading more than you are allowed to. And in C even an aligned readahead is UB.
A better example might be SSE instructions which do have aligned variants that trap on unaligned pointers.
yes IBM provided asm for s390 hton ntoh, and "all we had to do" for mainframe Linux was patch x86 only packages to use hton ntoh when they persisted binary data. for the kernel IBM did it on their own, contributing mainline, for userland suse did it, grabbing some patches from japanese turbolinux, and then red hat grabbed the patches from turbo and suse, and together we got them mainline lol. and PPC then just piggybacked on top of that effort.
> So the solution is simple right? Let's just use unsigned char instead. Sadly no. Because unsigned char in C expressions gets type promoted to the signed type int.
If you do use unsigned char, an alternative to masking would be performing the cast to uint32_t before instead of after the shift.
edit: For reference, this is what it would look like when implemented as a function instead of a macro:
In case anyone else wonders how the code in the linked tweet [0] would format your hard drive, it's the missing return on f1. Therefore, f1 is empty as well (no ret) and calling it will result in f2 being run. The commented out code is irrelevant.
EDIT: Reading the bug report [1], the actual cause for the missing ret is that the for loop will overflow, which is UB and causes clang to not emit any code for the function.
The first example in the article is flawed (or at least misleading).
1) They define a char array (which defaults to signed char, as mentioned in the post), including the value 0x80 which can't be represented in char, resulting in a compiler warning (e.g. in GCC 11.1).
The mentioned reason against using unsigned char (that shifting 128 left by 24 places results in UB) is also misleading: I could not reproduce the UB when changing the array to unsigned char. Perhaps the author meant leaving the array defined as signed char, but casting the signed chars to unsigned before shifting. That indeed results in UB, but I don't see why you would define the array as signed in the first place.
2) The cause for the undefined behavior isn't the bswap_32, rather it's because they try reading an uint32_t value from a char array, where b[0] is not aligned on a word boundary.
There is no need at all do redefine bswap. The simple solution would be to use an unsigned char array instead of a char array and just reading the values byte-wise.
Of course C has its footguns and warts and so on, but there is no need to dramatize it this much in my opinion.
Edit: To add to point 2) above: Another way to avoid the UB (in this specific case) would be to add __attribute__ ((aligned (4))) to the definition of b. In that case, even reading the array as a single uint32_t works as expected since the access is aligned to a word boundary.
Obviously, you can't expect any random (unsigned char) pointer to be aligned on a word boundary. Therefore, it is still necessary to read the uint32_t byte by byte.
> The mentioned reason against using unsigned char (that shifting 128 left by 24 places results in UB) is also misleading
No, that reasoning is correct. Integer promotions are performed on the operands of a shift expression, meaning the left operand will be promoted to signed int even if it starts out as unsigned char. Trying to shift a byte value with highest bit set by 24 will results in a value not representable as signed int, leading to UB.
Thanks, I just noticed a small mistake in my example (I don't trigger the UB because I access b[0] containing 0x80 without shifting, however I meant to do it the other way around).
Still, adding an explicit cast to the left operand seems to be enough to avoid this, e.g.:
uint32_t x = ((uint32_t)b[0]) << 24;
In summary, I think my point that using unsigned char would be appropriate in this case still stands.
Why is it an issue any more than say, order of fields in a struct is an issue? In one case you read bytes off the disk by doing ((b[0] << 8) | b[1]) (or equivalent), with the order reversed the other way around. Any application-level (say, not a compiler, debugger, etc) program should not even need to know the native byte order, it should only need to know the encoding that the file it’s trying to read used.
The good thing is that Big Endian is pretty much irrelevant these days.
Of all the historically Big Endian architectures, s390x is indeed the only one left that has not switched to little endian.
Also, this might be irrelevant at the cpu level, but within a byte, bits are usually displayed most significant bit first, so with little endian you end up with bit order:
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
instead of
15 to 0
This is because little endian is not how humans write numbers. For consistency with little endianness we would have to switch to writing "one hundred and twenty three" as
Correct me if I'm wrong, but were the now common numbers not imported in the same order from Arabic, which writes right to left? So numbers were invented in little endian, and we just forgot to translate their order.
Good question, I just did a little digging to see if I could find out. It sounds like old Arabic did indeed use little endian in writing and speaking, but modern Arabic does not. However, place values weren’t invented in Arabic, Wikipedia says that occurred in Mesopotamia, which spoke primarily Sumerian and was written in Cuneiform - where the direction was left to right.
It might not be how humans write numbers but it is consistent with how we think about numbers in a base system.
123 = 3x10^0 + 2x10^1 + 1x10^2
So if you were to go and label each digit in 123 with the power of 10 it represents, you end up with little endian ordering (eg the 3 has index 0 and the 1 has index 2). This is why little endian has always made more sense to me, personally.
I always think about values in big endian, largest digit first. Scientific notation, for example, since often we only care about the first few digits.
I sometimes think about arithmetic in little endian, since addition always starts with the least significant digit, due to the right-to-left dependency of carrying.
Except lately I’ve been doing large additions big-endian style left-to-right, allowing intermediate “digits” with a value greater than 9, and doing the carry pass separately after the digit addition pass. It feels easier to me to think about addition this way, even though it’s a less efficient notation.
Long division and modulus are also big-endian operations. My favorite CS trick was learning how you can compute any arbitrarily sized number mod 7 in your head as fast as people are reading the digits of the number, from left to right. If you did it little-endian you’d have to remember the entire number, but in big endian you can forget each digit as soon as you use it.
I don't know, when we write in general, we tend to write the most significant stuff first so you lose less information if you stop early. Even numbers we truncate twelve millions instead of something like twelve millions, zero thousand zero hundreds and 0.
Next you are going to want little endian polynomials, and that is just too far. Also, the advantage of big endian is it naturally extends to decimals/negative exponents where the later on things are less important. X squared plus x plus three minus one over x plus one over x squared etc.
Loss of big endian chips saddens me like the loss of underscores in var names in Go Lang. The homogeneity is worth something, thanks intel and camelCase, but the old order that passes away and is no more had the beauty of a new world.
In German _ein hundert drei und zwanzig_, literally _one hundred three and twenty_. The hardest part is are telephone numbers, that are usually given in blocks of two digits.
Well that would be hard for me to learn. I always find the small numbers between like 10 and 100 or 1000 the hardest for me to remember in languages I am trying to learn a bit of.
The only benefit to big endian is that it's easier for humans to read in a hex dump. Little endian on the other hand has many tricks available to it for building encoding schemes that are efficient on the decoder side.
Could you elaborate on these tricks? This sounds interesting.
The only thing I'm aware of that's neat in little endian is that if you want the low byte (or word or whatever suffix) of a number stored at address a, then you can simply read a byte from exactly that address. Even if you don't know the size of the original number.
- Long addition is possible across very large integers by just adding the bytes and keeping track of the carry.
- Encoding variable sized integers is possible through an easy algorithm: set aside space in the encoded data for the size, then encode the low bits of the value, shift, repeat until value = 0. When done, store the number of bytes you wrote to the earlier length field. The length calculation comes for free.
- Decoding unaligned bits into big integers is easy because you just store the leftover bits in the next value of the bigint array and keep going. With big endian, you're going high bits to low bits, so once you pass to more than one element in the bigint array, you have to start shifting across multiple elements for every piece you decode from then on.
- Storing bit-encoded length fields into structs becomes trivial since it's always in the low bit, and you can just incrementally build the value low-to-high using the previously decoded length field. Super easy and quick decoding, without having to prepare specific sized destinations.
Blame the people who failed to localize the right-to-left convention when arabic numerals were adopted. It's one of those things like pi vs. tau or jacobin weights and measurements vs. planck units. Tradition isn't always correct. John von Neumann understood that when he designed modern architecture and muh hex dump is not an argument.
Even if all CPUs were little-endian, big-endian would exist almost everywhere except CPUs, including in your head. Unless you're some odd person that actually thinks in little-endian.
I don't think it's a fuck up, rather I think it was unavoidable: Both ways are equally valid and when the time came to make the decision, some people decided one way, some people decided the other way.
Big and little endian are named after the never-ending "holy" war in Gulliver's Travels over how to open eggs. So we were always of the opinion that it doesn't really matter. But I open my eggs on the little end
Big Endian of course :-) However the one which has won is Little Endian. Even IBM admitted this when it switched the default in POWER 7 to little endian. s390x is the only significant architecture that is still big endian.
Little endian has the advantage that you can read the low bits of data without having to adjust the address. So you can for example do long addition in memory order rather than having to go backwards, or (with an appropriate representation such as ULEB128) in one pass without knowing the size.
Maybe I am biased working on mainframes, but I would personally take big endian over little endian. The reason is when reading a hex dump, I can easily read the binary integers from left to right.
But for example bitmaps in BE are a huge source of bugs, as readers and writers need to agree on the size to use for memory operations.
"SIMD in a word" (e.g. doing strlen or strcmp with 32- or 64-bit memory accesses) might have mostly fallen out of fashion these days, but it's also more efficient in LE.
Big endian is easier for humans to read when looking at a memory dump, but little endian has many useful features in binary encoding schemes due to the low byte being first.
I used to like big endian more, but after deep investigation I now prefer little endian for any encoding schemes.
I think the fundamental problems is that if you start a computation using N most significant bits and then incrementally add more bits, e.g. N+M bits total, then your first N bits might change as a result.
E.g. decimal example:
1.00/1.00 = 1.00
1.000/1.001 = 0.999000999000...
(adding one more bit changes the first bits of the outcome)
You can put emphasis on high order bits, but that makes decoding more complex. With little endian the decoder builds low to high, which is MUCH easier to deal with, especially on spillover.
For example, with ULEB128 [1], you just read 7 bits at a time, going higher and higher up the value you're reconstituting. If the value grows too big and you need to spill over to the next (such as with big integer implementations), you just fill the last bits of the old value, then put the remainder bits in the next value and continue on.
With a big endian encoding method (i.e. VLQ used in MIDI format), you start from the high bits and work your way down, which is fine until your value spills over. Because you only have the high bits decoded at the time of the spillover, you now have to start shifting bits along each of your already decoded big integer portions until you finally decode the lowest bit. This of course gets progressively slower as the bits and your big integer portions pile up.
Encoding is easier too, since you don't need to check if for example a uint64 integer value can be encoded in 1, 2, 3, 4, 5, 6, 7 or 8 bits. Just encode the low 8 bits, shift the source right by 8, repeat, until the source value is 0. Then backtrack to the as-yet-blank encoded length field in your message and stuff in how many bytes you encoded. You just got the length calculation for free. Use a scheme where you only encode up to 60 bit values, place the length field in the low 4 bits, and Robert's your father's brother!
For data that is right-heavy (i.e. the fully formed data always has real data on the right side and blank filler on the left - such as uint32 value 8 is actually 0x00000008), you want a little endian scheme. For data that is left-heavy, you want a big endian scheme. Since most of the data we deal with is right-heavy, little endian is the way to go.
You can see how this has influenced my encoding design in [2] [3] [4].
the greatest of all is lisp not being the most mainstream language, and we can only blame the lisp companies for this fiasco. in an ideal world we all would be using a lisp with parametric polymorphism. from highest level abstractions to machine level, all in one language.
A while back I was on a project to port a satellite simulator from SPARC/Solaris to RHEL/x64. The compressed telemetry stream that came from the satellite needed to be in big endian (and that's what the ground station software expected), and the simulator needed to mimic the behavior.
This was not a problem for the old SPARC system, which naturally put everything in the correct order without any fuss, but one of the biggest sticking points in porting over to x64 was having to now manually pack all of that binary data. Using Ada, (what else!) of course.
If memory serves correctly, ada 2012 and beyond has language level support for this. I was working on porting some code from an aviation platform to run on PC and it was all in ada 2005 so we didn't have the benefit of that available.
Same here, Ada2005 for the port. The simulator was originally written in Ada95. Part of what made it even less fun was the data was highly packed and individual fields crossed byte boundaries (these 5 bits are X, the next 4 bits are Y, etc.) :(
Couldn't you add the Bit_Order and Scalar_Storage_Order attributes (or aspects in Ada 2012) to your records/arrays? Or did Scalar_Storage_Order not exist at the time?
Ubsan should default on. If people don't like it, then they should be made turn it off with a switch, so at least it's more likely to be run than not run. Could save a huge amount of time debugging when compilers or architecture changes. Without it, I'd say many a programmer would be caught by these subtleties in the standard. Coming from a HW background (Verilog) I'd more naturally default to masking and shifting when building up larger variables from smaller ones, but I can imagine many would not.
There was a blog post and a FOSDEM presentation by (misguided) Gentoo developers a few years ago, and it was retracted, because sanitizers add their own exploitable vulnerabilities due to the way they work.
Sorry for my ignorance, but surely some UB being used for optimization by the compiler is compile time only. This is the part that should default on. Runtime detection is a different thing entirely, but compile time is a no brainer.
UBSAN detects undefined behavior at run-time. Compile-time detection of undefined behavior is present in the form of compiler warnings, but catches far from all cases of undefined behavior. The compiler does not actively exploit undefined behavior in the sense that it does not contain code like this:
if (undefined_behavior) break_program()
If it did, it could easily report the undefined behavior. However, that's not how it works. Instead, the compiler has optimization rules that are only valid if the code contains no undefined behavior. If the code contains undefined behavior, the optimization rules change the result of the program. For example, this code:
bool function(int x) {
return x + 1 > x;
}
Can be optimized to "return true". That is correct if x does not overflow, but if x overflows and wraps around the optimization changes the result of the program. In this case, it is acceptable according to the C/C++ standards for the optimization to assume x does not overflow, and hence this optimization is valid.
The compiler could tell you for every instance of signed integer arithmetic that it is making assumptions about your program, and that the signed integer arithmetic could potentially overflow, but that doesn't seem particularly helpful.
Thanks, though I'm not sure all compile time detectable undefined behaviours is exposed through warnings today. In the example in the article, why would something like left shift of a -ve value require runtime detection, surely the fact a signed char was used with left shift is all the compiler needs. So perhaps a subset to UB that is detectable at compile time should be reported.
In your example about the comparison of x + 1 vs x, I'm not sure that is a contraversial optimization. However this one, to me, is:
Here a diligent programmer is trying to do a null pointer check, but because dereferencing null is UB, then the optimizer removes the null pointer check. This is compile time UB that should be flagged to users.
Sanitizers have the ability to bring Rust-like safety assurances to all the C/C++ code that exists. The fact that existing ASAN runtimes weren't designed for setuid binaries shouldn't dissuade us from pursuing those benefits. We just need a production-worthy runtime that does less things. For example, here's the ASAN runtime that's used for the redbean web server: https://github.com/jart/cosmopolitan/blob/master/libc/intrin...
Run-time detection and heuristics on a language that is hard to analyze (e.g. due to weak aliasing, useless const, ad-hoc ownership and thread-safety rules) aren't in the same ballpark as compile-time safety guaranteed by construction, and an entire modern ecosystem centered around safety. Rust can use LLVM sanitizers in addition to its own checks, so that's not even a trade-off.
Oh I believe you but as you point out we need ASAN to make Rust codebases safer too. One of the things that's helped Rust be successful is that we're able to quickly write bindings for legacy C/C++/FORTRAN code using the unsafe keyword. The last Rust codebase I worked on had about 70k unsafe lines. One day Rust will be complete and we will rewrite all the legacy code but until then we depend on the low level C tooling to provide assurances like byte-granular invalid address access trapping.
> Could save a huge amount of time debugging when compilers or architecture changes.
I'm assuming we come from very different backgrounds, but it's not clear to me how switching compilers or architectures is so common that hardening code against it by default is appropriate. I would think that switching compilers or architectures is generally done very deliberately, so instrumenting code with UBsan for that transition would be the right thing to do?
Changing gcc version could cause your code with undefined behaviour to change. If you rely UB, whether you know you are or not, you are in for a bad time. Ubsan at least let's you know if your code is robust, or a ticking time bomb...
Changing compilers is a pretty regular thing IMHO; I use the compiler that comes with the OS and let's assume a yearly OS release cycle. Most of those will contain at least some changes to the compiler.
I don't really want to have to take that yearly update to go through and review (and presumablu fix) all the UB that has managed to sneak in over the year. It would be better to have avoided putting it in.
If you can assume GCC or Clang then __builtin_bswap{16,32,64} functions are provided which will be considerably more efficient, less error-prone, and easier to use than anything you can homebrew.
Well, yes. The only thing missing is knowing if you have to swap or not, if you don't want to assume your code will run on little endian systems exclusively.
Or, on Linux and BSD systems at least, you can use the <endian.h> or <sys/endian.h> functions (https://linux.die.net/man/3/endian) and rely on the libc implementation to do the system/compiler detection for you and use an appropriate compiler builtin inside of an inline function instead of bothering to hack something together in your own code.
The article mentions those functions at the bottom, but strangely still recommends hacking up your own macros.
But then you have to #ifdef the endianness of the target architecture. If you do it the right way as Russ Cox and Justine Tunney say, then your code can serialize and deserialize correctly regardless of the platform endianness.
That's not true. If you write the byte swap in ANSI C using the gigantic mask+shift expression it'll optimize down to the bswap instruction under both GCC and Clang, as the blog post points out.
Assuming the macros or your giant expression are correct. But you might as well use the compiler intrinsics which you know are both correct and the most efficient possible, and get on with your life.
Sorry I'd rather place my faith in arithmetic rather than someone's API provided the compiler is smart enough to understand the arithmetic and optimize accordingly.
"Someone" here is the same compiler you're trusting to optimize your giant arithmetic expression of the same idea. Your statement is internally inconsistent.
There is a value to keeping it completely clear in your head the difference between a value with arithmetic semantics vs a value with octets in a stream semantics. That thinking will work in all contexts, while the compiler knowledge is limited. The thinking will help you write correct ways to encode data in the URL or into a file being uploaded that your code generates for discord or whatever, in Python, without knowledge of the true endianness of the system the code is running on.
The article explicitly shows that the provided macros are very efficient with a modern compiler. You can check on godbolt.org that they emit the same code.
Though the article only mentions bswap64 and mentioning __builtin_bswap64 would be a nice addition.
This problem is it’s own special horror in Canbus data. Between endianness and sign it’s a nightmare of en/decoding possibilities and the associated mistakes that come with that.
TIFF is another one. The only endian-switchable image format that I'm aware of.
Fun fact: CD-ROM superblocks have both-endian fields. Each integer is stored twice in big and little endian format. I assume this was to allow underpowered 80s hardware which didn't have enough resource to do byte swapping.
> If you program in C long enough, stuff like this becomes second nature, and it starts to almost feel inappropriate to even have macros like the above, since it might be more appropriately inlined into the specific code. Since there have simply been too many APIs introduced over the years for solving this problem. To name a few for 32-bit byte swapping alone: bswap_32, htobe32, htole32, be32toh, le32toh, ntohl, and htonl which all have pretty much the same meaning.
> Now you don't need to use those APIs because you know the secret.
This sentiment seems problematic. The solution shouldn't be "we just have to educate the masses of C programmers on how to properly deal with endianness". That will never happen.
The solution should be "It's in the standard library. Go look there and don't think too hard." C is sufficiently low-level, and endianness problems sufficiently common, that I would expect that kind of routine to be available.
The point is that keeping the distinction clear in your head between numeric semantics and sequence of octets semantics makes the problem universally tractible. You have a data structure where with a numeric value. Here you have a sequence of octets described by some protocol formalism, BNF in the old days. The mapping from one to the other occurs in the math between octets and numeric values and the various network protocols for representing numbers. There are many more choices than just big endian or little endian. Could be ASN infinite precision ints. Could be 32 bit IEEE floats or 64 bit IEEE floats. The distinction is universal between language semantics and external representations.
This is why people that memcpy structs right into the buf get such derision, even if it’s faster and written for a mono-Implementation of a language semantics. It is sloppy thought made manifest.
Of course, the canonical work on this subject is Danny Cohen's On Holy Wars And A Plea For Peace [0]. It's an informative and highly readable article. My favorite quote, from the conclusion, is:
The "Be reasonable, do it my way" approach does not work. Neither does the Esperanto approach of "let's all switch to yet a new language".
His bottom line conclusion being
It is more important to agree upon an order than which order is agreed upon.
In her first sentence, the phrase “the C / C++ programming language” is no longer correct: C++20 requires two’s complement signed integers.
C++ 20 is quite new so I would assume that very few people know this yet.
C and C++ obviously differ a lot, but by that phrase she clearly means “the part where then two languages overlap”. The C++ committee has been willing to break C compatibility in a few ways (not every valid C program is a valid C++ program), and this has been true for a while.
"the c/c++ language" exists insofar as you can import this c code into your c++, and this is something that c++ programmers need to know how to do, so they'd better learn enough of the differences between c and c++ or they'll be stumped when they crack open somebody else's old code.
I haven’t seen a one’s complement machine in decades but at the time C was standardized here were still quite a few (afaik none had a single-chip CPU, to get to your question). But since they existed, the language definition didn’t require it and some optimizations were technically UB.
The C++ committee decided that everyone had figured this out by now and so made this breaking change.
Of course it gets a bit hairier if the code is also supposed to run on other systems.
MacOS has OSSwapHostToLittleIntXX, OSSwapLittleToHostIntXX, OSSwapHostToBigIntXX and OSSwapBigToHostIntXX in <libkern/OSByteOrder.h>.
I'm not sure if Windows has something similar, or if it even supports running on big endian machines (if you know, please tell).
My solution for achieving some portability currently entails cobbling together a "compat.h" header that defines macros for the MacOS functions and including the right headers. Something like this:
This is usually my go-to-solution for working with low level on-disk or on-the-wire binary data structures that demand a specific endianness. In C I use "load/store" style functions that memcpy the data from a buffer into a struct instance and do the endian swapping (or reverse for the store). The copying is also necessary because the struct in the buffer may not have proper alignment.
Technically, the giant macro of doom in the article takes care of all of this as well. But unlike the article, I would very much not recommend hacking up your own stuff if there are systems libraries readily available that take care of doing the same thing in an efficient manner.
In C++ code, all of this can of course be neatly stowed away in a special class with overloaded operators that transparently takes care of everything and "decays" into a single integer and exactly the above code after compilation, but is IMO somewhat cleaner to read and adds much needed type safety.
Indeed, I don't get the article. It's like writing "C is hard because here is how hard it is to implement memcpy using SIMD correctly."
Please don't do that. Use battle-tested low-level routines. Unless your USP is "our software swaps bytes faster than the competition", you should not spend brain power on that.
Windows/MSVC has _byteswap_ushort(), _byteswap_ulong(), _byteswap_uint64(). (note that unsigned long is 32 bits on Windows) It's ugly but it works.
Boost provides boost::endian which allows converting between native and big or little, which just does the right thing on all architectures and compilers and compiles down to a no-op or bswap instruction instruction. It's much better than writing (and testing!) your own giant pile macros and ifdefs to detect the compiler/architecture/OS, include the correct includes, and perform the correct conversions in the correct places.
Itanium can be configured to run in either endianness (it's "bi-endian"). Windows on Itanium always ran in little-endian mode and did not support big-endian mode. The same was true of PowerPC. Windows never ran in big-endian mode on any architecture.
Or just cast the pointer to uint##_t and use be##toh and htobe## from <endian.h>? I think this is making a mountain out of a mole hill. I've spent tons of time doing wire (de)serialization in C for network protocols and endian swaps are far from the most pressing issue I see. The big problem imo is the unsafe practices around buffer handling allowing buffer over runs.
Historical and obscure machines aside, there are a few things modern C++ code should take for granted, because even new systems will probably not bother breaking them: Text is encoded in UTF-8. Negative integers are twos-complement. Float is 32 bit ieee 754, double and long double are 64 bit ieee 754. Char is 8 bit, short is 16 bit, int is 32 bit, long long is 64 bit.
I start with unsigned char to begin with (well `uint8_t` to be precise, which has the advantage of not compiling at all if you happen to use a DSP that uses 32-bit chars). Then I convert those chars to unsigned 32-bit integers. Only then do I shift them. There is no need to mask anything here.
Note that modern compilers translate this whole thing into a single unaligned load operation. Even better, I've noticed that using a macro instead of a function tends to make performance worse with modern compilers.
If this is for deserialisation then it's okay for x[0] to be signed. You just need to recast the result as int32_t (or simply assign to an int32_t variable without any cast) and it is not UB.
I agree that in an ideal world we should just write load code using byte loads and shifts. But in the world we live in, compilers only got the ability to recognize that and emit a bswap instead in relatively recent [0] versions (compared to the age of C). And the recognition can still depend on the exact pattern used. Also, debug builds will still emit the whole shift mess, which in some cases can be annoying.
Isn't the 'modern' solution to memcpy into a temp and swap the bytes in that? C++ has added/will add std::launder and std::bless to deal with this issue
> Isn't the 'modern' solution to memcpy into a temp and swap the bytes in that?
Or just use the endian.h / sys/endian.h routines, which do the right thing (be32dec / be32enc / whatever). memcpy+swap is fine, and easier to get right than the author's giant expressions, but you might as well use the named routines that do exactly what you want already.
No, it is to read a byte at a time and turn it into the semantic value for the data structure you are filling in. Like read 128 and then 1 and set the variable to 32769. If u are the author of protobufs then you may run profiling and write the best assembly etc but otherwise no, don’t do it.
That huge macro appears to be wrong, as there are little endian PowerPC systems, where __ppc__ and _powerpc__ macros are also defined, making the outcome of the detection invalid.
There are no current middle-endian systems but they used to exist. The PDP-11 is the most famous one. The macros would work on all systems, but as only very old systems are middle-endian, they also have old compilers so may not be able to optimise it as well.
I for one would go for big-endian, simply because reading memory dumps and byte blocks in assembly or elsewhere works without mental byte-swapping arithmetics for multi-byte entities.
Just out of curiosity, I would be interested in learning why so many CPUs today are little-endian. Is it because it is cheaper / more efficient for processor implementations or is it because “the others do it, so we do it the same way”?
It simplifies certain instructions internally. Practically everything is little endian because x86 won.
> And if you think about a serial machine, you have to process all the addresses and data one-bit at a time, and the rational way to do that is: low-bit to high-bit because that’s the way that carry would propagate. So it means that [in] the jump instruction itself, the way the 14-bit address would be put in a serial machine is bit-backwards, as you look at it, because that’s the way you’d want to process it. Well, we were gonna built a byte-parallel machine, not bit-serial and our compromise (in the spirit of the customer and just for him), we put the bytes in backwards. We put the low- byte [first] and then the high-byte. This has since been dubbed “Little Endian” format and it’s sort of contrary to what you’d think would be natural. Well, we did it for Datapoint. As you’ll see, they never did use the [8008] chip and so it was in some sense “a mistake”, but that [Little Endian format] has lived on to the 8080 and 8086 and [is] one of the marks of this family.
I suspect it does somewhere as a system that had words as the main addressable memory but also allowed byte addressing could have little endian double words but big endian ordering inside the words (as bytes).
Why would one choose the memory representation of the number based on the advantages of the internal ALU wiring?
Of all those reasons, the only one I can make sense of is the "I can’t transparently widen fields after the fact!", and that one is way too niche to explain anything.
I don’t understand? Why not make the memory representation sympathetic with the operations you’re going to do on it? It’s the raison d’être of computers to compute and to do it fast.
Another example: memory representation of pixels in GPUs which are swizzled to make computations efficient
> I don’t understand? Why not make the memory representation sympathetic with the operations you’re going to do on it?
There's no reason to, as there's no reason not to. It's basically irrelevant.
If carrier passing is so important, why can't you just mirror your transistors and operate on the same wires, but on the opposite order? Well, you can, and it's trivial. (And, by the way, carrier passing isn't important. High performances ALU pass carrier only though blocks, that can appear anywhere. And the wiring of those isn't even planar, so how you arrange them isn't a showstopper.)
Hint: The reason why it's called "endianness" comes from the novel Gulliver's Travels, in which the neighboring nations of Lilliput and Blefuscu went to bitter, bloody war over which end to break your eggs from: the big end or the little end. The warring factions were also known as Big-Endians and Little-Endians, and each thought themselves superior to the dirty heathens on the other side. If one side were objectively correct, if there were an inherent advantage to breaking your egg from one side or the other, would there be a war at all?
> if there were an inherent advantage to breaking your egg from one side or the other, would there be a war at all?
Fascism vs. not-fascism, Stalinist Communism vs. Western Capitalism, Islamism vs. liberal democracy... I’m not sure “the existence of war around a divide in ideas proves that neither sides ideas are correct” is a particularly comfortable maxim to consider the ramifications of.
Two similar societies warring over a trivial idea probably means neither is right. Swift's Big Endians and Little Endians are a satire of the Catholic-Anglican schism in England.
> Two similar societies warring over a trivial idea probably means neither is right.
Well, sure, that it’s a trivial idea pretty much inherently means either that neither is right or (and this is very much not an exclusive or) being right doesn’t matter.
The problem with real cases is that people inside the conflict don’t believe the idea is trivial (conversely, to people outside rhe conflict—or caught in the middle—even the conflicts we think of as about foundational ideas seem like trivial or irrelevant differences.)
Not sure what you think the source code "says". I mean, I know what you want it to mean, but just because integer wrapping is intuitive to you doesn't imply that that is what the code means. C++ abstract machine and all.
But to answer the actual question: For C++20, integer types were revisited. It is now (finally) guaranteed that signed integers are two's complement, along with a list of other changes. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p090... also for how the committee voted on the individual issues.
Note in particular:
> The main change between [P0907r0] and the subsequent revision is to maintain undefined behavior when signed integer overflow occurs, instead of defining wrapping behavior. This direction was motivated by:
> - Performance concerns, whereby defining the behavior prevents optimizers from assuming that overflow never occurs;
> - Implementation leeway for tools such as sanitizers;
> - Data from Google suggesting that over 90% of all overflow is a bug, and defining wrapping behavior would not have solved the bug.
So yes, the committee very recently revisited this specific issue, and re-affirmed that signed integer overflow should be UB.
I've never been very satisfied with these approaches for C where you hope the compiler does the right thing. It makes sense to provide some C implementation for portability's sake but any sizeable reordering cries out for a handtuned, processor specific, approach (and the non-sizeable probably doesn't require high speed). I would expect any SIMD instruction set to include a shuffle.
It can also be a good idea to swap recursively. First swap the upper and lower half, then swap the upper and lower quarters (bytes for a 32bit) which can be done with only 2 masks. Then if its 64bit value swap alternate bytes, again with only 2 masks. This can be extended all the way to full bit reverse in 3 more lines each with 2 masks and shifts.
One of the few reasons I ever even reached to C is the ability to slurp in data and reinterpret it as a struct, or the ability to reason in which registers things will show up and mix in some `asm` with my C.
I think there should really be a dialect of C(++) where the machine model is exactly the physical machine. That doesn't mean the compiler can't do optimizations, but it shouldn't do things like prove code as UB and fold everything to a no-op. (Like when you defensively compare a pointer to NULL that according to spec must not be NULL, but practically could be...)
`-fno-strict-overflow -fno-strict-aliasing -fno-delete-null-pointer-checks` gets you halfway there, but it would really only be viable if you had a blessed `-std=high-level-assembler` or `-std=friendly-c` flag.