Hacker News new | past | comments | ask | show | jobs | submit login
The Byte Order Fiasco (justine.lol)
319 points by cassepipe on May 8, 2021 | hide | past | favorite | 364 comments



It is a ridiculous feature of modern C that you have to write the super verbose "mask and shift" code, which then gets compiled to a simple `mov` and maybe a `bswap`. Wheras, the direct equivalent in C, an assignment with a (type changing) cast, is illegal. There is a huge mismatch between the assumptions of the C spec and actual machine code.

One of the few reasons I ever even reached to C is the ability to slurp in data and reinterpret it as a struct, or the ability to reason in which registers things will show up and mix in some `asm` with my C.

I think there should really be a dialect of C(++) where the machine model is exactly the physical machine. That doesn't mean the compiler can't do optimizations, but it shouldn't do things like prove code as UB and fold everything to a no-op. (Like when you defensively compare a pointer to NULL that according to spec must not be NULL, but practically could be...)

`-fno-strict-overflow -fno-strict-aliasing -fno-delete-null-pointer-checks` gets you halfway there, but it would really only be viable if you had a blessed `-std=high-level-assembler` or `-std=friendly-c` flag.


> One of the few reasons I ever even reached to C is the ability to slurp in data and reinterpret it as a struct, or the ability to reason in which registers things will show up and mix in some `asm` with my C.

Which results in undefined behavior according to the C ISO standard.

Quote:

“2 All declarations that refer to the same object or function shall have compatible type; otherwise, the behavior is undefined.”

From: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf 6.2.7


How? I mean, doesn't GP mean this?

    struct whatever p;
    fread(p, sizeof(p), 1, fp);


It should be perfectly fine to do this:

  union reinterpret {
    char raw[100];
    struct myStruct interpreted;
  } example;

  read(fd, &example.raw)
  struct myStruct dest = interpreted;

This is standard-compliant C code, and it is a common way of reading IP addresses from packets, for example.


You don't even need to pun that. It's legal to say:

    struct myStruct example;
    read(fd, &example, sizeof(example));
That "should present no problem unless binary data written by one implementation are read by another" quoth ANSI X3.159-1988. One example of a time where I've used that, is when storing intermediary build artifacts. Those artifacts only exist on the host machine. If the binary that writes/reads those artifacts gets recompiled, then the Makefile will invalidate the artifacts so they're regenerated. Since flags like -mstructure-size-boundary=n do exist and ABI breakages have happened with structs in the past.


(It should be noted that this is not valid C++ code.)


Sensitive emotional subjects shouldn't be noted. Reminding C developers of the void* incompatibility is a good way to get them to feel triggered because it makes the language unpleasant.


Exactly.


> Wheras, the direct equivalent in C, an assignment with a (type changing) cast, is illegal.

I don't understand what you mean by that. The direct equivalent of what? Endianess is not part of the type system in C so I'm not sure I follow.

> I think there should really be a dialect of C(++) where the machine model is exactly the physical machine.

Linus agrees with you here, and I disagree with both of you. Some UBs could certainly be relaxed, but as a rule I want my code to be portable and for the compiler to have enough leeway to correctly optimize my code for different targets without having to tweak my code.

I want strict aliasing and I want the compiler to delete extraneous NULL pointer checks. Strict overflow I'm willing to concede, at the very least the standard should mandate wrap-on-overflow ever for signed integers IMO.


I am sympathetic, but portability was more important in the past and gets less important each year. I used to write code strictly keeping the difference between numeric types and sequences of bytes in mind, hoping to one day run on an Alpha or a Tandem or something, but it has been a long time since I have written code that runs on non-(Intel AMD or le ARM)


x86_32, x86_64, arm, arm64, POWER , RISC-V and several others are alive and kicking. China is making their own ISA. And there is still plenty of space and time for new ISAs to be created.

Portability is still plenty relevant.


> I think there should really be a dialect of C(++) where the machine model is exactly the physical machine.

Sounds great, until you have to rewrite all your software to go from x86-64 to ARM


Quite common when coding games back in the 8 and 16 bit days. :)

However for the case in hand, it would suffice to just write the key routines in Assembly, not everything.


> There is a huge mismatch between the assumptions of the C spec and actual machine code.

People like to say „C is close to the metal“. Really not true at all anymore.


Actually, it is true - which is why endian is a problem in the first place. ASM code is different when written for little endian vs big endian. Access patterns are positively offset instead of negatively.

A language that does the same things regardless of endianness would not have pointer arithmetic. That is not ASM and not C.


You don’t have to mask and shift. You can memcpy and then byte swap in a function. It will get inlined as mov/bswap.

Practically speaking, common compilers have intrinsics for bswap. The memcpy function can be thought of as an intrinsic for unaligned load/store.


How do you detect if a byte swap is needed? I.e. wether the (fixed) wire endianness matches the current platform endianness?


Using the preprocessor, something like this:

    uint32_t swap32(uint32_t x) { ... }

    #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    uint32_t swap32be(uint32_t x) { return swap32(x); }
    #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    uint32_t swap32be(uint32_t x) { return x; }
    #else
    #error "Unknown endian"
    #endif
You can make the preprocessor condition broader if you care about more compilers and more platforms. Yes, I'm making assumptions about which platforms you want to target... which is fine. No, I don't care about your PDP-11, nor about dynamically changing your endian at runtime. Nearly any problem in C can be made arbitrarily difficult if you care about sufficiently bizarre platforms, or ask that people write code that is correct on any theoretical conforming C implementation. So we pick some platforms to support.

The above code is fairly simple. You can separate the part where you care about unaligned memory access and the part where you care about endian.

Some irrelevant details left out above.


Author here. The blog post has that as the naive example. The whole intention was to help people understand why we don't need to do that. Could you at least explain why you disagree if you're going to use this thread to provide the complete opposite advice?


When I read the blog post I saw this,

    #define READ32BE(p) bswap_32(*(uint32_t *)(p))
Which as you correctly state in the article, is incorrect code. We agree about this. I proposed an alternate solution, where the READ32BE would be like this:

    uint32_t read32be(const void *ptr) {
        uint32_t x;
        memcpy(&x, ptr, sizeof(x));
        return swap32be(x); // Nop on big-endian.
    }
What I like about this is that it breaks the problem down into two parts: reading unaligned data and converting byte order. The reason for this is, sometimes, you need a half of that. Some wire formats have alignment guarantees, and if you know that the alignment guarantees are compatible with your platform, you can just read the data into a buffer and then (optionally) swap the bytes in place.

Just to give an example... not too long ago I was working with legacy code that was written for MIPS. Unaligned access does not work on MIPS, so the code was already carefully written to avoid that. All I had to do was make sure that the data types were sized (e.g. replace "long" with "int32_t") and then go through and byte swap everything.

    struct Something {
        int32_t x, y;
        char name[16];
    };

    void Something_Swap(struct Something *p) {
        p->x = swap32be(p->x);
        p->y = swap32be(p->y);
    }
So it's nice to have a function like swap32be(), and "you don't have to mask and shift" I would say is true, it just depends on which compilers you want to support. I would say that a key part of being a C programmer is making a conscious decision about which compilers you want to support.

Yes, I'm aware that structs are not a great way to serialize data in general, but sometimes they're damn convenient.


Ie how do you know the target's endianness? C++20 added std::endian. Otherwise you can use a macro like this one from SDL

https://github.com/libsdl-org/SDL/blob/9dc97afa7190aca5bdf92...


There have been CPU architectures where the endianness at compile time isn't necessarily sufficient. I forget which, maybe it was DEC Alpha, where the CPU could flip back and forth? I can't recall if it was a "choose at boot" or a per process change.


ARM allows dynamic changing of endianess[1].

[1]: https://developer.arm.com/documentation/dui0489/h/arm-and-th...


Which nothing will be able to deal with so you might as well not bother to support it. Your compiler will also assume a fixed endianness based on the target triple.


When do you byte swap?


24 hours a day, man. I'm always byte swapping.

(I'm not sure how to answer the question... what do you mean, "when?")


The entire problem of using byte swaps is that you need to use them when your native platform's byte order is different from that of the data you are reading.

You know the byte order of the data. But the tricky part is, what is the byte order of the platform?


Whether it is tricky depends on what platforms you care about.


Or, you can just follow the advice of the article, and not need to worry about it because the compiler takes care of it for you.


> because the compiler takes care of it for you.

It will always be correct, but you can't just assume that the compiler will optimize the shifts into a byteswap instructions. If you look at the article you will see that it tires to no-true-scotsman that concern away by talking about a "good modern compiler".


And what exactly is the problem there? Are you going to be writing code that a) is built with a weird enough compiler that it fails this optimisation but also b) does byte swapping in a performance critical section?


Of course nobody wants C to backstab them with UB, but at the same time programmers want compilers to generate optimal code. That's the market pressure that forces optimizers to be so aggressive. If you can accept less optimized code, why aren't you using tcc?

The idea of C that "just" does a straightforward machine translation breaks down almost immediately. For example, you'd want `int` to just overflow instead of being UB. But then it turns out indexing `arr[i]` can't use 64-bit memory addressing modes, because they don't overflow like a 32-bit int does. With UB it doesn't matter, but a "straightforward C" would emit unnecessary separate 32-bit mul/shift instructions.

https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759...


> nobody wants C to backstab them with UB, but at the same time programmers want compilers to generate optimal code

The value of compiler optimization isn't the same thing as the value of having extensive undefined behaviour in a programming language.

Rust and Ada perform about the same as C, but lack C's many footguns.

> indexing `arr[i]` can't use 64-bit memory addressing modes

What do you mean here?


Typically, the assembly instruction that would do the read in arr[i] can do something like:

    x = *(y + z);
where y and z are both 64-bit integers. If I had

    int arr[1000];
    initialize(&arr);
    int i = read_int();
    int x = arr[i];
    print(x);
then to get x I'd need to do something like,

    tmp = i * 4;
    tmp1 = (uint64_t)tmp;
    x = *(arr + tmp1);
Which, since i is signed, can't just be a cheap shift, and then needs to be upcasted to a uint64_t (which is cheap, at least).


So in your 'machine model is the physical machine' flavour, should "I cast an unaligned pointer to a byte array to int32_t and deref" on SPARC (a) do a bunch of byte-load-and-shift-and-OR or (b) emit a simple word load which segfaults? If the former, it's not what the physical machine does, and if the latter, then you still need to write the code as "some portable other thing". Which is to say that the spec's UB here is in service of "allow the compiler to just emit a word load when you write *(int32_t)p".

What I think the language is missing is a way to clearly write "this might be unaligned and/or wrong endianness, handle that". (Sometimes compilers provide intrinsics for this sort of gap, as they do with popcount and count-leading-zeroes; sometimes they recognize common open-coded idioms. But proper standardised support would be nicer.)


Endianness doesn't matter though, for the reasons Rob Pike explained. For example, the bits inside each byte have an endianness probably inside the CPU but they're not addressable so no one thinks about that. The brilliance of Rob Pike's recommendation is that it allows our code to be byte order agnostic for the same reasons our code is already bit order agnostic.

I agree about bsf/bsr/popcnt. I wish ASCII had more punctuation marks because those operations are as fundamental as xor/and/or/shl/shr/sar.


D's machine model does actually assume the hardware, and using the compile time metaprogramming you can pretty much do whatever you want when it comes to bit twiddling - whether that means assembly, flags etc.


I suspect you might like C--.

https://en.m.wikipedia.org/wiki/C--


> There is a huge mismatch between the assumptions of the C spec and actual machine code.

Right, which is why the kind of UB pedantry in the linked article is hurting and not helping. Cranky old man perspective here:

Folks: the fact that compilers will routinely exploit edge cases in undefined behavior in the language specification to miscompile obvious idiomatic code is a terrible bug in the compilers. Period. And we should address that by fixing the compilers, potentially by amending the spec if feasible.

But instead the community wants to all look smart by showing how much they understand about "UB" with blog posts and (worse) drive-by submissions to open source projects (with passive agressive sneers about code quality), so nothing gets better.

Seriously: don't tell people to shift and mask. Don't pontificate over compiler flags. Stop the masturbatory use of ubsan (though the tool itself is great). And start submitting bugs against the toolchain to get this fixed.


I agree but language of the standard very unambiguously lets them do it. Quoth X3.159-1988

     * Undefined behavior --- behavior, upon use of a nonportable or
       erroneous program construct, of erroneous data, or of
       indeterminately-valued objects, for which the Standard imposes no
       requirements.  Permissible undefined behavior ranges from ignoring the
       situation completely with unpredictable results, to behaving during
       translation or program execution in a documented manner characteristic
       of the environment (with or without the issuance of a diagnostic
       message), to terminating a translation or execution (with the issuance
       of a diagnostic message).
In the past compilers "behaved during translation or program execution in a documented manner characteristic of the environment" and now they've decided to "ignore the situation completely with unpredictable results". So yes what gcc and clang are doing is hostile and dangerous, but it's legal. https://justine.lol/undefined.png So let's fix our code. The blog post is intended to help people do that.


So let's fix our code.

No; I say we force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority. There's a lot more programmers who are not compiler writers.

The standard is really a minimum bar to meet, and what's not defined by it is left to the discretion of the implementers, who should be doing their best to follow the "spirit of C", which ultimately means behaving sanely. "But the standard allows it" should never be a valid argument --- the standard allows a lot of other things, not all of which make sense.

A related rant by Linus Torvalds: https://bugzilla.redhat.com/show_bug.cgi?id=638477#c129


force the compiler writers to fix their idiotic assumptions instead of bending over backwards to please what's essentially a tiny minority

As far as I understand it, they do neither. Transforming an AST to any level of target code is not done by handcrafted recipes, but instead is feeded into efficient abstract solvers which have these assumptions as an operational detail. E.g.:

  p = &x;
  if (p != &x) foo(); // optimized out
is not much different from

  if (p == NULL) foo(); // optimized out
  printf("%c", *p);
No assumption here is idiotic, cause no single human was involved, it’s just a class of constraints, which alone to separate properly you’ll have to scratch your head extensively (imagine telling a logic system that p is both 0 and not-0 when 0-test is “explicit” and asking it to normally operate). Compiler writers do not format disks just to punish your UBs. Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one?


In your example, why should it optimise out the second case? Maybe foo() changed p so it's no longer null.

Compiler writers do not format disks just to punish your UBs.

IMHO if the compiler exploiting UB is leading to counterintuitive behaviour that's making it harder to use the language, the compiler is the one that needs fixing, regardless of whether the standard allows it. "But we wrote the compiler so it can't be fixed" just feels like a "but the AI did it, not me" excuse.


You would need to pass *p or declare it as volatile I assume, otherwise by what means would foo change p?


The address of p could have been taken somewhere earlier and stored in a global that foo accesses, or a similar path to that; and of course, p could itself be a global. Indeed, if the purpose of foo is to make p non-null and point to valid memory, then by optimising away that code you have broken a valid program.

If the compiler doesn't know if foo may modify p, then it can't remove the call. Even if it can prove that foo does not modify p, it still can't remove the call: foo may still have some other side-effects that matter (like not returning --- either longjmp()'ing elsewhere or perhaps printing an error message about p being null and exiting?), so it won't even get to the null dereference.

As a programmer, if I write code like that, I either intend for foo to be doing something to p to make it non-null, or if it doesn't for whatever reason, then it will actually dereference the null and whatever happens when that's attempted on the particular platform, happens. One of the fundamental principles of C is "trust the programmer". In other words, by trying to be "helpful" and second-guessing the intent of the code while making assumptions about UB, the compiler has completely broken the expectations of the programmer. This is why assumptions based on UB are stupid.

The standard allows this, but the whole intent of UB is not so compiler-writers can play language-lawyer and abuse programmers; things it leaves undefined are usually because existing and possible future implementations vary so widely that they didn't even try to consider or enumerate the possibilities (unlike with "implementation-defined").


But in fact compilers do regularly prove such things as, "this function call did not touch that local variable". Escape analysis is a term related to this.

I'm more of two minds about that other step, where the compiler goes like, "here in the printf call the p will be dereferenced, so it surely is non-null, so we silently optimize that other thing out where we consider the possibility of it being null".

Also @joshuamorton, couldn't the compiler at least print a warning that it removed code based on an assumption that was inferred by the compiler? I really don't know a lot about those abstract logic solver approaches, but it feels like it should be easy to do.


You don't need to worry about null check removal optimizations unless you do this:

    int main() {
      char *p;
      p = mmap(0, 65536, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
      // ...
      return __builtin_popcountl((uintptr_t)p);
    }
Or you do this:

    void ContinueOnError(int sig, siginfo_t *si, ucontext_t *ctx) {
      xed_decoded_inst_zero_set_mode(&xedd, XED_MACHINE_MODE_LONG_64);
      xed_instruction_length_decode(&xedd, (void *)ctx->uc_mcontext.rip, 15);
      ctx->uc_mcontext.rip += xedd.length;
    }

    int main() {
      signal(SIGSEGV, ContinueOnError);
      volatile long *x = NULL;
      printf("*NULL = %ld\n", *x);
    }


warning that it removed code based on an assumption that was inferred by the compiler

That would dump a ton of warnings from various macro/meta routines, which real-world C is usually peppered with. Not that it’s particularly hard to do (at the very least compilers know which lines are missing from debug info alone).


> No assumption here is idiotic

Yes, the assumption that p is non-null is idiotic. Also, the implicit assumption that foo will always return.

> no single human was involved

Humans implemented the compilers that use the spec adversarially and humans lobby the standards committee to not fix the bugs

> Of course you can write a boring compiler that emits opcodes at face expr value, without most UBs being a problem. Plenty of these, why not just take one

The majority of optimizations are harmless and useful, only a handful are idiotic and harmful. I want a compiler that has the good optimizations and not the bad ones.


For essentially every form of UB that compilers actually take advantage of, there's a real program optimization benefit. Are there any particular UB cases where you think the benefit isn't worth it, or it should be implementation-specific behavior instead of undefined behavior?


Most performance wins from UB come from removing code that someone wrote intentionally. If that code wasn't meant to be run, it shouldn't be written. If it was written, it should be run.

Now obviously there are lots of counter-examples for that. You can probably list ten in a minute. But it should be the guiding philosophy of compiler optimizations. If the programmer wrote some code, it shouldn't just be removed. If the program would be faster without that code, the programmer should be the one responsible for deciding whether the code gets removed or not.


MSVC and ICC have traditionally been far less keen on exploiting UB, yet are extremely competitive on performance (ICC in particular). That alone is enough evidence to convince me that UB is not the performance-panacea that the gcc/clang crowd think it is, and from my experience with writing Asm, good instruction selection and scheduling is far more important than trying to pull tricks with UB.


Get the teamsters and workers world party to occupy clang. You should fork C to restore the spirit of C and call it Spiritual C since we need a new successor to Holy C.


I read this, and go "yes, yes, yes", and then "NO!".

Shifts and ors really is the sanest and simplest way to express "assembling an integer from bytes". Masking is _a_ way to deal with the current C spec which has silly promotion rules. Unsigned everything is more fundamental than signed.


It does, macro assemblers, specially those with PC and Amiga roots.

Which given its heritage, that is what PDP-11 C used to be, after all BCPL origin was as minimal language required to bootstrap CPL, nothing else.

Actually, I think TI has a macro Assembler with a C like syntax, just cannot recall the name any longer.


> That doesn't mean the compiler can't do optimizations, but it shouldn't do things like prove code as UB and fold everything to a no-op.

UB doesn't just mean the compiler can treat it as a no-op. It means the compiler can do whatever it likes and still be compliant with the spec.

From the POV of someone consulting the spec, if something results in UB, what it means is: "Don't look here for documentation, look in the documentation of your compiler!".

Many compilers prefer to do a no-op because it is the cheapest thing to do.


My read of the standard is the worst the compiler can do, is to do nothing. For example, the blog post links a tweet where clang doing nothing meant generating an empty function so that calling it execution fell through to a different function the author had written which formats the hard drive. However it wouldn't be kosher for the compiler to generate the asm that formats your hard drive itself as an intended punishment for UB since the standard recommends, if you're not going to do nothing, then you can have the compiler act in a way that's characteristic of the environment, or you can crash with an error.


C is carful to distinguish “unspecified behavior” (every compiler must document a consistent choice) and “undefined behavior” which doesn’t necessarily have any safe uses.


you could instead simply use hton/ntoh and trust the library properly does The Right Thing tm


Rust gets this right. These primitives are available for all the numeric types.

    u32::from_le_byte(bytes) // u32 from 4 bytes, little endian
    u32::from_be_byte(bytes) // u32 from 4 bytes, big endian
    u32::to_le_bytes(num) // u32 to 4 bytes, little endian
    u32::to_be_bytes(num) // u32 to 4 bytes, big endian
This was very useful to me recently as I had to write the marshaling and un-marshaling for a game networking format with hundreds of messages. With primitives like this, you can see what's going on.


There are equivalent functions in C too. The point of the article is about not using them. So how would you implement the above functions in Rust would be more pertinent.


Given that Rust isn’t C, the answer is Rust has a compiler intrinsic for bswap and it calls that as appropriate. LLVM will then turn that into the correct instruction(s) for the target platform.


If I were forced to implement them myself for some reason, I would probably simply do them like this:

    fn from_be(bytes: [u8; 4]) -> u32 {
        (bytes[0] as u32) << 24
        | (bytes[1] as u32) << 16
        | (bytes[2] as u32) << 8
        | (bytes[3] as u32) << 0
    }
It's direct, to the point, and does exactly what it says on the tin because all pertinent behaviour is defined. The way Rust's corelib implements it is to transmute the array into the integer, then call the bswap intrinsic if the bytes need swapping(detected at compile time).


isnt the point to be careful when implementing them? so the compiler detects the intention to byteswap?

when we ported little endian x86 Linux to the big endian mainframe we sprinkled hton/ntoh all over the place, happily so. they are the way to go and they should be implemented properly, not be replaced by a homegrown version.

all that said, I'm surprised 64bit htonll and ntohll are not standard yet. anybody knows why?


Blech. I learned to program (around ‘99) by implementing the crusty old FCS1.0 format, which allows for aggressively weird wire formats. Our machine was a PDP-11/72 with its head sawzalled off and custom wire wrap boards dropped in. The “native” format (coming from analog) was 2143 order as a 36b packet. The bits were [8,0:7] (using verilog notation). However, sprinkled randomly in the binary header were chunks of 7- and 8- bit ANSI (packed) and some mutant knockoff 6-bit EBCDIC.

The original listing was written by “Jennifer — please call me if you have troubles”, an undergraduate from MIT. It was hand-assembled machine code, in a neat hand in a big blue binder. That code ran non-stop except for a few hurricanes from 1988 until 2008; bug-free as far as I could tell. Jennifer last-name-unknown, you were my idol & my demon!

I swore off programming for nearly a year after that.


Functions like ntohl and htonl are the biggest blemish in the design of the Berkeley Sockets API because it's defined to read memory off the wire without deserializing it. Those functions shouldn't have been invented for the reasons described in the linked blog posts. The C standard isn't going to evolve to include functions that only exist to accommodate code that misunderstands the standard.


It's about accessing memory plus an extra conversion step vs. accessing memory in the right order in one step. As an extra, platform-dependent implementations of the accessors could be done, like using the LWL+LWR instruction pair on MIPS.

For reference, check how Linux does it. https://elixir.bootlin.com/linux/latest/source/include/linux...


Doesn't this run into the same issue described by the author?

> Because unsigned char in C expressions gets type promoted to the signed type int.

> So if we say 0x80<<24 it overwrites the sign bit, which is an undefined behavior

Does the u8 type protect against this?


Unless you are planning on running your game on a mainframe, just don’t bother with endianness for the networking.

Big endian is dead for game developers.

Copy entire arrays of structs onto the wire without fear!

(Just #pragma pack them first)


> game on a mainframe

Maybe your program isn't a game.

Maybe you have to deal a server that uses Power, or an embedded system that uses PowerPC (or ARM or MIPS in big-endian mode).

Maybe you're running on an older architecture (SPARC, PowerPC, 68K.)

Maybe you have to deal with a pre-defined data format (e.g. TCP/IP packet headers) that uses big-endian byte ordering for some of its components.


That’s theoretically possible. But I’d be very interested in why. Especially if you are doing anything involving networking.


Because it makes it clear what's going on. Most of those functions just generate a move, but it's the correct move.

I had to read through excessively-clever C++ code that did the same thing to figure out what conversions were happening, then re-express it in Rust. I'm re-implementing a legacy mess that people are afraid to work on. As it happens, in this message system, some items, mainly packet sequence numbers, are big-endian, because they were following what IP and UDP do, and everything else is little endian.

I know how to do this with shifts and masks, and I've done things like that when programming in assembly. That was a long time ago. There's been progress in how to write programs.


Obviously if you are interacting with an old protocol that uses network byte order for some things, you will need to use these functions.

But what's the argument for using them in new game code?

The code will not be running anywhere that has big-endianness. No current platform a game could run on uses it, and I can't imagine a scenario where a new platform would come into existance and use it either.

If you insist on using network-byte order anyway, then you have to do an extra bswap op for each bit of data you send. Sure, the cost of that is super minor and probably not worth worrying about.

But the bigger cost is that you can't just send whole structures at a time. You have to individually serialise each thing. Now you have to have a whole serialisation concept. You have to have some way of enumerating all the fields. You have to walk all the structures. What a pain.

If you want to send a thing over the network, just send it.


I disagree... but what I would do is use little endian, so the legacy machines need to byte swap.


This is why, in 2021, the mantra that C is a good language for these low level byte twiddling tasks needs to die. Dealing with alignment and endianness properly requires a language that allows you to build abstractions.

The following is perfectly well defined in C++, despite looking like almost the same as the original unsafe C:

    #include <boost/endian.hpp>
    #include <cstdio>
    using namespace boost::endian;

    unsigned char b[5] = {0x80,0x01,0x02,0x03,0x04};

    int main() {
        uint32_t x = *((big_uint32_t*)(b+1));
        printf("%08x\n", x);
    }
Note that I deliberately misaligned the pointer by adding 1.

https://gcc.godbolt.org/z/5416oefjx

[Edit] Fun twist: the above code doesn't work where the intermediate variable x is removed because printf itself is not type safe, so no type conversion (which is when the bswap is deferred to) happens. In pure C++ when using a type safe formatting function (like fmt or iostreams) this wouldn't happen. printf will let you throw any garbage in to it. tl;dr outside embedded use cases writing C in 2021 is fucking nuts.


Correct me if I'm wrong, but your example is just using a library to do the same task, rather than illustrating any difference between C and C++. If you want to pull boost in to do this, that's great, but that hardly seems like a fair comparison to the OP, since instead of implementing code to solve this problem yourself you're just importing someone else's code.


No, the fact that this can be done in a library and looks like a native language feature demonstrates the power of C++ as a language.

This example is demonstrating:

- First class treatment of user (or library) defined types

- Operator overloading

- The fact that it produces fast machine code. Try changing big_uint32_t to regular uint32_t to see how this changes. When you use the later ubsan will introduce a trap for runtime checks, but it doesn't need to in this case.


Operator overloading is a mixed blessing though, it can be very convenient but it's also very good at obfuscating what's going on.

For instance I'm not familiar with this boost library so I'd have a lot of trouble piecing out what your snippet does, especially since there's no explicit function call besides the printf.

Personally if we're going the OOP route I'd much prefer something like Rust's `var.to_be()`, `var.to_le` etc... At least it's very explicit.

My hot take is that operator overloading should only ever be used for mathematical operators (multiplying vectors etc...), everything else is almost invariably a bad idea.


Ironically, it was proposed not so long ago to deprecate to_be/to_le in favour of to_be_bytes/to_le_bytes, since the former conflate abstract values with bit representations.


That's fine if whatever type 'var' happens to be is NOT usable as an arithmetic type, otherwise you can easily just forget to call .to_le() or .to_native(), or whatever, and end up with a bug. I don't know Rust, so don't know if this is the case?

Boost.Endian actually lets you pick between arithmetic and buffer types.

'big_uint32_buf_t' is a buffer type that requires you to call .value() or do a conversion to an integral type. It does not support arithmetic operations.

'big_uint32_t' is an arithmetic type, and supports all the arithmetic operators.

There are also variants of both endian suffixed '_at' for when you know you have aligned access.


The idiomatic way to do this in Rust is to use functions like .to_le_bytes(), so you have the u32 (or whatever) on one end and raw bytes (something like [u8; 4]) on the other. It can get slightly tedious if you're doing it by hand, but it's impossible to accidentally forget. If you're doing this kind of thing at scale, like dealing with TrueType fonts (another bastion of big-endian), it's common to reach for derive macros, which automate a great deal of the tedium.


Who decides what methods to add to the bytes type/abstraction?

If I have a 3 byte big endian integer can I access it easily in rust without resorting to shifts?

In C++ I could probably create a fairly convincing big_uint24_t type and use it in a packed struct and there would be no inconsistencies with how it's used with respect to the more common varieties


In Rust, [u8; N] and &[u8] are both primitive types, and not abstractions. It's possible to create an abstraction around either (the former even more so now with const generics), but that's not necessary. It's also possible to use "extension traits" to add methods, even to existing and built-in types[1].

I'm not sure about a 3 byte big endian integer. I mean, that's going to compile down to some combination of shifting and masking operations anyway, isn't it? I suspect that if you have some oddball binary format that needs, this it will be possible to write some code to marshal it, that compiles down to the best possible asm. Godbolt is your friend here :)

[1]: https://rust-lang.github.io/rfcs/0445-extension-trait-conven...


I agree then that in Rust you could make something consistent.

I think there's no need for explicit shifts. You need to memcpy anyway to deal with alignment issues, so you may as well just copy in to the last 3 bytes of a zero-initialized, big endian, 32bit uint.

https://gcc.godbolt.org/z/jEnsW8WfE


That's just constant folding. Here's what it looks like when you actually need to go to memory:

https://gcc.godbolt.org/z/9qGqh6M1E

And I think we're on the same page, it should be possible to get similar results in Rust.


It demonstrates that c++ is even less safe.


You are still casting one pointer type into another which can result in unaligned access.

If you need to change byte orders, you should use library to achieve that.


Boost.Endian is the library here and this code is safe because the big_uint32_t type has an alignment requirement of 1 byte.

This is why ubsan is silent and not even injecting a check in to the compiled code.

You can check the alignment constraints with static_assert (something else you can't do in standard C): https://gcc.godbolt.org/z/KTcf9ax6r


C11 has static_assert: https://gcc.godbolt.org/z/E3bGc95o3

Is also has _Generic() so you can roll up a family of endianness conversion functions and safely change types without blowing up somewhere else with a hardcoded conversion routine.


I find you missed the point of the post and the issues described in it.

In my estimation, libraries like boost are way too big and way too clever and they create more problems than they solve. Also, they don't make me happy.

You're overfocusing on a "problem" that is almost completely irrelevant for most of programming. Big endian is rare to be found (almost no hardware to be found, but some file formats and networking APIs have big-endian data in them). Where you still meet it, you don't do endianness conversions willy-nilly. You have only a few lines in a huge project that should be concerned with it. Similar situation for dealing with aligned reads.

So, with boost you end up with a huge slow-compiling dependency to solve a problem using obscure implicit mechanisms that almost no-one understands or can even spot (I would never have guessed that your line above seems to handle misalignment or byte swapping).

This approach is typical for a large group of C++ programmers, who seem to like to optimize for short code snippets, cleverness, and/or pedantry.

The actual issue described in the post was the UB that is easy to hit when doing bit shifting, caused by the implicit conversions that are defined in C. While this is definitely an unhappy situation, it's easy enough to avoid this using plain C syntax (cast expression to unsigned before shifting), using not more code than the boost-type cast in your above code.

The fact that the UB is so easy to hit doesn't call for excessive abstraction, but simply a revisit of some of the UB defined in C, and how compiler writers exploit it.

(Anecdata: I've written a fair share of C code, while not compression or encryption algorithms, and personally I'm not sure I've ever hit one of the evil cases of UB. I've hit Segmentation faults or had Out-of-bounds accesses, sure, but personally I've never seen the language or compilers "haunt me".)


Do you use UBSAN and ASAN? When you write unit tests do you feed numbers like 0x80000000 into your algorithm? When you allocate test memory have you considered doing it with mmap(4096) and putting the data at the end of the map? (Or better yet, double it and use mprotect). Those are some good examples of torture tests if you're in the mood to feel haunted.


Every day I spend futzing around with endianness is a day I'm not solving 'real' problems. These things are a distraction and a complete waste of developer time: It should be solved 'once' and only worried about by people specifically looking to improve on the existing solution. If it can't be handled by a library call, there's something really broken in the language.

(imo, both c and cpp are mainly advocated by people suffering from stockholm syndrome.)


But that's the point: No one spends a day futzing around with endianness, and there are in fact functions for swapping endianness. You can just call them, no need to hide the swap in a pointer cast expression to a type that has the dereferencing operator overloaded.


I agree with the bulk of this post.

Re the anecdata at the end. Have you ever run your code through the sanitizers? I have. CVE-2016-2414 is one of my battle scars, and I consider myself a pretty good programmer who is aware of security implications.


Very little, quite frankly. I've used valgrind in the past, and found very few problems. I just ran -fsanitize=undefined for the first time on one of my current projects, which is an embedded network service of 8KLOC, and with a quick test covering probably 50% of the codepaths by doing network requests, no UB was detected (I made sure the sanitizer works in my build by introducing a (1<<31) expression).

Admittedly I'm not the type of person who spends his time fuzzing his own projects, so my statement was just to say that the kind of bugs that I hit by just testing my software casually are almost all of the very trivial kind - I've never experienced the feeling that the compiler "betrayed" me and introduced an obscure bug for something that looks like correct code.

I can't immediately see the problem in your CVE here [0], was that some kind of betrayal by compiler situation? Seems like strange things could happen if (end - start) underflows.

[0] https://android.googlesource.com/platform/frameworks/minikin...


This one wasn't specifically "betrayal by compiler," but it was a confusion between signed and unsigned quantities for a size field, which is very similar to the UB exhibited in OP.

Also, the fact that you can't see the problem is actually evidence of how insidious these problems are :)

The rules for this are arcane, and, while the solution suggested in OP is correct, it skates close to the edge, in that there are many similar idioms that are not ok. In particular, (p[1] << 8) & 0xff00, which is code I've written, is potentially UB (hence "mask, and then shift" as a mantra). I'd be surprised if anyone other than jart or someone who's been part of the C or C++ standards process can say why.


> the fact that you can't see the problem is actually evidence of how insidious these problems are

I've looked for a while now, but still can't see it, would you be willing to share?

> (p[1] << 8) & 0xff00

With p[1] being uint8_t? Because then I cannot imagine why, and also fail to see a reason to apply the 0xff00 mask here.

If this is for int8_t instead, the problem you are alluding to is sign extension? If p[1] gets promoted to an int in the negative range, (then its representation has the high order bit set), and shifting that to the left is UB.


Yes, I was assuming it was char *, as in the OP, which can be signed. And any left shift of a negative quantity is UB in C (I'm not sure if this is fixed in recent C++), it doesn't have to be what's commonly thought of as overflow.


Raph, clearly you're just not as good a programmer as you think you are.


Why thank you Vitali. Coming from you, that is high praise indeed.


As a very minor counterpoint: I like C because frankly it’s fun. I wouldn’t start a web browser or maybe even an operating system in it today, but as a language for messing around I find it rewarding. I also think it is incredibly instructive in a lot of ways. I am not a C++ developer but ANSI C has a special place in my heart.

Also, I will say that when it comes to programming Arduinos and ESP8266/ESP32 chips, I still find that C is my go to despite things like Alia, MicroPython, etc. I think it’s possible that once Zig supports those devices fully that I might move over. But in the meantime I guess I’ll keep minding my off by one errors.


This has nothing to do with C++ because your example only hides the real issue occurring in the blog post example: The unaligned read on the array. Try adding something like

  printf("%08x\n", *((uint32_t*)(b)));
to your example and you'll see that it produces UB as well. The reason there is no UB with big_uint32_t probably is that that struct/class/whatever it is probably redefines its dereferencing operator to perform byte-wise reads.

Godbolt example: https://gcc.godbolt.org/z/seWrb5cz7


I fail to see your point. The point of my post is that the abstractions you can build in C++ are as easy to use and as efficient as doing things the wrong, unsafe way...so there's no reason not to do things in a safe, correct way.

Obviously if you write C and compile it as C++ you still end up with UB, because C++ aims for extreme levels of compatibility with C.


Sorry for being unclear. My point is that the example in the blog post does two things, a) it reads an unaligned address causing UB and b) it performs byte-order swapping. The post then goes on about avoiding UB in part b), but all the time the UB was caused by the unaligned access in a).

Of course your example solves both a) and b) by using big_uint32_t, and I agree that this is an interesting abstraction provided by Boost, but I think the takeaway "use C++ for low-level byte fiddling" is slightly misleading: Say I was a novice C++ programmer, saw your example of how C++ improves this but at the same time don't know that big_uint32_t solves the hassle of reading a word from an unaligned address for me. Now I use your pattern in my byte-fiddling code, but then I need to read a word in host endianness. What do I do? Right, I remember the HN post and write *((uint32_t*)(b+1)) (without the big_, because I don't need that!). And then I unintentionally introduced UB. In other words, big_uint32_t is a little "magic" in this case, as it suggests a similarity to uint32_t which does not actually exist.

To be honest, I don't think the byte-wise reading is in any way inappropriate in this case: If you're trying to read a word in non-native byte order from an unaligned access, it is perfectly fine to be very explicit about what you're doing in my opinion. There also is nothing unsafe about doing this as long as you follow certain guidelines, as mentioned elsewhere in this thread.


Sure, the only correct way to read an unaligned value in to an aligned data type in both C or C++ is via memcpy.

I still think being able to define a type that models what you're doing is incredibly valuable because as long as you don't step outside your type system you get so much for free.


You could also mask and shift the value byte-wise just like with an endian swap. Depending on the destination and how aggressive the compiler optimizes memcpy or not, it could even produce more optimal code, perhaps by working in registers more.

Conceptual consistency is a good thing, but there is a generally higher cognitive load to using C++ over C. I've used both C++ and C professionally, and I've gone deeper with type safety and metaprogramming than most folk. I've mostly used C for the last few years, and I don't feel like I'm missing anything. It's still possible to write hard-to-misuse code by coming up with abstractions that play to the language's strengths.

Operator overloading in particular is something I've refined my opinion on over the years. My current thought is that it's best not to use operators in user/application defined APIs, and should be reserved for implementing language defined "standard" APIs like the STL. Instead, it's better to use functions with names that unambiguously describe their purpose.


C is perfect for these problems. I like teaching the endian serialization problem because it broaches so many of the topics that are key to understanding C/C++ in general. Even if we choose to spend the majority of our time plumbing together functions written by better men, it's nice to understand how the language is defined so we could write those functions, even if we don't need to.


For sure, it's a good way to teach that C is insufficient to deal with even the simplest of tasks. Unfortunately teaching has a bad habit of becoming practice, no matter how good the intention.

With regard to teaching C++ specifically I tend to agree with this talk:

CppCon 2015 - Kate Gregory “Stop Teaching C": https://www.youtube.com/watch?v=YnWhqhNdYyk


One of her slides was titled "Stop teaching pointers!" too. My VP back at my old job snapped at me once because I got too excited about the pointer abstractions provided by modern C++. Ever since that day I try to take a more rational approach to writing native code where I consider what it looks like in binary and I've configured my Emacs so it can do what clang.godbolt.org does in a single keystroke.


For the record, she's not really saying people shouldn't learn this low level stuff... just that 'intro to C++' shouldn't be teaching this stuff first

The biggest problem with C++ in industry is that people tend to write "C/C++" when it deserves to be recognized as a language in its own right.


One does not simply introduce C++. It's the most insanely hardcore language there is. I wouldn't have stood any chance understanding it had it not been for my gentle introduction with C for several years.


Really?

Apparently the first year students at my university didn't had any issue going from Standard Pascal to C++, in the mid-90's.

Proper C++ was taught using our string, vector and collection classes, given that we were still a couple of years away from ISO C++ being fully defined.

C style programming with low level tricks were only introduced later as advanced topics.

Apparently thousands of students managed to get going the remaining 5 years of the degree.


C++ in the mid 90s was a lot simpler than C++ now.


No one obliges you to write C++20 with SFINAE template meta-programming, using classes with CTAD constructors.

Just like no Python newbie is able to master Python 3.9 full language set, standard library, numpy, pandas, django,...


Well there's a reason universities switched to Java when teaching algorithms and containers after the 90's. C++ is a weaker abstraction that encourages the kind of curiosity that's going to cause a student's brain to melt the moment they try to figure out how things work and encounter the sorts of demons the coursework hasn't prepared them to face. If I was going to teach it, I'd start with octal machine codes and work my way up. https://justine.lol/blinkenlights/realmode.html Sort of like if I were to teach TypeScript then I'd start with JavaScript. My approach to native development probably has more in common with web development than it does with modern c++ practices to be honest, and that's something I talk about in one of my famous hacks: https://github.com/jart/cosmopolitan/blob/4577f7fe11e5d8ef0a...


US universities maybe, there isn't much Java on my former university learning plan.

The only subjects that went full into Java were distributed computing and compiler design.

And during the last 20 years they already went back into their decision.

I should note that languages like Prolog, ML and Smalltalk were part of the learning subjects as well.

Assembly was part of electronic subjects where design of a pseudo CPU was also part of the themes. So we had our own pseudo Assembly, x86 and MIPS.


> Well there's a reason universities switched to Java when teaching algorithms and containers after the 90's

Where ? I learned algorithms in C and C++ (and also a bit in Caml and LISP) and I was in university 2011-2014


Yes this is the curse of knowledge, people that know c++ by their exposure to it for decades are usually unable to bring any new comer to it.


C++ makes Rust look easy to learn.


Yes, there is some value in using C for teaching these concepts. But the problem I see is that, once taught, many people will then continue to use C and their hand written byte swapping functions, instead of moving on to languages with better abstraction facilities and/or availing themselves of the (as you point out) many available library implementations of this functionality.


What are the advantages of this over a simple function with the following signature?

    uint32_t read_big_uint32(char *bytes);
Having a big_uint32_t type seems wrong to me conceptually. You should either deal with sequences of bytes with a defined endianness or with native 32-bit integers of indeterminate endianness (assuming that your code is intended to be endian neutral). Having some kind of halfway house just confuses things.


The library provides those functions too, but I don't see how having an arithmetic type with well defined size, endiannness and alignment is a bad thing.

If you're defining a struct to mirror a data structure from a device, protocol or file format then the language / type system should let you define the properties of the fields, not necessarily force you to introduce a parsing/decoding stage which could be more easily bypassed.


It is no longer arithmetic if there is an endianness. Some things are numbers and some things are sequences of bytes. Arithmetic only works on the former.


I agree, but a little nitpick: A sequence of bytes does not have a defined endianness. Only groups of more than one bytes (i.e. half words, words, double words or whatever you want to call them) have an endianness.

In practice, most projects (e.g. the Linux kernel or the socket interface) differentiate between host (indeterminate) byte order and a specific byte order (e.g. network byte order/big endian).


I'd say, putting multiple of those types into a struct that then perfectly describes the memory layout of each byte of data in memory/network packet in a reliable and user friendly way to manipulate for the coder.


I see. That does seem helpful once you consider how these types compose, rather than thinking about a one-off conversion. However, I think it would be cleaner to have a library that auto-generated a parser for a given struct paired with an endianness specification, rather than baking the endianness into the types. (Probably this could be achieved by template metaprogramming too.)


Or just use the functions in <arpa/inet.h> to convert from host to network byteorder?


this! use hton/ntoh and be happy.

nitpick: the 64bit versions are not fully available yet, htonll, ntohll


By the same token, I think most uses for C++ these days are nuts. If you're doing a greenfield project 90% of the time it's better to use Rust.

C++ has a multitude of its own pitfalls. Some of the C programmer hate for C++ is justified. After all, it's just C with a pre-processing stage in the end.

There's good reasons why many C projects never considered C++ but are already integrating the nascent Rust. I always hated low level programming until Rust made it just as easy and productive as high level stuff


Wouldn't that cast be UB because it is type punning?


No, because no punning exists here. The code is C++, so this calls a conversion function that likely does the bit manipulation internally in a legal way.


char* is a allowed to alias to other pointer types.


Hm. Afsik, you are always allowed to convert _to_ a char, but _from_ is not ok in general. See i.e. [0]

[0] https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...


Why is it not ok to convert from a char? Some of the information in the gist is wrong. Type punning with unions for example is legal. ANSI X3.159-1988 is quite clear on that point in its aliasing rules. I've seen a lot of comments people post online saying you must use memcpy to read the bits in a float or that c++ forbids union punning but where is that written. Since if that were true every math library would break.


Remember how we used to have machines with a 7 bit byte? And everything was written to handle either 6, 7, or 8 bit bytes?

And now we've settled on all machines being 8 bit bytes, and programmers no longer have to worry about such details?

Is it time to do the same for big endian machines? Is it time to accept that all machines that matter are little endian, and the extra effort keeping everything portable to big endian is no longer worth the mental effort?


That reminds me of a project to interface with vending machines. (We built a bookshop in a vending machine that would tweet whenever it sold an item, with automated stock management.)

Vending machines have an internal protocol a little like I2C. We created a custom peripheral to bridge the machine to the web, based on a Raspberry Pi.

The protocol was defined by Coca Cola Japan in 1975 (in order to have optionality in their supply chain). It's still in use today. But because it was designed in Japan, with a need for wide characters, it assumes 9 bit bytes.

We couldn't find any way to get a Raspberry Pi to speak 9 bit bytes. The eventual solution was a custom shield that would read the bits, and reserialise to 8 bit bytes for the Pi to understand. And vice versa.

9 bit bytes. I grew up knowing that bytes had variable length, bit this was the first time I encountered it in the wild. This was 2015.


This just doesn't seem right. Granted, I don't know much about your use case, but Raspberry Pi's are powerful computing devices and I find it difficult to believe there was no way to handle this without additional hardware.


I’m not familiar with the “vending machine” protocol he’s talking about, but it’s entirely reasonable that it has certain timing requirements. Usually the way you interface with these is by having a dedicated HW block to talk the protocol, or by bit banging. The former wouldn’t be supported on RPi because it’s obscure, the latter requires tight GPIO timing control that is difficult to guarantee on a non-real-time system like the RPi usually runs.


Well you could bit bang and the 9 bits wouldn't be an issue. (Even if you had a tiny PIC microcontroler just to do that)

This is best solvable the closer to the device in question and in the simplest way possible.


The irony is that while a tiny PIC can do bit banging easily, the mighty Pi will struggle with it.


I'm familiar with both, and have Pi's bit-banging at 8MHz. It's not hard-realtime like a PIC though (where I've bitbanged a resistor D2A hung off a dsPIC33 to 17.734475MHz). It's an improvement over the years, but surprisingly little since bit-banging 4MHz Z80's more than 4 decades ago, where resolution was 1 T state (250ns).


The 9 bit serial OP mentioned likely doesn't have a seperate clock line, so it is hard realtime and timing matters a lot, and I doubt the Pi could reliably do anything over 1 kHz baud with bit banging. You could do much better if you didn't run Linux.


Sorry, dumb question: what is bit banging?


In order to exchange data over a serial connection, the ones and zeroes have to be sent with exact timing, so the receiver can reliably tell where one bit ends and the next begins. Because of this, the hardware that's doing the communication can't do anything else at the same time. And since the actual mechanics of the process are simple and straightforward, most computers with a serial connection have special serial-interface hardware (a Universal Asynchronous Receiver/Transmitter, or UART) to take care of it - the CPU gives the UART some data, then returns to more productive pursuits while the UART works away.

But sometimes you can't use a UART: maybe you're working on a tiny embedded computer without one, or maybe you need to speak a weird 9-bit protocol a standard UART doesn't understand. In that case, you can make the CPU pump the serial line directly. It's inefficient (there's probably more interesting work the CPU could be doing) and it can be difficult to make the CPU pause for exactly the right amount of time (CPUs are normally designed to run as fast or as efficiently as possible, nothing in between), but it's possible and sometimes it's all you've got. That's bit-banging.


Consider being a teacher. Thats a good explanation.


The practice of using software to literally toggle (or read) individual pins with the correct software-controlled timing in order to communicate with some hardware.

To transmit a bit pattern 10010010 over a single pin channel, for example, you'd literally set the pin high, sleep for a some predetermined amount of time, set it low, sleep, set it low, sleep, set it high, etc.


We really should have moved to 32 bit bytes when moving to 64 bit words. Would have simplified Unicode considerably.


Not really. Unicode is a variable width abstract encoding; a single character can be made up of multiple code points.

For Unicode, 32-bit bytes would be an incredibly wasteful in memory encoding.


> Unicode is a variable width abstract encoding;

To be a bit more explicit: Unicode is a character encoding, to 20-and-a-half-bit 'bytes', that is variable-width in those 'bytes', even before considering how the 'bytes' are encoded to actual bytes. Eg "ψ̊" (greek small psi with ring above) is U+3C8 U+30A (two 'bytes').


Unicode is not a text encoding. UTF8, UTF16, UTF32, etc. are text encodings.

> a single character can be made up of multiple code points.

It's really the other way round...


Or maybe you meant to say: A single abstract character (or code point) can be made up of multiple code units.

Unfortunately, the term “character“ alone is ambiguous because depending on the context it can refer to either code points or code units.


To be technical, by "character" I mean "user-perceived character" or (in Unicode speak) "extended grapheme cluster". This is the thing a user will think of as one character when looking at it on their screen.

A code point is the atomic unit of the abstract Unicode encoding. By "abstract" I mean it is not an actual text encoding you can write to a file.

A code unit is the atomic unit of an actual text encoding, such as UTF-8, UTF-16LE or UTF-32LE (and their BE equivalents).

---

So to put it together a "user-perceived character" is made up of one or more "code points". When implemented in an application, each "code point" is encoded using one or more "code units".


Now that's a good summary! When talking about Unicode, some extra clarity can never hurt.


One byte = one "character" makes for much easier programming.

Text generally uses a small fraction of memory and storage these days.


> One byte = one "character" makes for much easier programming.

Only if you are naively operating in the Anglosphere / world where the most complex thing you have to handle is larger character sets. In reality, there's ligatures, diacritics, combining characters, RTL, nbsp, locales, and emoji (with skin tones!). Not to mention legacy encoding.

And no, it does not use a "small fraction of memory and storage" in a huge range of applications, to the point where some regions have transcoding proxies still.


This is not about covering ALL of Unicode. This is about starting to cover Unicode.

"Anglosphere" would be just 7(&"8") bit ASCII, and it's the current situation where it takes quite a lot of skill and knowledge just to start learning how to properly deal with Unicode, because it's often not even taught !

IMHO 32-bit bytes would help tremendously with onboarding developers into Unicode, because it would force dumping ASCII-only as the starting point (and sadly, often ending point) for teaching how to deal with text.

And who can blame the teachers, Unicode is already hard enough without even having to deal with the difficulties coming from having to explain its multi-byte representation...

Last but not least : this would have forced standardization between the Unix world now on UTF-8 and the Windows world which is still stuck on UTF-16 (and Windows-1252 ?!?) for some of the core functions like filenames, which, for instance, still regularly results in issues working with files with non-ASCII filenames.


Not all user-perceived characters can be represented as a single Unicode codepoint. Hence, Unicode text encodings (almost[1]) always have to be treated as variable length, even UTF-32.

[1] at runtime, you could dynamically assign 'virtual' codepoints to grapheme clusters and get a fixed-length encoding for strings that way


Even the individual unicode codepoints themselves are variable width if we consider that things like cjk and emoji take up >1 monospace cells.


Every time I see one of these threads, my gratitude to only do backend grows. Human behavior is too complex, let the webdevs handle UI, and human languages are too complex, not sure what speciality handles that. Give me out of order packets and parsing code that skips a character if the packet length lines up just so any day.

I am thankful that almost all the Unicode text I see is rendered properly now, farewell the little boxes. Good job lots of people.


I think we really have the iPhone jailbreakers to thank for that. U.S. developers were allergic almost offended by anything that wasn't ASCII and then someone released an app that unlocked the emoji icons that Apple had originally intended only for Japan. Emoji is defined in the astral planes so almost nothing at the time was capable of understanding them, yet were so irresistible that developers worldwide who would otherwise have done nothing to address their cultural biases immediately fixed everything overnight to have them. So thanks to cartoons, we now have a more inclusive world.


I'm pretty sure Unicode was pretty widespread before the iphone/emoji popularity.


There's supporting Unicode, and 'supporting' Unicode. If you're only dealing with western languages, it's easy to fall into the trap of only 'supporting' Unicode. Proper emoji handling will put things like grapheme clusters and zero-width joiners on your map.


You know, bytes are not only about text, they are also used to represent binary data...

Not to mention that bytes have nothing to do with unicode. Unicode codepoints can be encoded in many different ways: UTF8, UTF16, UTF32, etc.


https://news.ycombinator.com/item?id=27086928

These various ways to encode Unicode have quite a lot to do with bytes being 8-bit sized !


But Unicode itself doesn't!

Anyway, it doesn't make much sense to define the size of a “byte“ as anything else then 8 bits, because that's the smallest adressable memory unit. If you need a 32 bit data type, just use one!


My very point is that we should have increased the size of the smallest addressable memory unit from 8 to 32 bits, increased again, as previous computer architectures used from 4 to 7 bits per byte. (There might be still e-mail servers around directly compatible with "non-padded" 7-bit ASCII ?)



But why? Just so that we need to do more bit twiddling and waste memory?

Again, bytes are not foremost about text. We habe to deal with all sorts of data, many of which is shorter than 32 bits.

You can always pick a larger data type for your type of work, but not the opposite.


Because these days it's critical for "basic computer literacy" :

https://news.ycombinator.com/item?id=27094663

https://news.ycombinator.com/item?id=27104860

(You'll also notice that caring about not wasting the 8th bit with ASCII has lead us into all sorts of issues... and why care so much about it when as soon as data density becomes important, we can use compression which AFAIK easily rids us of padding ?)


You're basically arguing against variable width text encodings - which is ok. But you know, it's entirely possible to use UTF32. In fact, some programming languages use it by default to represent strings.

But again and again, all of this has nothing to do with the size of a byte.

BTW, are you aware that 8-bit Microcontrollers are still in widespread use and nowhere near of being discontinued?


Static width text encoding + Unicode = Cannot fit a "character" in a single octet, which currently is the default addressable unit of storage/memory.

Programming microcontrollers isn't considered to be "mandatory computer literacy" in college, while basic scripting, which involves understanding how text is encoded at the storage/memory level - is.


Again and again and again, a byte is not meant to hold a text character. Also, as the sibling parent has pointed out, fixed width encoding only gets you so far because it doesn't help with grapheme clusters. That's probably why the world has basically settled with UTF8: it saves memory and destroys any notion that every abstract text character can somehow be represented by a single number.

> mandatory computer literacy

I don't understand why you keep bringing up this phrase and ignore a huge part of real world computing. College students should simply learn how Unicode works. Are you seriously demanding that CPU designers should change their chip design instead?


UTF32 is a variable length encoding if we consider combining characters.


Generally, I think you are conflating/confusing the concept of “byte“ (= smallest unit of memory) with the concept of “character“ resp. “code unit“ (= smallest unit of text encoding). The size of the former depends on the CPU architecture and on modern systems it's always 8 bits. The size of the latter depends on the specific text encoding.


People were holding off on transitioning because pointers use twice as much space in x64. If bytes had quadrupled in space with x64 we would still be using 32 bit software everywhere


Well, obviously it would have delayed the transition. However you can only go so far with 4Go-limited memory.

And do you have examples of still widely used 8-bit sized data formats ?


I assume you wrote this comment in UTF-8 over HTTP (ASCII-based) and TLS (lots of uint8 fields).


To clarify : I'm more concerned about "final" data formats, less about "transport" ones, which need much longer legacy support.


You can go very far with just 4GB of memory, especially when not using wasteful software.


MIDI (8 bit), 16 bit PCM, 24 PCM and basically any compressed data format (which is always byte based, because the idea is to save memory). You obviously don't care about memory, but many people do!


But compressed data formats aren't going to care about byte size for this very reason...


Ok, bad example.

But still, 'byte' refers to the smallest addressable unit of memory. There's just no point in arguing over its size...


RGB and Y′CbCr


To start with, RGB (and I assume Y′CbCr ?) can be encoded in many different ways. The most common one today (still) uses 8 bits per channel, meaning that a separated 1-octet value can only define monochrome. Therefore 8-bpc RGB is a 24-bit sized format, not a 8-bit sized data one.

And, by an interesting coincidence, with the arrival of "HDR", 8-bit per channel is slowly becoming obsolete (because insufficient). The next "step" is 10-bit per channel with 3 channels (hence "HDR10(+)"), and so should fit quite well in 32 bits ?

(However, it would seem that even Dolby's Perceptual Quantizer transfer function might need 12 bits per channel to avoid banding over the "HDR" Rec.2020/2100-sized color gamut..?)


We're debating semantics, but if I reshaped an RGB image into component arrays i.e. u8[yn][xn][3] → u8[3][yn][xn] then would you still view that as a 24-bit format? What if those 24-bit values were huffman or run-length encoded would it be an n-bit format? If your Y′CbCr luminance plane has a legal range of 16..235 and the chrominance planes are 16..240, then would it be a 23.40892 bit format?


I'm arguing about non-compressed, eventually padded data types that make learning Unicode (or any other applicable data format) easier because of the equivalence : 1 atomic unit ("character", pixel) = 1 smallest addressable unit of memory (byte). This involves byte size being at least as large as atom size.

And it's particularly important to have this property for text, because not only data is overwhelmingly stored as text (in importance, not by "weight"), but because computer programs themselves are written using text.


Can you recommend me a good PC computer display at any cost that has an objectively good gamut so I can see what you see?


I'm sorry, I'm not sure that I understand ?


Well I figured since you feel strongly about using a type wider than 8-bits for RGB you must have a really good display that actually lets you perceive the colors that enables you to encode. Most PC displays are garbage including the expensive ones because first, sRGB only specifies a very small portion of light that's perceivable and secondly, any display maker who builds something better is going to run into complaints about how terrible netflix looks, because it reveals things like banding (which you mentioned) that otherwise wouldn't be perceivable. So I was hoping you could recommend me a better monitor, so I can get into >8 bit RGB, because I've found it exceedingly difficult to shop around for this kind of thing.


Ok, so you weren't sarcastic and/or misunderstanding my use of "atomic".

Sadly, I kind of gave up on getting a "HDR" display, at least for now, because :

- AFAIK neither Linux nor Windows have good enough "HDR" support yet. (MacOS supposedly does, but I'm not interested.)

- I'm happy enough with my HP LP2475w which I got for dirt cheap just before "HDR" became a thing. I consider the 1920x1200 resolution to be perfect for now (as a bonus I can manually scale various old resolutions like 800x600 to be pixel-perfect) - too many programs/OSes still have issues with auto-scaling programs on higher resolution screens (which would come with "HDR"). I'm also particularly fond of the 16:10 ratio, which seems to have gone extinct.

- Maybe I'll be able to run this monitor properly in wide gamuts (though with banding), or maybe even in some kind of ""HDR" compatibility mode", though it would seem that the current sellers of "HDR" screens aren't going to make that easy. I might be able to get a colorimeter soon to properly calibrate it.


If you have a $200 monitor then it probably struggles to make proper use of 8-bit formats. I have a display that claims to simulate DICOM but it's not enough I want more. However I'm not willing to spend $3000 on a display which doesn't have engineering specs and then send it back because it doesn't work. I don't care about resolution. I care about being able to see the unseen. I care about edge cases like yellow and blue making pink. That was the first significant finding Maxwell reported on when he invented RGB. However nearly every monitor ever made mixes those two colors wrong, as gray, due to subpixel layout issues. Nearly every scaling algorithm mixes those two colors wrong too, due to the way sRGB was designed. It's amazing how poorly color is modeled on personal computers. https://justine.lol/maxwell.png


Well, when released in 2008 it was a $600 monitor, I got it second-hand for 80€.

I'm not sure what DICOM has to do with color reproduction quality ? Also it seems to be a quite a bit older standard than sRGB...

By definition, you can't "see the unseen". "Yellow" and "blue" are opponent "colors", so, by definition, a proper mixture of them is going to give you grey :

https://www.handprint.com/HP/WCL/color2.html#opponentfunctio...

Also, when talking about subtle color effects, you have to consider that personal variation might come into play (for instance red-green "colorblindness" is a spectrum).


It looks like this thing is the thing I want to buy https://www.apple.com/pro-display-xdr/ There's plenty of light that is currently unseeable. Look at chromaticity chart for sRGB. If your definition of color mixes yellow and blue and as grey then you've defined color wrong, because nature has a different definition where it's pink. For example the CIELAB colorspace will mix the two as pink. Also I'm not colorblind. If I'm on the spectrum I would be on the able to see more color more accurately end of the spectrum. Although when designing charts I'm very good at choosing colors that accommodate people who are colorblind, while still looking stylish, because I feel like inclusive technology is important.


Use Erlang. It has 32-bit char.


Not really. Strings are a list of integers [1], integers are signed and fill a system word, but there's also 4 bits of type information. So you can have a 28-bit signed integer char on a 32-bit system or a signed 60-bit integer.

However, since Unicode is limited to 21-bits by utf-16 encoding, a unicode code point will fit in a small integer.

[1] unless you use binaries, which is often a better choice.


We used to have machines with arbitrarily sized bytes, and 36 bit words!

http://pdp10.nocrew.org/docs/instruction-set/Byte.html

>In the PDP-10 a "byte" is some number of contiguous bits within one word. A byte pointer is a quantity (which occupies a whole word) which describes the location of a byte. There are three parts to the description of a byte: the word (i.e., address) in which the byte occurs, the position of the byte within the word, and the length of the byte.

>A byte pointer has the following format:

     000000 000011 1 1 1111 112222222222333333
     012345 678901 2 3 4567 890123456789012345
     _________________________________________
    |      |      | | |    |                  |
    | POS  | SIZE |U|I| X  |        Y         |
    |______|______|_|_|____|__________________|
>POS is the byte position: the number of bits from the right end of the byte to the right end of the word. SIZE is the byte size in bits.

>The U field is ignored by the byte instructions.

>The I, X and Y fields are used, just as in an instruction, to compute an effective address which specifies the location of the word containing the byte.

"If you're not playing with 36 bits, you're not playing with a full DEC!" -DIGEX (Doug Humphrey)

http://otc.umd.edu/staff/humphrey


What happens is that all machines that matter are little endian but network works always in Big Endian.


Isn’t big endian a bit more natural considered on a bit level? The bits start from highest to lowest on a serial connection.


Big-endian is natural when you're comparing numbers, which is probably why people represent numbers in a big-endian fashion.

Little-endian is natural with casts because the address doesn't change, and it's the order in which addition takes place.


I feel like big endian is more _intuitive_ because that's what our number notation has evolved to be.

But more _natural_ is little endian because, well, it's just more straightforward to have the digits' magnitude be in ascending order (2^0, 2^1, 2^2, 2^3...) instead of putting it in reverse.

Plus you encounter less roadblocks in practice with little endian (e.g. address changes with casts) which is often a sign of good natural design


I'm curious how you're defining "natural", and if you think ISO-8601 is the reverse of "natural" too.

All human number systems I've ever seen write numbers out as big Endian (yes, even Roman numerals), so I'm really struggling to see how that wouldn't be considered natural.


Counting out change is little endian - usually you start cents then dollars.

I wonder if we went big endian “by mistake” with Arabic numerals given that Arabic is written right to left.

Some ancient texts have “four and twenty” which is little endian.

We also add commas to large numbers to help with a human processing problem - you have to get to the end of the number to know what the first digit represents and then count backwards (groups of three help).


It seems like it would be a more natural for representing the number when communicating with a human.

But that's not what we're doing here, so it's not entirely relevant.


> The bits start from highest to lowest on a serial connection.

This is only true on a big endian serial connection (that is, one that, tautologically, sends the most-significant bit first). Offhand, I think most serial protocols are big endian, but by that logic, most CPUs are little endian, so that doesn't really help.

The thing that's actually useful about big endian is not that it's natural (as kangalioo points out, that's little endian) or that it's how humans write numbers (by that logic crap like BCD or decimal floats is a good idea), but that big endian preserves lexicographic order of fixed-width integers.


We'll have to keep it as a quirk of history...

A bit like the electron has a negative charge...


They hat a 50/50 chance at getting the technical electricity direction right... and the fucked it up!


Machines that matter to who? Maintainers of packages for OSes that support s390x (RHEL/Fedora, SUSE, Debian), Arduino AVR users, AIX users, embedded systems people with, say, Coldfire or some DSP? I don't know to what extent portable code is relevant to embedded systems, but I guess they care about cryptography libraries, for instance. (I know it's bi- in principle, but SPARC is still even in the Top500.)


> Is it time to accept that all machines that matter are little endian.

Well, no, because it's not the case. SPARC is big-endian, and a bunch of IBM processors. ARM processors are mostly bi-endian.

> Is it time to do the same for big endian machines?

No. Not just because of their prevalence, but because there isn't a compelling reason why everything should be little- endian.


IBM is going to be pretty annoyed when your code doesn't work on their mainframes.


In my experience IBM does the right thing and sends patches rather than asking us to fix their problems for them, and I respect them for that reason, even if it's a tiny burden to review those changes.

However endianness isn't just about supporting IBM. Modern compilers will literally break your code if you alias memory using a type wider than char. It's illegal per the standard. In the past compilers would simply not care and say, oh the architecture permits unaligned reads so we'll just let you do that. Not anymore. Modern GCC and Clang force your code to conform to the abstract standard definition rather than the local architecture definition.

It's also worth noting that people think x86 architecture permits unaligned reads but that's not entirely true. For example, you can't do unaligned read-ahead on C strings, because in extremely rare cases you might cross a page boundary that isn't defined and trigger a segfault.


> It's also worth noting that people think x86 architecture permits unaligned reads but that's not entirely true. For example, you can't do unaligned read-ahead on C strings, because in extremely rare cases you might cross a page boundary that isn't defined and trigger a segfault.

But that's not a problem with an unaligned read but rather that you are reading more than you are allowed to. And in C even an aligned readahead is UB.

A better example might be SSE instructions which do have aligned variants that trap on unaligned pointers.


yes IBM provided asm for s390 hton ntoh, and "all we had to do" for mainframe Linux was patch x86 only packages to use hton ntoh when they persisted binary data. for the kernel IBM did it on their own, contributing mainline, for userland suse did it, grabbing some patches from japanese turbolinux, and then red hat grabbed the patches from turbo and suse, and together we got them mainline lol. and PPC then just piggybacked on top of that effort.


> So the solution is simple right? Let's just use unsigned char instead. Sadly no. Because unsigned char in C expressions gets type promoted to the signed type int.

If you do use unsigned char, an alternative to masking would be performing the cast to uint32_t before instead of after the shift.

edit: For reference, this is what it would look like when implemented as a function instead of a macro:

    static inline uint32_t read32be(const uint8_t *p)
    {
        return (uint32_t)p[0] << 24
             | (uint32_t)p[1] << 16
             | (uint32_t)p[2] <<  8
             | (uint32_t)p[3];
    }


In case anyone else wonders how the code in the linked tweet [0] would format your hard drive, it's the missing return on f1. Therefore, f1 is empty as well (no ret) and calling it will result in f2 being run. The commented out code is irrelevant.

EDIT: Reading the bug report [1], the actual cause for the missing ret is that the for loop will overflow, which is UB and causes clang to not emit any code for the function.

[0] https://twitter.com/m13253/status/1371615680068526081

[1] https://bugs.llvm.org/show_bug.cgi?id=49599


The first example in the article is flawed (or at least misleading).

1) They define a char array (which defaults to signed char, as mentioned in the post), including the value 0x80 which can't be represented in char, resulting in a compiler warning (e.g. in GCC 11.1).

The mentioned reason against using unsigned char (that shifting 128 left by 24 places results in UB) is also misleading: I could not reproduce the UB when changing the array to unsigned char. Perhaps the author meant leaving the array defined as signed char, but casting the signed chars to unsigned before shifting. That indeed results in UB, but I don't see why you would define the array as signed in the first place.

2) The cause for the undefined behavior isn't the bswap_32, rather it's because they try reading an uint32_t value from a char array, where b[0] is not aligned on a word boundary.

There is no need at all do redefine bswap. The simple solution would be to use an unsigned char array instead of a char array and just reading the values byte-wise.

Of course C has its footguns and warts and so on, but there is no need to dramatize it this much in my opinion.

I've prepared a Godbolt example to better explain the arguments mentioned above: https://godbolt.org/z/Y1EWK6e17

Edit: To add to point 2) above: Another way to avoid the UB (in this specific case) would be to add __attribute__ ((aligned (4))) to the definition of b. In that case, even reading the array as a single uint32_t works as expected since the access is aligned to a word boundary.

Obviously, you can't expect any random (unsigned char) pointer to be aligned on a word boundary. Therefore, it is still necessary to read the uint32_t byte by byte.


> The mentioned reason against using unsigned char (that shifting 128 left by 24 places results in UB) is also misleading

No, that reasoning is correct. Integer promotions are performed on the operands of a shift expression, meaning the left operand will be promoted to signed int even if it starts out as unsigned char. Trying to shift a byte value with highest bit set by 24 will results in a value not representable as signed int, leading to UB.


Thanks, I just noticed a small mistake in my example (I don't trigger the UB because I access b[0] containing 0x80 without shifting, however I meant to do it the other way around).

Still, adding an explicit cast to the left operand seems to be enough to avoid this, e.g.:

  uint32_t x = ((uint32_t)b[0]) << 24;
In summary, I think my point that using unsigned char would be appropriate in this case still stands.


> Still, adding an explicit cast to the left operand seems to be enough to avoid this

Indeed. See my other comment, https://news.ycombinator.com/item?id=27086482


Byte order is one of the great unnecessary historical fuck ups in computing.

A similar one is that signedness of char is machine dependent. It's typically signed on Intel and unsigned on ARM.

Sigh!


Why is it an issue any more than say, order of fields in a struct is an issue? In one case you read bytes off the disk by doing ((b[0] << 8) | b[1]) (or equivalent), with the order reversed the other way around. Any application-level (say, not a compiler, debugger, etc) program should not even need to know the native byte order, it should only need to know the encoding that the file it’s trying to read used.


> order of fields in a struct

This is defined in C to be the order the fields are declared in.


But the padding rules between fields are a mess.


By the way, mathematicians also have their fuck ups:

https://tauday.com/tau-manifesto


For anyone curious or who is still attached to pi, here is a response to the tau manifesto:

https://blog.wolfram.com/2015/06/28/2-pi-or-not-2-pi/


The good thing is that Big Endian is pretty much irrelevant these days. Of all the historically Big Endian architectures, s390x is indeed the only one left that has not switched to little endian.


Network byte order is big endian so it is far from being pretty much irrelevant these days.


Also, this might be irrelevant at the cpu level, but within a byte, bits are usually displayed most significant bit first, so with little endian you end up with bit order:

7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8

instead of

15 to 0

This is because little endian is not how humans write numbers. For consistency with little endianness we would have to switch to writing "one hundred and twenty three" as

321


Correct me if I'm wrong, but were the now common numbers not imported in the same order from Arabic, which writes right to left? So numbers were invented in little endian, and we just forgot to translate their order.


Good question, I just did a little digging to see if I could find out. It sounds like old Arabic did indeed use little endian in writing and speaking, but modern Arabic does not. However, place values weren’t invented in Arabic, Wikipedia says that occurred in Mesopotamia, which spoke primarily Sumerian and was written in Cuneiform - where the direction was left to right.

https://en.wikipedia.org/wiki/Number#First_use_of_numbers

https://en.wikipedia.org/wiki/Mesopotamia

https://en.wikipedia.org/wiki/Cuneiform


It might not be how humans write numbers but it is consistent with how we think about numbers in a base system.

123 = 3x10^0 + 2x10^1 + 1x10^2

So if you were to go and label each digit in 123 with the power of 10 it represents, you end up with little endian ordering (eg the 3 has index 0 and the 1 has index 2). This is why little endian has always made more sense to me, personally.


I always think about values in big endian, largest digit first. Scientific notation, for example, since often we only care about the first few digits.

I sometimes think about arithmetic in little endian, since addition always starts with the least significant digit, due to the right-to-left dependency of carrying.

Except lately I’ve been doing large additions big-endian style left-to-right, allowing intermediate “digits” with a value greater than 9, and doing the carry pass separately after the digit addition pass. It feels easier to me to think about addition this way, even though it’s a less efficient notation.

Long division and modulus are also big-endian operations. My favorite CS trick was learning how you can compute any arbitrarily sized number mod 7 in your head as fast as people are reading the digits of the number, from left to right. If you did it little-endian you’d have to remember the entire number, but in big endian you can forget each digit as soon as you use it.


I don't know, when we write in general, we tend to write the most significant stuff first so you lose less information if you stop early. Even numbers we truncate twelve millions instead of something like twelve millions, zero thousand zero hundreds and 0.


Next you are going to want little endian polynomials, and that is just too far. Also, the advantage of big endian is it naturally extends to decimals/negative exponents where the later on things are less important. X squared plus x plus three minus one over x plus one over x squared etc.

Loss of big endian chips saddens me like the loss of underscores in var names in Go Lang. The homogeneity is worth something, thanks intel and camelCase, but the old order that passes away and is no more had the beauty of a new world.


In German _ein hundert drei und zwanzig_, literally _one hundred three and twenty_. The hardest part is are telephone numbers, that are usually given in blocks of two digits.


Well that would be hard for me to learn. I always find the small numbers between like 10 and 100 or 1000 the hardest for me to remember in languages I am trying to learn a bit of.


Exactly. This is so infuriating. Whoever let little-endian win made a huge disfavor for humanity.


The only benefit to big endian is that it's easier for humans to read in a hex dump. Little endian on the other hand has many tricks available to it for building encoding schemes that are efficient on the decoder side.


Could you elaborate on these tricks? This sounds interesting.

The only thing I'm aware of that's neat in little endian is that if you want the low byte (or word or whatever suffix) of a number stored at address a, then you can simply read a byte from exactly that address. Even if you don't know the size of the original number.


I've posted in some other replies, but a few:

- Long addition is possible across very large integers by just adding the bytes and keeping track of the carry.

- Encoding variable sized integers is possible through an easy algorithm: set aside space in the encoded data for the size, then encode the low bits of the value, shift, repeat until value = 0. When done, store the number of bytes you wrote to the earlier length field. The length calculation comes for free.

- Decoding unaligned bits into big integers is easy because you just store the leftover bits in the next value of the bigint array and keep going. With big endian, you're going high bits to low bits, so once you pass to more than one element in the bigint array, you have to start shifting across multiple elements for every piece you decode from then on.

- Storing bit-encoded length fields into structs becomes trivial since it's always in the low bit, and you can just incrementally build the value low-to-high using the previously decoded length field. Super easy and quick decoding, without having to prepare specific sized destinations.


Blame the people who failed to localize the right-to-left convention when arabic numerals were adopted. It's one of those things like pi vs. tau or jacobin weights and measurements vs. planck units. Tradition isn't always correct. John von Neumann understood that when he designed modern architecture and muh hex dump is not an argument.


that's why little endian == broken endian

said a friend who also quips: "never trust a computer you can lift"


> The good thing is that Big Endian is pretty much irrelevant these days.

This is nonsense - many file formats are big endian.


With a bonus of some being EBCDIC too.


This is true.


Network protocols still mostly use "Network Byte Order", i.e. big endian.


Or text. Or handled by generated code like protobuf.


As there was talk about in a subthread yesterday [0] so does arm support big endian though it is not used as much anymore is it still there.

POWER also still uses big endian though recently little endian POWER have gotten more popular

[0]: https://news.ycombinator.com/item?id=27075419


Even if all CPUs were little-endian, big-endian would exist almost everywhere except CPUs, including in your head. Unless you're some odd person that actually thinks in little-endian.


I don't think it's a fuck up, rather I think it was unavoidable: Both ways are equally valid and when the time came to make the decision, some people decided one way, some people decided the other way.


And which is the correct byte ordering, pray tell?


Big and little endian are named after the never-ending "holy" war in Gulliver's Travels over how to open eggs. So we were always of the opinion that it doesn't really matter. But I open my eggs on the little end


Big Endian of course :-) However the one which has won is Little Endian. Even IBM admitted this when it switched the default in POWER 7 to little endian. s390x is the only significant architecture that is still big endian.


Little endian has the advantage that you can read the low bits of data without having to adjust the address. So you can for example do long addition in memory order rather than having to go backwards, or (with an appropriate representation such as ULEB128) in one pass without knowing the size.


Maybe I am biased working on mainframes, but I would personally take big endian over little endian. The reason is when reading a hex dump, I can easily read the binary integers from left to right.


That's the only thing that BE has over LE.

But for example bitmaps in BE are a huge source of bugs, as readers and writers need to agree on the size to use for memory operations.

"SIMD in a word" (e.g. doing strlen or strcmp with 32- or 64-bit memory accesses) might have mostly fallen out of fashion these days, but it's also more efficient in LE.


Big endian is easier for humans to read when looking at a memory dump, but little endian has many useful features in binary encoding schemes due to the low byte being first.

I used to like big endian more, but after deep investigation I now prefer little endian for any encoding schemes.


Couldn’t encoding systems be redone with emphasis on the high-order bits? Or is the assumption that the values are clustered in the low bits?


I think the fundamental problems is that if you start a computation using N most significant bits and then incrementally add more bits, e.g. N+M bits total, then your first N bits might change as a result.

E.g. decimal example:

    1.00/1.00 = 1.00
    1.000/1.001 = 0.999000999000...
(adding one more bit changes the first bits of the outcome)


You can put emphasis on high order bits, but that makes decoding more complex. With little endian the decoder builds low to high, which is MUCH easier to deal with, especially on spillover.

For example, with ULEB128 [1], you just read 7 bits at a time, going higher and higher up the value you're reconstituting. If the value grows too big and you need to spill over to the next (such as with big integer implementations), you just fill the last bits of the old value, then put the remainder bits in the next value and continue on.

With a big endian encoding method (i.e. VLQ used in MIDI format), you start from the high bits and work your way down, which is fine until your value spills over. Because you only have the high bits decoded at the time of the spillover, you now have to start shifting bits along each of your already decoded big integer portions until you finally decode the lowest bit. This of course gets progressively slower as the bits and your big integer portions pile up.

Encoding is easier too, since you don't need to check if for example a uint64 integer value can be encoded in 1, 2, 3, 4, 5, 6, 7 or 8 bits. Just encode the low 8 bits, shift the source right by 8, repeat, until the source value is 0. Then backtrack to the as-yet-blank encoded length field in your message and stuff in how many bytes you encoded. You just got the length calculation for free. Use a scheme where you only encode up to 60 bit values, place the length field in the low 4 bits, and Robert's your father's brother!

For data that is right-heavy (i.e. the fully formed data always has real data on the right side and blank filler on the left - such as uint32 value 8 is actually 0x00000008), you want a little endian scheme. For data that is left-heavy, you want a big endian scheme. Since most of the data we deal with is right-heavy, little endian is the way to go.

You can see how this has influenced my encoding design in [2] [3] [4].

[1] https://en.wikipedia.org/wiki/LEB128

[2] https://github.com/kstenerud/concise-encoding/blob/master/cb...

[3] https://github.com/kstenerud/compact-float/blob/master/compa...

[4] https://github.com/kstenerud/compact-time/blob/master/compac...


Middle-endian is the only correct answer. It's a tradeoff between both little-endian and big-endian. The PDP-11 got it right.


Yup, we're all waiting for the rest of the world to catch up to MM/DD/YYYY.


I don't suppose it's being modified, but I wonder how much -11 code is still running, even on real hardware.


the greatest of all is lisp not being the most mainstream language, and we can only blame the lisp companies for this fiasco. in an ideal world we all would be using a lisp with parametric polymorphism. from highest level abstractions to machine level, all in one language.


A while back I was on a project to port a satellite simulator from SPARC/Solaris to RHEL/x64. The compressed telemetry stream that came from the satellite needed to be in big endian (and that's what the ground station software expected), and the simulator needed to mimic the behavior.

This was not a problem for the old SPARC system, which naturally put everything in the correct order without any fuss, but one of the biggest sticking points in porting over to x64 was having to now manually pack all of that binary data. Using Ada, (what else!) of course.


If memory serves correctly, ada 2012 and beyond has language level support for this. I was working on porting some code from an aviation platform to run on PC and it was all in ada 2005 so we didn't have the benefit of that available.


Same here, Ada2005 for the port. The simulator was originally written in Ada95. Part of what made it even less fun was the data was highly packed and individual fields crossed byte boundaries (these 5 bits are X, the next 4 bits are Y, etc.) :(


Couldn't you add the Bit_Order and Scalar_Storage_Order attributes (or aspects in Ada 2012) to your records/arrays? Or did Scalar_Storage_Order not exist at the time?


Given enough memory it may be worth treating the whole stream internally as a bitstream.


Ubsan should default on. If people don't like it, then they should be made turn it off with a switch, so at least it's more likely to be run than not run. Could save a huge amount of time debugging when compilers or architecture changes. Without it, I'd say many a programmer would be caught by these subtleties in the standard. Coming from a HW background (Verilog) I'd more naturally default to masking and shifting when building up larger variables from smaller ones, but I can imagine many would not.


Sanitizers may introduce side channels. This is an issue for crypto code.


There was a blog post and a FOSDEM presentation by (misguided) Gentoo developers a few years ago, and it was retracted, because sanitizers add their own exploitable vulnerabilities due to the way they work.

https://blog.hboeck.de/archives/879-Safer-use-of-C-code-runn...

https://www.openwall.com/lists/oss-security/2016/02/17/9


Sorry for my ignorance, but surely some UB being used for optimization by the compiler is compile time only. This is the part that should default on. Runtime detection is a different thing entirely, but compile time is a no brainer.


UBSAN detects undefined behavior at run-time. Compile-time detection of undefined behavior is present in the form of compiler warnings, but catches far from all cases of undefined behavior. The compiler does not actively exploit undefined behavior in the sense that it does not contain code like this:

  if (undefined_behavior) break_program()
If it did, it could easily report the undefined behavior. However, that's not how it works. Instead, the compiler has optimization rules that are only valid if the code contains no undefined behavior. If the code contains undefined behavior, the optimization rules change the result of the program. For example, this code:

   bool function(int x) {
      return x + 1 > x;
   }
Can be optimized to "return true". That is correct if x does not overflow, but if x overflows and wraps around the optimization changes the result of the program. In this case, it is acceptable according to the C/C++ standards for the optimization to assume x does not overflow, and hence this optimization is valid.

The compiler could tell you for every instance of signed integer arithmetic that it is making assumptions about your program, and that the signed integer arithmetic could potentially overflow, but that doesn't seem particularly helpful.


Thanks, though I'm not sure all compile time detectable undefined behaviours is exposed through warnings today. In the example in the article, why would something like left shift of a -ve value require runtime detection, surely the fact a signed char was used with left shift is all the compiler needs. So perhaps a subset to UB that is detectable at compile time should be reported.

In your example about the comparison of x + 1 vs x, I'm not sure that is a contraversial optimization. However this one, to me, is:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Here a diligent programmer is trying to do a null pointer check, but because dereferencing null is UB, then the optimizer removes the null pointer check. This is compile time UB that should be flagged to users.


Sanitizers have the ability to bring Rust-like safety assurances to all the C/C++ code that exists. The fact that existing ASAN runtimes weren't designed for setuid binaries shouldn't dissuade us from pursuing those benefits. We just need a production-worthy runtime that does less things. For example, here's the ASAN runtime that's used for the redbean web server: https://github.com/jart/cosmopolitan/blob/master/libc/intrin...


Run-time detection and heuristics on a language that is hard to analyze (e.g. due to weak aliasing, useless const, ad-hoc ownership and thread-safety rules) aren't in the same ballpark as compile-time safety guaranteed by construction, and an entire modern ecosystem centered around safety. Rust can use LLVM sanitizers in addition to its own checks, so that's not even a trade-off.


Oh I believe you but as you point out we need ASAN to make Rust codebases safer too. One of the things that's helped Rust be successful is that we're able to quickly write bindings for legacy C/C++/FORTRAN code using the unsafe keyword. The last Rust codebase I worked on had about 70k unsafe lines. One day Rust will be complete and we will rewrite all the legacy code but until then we depend on the low level C tooling to provide assurances like byte-granular invalid address access trapping.


> Ubsan should default on

> Could save a huge amount of time debugging when compilers or architecture changes.

I'm assuming we come from very different backgrounds, but it's not clear to me how switching compilers or architectures is so common that hardening code against it by default is appropriate. I would think that switching compilers or architectures is generally done very deliberately, so instrumenting code with UBsan for that transition would be the right thing to do?


Changing gcc version could cause your code with undefined behaviour to change. If you rely UB, whether you know you are or not, you are in for a bad time. Ubsan at least let's you know if your code is robust, or a ticking time bomb...


Changing compilers is a pretty regular thing IMHO; I use the compiler that comes with the OS and let's assume a yearly OS release cycle. Most of those will contain at least some changes to the compiler.

I don't really want to have to take that yearly update to go through and review (and presumablu fix) all the UB that has managed to sneak in over the year. It would be better to have avoided putting it in.


If you can assume GCC or Clang then __builtin_bswap{16,32,64} functions are provided which will be considerably more efficient, less error-prone, and easier to use than anything you can homebrew.


Well, yes. The only thing missing is knowing if you have to swap or not, if you don't want to assume your code will run on little endian systems exclusively.

Or, on Linux and BSD systems at least, you can use the <endian.h> or <sys/endian.h> functions (https://linux.die.net/man/3/endian) and rely on the libc implementation to do the system/compiler detection for you and use an appropriate compiler builtin inside of an inline function instead of bothering to hack something together in your own code.

The article mentions those functions at the bottom, but strangely still recommends hacking up your own macros.


My favourite builtins are the overflow checked integer operations:

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...


But then you have to #ifdef the endianness of the target architecture. If you do it the right way as Russ Cox and Justine Tunney say, then your code can serialize and deserialize correctly regardless of the platform endianness.


That's not true. If you write the byte swap in ANSI C using the gigantic mask+shift expression it'll optimize down to the bswap instruction under both GCC and Clang, as the blog post points out.


Funny that compilers (e.g. clang: https://github.com/llvm/llvm-project/blob/b04148f77713c92ee5... ) might be able to do that only because someone on the compiler team has hand-coded a bswap expression detector.


Assuming the macros or your giant expression are correct. But you might as well use the compiler intrinsics which you know are both correct and the most efficient possible, and get on with your life.


Sorry I'd rather place my faith in arithmetic rather than someone's API provided the compiler is smart enough to understand the arithmetic and optimize accordingly.


"Someone" here is the same compiler you're trusting to optimize your giant arithmetic expression of the same idea. Your statement is internally inconsistent.


There is a value to keeping it completely clear in your head the difference between a value with arithmetic semantics vs a value with octets in a stream semantics. That thinking will work in all contexts, while the compiler knowledge is limited. The thinking will help you write correct ways to encode data in the URL or into a file being uploaded that your code generates for discord or whatever, in Python, without knowledge of the true endianness of the system the code is running on.


The fallacy in the article is that anyone should code these functions. There's plenty of public domain libraries that do this correctly.

https://github.com/rustyrussell/ccan/blob/master/ccan/endian...


_byteswap_{ushort,ulong,uint64} for MSVC. Together with yours on x86 these should take care of the three major compilers.


The article explicitly shows that the provided macros are very efficient with a modern compiler. You can check on godbolt.org that they emit the same code.

Though the article only mentions bswap64 and mentioning __builtin_bswap64 would be a nice addition.


Given it can be done with careful code AND many processors have a single instruction to do it I’m surprised it hasn’t been added to the C standard.


__builtin_bswap does exactly the same thing as the macros.


This problem is it’s own special horror in Canbus data. Between endianness and sign it’s a nightmare of en/decoding possibilities and the associated mistakes that come with that.


TIFF is another one. The only endian-switchable image format that I'm aware of.

Fun fact: CD-ROM superblocks have both-endian fields. Each integer is stored twice in big and little endian format. I assume this was to allow underpowered 80s hardware which didn't have enough resource to do byte swapping.


> If you program in C long enough, stuff like this becomes second nature, and it starts to almost feel inappropriate to even have macros like the above, since it might be more appropriately inlined into the specific code. Since there have simply been too many APIs introduced over the years for solving this problem. To name a few for 32-bit byte swapping alone: bswap_32, htobe32, htole32, be32toh, le32toh, ntohl, and htonl which all have pretty much the same meaning.

> Now you don't need to use those APIs because you know the secret.

This sentiment seems problematic. The solution shouldn't be "we just have to educate the masses of C programmers on how to properly deal with endianness". That will never happen.

The solution should be "It's in the standard library. Go look there and don't think too hard." C is sufficiently low-level, and endianness problems sufficiently common, that I would expect that kind of routine to be available.


Typical C culture, you would also expect that by now something like SDS would be part of the standard as well.

https://github.com/antirez/sds


Adding API that introduces an entirely new string model that is incompatible with the rest of the standard library seems like a nonstarter.


Yeah, because that is the only way that they could ever do it, assuming WG14 would ever care about a safer C.


The point is that keeping the distinction clear in your head between numeric semantics and sequence of octets semantics makes the problem universally tractible. You have a data structure where with a numeric value. Here you have a sequence of octets described by some protocol formalism, BNF in the old days. The mapping from one to the other occurs in the math between octets and numeric values and the various network protocols for representing numbers. There are many more choices than just big endian or little endian. Could be ASN infinite precision ints. Could be 32 bit IEEE floats or 64 bit IEEE floats. The distinction is universal between language semantics and external representations.

This is why people that memcpy structs right into the buf get such derision, even if it’s faster and written for a mono-Implementation of a language semantics. It is sloppy thought made manifest.


I just use ntohl/htonl like a civilized person.

(Yes, the article mentions those, but they've been standard for decades).


what's the best practice for 64bit values these days? is htonll ntohll widely available yet?


It’s not every day you can write a blog post that calls out rob pike… ;)


Author here. I'm improving upon Rob Pike's outstanding work. Standing on the shoulders of a giant.


Totally agree. My comment was made in jest. Mad kudos to you as you clearly possess talent and humility that’s in short supply today.


Of course, the canonical work on this subject is Danny Cohen's On Holy Wars And A Plea For Peace [0]. It's an informative and highly readable article. My favorite quote, from the conclusion, is:

    The  "Be reasonable, do it my way" approach does not work.  Neither does the Esperanto approach of "let's all switch to yet a new language".
His bottom line conclusion being

    It is more important to  agree  upon an order than which order is agreed upon.
[0] https://www.rfc-editor.org/ien/ien137.txt


In her first sentence, the phrase “the C / C++ programming language” is no longer correct: C++20 requires two’s complement signed integers.

C++ 20 is quite new so I would assume that very few people know this yet.

C and C++ obviously differ a lot, but by that phrase she clearly means “the part where then two languages overlap”. The C++ committee has been willing to break C compatibility in a few ways (not every valid C program is a valid C++ program), and this has been true for a while.


"the c/c++ language" exists insofar as you can import this c code into your c++, and this is something that c++ programmers need to know how to do, so they'd better learn enough of the differences between c and c++ or they'll be stumped when they crack open somebody else's old code.


It hasn't been true since C99, at least -- C++ didn't adopt C99 designated initializers.


What chips can be targeted by C compilers today that don't use 2's complement?


I haven’t seen a one’s complement machine in decades but at the time C was standardized here were still quite a few (afaik none had a single-chip CPU, to get to your question). But since they existed, the language definition didn’t require it and some optimizations were technically UB.

The C++ committee decided that everyone had figured this out by now and so made this breaking change.


FWIW there is a <sys/endian.h> on various BSDs that contains "beXXtoh", "leXXtoh", "htobeXX", "htoleXX" where XX is a number of bits (16, 32, 64).

That header is also available on Linux, but glibc (and compatible libraries) named it <endian.h> instead.

See: man 3 endian (https://linux.die.net/man/3/endian)

Of course it gets a bit hairier if the code is also supposed to run on other systems.

MacOS has OSSwapHostToLittleIntXX, OSSwapLittleToHostIntXX, OSSwapHostToBigIntXX and OSSwapBigToHostIntXX in <libkern/OSByteOrder.h>.

I'm not sure if Windows has something similar, or if it even supports running on big endian machines (if you know, please tell).

My solution for achieving some portability currently entails cobbling together a "compat.h" header that defines macros for the MacOS functions and including the right headers. Something like this:

https://github.com/AgentD/squashfs-tools-ng/blob/master/incl...

This is usually my go-to-solution for working with low level on-disk or on-the-wire binary data structures that demand a specific endianness. In C I use "load/store" style functions that memcpy the data from a buffer into a struct instance and do the endian swapping (or reverse for the store). The copying is also necessary because the struct in the buffer may not have proper alignment.

Technically, the giant macro of doom in the article takes care of all of this as well. But unlike the article, I would very much not recommend hacking up your own stuff if there are systems libraries readily available that take care of doing the same thing in an efficient manner.

In C++ code, all of this can of course be neatly stowed away in a special class with overloaded operators that transparently takes care of everything and "decays" into a single integer and exactly the above code after compilation, but is IMO somewhat cleaner to read and adds much needed type safety.


Indeed, I don't get the article. It's like writing "C is hard because here is how hard it is to implement memcpy using SIMD correctly."

Please don't do that. Use battle-tested low-level routines. Unless your USP is "our software swaps bytes faster than the competition", you should not spend brain power on that.


Windows/MSVC has _byteswap_ushort(), _byteswap_ulong(), _byteswap_uint64(). (note that unsigned long is 32 bits on Windows) It's ugly but it works.

Boost provides boost::endian which allows converting between native and big or little, which just does the right thing on all architectures and compilers and compiles down to a no-op or bswap instruction instruction. It's much better than writing (and testing!) your own giant pile macros and ifdefs to detect the compiler/architecture/OS, include the correct includes, and perform the correct conversions in the correct places.


At least historically windows have had big-endian versions as both SPARC and Itanium use big endian.


Itanium can be configured to run in either endianness (it's "bi-endian"). Windows on Itanium always ran in little-endian mode and did not support big-endian mode. The same was true of PowerPC. Windows never ran in big-endian mode on any architecture.


Or just cast the pointer to uint##_t and use be##toh and htobe## from <endian.h>? I think this is making a mountain out of a mole hill. I've spent tons of time doing wire (de)serialization in C for network protocols and endian swaps are far from the most pressing issue I see. The big problem imo is the unsafe practices around buffer handling allowing buffer over runs.


Historical and obscure machines aside, there are a few things modern C++ code should take for granted, because even new systems will probably not bother breaking them: Text is encoded in UTF-8. Negative integers are twos-complement. Float is 32 bit ieee 754, double and long double are 64 bit ieee 754. Char is 8 bit, short is 16 bit, int is 32 bit, long long is 64 bit.


Here's how I implement little endian parsing:

  static uint32_t load32_le(const uint8_t s[4])
  {
      return (uint32_t)s[0]
          | ((uint32_t)s[1] <<  8)
          | ((uint32_t)s[2] << 16)
          | ((uint32_t)s[3] << 24);
  }
I start with unsigned char to begin with (well `uint8_t` to be precise, which has the advantage of not compiling at all if you happen to use a DSP that uses 32-bit chars). Then I convert those chars to unsigned 32-bit integers. Only then do I shift them. There is no need to mask anything here.

Note that modern compilers translate this whole thing into a single unaligned load operation. Even better, I've noticed that using a macro instead of a function tends to make performance worse with modern compilers.


Why mask and then shift instead of casting to the correct type and then shifting, like this:

    (uint32_t)x[0] << 24 | ...
Of course, this requires that x[0] be unsigned.


If this is for deserialisation then it's okay for x[0] to be signed. You just need to recast the result as int32_t (or simply assign to an int32_t variable without any cast) and it is not UB.


I agree that in an ideal world we should just write load code using byte loads and shifts. But in the world we live in, compilers only got the ability to recognize that and emit a bswap instead in relatively recent [0] versions (compared to the age of C). And the recognition can still depend on the exact pattern used. Also, debug builds will still emit the whole shift mess, which in some cases can be annoying.

[0] https://godbolt.org/z/jMbqT86jo


Isn't the 'modern' solution to memcpy into a temp and swap the bytes in that? C++ has added/will add std::launder and std::bless to deal with this issue


> Isn't the 'modern' solution to memcpy into a temp and swap the bytes in that?

Or just use the endian.h / sys/endian.h routines, which do the right thing (be32dec / be32enc / whatever). memcpy+swap is fine, and easier to get right than the author's giant expressions, but you might as well use the named routines that do exactly what you want already.


>C++ has added/will add std::launder and std::bless to deal with this issue

You're thinking of std::bit_cast. std::launder solves a different, much more obscure problem: https://miyuki.github.io/2016/10/21/std-launder.html


oops, my mistake


No, it is to read a byte at a time and turn it into the semantic value for the data structure you are filling in. Like read 128 and then 1 and set the variable to 32769. If u are the author of protobufs then you may run profiling and write the best assembly etc but otherwise no, don’t do it.


That huge macro appears to be wrong, as there are little endian PowerPC systems, where __ppc__ and _powerpc__ macros are also defined, making the outcome of the detection invalid.


This is valid code in C++20:

    if constexpr (std::endian::native == std::endian::big) {
        std::cout << "big-endian" << '\n';
    }
    else if constexpr (std::endian::native == std::endian::little) {
        std::cout << "little-endian"  << '\n';
    }
    else {
        std::cout << "mixed-endian"  << '\n';
    }
Doesn't solve everything, but it's saner even if what you're writing is C-style low-level code.


It wasn't clear to me but what was the undefined behaviour in the naive approach?


Violation of the effective typing rules ('strict aliasing') and a potential violation of alignment requirements of your platform.



I wonder if those macros work with middle-endian systems.


Is this a joke or am I just unaware of any systems out there that are "middle-endian"..?!


Sadly not a joke, but thankfully quite obscure: https://en.wikipedia.org/wiki/Endianness#Middle-endian


There are no current middle-endian systems but they used to exist. The PDP-11 is the most famous one. The macros would work on all systems, but as only very old systems are middle-endian, they also have old compilers so may not be able to optimise it as well.


no. but hton(3)/ntoh(3) from inet.h do.


In an ideal world which endian format would one go for?


I for one would go for big-endian, simply because reading memory dumps and byte blocks in assembly or elsewhere works without mental byte-swapping arithmetics for multi-byte entities.

Just out of curiosity, I would be interested in learning why so many CPUs today are little-endian. Is it because it is cheaper / more efficient for processor implementations or is it because “the others do it, so we do it the same way”?


https://stackoverflow.com/questions/5185551/why-is-x86-littl...

It simplifies certain instructions internally. Practically everything is little endian because x86 won.

> And if you think about a serial machine, you have to process all the addresses and data one-bit at a time, and the rational way to do that is: low-bit to high-bit because that’s the way that carry would propagate. So it means that [in] the jump instruction itself, the way the 14-bit address would be put in a serial machine is bit-backwards, as you look at it, because that’s the way you’d want to process it. Well, we were gonna built a byte-parallel machine, not bit-serial and our compromise (in the spirit of the customer and just for him), we put the bytes in backwards. We put the low- byte [first] and then the high-byte. This has since been dubbed “Little Endian” format and it’s sort of contrary to what you’d think would be natural. Well, we did it for Datapoint. As you’ll see, they never did use the [8008] chip and so it was in some sense “a mistake”, but that [Little Endian format] has lived on to the 8080 and 8086 and [is] one of the marks of this family.


And does middle endian even exist?


US date format: 12/31/2021


I suspect it does somewhere as a system that had words as the main addressable memory but also allowed byte addressing could have little endian double words but big endian ordering inside the words (as bytes).


Not currently AFAIK, but apparently the PDP-11 had a middle-endian arch. See other comments in this thread.


My brain is trained to read little-endian in memory dumps. It's no different than the German "fünf-und-zwanzig" (five and twenty). :))



Why would one choose the memory representation of the number based on the advantages of the internal ALU wiring?

Of all those reasons, the only one I can make sense of is the "I can’t transparently widen fields after the fact!", and that one is way too niche to explain anything.


I don’t understand? Why not make the memory representation sympathetic with the operations you’re going to do on it? It’s the raison d’être of computers to compute and to do it fast.

Another example: memory representation of pixels in GPUs which are swizzled to make computations efficient


> I don’t understand? Why not make the memory representation sympathetic with the operations you’re going to do on it?

There's no reason to, as there's no reason not to. It's basically irrelevant.

If carrier passing is so important, why can't you just mirror your transistors and operate on the same wires, but on the opposite order? Well, you can, and it's trivial. (And, by the way, carrier passing isn't important. High performances ALU pass carrier only though blocks, that can appear anywhere. And the wiring of those isn't even planar, so how you arrange them isn't a showstopper.)


Little endian. There is no extant big-endian CPU that matters.


I did say in an ideal world.


Hint: The reason why it's called "endianness" comes from the novel Gulliver's Travels, in which the neighboring nations of Lilliput and Blefuscu went to bitter, bloody war over which end to break your eggs from: the big end or the little end. The warring factions were also known as Big-Endians and Little-Endians, and each thought themselves superior to the dirty heathens on the other side. If one side were objectively correct, if there were an inherent advantage to breaking your egg from one side or the other, would there be a war at all?


> if there were an inherent advantage to breaking your egg from one side or the other, would there be a war at all?

Fascism vs. not-fascism, Stalinist Communism vs. Western Capitalism, Islamism vs. liberal democracy... I’m not sure “the existence of war around a divide in ideas proves that neither sides ideas are correct” is a particularly comfortable maxim to consider the ramifications of.


Two similar societies warring over a trivial idea probably means neither is right. Swift's Big Endians and Little Endians are a satire of the Catholic-Anglican schism in England.


> Two similar societies warring over a trivial idea probably means neither is right.

Well, sure, that it’s a trivial idea pretty much inherently means either that neither is right or (and this is very much not an exclusive or) being right doesn’t matter.

The problem with real cases is that people inside the conflict don’t believe the idea is trivial (conversely, to people outside rhe conflict—or caught in the middle—even the conflicts we think of as about foundational ideas seem like trivial or irrelevant differences.)


Case in point: Israel and Palestine :-(


https://twitter.com/m13253/status/1371615680068526081

Would it hurt anyone to define this undefined behavior and do exactly what the source code says?


Not sure what you think the source code "says". I mean, I know what you want it to mean, but just because integer wrapping is intuitive to you doesn't imply that that is what the code means. C++ abstract machine and all.

But to answer the actual question: For C++20, integer types were revisited. It is now (finally) guaranteed that signed integers are two's complement, along with a list of other changes. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p090... also for how the committee voted on the individual issues.

Note in particular:

> The main change between [P0907r0] and the subsequent revision is to maintain undefined behavior when signed integer overflow occurs, instead of defining wrapping behavior. This direction was motivated by:

> - Performance concerns, whereby defining the behavior prevents optimizers from assuming that overflow never occurs;

> - Implementation leeway for tools such as sanitizers;

> - Data from Google suggesting that over 90% of all overflow is a bug, and defining wrapping behavior would not have solved the bug.

So yes, the committee very recently revisited this specific issue, and re-affirmed that signed integer overflow should be UB.


I haven't noticed the signed integer overflow, which does indeed complicate things, and I thought it was just the infinite loop UB.

> Data from Google suggesting that over 90% of all overflow is a bug, and defining wrapping behavior would not have solved the bug.

Of all overflow? Including unsigned integers where the behavior is defined?


That 90% of all overflows are bugs doesn't surprise me at all, even if you include unsigned integers.


I've never been very satisfied with these approaches for C where you hope the compiler does the right thing. It makes sense to provide some C implementation for portability's sake but any sizeable reordering cries out for a handtuned, processor specific, approach (and the non-sizeable probably doesn't require high speed). I would expect any SIMD instruction set to include a shuffle.


It can also be a good idea to swap recursively. First swap the upper and lower half, then swap the upper and lower quarters (bytes for a 32bit) which can be done with only 2 masks. Then if its 64bit value swap alternate bytes, again with only 2 masks. This can be extended all the way to full bit reverse in 3 more lines each with 2 masks and shifts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: