
The byte order fallacy - enneff
http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
======
astrange
His point is that instead of byte swapping input, we should always use single-
byte load operations because "it works for him."

But Plan9 is not a system known for its graphics, and I think performance
would seriously suffer if everyone had to program like that. Being able to
load a pixel as an int is the reason 32-bit RGB is used more often as a pixel
format than 24-bit.

Of course it might not matter as much these days, GCC and LLVM can optimize
his code sequences into bswap instructions automatically. And SIMD/shader code
don't have endian portability problems I know of, if only because SIMD is
already not portable.

~~~
perfunctory
> I think performance would seriously suffer

Evidence please.

~~~
xpaulbettsx
On modern CPUs, a byte load/store is really an integer (i.e. 32-bit/64-bit
depending on arch) load/store that is rigged to only affect the target byte.
On IA64 and PPC, it would just SIGBUS out (as it probably should on x86/amd64
too, but they kept it for compat reasons)

~~~
astrange
Desktop PPC CPUs (when there were such things) allowed misaligned memory
operations with some performance penalty.[1]

x86 practically offers it for free in newer architectures (Sandy Bridge, Ivy
Bridge and Bulldozer).[2]

[1] <https://developer.apple.com/hardwaredrivers/ve/g5.html>

[2] <http://agner.org/optimize/instruction_tables.pdf> (check MOVDQU timings)

~~~
chmike
AFIK ARM processors don't support misaligned word access. AFIK misaligned word
access is twice slower than aligned word access (requires 2 reads). So I don't
understand "offers it for free". But this is still twice faster than the
example code. Note that endianess and word alignment are two distinct
problems.

The point made by the author addresses this issue from a different angle.

As the author say, programmers should always write endianess neutral code
unless it is impossible which is generally at the interfaces, where data is
read and written (I/O) by the program. If the code is correctly and
intelligently optimized so that marshaling is done once, then the byte
swapping may generally be expected to be a low frequency operation. In this
case the most simple and portable code should be favored.

Trying to optimize this operation by word read and byte swapping provides an
insignificant optimization with a higher cost on code portability and
maintainability. The author is right on this.

Though it is also true that in some cases, the operation frequency is very
high (i.e. reading million pixel values of an image). For these use cases, the
programming overhead of using highly optimized code is perfectly justified.
But then don't use half backed optimizations. Try to align data on words
(twice faster), read by word (four time faster) and use byte swapping machine
instruction available on the target CPU instead of the proposed shifts and bit
masks.

My opinion is that good languages should provide optimized data marshaling
functions in their library so that the code can be optimal and portable at the
same time.

~~~
mansr
ARM supports unaligned memory accesses since v6. In most modern
implementations, unaligned accesses falling entirely within a 16-byte aligned
block have no penalty at all, while crossing 16-byte boundaries does impose a
cost. If the locations of unaligned accesses are randomly distributed, this
cost is still cheaper on average than accessing a byte at a time.

------
mrmekon
I agree with his suggestion that most code manipulating byte order is
incorrect or unneeded. Within your application's logic, everything should
already be uniform.

I agree slightly less with "computer's byte order doesn't matter",
differentiated from peripheral and data stream byte order... that's the same
friggin thing. It matters that you treat your inputs and outputs correctly,
and how you do that depends on your computer's byte order, so they computer's
byte order _does_ matter. Just not so much during the data processing stage.

But mostly, I'm just saddened that every post about C now has a "only people
who do [X thing that requires C] do that, and you're probably not one of them,
so you should do that!" Maybe there's just a huge disconnect between people-
who-blog and people-who-write-low-level-code, but most of the software guys I
know have worked professionally on microcontrollers, DSPs, operating systems,
or compilers within the last 5 years, and I'm working on a compiler for a DSP
right now (and I expect byte-order to matter).

~~~
oh_sigh
No offense brother, but Rob Pike shits all over you and any of your "software
guys". The man is a living legend. This is not to say that he can't be wrong,
but to call him just a "person who blogs" only shows how little you know.

~~~
varikin
Or it just shows that not everyone looks at the about me on every blog to see
that Rob is Rob Pike.

~~~
oh_sigh
How does that make it any better? If the name on the blog is what makes you
think a post is shit or gold, then you are probably not a very good critical
thinker.

------
adrianmsmith
I wish there were, in C, some equivalent of "struct" but where you could
specify

\- The byte order / endianness

\- The alignment of variables

\- The exact width of variables (32-bits, 64-bits)

"struct" is great for what it was designed for, storing your internal data
structures in a way efficient for the machine.

But everyone abuses structs and tries to read external data sources e.g. files
using them. They might hack it to work on their own machine, then as soon as a
machine of the other endianness comes along, hacks and #ifdefs appear, then
machines with ints of different widths come along....

Of course these people are using structs "wrong", like the author of the
article suggests. But nevertheless, the fact that people are using structs
"wrong" suggests there is a need for something that provides what people are
trying to use structs for.

~~~
ge0rg
Yeah, C is really missing a way to serialize/deserialize data from raw
memory/sockets into structs usable by your code. The least insane way, libpack
[1] requires replicating the data format definition three times:

* define the struct with all elements

* define a string for the binary representation

* call fpack/funpack with the string and all the struct elements as parameters...

Unfortunately, fixing this either requires some kind of black X-macro [2]
magic or another template language used to write the specification and to
generate the three above-mentioned representations from it...

[1] <http://www.leonerd.org.uk/code/libpack/intro.html>

[2] <http://drdobbs.com/184401387>

~~~
masklinn
> or another template language

Surely this could be handled via simple syntactic extensions to the struct
specification (with everything wrapped into an ungodly macro from hell) in
order to define the mapping between the struct itself and libpack's format
string, no?

~~~
ge0rg
The problem is that you need to replicate the struct entries in the
pack/unpack calls as well, which is only possible in plain C by using
X-Macros.

It might be possible to construct a macro that creates both the struct and the
format string, though.

~~~
masklinn
> The problem is that you need to replicate the struct entries in the
> pack/unpack calls as well

Don't you only need the (generated) format string? Ideally, the macro could
generate some wrapper function of some sort as well, which would unpack, fill
and return an instance of the struct.

------
dlsym
The author claims that byte swapping code \- _"depends on integers being 32
bits long, or requires more #ifdefs to pick a 32-bit integer type."_

True. But you might consider using inttypes.h which defines some pretty useful
things like uint32_t (an unsigned 32 bit wide integer for example).

\- _"may be a little faster on little-endian machines, but not much, and it's
slower on big-endian machines."_

In fact swapping the byte order is _one_ CPU instruction. You can for example
use some inline assembly to optimize your code. (If your compiler fails to
recognize this pattern.)

    
    
         uint32_t byte_swap( uint32_t x )
         {
             asm( "bswap %0"
                : "=g"(x)
                : "0"(x)
             );
    
             return x;
         }
    

Just my two cents...

~~~
masklinn
> In fact swapping the byte order is _one_ CPU instruction.

That's one _machine_ instruction, I'm pretty sure it's more than one microcode
instruction ;)

~~~
dfox
Swapping bits around is operation that is essentially free in hardware. It's
just wires.

~~~
ableal
Most hardware is just wires. Especially since transistors shrunk down to
nearly nothing.

Still, the layout of something like a barrel shifter (e.g.
[http://www.erc.msstate.edu/mpl/distributions/scmos/images/bs...](http://www.erc.msstate.edu/mpl/distributions/scmos/images/bshift.gif)
, from a casual search) takes its space on die, much like an adder or
multiplier. It's all wires _and switches_.

------
bluesmoon
I noticed ifdefs like this when I inherited some C code back in 2001. I'd
always worked on x86 systems, so never really encountered machines with
different byte orders. The code was fugly, and I didn't like it, so I studied
it some and it hit me that it didn't matter what the byte order was. If I
constructed a 32 bit int and assigned it to a 32 bit int, the compiler would
take care of the byte order. All I needed to know was the byte order of the
network protocol we were using.

Tested new code on my x86 box and it worked. Then just committed to
sourceforge CVS and told the rest of the world to test. It worked. My code
looked a lot like Rob's.

------
yason
The smallest questions always cause the most heated debate.

It doesn't really matter much how the possible byte order swap is done: what
matters that these ifdefs aren't littered around the code and byte-order
swapping is limited to the lowest level where data is actually read from an
external source.

I would personally go with his byte array reads as it's less confusing but I
would still wrap the functionality inside inlined functions like these:

    
    
      uint32_t inline read_be32 (void*);
      uint32_t inline read_le32 (void*);
    

And then use these whenever reading 32-bit integers from big-endian or little-
endian data source.

~~~
alexchamberlain
I've tried to tackle this issue at <https://github.com/alexchamberlain/byte-
order>.

~~~
dchest
You did this to prove Pike's argument, didn't you?

 _Whenever I see code that asks what the native byte order is, the odds are
about a hundred to one the code is either wrong or misguided._

[https://github.com/alexchamberlain/byte-
order/commit/b804361...](https://github.com/alexchamberlain/byte-
order/commit/b804361636f9233a6c9b0b38b04ceef98f3e8faa#L0L64)

~~~
alexchamberlain
Well caught!

------
tytso
Rob Pike was specifically talking about binary streams. There are many cases
where you can make simplifying assumptions in the name of speed; this is quite
common in the Linux Kernel, where (a) lots of code uses it, so optimizing for
every last bit of CPU efficiency is important, and (b) we need to know a lot
about the CPU architecture anyway, so it's anything _but_ portable code. (Of
course, we do create abstractions to hide this away from the programmer ---
i.e., macros such as le32_to_cpu(x) and cpu_to_le32(x), and we mark structure
members with __le32 instead of __u32 where it matters, so we can use static
code analysis techniques to catch where we might have missed a le32_to_cpu
conversion or vice versa.)

What are some of the assumptions which Linux makes? For one, that there is a
32-bit type available to the compiler. For just about all modern CPU
architectures where you might want to run Linux, this is true. This means that
we can define a typedef for __u32, and it means that we can declare C
structures where we can use a structure layout that represents the on-the-wire
or on-the-disk format without needing to do a pull the bytes, one at a time,
off the wire decoding stream. It also means that the on-the-wire or on-disk
structures can be designed to be such that integers can be well aligned such
that on all modern architectures such that we don't have to worry about
unaligned 32-bit or 64-bit accesses.

And it's not just Linux which does this. The TCP/IP headers are designed the
same way, and I guarantee you that networking code that might need to work at
10 Gbps isn't pulling off the IP headers one byte at a time and shifting them
8 bits at a time, to decode the IP header fields. No, they're dropping the
incoming packet on an aligned buffer, and then using direct access to the
structures using primitives such as htonl(). (It also means that at least for
the forseeable future, CPU architectures will be influenced by the
implementation and design choices of such minor technologies such as TCP/IP
and the Linux kernel, so it's a fair bet that no matter what, there will
always be a native 32-bit type, for which 4-byte aligned access will be fast.)

The original TCP/IP designers and implementors knew what they were doing, and
having worked with some of them, I have at least as much respect, if not more
so, than Rob Pike...

------
ableal
It's by Rob Pike. Not a wise choice of target to nitpick. When he says "I
guarantee that [...]", I'm inclined to take his word for it.

Nice piece, clearing up a cobweb in a poorly lighted corner. And teaches (with
code example) what one really needs to know about handling byte order in data
streams.

------
premchai21
I wish I had time to write a more thorough response right now, but I just did
a short test with Debian sid and its GCC 4.6.3 on a modern Xeon machine under
Xen (so, not the best performance testing device, so take this with some
salt).

At -O9, the compiler optimizes a masks-and-shifts swap of a uint64_t into a
bswapq instruction identical to the one emitted by the GCC-specific
__builtin_bswap64; this can be coupled with an initial memcpy into a temporary
uint64_t. Loading individual bytes and shifting them in emits a pile of
instructions that take up 16 times as much code space and ~35% runtime penalty
(2.7 s versus 2 s). This is measured in a loop decoding a big-endian integer
into a native uint64_t and writing it to a volatile extern uint64_t global,
2^30 iterations, function called through a function pointer.

Aligned versus unaligned pointers seem to make no real difference on this CPU,
using a static __attribute__((aligned(8))) uint8_t[16] and offsets of 0
(aligned) and 5 (unaligned) from the start of the array.

I also tried a function with the explicit cast-shift-or that uses an initial
memcpy into a local uint8_t[8] in case the compiler was doing something
strange with regard to memory read fault ordering as compared to the explicit
memcpy in the two bswapq-generating versions. This resulted in some very
"interesting" code that shoves the local array into a register and then very
roughly masks and shifts all the bits around, at about a 100% penalty from the
bswapq functions. :-(

If anyone's interested in the details, reply and I'll try to put them
somewhere accessible, though it may take a little while.

~~~
figglesonrails
This isn't surprising. If the set the AC bit on x86, then it will disallow
unaligned accesses and you'll be operating in an environment more similar to
RISC machines. In order to allow such a thing to succeed, GCC can't produce a
32-bit read from char* address since the alignment is only guaranteed to be 1
(i.e. no alignment) and this would trigger SIGBUS. Thus, in order to get a
32-bit read, you must deref a 32-bit variable, not 4x 8-bit ones. This makes
even more sense on RISC systems where this "optimization" would be a tragic
bug you'd want to work around in your compiler. See my post with the x86
assembly output confirming your general results.

------
huhtenberg
Rant-y. I don't like that.

> _The byte order of the computer doesn't matter much at all except to
> compiler writers and the like_

Binary protocol parsing is one area that relies heavily on byte ordering,
_struct_ packaging and alignment. Tangentially, binary _file_ parsing that is
optimized for speed will have the same dependency. In fact, anything that
deals with fast processing of the off-the-wire data will want to know about
the byte order.

------
kstenerud
Actually it's kind of funny... I recently wrote a base64 encoder/decoder that
makes use of native endian order to build 16-bit unsigned int based lookup
tables that map to the correct byte sequence in memory regardless of native
endianness.

Looking up a 16-bit int rather than 2 chars, and outputting as a 32-bit int
rather than 4 chars yields a nice performance boost at the cost of possibly
not being portable for some more esoteric architectures that don't have a 16
and 32-bit unsigned int type.

So while he's right that 99% of the time you shouldn't be fiddling with byte
order, it still pays to know how to wield such a tool, and it's most
definitely not just for compiler writers.

------
GoSailTheC
CPU byte order definitely matters to device drivers read/writing across the
I/O bus: they must perform wide aligned reads and writes using single CPU
loads and stores. Rob's approach simply won't work there. Similarly, OS-bypass
networking and video, which expose hardware device interfaces in user space,
require CPU-endian aware libraries.

That said, use Rob's portable approach anytime you don't have a compelling
reason not to, if only to not have to worry about alignment and portability.
Doing otherwise is premature optimization and a maintenance headache.

------
TwoBit
There's one and only one reason we write code the way he says not to:
performance. Working with words instead of byte munging makes a huge
performance difference. And in game development performance beats most other
reasoning, especially when we are talking about loading tens of thousands of
these on startup. And besides, all our code is wrapped in calls to inline
functions named uint32_t FromBigEndian(...) anyway, so it's actually cleaner
than what he proposes.

------
sparkie
One fallacy is that you need to ever manually convert byte order yourself in
the way the article suggests. Most systems have something that'll do it for
you - eg, htonl, ntohl.

~~~
ArchD
I don't know why you got downvoted. Your idea is valid and people may not have
realized that htonl has friends an relatives that totally make the issue of
the article moot.

    
    
           #define _BSD_SOURCE             /* See feature_test_macros(7) */
           #include <endian.h>
    
           uint16_t htobe16(uint16_t host_16bits);
           uint16_t htole16(uint16_t host_16bits);
           uint16_t be16toh(uint16_t big_endian_16bits);
           uint16_t le16toh(uint16_t little_endian_16bits);
    
           uint32_t htobe32(uint32_t host_32bits);
           uint32_t htole32(uint32_t host_32bits);
           uint32_t be32toh(uint32_t big_endian_32bits);
           uint32_t le32toh(uint32_t little_endian_32bits);
    
           uint64_t htobe64(uint64_t host_64bits);
           uint64_t htole64(uint64_t host_64bits);
           uint64_t be64toh(uint64_t big_endian_64bits);
           uint64_t le64toh(uint64_t little_endian_64bits);

~~~
alexchamberlain
Where are these defined?

~~~
gkelly
I found them, on ubuntu, in:

    
    
      /usr/include/endian.h

------
mcculley
I think this is mostly correct. Certainly when dealing with streams, it makes
code more straightforward to just deal with a byte at a time. But grepping
over some old code to see where I've used WORDS_BIGENDIAN, I see cases where I
defined a typedef struct for a memory mapped binary data format. That is one
place where you would sacrifice performance and clarity by dealing with bytes.

------
haberman
I prefer the code snippet:

    
    
      memcpy(&i, data, sizeof(i));
      i = le32toh(i);  // Or whichever function is correct.
    

This is easier to read and requires less smarts from the compiler to do the
right thing efficiency-wise.

------
gaius
LSB-MSB enables certain useful addressing modes in the 6502, e.g. fast access
to the zero page. Therefore _IMHO_ little-endian is better. I'm not really
interested in 16-bit and above :-)

------
duaneb
Of all the things to quibble about, he chooses the 0.01% of binary I/O that's
write once, test once.

------
coffeeaddicted
But now he needs 2 variables, i and data, while otherwise I can just read in i
and swap it afterward on a big-endian machine (assuming I read in little-
endian certainly).

~~~
alexchamberlain
Multiple variables shouldn't be scary. Storing the same data multiple times
is... You just need to use pointers. See
<https://github.com/alexchamberlain/byte-order>.

~~~
coffeeaddicted
You realize your code is littered with just those defines which the article
wrote are not necessary at all? You are exactly proving my point, you don't
need a second variable in the case where you don't need to switch bytes when
you use a define. And the trick is to use the define only after you already
have the value already in the integer. In his solution you would need the data
pointer which you have put into the define _always_.

~~~
alexchamberlain
They are necessary to provide optimal code without relying on the optimiser.

------
skrebbel
> _byte order doesn't matter._

> _Let's say your data stream has a little-endian-encoded 32-bit integer.
> Here's how to extract it (assuming unsigned bytes):_
    
    
        i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
    

Wait, if byte order doesn't matter, why do I need to do byte-level array
lookups when i'm processing a stream of integers? Oh yeah, because byte order
_does_ matter. If byte order wouldn't matter (say, if all computers were
32-bit, had the same byte order and the same endianness), I could just cast
the stream to int* and be done with it. I can't, because of _byte order_. It
_matters_.

Whether you deal with it using byte-array lookups and math or #ifdefs and
bitmasks, well, whatever rocks your boat man! Good that you're taking it into
account, because byte order matters!

~~~
msbarnett
Did you bother to read the article? He writes that _native_ byte order doesn't
matter, not, as you botched the quote "byte order doesn't matter".

The byte order of the input data _obviously_ matters, and nothing you've said
here disagrees with anything he wrote.

~~~
demallien
I think that skrebel is trying to say that doing as the article suggests comes
with an unnecessary performance penalty hit if your CPU has the same
endianness as the data stream.

~~~
fpgeek
Why should there be a any performance penalty? A good compiler (and I've
worked with at least one that could) would know the machine's endianness and
could optimize away that sequence of selections, shifts and ors when it isn't
needed.

~~~
alexchamberlain
Ok, but the performance penalty is then at compile time...

~~~
alexchamberlain
Lots of downvotes... Do you disagree? If so, why? Or do you think it is
insignificant?

I've sat and watch C++ compile for 5 hours... Compile time performance is
important too!

------
alexchamberlain
There are a lot of errors in this article and the code therein.

    
    
      i = *((int*)data);
      #ifdef BIG_ENDIAN
      /* swap the bytes */
      i = ((i&0xFF)<<24) | (((i>>8)&0xFF)<<16) | (((i>>16)&0xFF)<<8) | (((i>>24)&0xFF)<<0);
      #endif
    

This should use uint32_t, it is the best way of getting a platform independent
unsigned 32-bit integer, which is what you want here.

 _It's more code._

Couple more lines of C, yes. No more at the machine level.

 _It assumes integers are addressible at any byte offset; on some machines
that's not true._

Not sure about this one...

 _It depends on integers being 32 bits long, or requires more #ifdefs to pick
a 32-bit integer type._

This is caused by bad code - see above.

 _It may be a little faster on little-endian machines, but not much, and it's
slower on big-endian machines._

It is faster on a LE machine, but not slower on a BE machine - the same code
can be used and it's a compile time #ifdef.

 _If you're using a little-endian machine when you write this, there's no way
to test the big-endian code._

Test on a BE machine?

 _It swaps the bytes, a sure sign of trouble (see below)._

No actual facts here...

As pointed out by another commentor, this can be optimised out by the compiler
on many platforms.

~~~
dfox
The point is, when you use explicit shifting of bytes you have same code that
works independently of endianity, the fact that compiler is probably going to
generate exactly same code seems to me like good argument to go with the more
readable choice (ie. explicit shifts), also variant proposed by article is
actually portable C, anything involving casting arrays of one type to arrays
of another incompatible type is not.

No modern architecture can access arbitrarily aligned words in memory directly
(presence of caches modifies things slightly, but shifts the problem from data
bus width to cache line width as unaligned word can still span two cache
lines). There are generally two solutions to this: disallow that at CPU level
(and handle that by raising SIGBUS), emulate it in hardware by doing two
memory accesses for one load or store (which involves significant additional
complexity), Intel invented third solution in i386: OS can select between
these two behaviors.

~~~
alexchamberlain
I would argue we, as a community, need to write a portable, yet optimised,
byte order convertors. htobe, htole, htobel, htolel, htobell, htolell, and
vice versa.

~~~
dfox
Converting byte order of integer is mostly pointless operation (which is what
the article tries to say), what is needed is portable, yet optimized way to
build/parse portable binary structures. In my opinion there are two reasons
why too much optimization in this is complete waste of time:

1) Even if compilers are not able to optimize manual conversion of integer
to/from discrete bytes into same code as word sized access with optional byte
order swap, it's mostly irrelevant, as there aren't going to be any
significant difference in performance between one four byte access and four
one byte accesses (as in both cases you end up with same number of actual
memory transactions, which is the slow part, due to caches)

2) when you are handling portable binary representation of something, it's
always connected to some IO, which is slow already so any performance boost
that you get from microoptimalization like this is completely negligible.

I tend to just hand write few lines of C to pack/unpack integers explicitly
when needed as it seems to me as the most productive thing you can do.

By the way all the big endian <-> little endian functions you propose boil
down to two implementations for each size of operand: no-op and mirroring of
all bytes, both of which are mostly trivial.

What is really missing is portable and efficient way to encode floating point
numbers, as there is no portable way to find out their endianity and in
floating point case it's more complex than just big vs. little endian.

~~~
alexchamberlain
It's the boiling down that people get confused with... I've started an
implementation at <https://github.com/alexchamberlain/byte-order>.

