

Accessing unaligned memory - djoldman
http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html

======
malkia
Treyarch's Spiderman 2 on the PS2 was using lzo compressed data, but
compressed in such way such that compressed chunks were up 64kb, decompressing
to max 128kb (avg ratio was 1:1.5 I think). Biggest speedup for the lzo-
decompress, was to put a few gcc asm instructions to use the unaligned write,
and I think I've used unaligned read one too. Also the data was padded already
to 16 bytes (or even more). This allowed us to do 5/6 mb/sec uncompressed
speed from disk DVD that was able to do 2 to 4mb/sec depending on whether it
was close to the center, or on the edge (e.g. it was constant angular
velocity, rather than constant speed - I think we were using this mode as it
was less noisy, and more reliable). This helped us also for the japanese ps2,
many of which have shipped with problems, limiting even their outer speed to
2mb/sec. (Spiderman2 was heavily streaming game).

The decompression was done on the IOP chip (e.g. the PS1 in the PS2), on the
PS2 this chip was used for file I/O, sound, and few other minor things (I
think controller too, and maybe memory card). So the lzo code originally was
only doing 2/3mb/s decompression speed on the 40Mhz chip, but adding these two
unaligned read/writes (I don't remember the details) gave us boost with 70-80%
or more (again don't remember the details). ZLIB was out of the question, and
lz4, lzfast, etc. were not known to us back then.

Later I've read that this unaligned read/write instruction had a bit of legal
issue, but I can't find the article about it. Something about MIPS licensing
it for it's use to other companies doing MIPS compatible chips.

As for the 64kb compressed -> 128kb uncompressed (on avg.) - Basically the way
it was done - I've had only one 128kb buffer, and I would read the compressed
chunk (from 8kb to 64kb aligned to 4kb (I think)) but I would read it at the
end of that buffer, then LZO was pretty cool that it was able to decompress
in-place, as long as your compressed data was a bit ahead of where it was
going to get decompressed.

Then that buffer was transfered with the DMA to the PS2 CPU chip, and then the
next one was read (not sure whether I tried double buffering, or not - details
are escaping me now). I think Kung Fu Panda, and maybe one more game used it,
but Spiderman 2 possibly had the best benefit. Still to this day I'm not sure
how much of an impact this was, since the GC and Xbox versions shipped without
compressing any data.

It was just a hint from a very cool coworker to try it out, and I did it :)
Got help from few other colleagues to speed it up later...

~~~
giovannibajo1
Just out of curiosity: LZO is very popular in games but it is GPL licensed.
Were you violating the license by using a modifying the original C code in a
proprietary codebase?

~~~
malkia
We've had the license purchased

------
eridius
I find it interesting that for Clang 3.6 the best method is memcpy() on all
platforms. I'm guessing this is because Apple is deeply involved in the
development of Clang, and Apple also cares _very much_ about the performance
of ARMv7 (and ARMv6 in the past, though probably not so much anymore).

~~~
brigade
Yeah... gcc's optimizer _very_ often falls down in absolutely trivial cases on
ARM. Like they discovered here - compiling even

    
    
        u32 read32(const void* ptr) {
            u32 result; memcpy(&result, ptr, 4);
            return result;
        }
    

results in a load from ptr, then a store to the stack, then a load again into
the same register to return. Thats... kinda bad that it can't eliminate an
easy dead store to the stack.

    
    
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr	r0, [r0, #0]	@ unaligned
        sub	sp, sp, #8
        str	r0, [sp, #4]	@ unaligned
        ldr	r0, [sp, #4]
        add	sp, sp, #8
        bx	lr
    

Though this specific case is probably ultimately because gcc had code to
handle unaligned accesses via bytewise accesses (I think... can't get gcc to
generate such a sequence now), since unaligned support in armv6 could be (and
often was) disabled by the OS, and it still hasn't been excised from the newer
architectures that don't need it. LLVM on the other hand has never cared about
ARM CPUs/OSes that don't support unaligned accesses.

------
userbinator
_Of course, a better solution would be for all compilers, and gcc
specifically, to properly translate `memcpy()` into efficient assembly for all
targets._

An even better solution would be for newer non-x86 hardware to handle
unaligned accesses automatically:

[http://lemire.me/blog/archives/2012/05/31/data-alignment-
for...](http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-
myth-or-reality/)

ARMv7 is similar to x86 in that it has "natural unaligned access", hence the
great speedup afforded by using packed structures.

IMHO data alignment restrictions were an implementation artifact of the early
RISCs which might've seemed a good idea at the time, when the width of the
databus was the same as the wordsize, but makes little sense now with caches
and _very_ wide datapaths; the amount of extra circuitry needed is really
insignificant compared to the rest of the CPU (including the caches), and
memory bandwidth is often the bottleneck such that packing data structures to
utilise the cache most efficiently becomes important.

(I say this having worked with MIPS code to do unaligned multibyte
reads/writes; taking the appropriate bytes and shifting them the right amount
and then combining them is really trivial to do in hardware - it's just
wiring, essentially - but turns into annoying multi-instruction sequences when
done in software.)

~~~
caf
Unaligned words within a cacheline is one thing, but what about unaligned
words that straddle the end of a cacheline - or worse, the end of a page?

In particular, consider how these accesses will interact with atomic
operations.

~~~
userbinator
_In particular, consider how these accesses will interact with atomic
operations._

For x86, Intel has guaranteed that since the P6, any accesses within a cache
line are atomic. The LOCK prefix can be used to enforce atomicity even with
accesses that cross cachelines.

 _or worse, the end of a page?_

I suppose you're thinking about page faults? If a #PF occurs, if the handler
returns with a valid mapping, the instruction just gets restarted as usual.
This reduces to the crossing cacheline case.

I'm not saying that alignment is entirely unimportant and can always be
ignored, but the number of cases in which it is important, seems to be
decreasing with the newer CPUs.

~~~
caf
I was responding mainly to the assertion about the trivial amount of extra
circuitry required. The cacheline-crossing case means that it's a more
complicated affair than just shuffling some bytes around as in the within-
cacheline case.

The "accesses across cachelines require LOCK to be atomic" x86 case is also
illustrative - for example should the pthreads implementation include the LOCK
prefix just in case you have embedded your pthread_mutex_t in a packed
structure where it ends up unaligned?

------
0x0
I believe that unaligned memory access on ARM is not only slow (if
traps/kernel fixups are enabled), but if fixups aren't configured, you will
actually just _read garbage data_!

[https://wiki.debian.org/ArmEabiFixes#word_accesses_must_be_a...](https://wiki.debian.org/ArmEabiFixes#word_accesses_must_be_aligned_to_a_multiple_of_their_size)

~~~
david-given
That's pretty common in a lot of low end microcontrollers --- I've met one
(the MSP430) where the low bit of the address bus will be wired to zero when
doing a 16-bit read. It's a bit rage-inducing when trying to track down a bug
caused by it, but totally unsurprising.

I _am_ surprised that recent ARMs have hardware support for unaligned
accesses. What sane person expects these to work?

~~~
0x0
Funny that this comment and the sibling one seem to be contradictive, "who
expects this to work" vs "if it doesn't work get with the times".

I bet there is a ton of originally x86-targeted software and libraries that
never paid much attention to alignment. It's pretty scary to consider what
could happen with stuff like that is ported over to ARM if garbage data is
returned for unaligned access - what if stuff (structs etc) is misaligned only
sometimes (depending on allocator, array lengths, etc). Could it be the "next
heartbleed"? :P

~~~
TillE
Since most people don't write assembly code anymore, it usually makes sense
for the hardware to be simple and fast. That means the compiler is a little
more difficult to write, but code in C should be unaffected.

~~~
0x0
There's lots of ways to hit aligment problems in C, at least if you do a
little typecasting here and there.

Coincidentally, alignment issues was the topic of yesterday's Old New Thing
blog post:
[http://blogs.msdn.com/b/oldnewthing/archive/2015/08/20/10636...](http://blogs.msdn.com/b/oldnewthing/archive/2015/08/20/10636350.aspx)
(and references an example of bad C code alignment in
[http://blogs.msdn.com/b/oldnewthing/archive/2004/08/25/22019...](http://blogs.msdn.com/b/oldnewthing/archive/2004/08/25/220195.aspx)
)

