

Counting bytes fast - walrus
http://fastcompression.blogspot.com/2014/09/counting-bytes-fast-little-trick-from.html

======
nkurz
I wish the author would have been more specific about the processor being
tested. This is the sort of thing that is extremely dependent on the
microarchitecture of processor used. I wonder if he's testing on AMD or older
Intel?

Modern Intel chips usually do an very good job of "store-to-load forwarding"
when the load is the same size or smaller than the store. Although it's
testing a slightly different effect, this is a great recent article on the
topic: [http://blog.stuffedcow.net/2014/01/x86-memory-
disambiguation...](http://blog.stuffedcow.net/2014/01/x86-memory-
disambiguation/)

~~~
sitkack
Pretty sure this would help. [https://software.intel.com/en-
us/node/513929](https://software.intel.com/en-us/node/513929)

Author should have supplied test code and a benchmark suite.

~~~
danbruc
How could packed addition help for this task? The project description on
GitHub says benchmarking happens on a »Intel Core i5-3340M (oc'ed to 3.0GHz)«
so maybe the article is based on numbers from the same machine.

~~~
sitkack
the avx registers have 512 bytes of storage.

~~~
nkurz
The space is there, but it's difficult to see how to make use of it for this
problem. You'd need to be able to use an 8-bit input value to choose which
position in which register to update. Incrementing the position (pos = input %
32) is possible, though not as straightforward as you might hope.

But choosing which of the vectors to update (vecX where X = input / 32)
doesn't have a good answer without requiring unpredictable branching. There's
a thread elsewhere on this page where we muse about the possibility of using
self-modifying code, but otherwise I don't know of any answer.

~~~
sitkack
During my morning boot (awakening) I was thinking about the self modifying
code solution. I don't think it is needed and that it would eat up lots of
memory bandwidth. I think a 256 entry jump table is sufficient which could be
inlined so

    
    
        R8 <- load 64 bits from ram
        JUMP A1
        @A0
        SHIFT R8
        @A1
        R9 load low 8 bits to register
        pc += R9 * SIZEOF_AVX_INCR_ROUTINE
    

Each AVX_INCR_ROUTINE would be the same size and have an absolute jump back to
A0. There would need to be a DEC counter so the whole thing was only done 8
times. Also memory reads could be interleaved during the count or use the AVX
register for a destination directly from memory. But it seems like a waste if
there are already 64 bit registers sitting there.

If one used U16 counters, every 65k elements read would need to sweep through
and write the high buckets out to 32 bit counters to prevent overflow.

I don't yet have any solid evidence but I think using the AVX registers for
histogram storage _could_ enable histogramming at the read bandwidth of main
memory.

[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

~~~
nkurz
I don't think any approach based around a jump table is going to perform well.
The issue is that modern processors use speculative execution, and have a deep
pipeline that needs to be refilled whenever the wrong path is chosen. A jump
table and an cmp/jc are predicted in the same way, but the jump table is
usually harder to predict. Every time the prediction is wrong, you lose ~15
cycles. The 4-table approach describes takes about 2 cycles per byte, so there
isn't much room for error.

The whole idea of "the" program counter is almost gone. You can still set it
to affect program behaviour, but in reality there is a "reorder buffer" filled
with instructions from an "execution window" that are issued by the backend as
soon as their dependencies are available, and the front end has a mostly
independent "speculative pc" that is fetching and decoding instructions 10-100
cycles ahead of the current execution window. The guess as to which execution
path will be taken at the jump table will probably be made even before the
load is issued to get the address.

Per your earlier recommendation, I did read through "Inner Loops". It's a
really good book, and offers a great overall approach to programming, but this
is one of the places where the details of modern processors have changed so
much that the older approach is counterproductive. To be competitive for this
task, you probably need to be branch free, so that you have speculative
execution working with you rather than against you.

~~~
sitkack
I see your point. I am going to try and implement the approach I outlined
above as well as the code gen. Glad you checked out "Inner Loops", I a guess
my take away from that book is understand the machine, your code and most of
all do some science. The hardware used the book is very dated, and I'd love to
see a second edition using SkyLake or newer.

Maybe ... instead of doing the JMP back to the top of the loop, you combine
the two approaches to like a direct threaded interpreter. In each
AVX_INCR_ROUTINE modify the code by over writing the next JMP address, don't
jump back to A0. The issue of consecutive duplicates arrises again.

Or ask the processor for some L1 cache that never gets flushed to main memory.

Or use R8-R15 as an 8x8 array of U8 counters, do 4 passes [0-63] [64-127]
[128-191] [192-255] or a 4x8 array of U16 counters, would have to test.

    
    
        Function    Best Rate MB/s  Avg time     Min time     Max time
        Copy:           11569.0     0.145255     0.138301     0.154563
        Scale:          11383.5     0.145140     0.140555     0.148503
        Add:            12544.6     0.196920     0.191318     0.204782
        Triad:          12464.8     0.198199     0.192542     0.207492
        

from
[http://www.cs.virginia.edu/stream/FTP/Code/stream.c](http://www.cs.virginia.edu/stream/FTP/Code/stream.c)

I still think the histogramming can get above 6GB/s

------
0x0
Thanks for this post! I've known about the massive effects of optimizing for
cache lines but this is the first time I've heard of the write commit delay.
Great example as well.

------
infogulch
If I remember correctly, the Mill wouldn't have this issue, since all stores
go through the cache hierarchy (by default) so they finish as soon as it's in
L1 (~3 cycles typically). It's then evicted down through the cache and into to
main memory as usual, while retaining consistent aliasing semantics
automatically.

I may have misunderstood, but this is covered in more detail in the memory
talk I believe: [http://youtu.be/bjRDaaGlER8](http://youtu.be/bjRDaaGlER8)

------
readerrrr
I did a quick test on an i5( Ivy Bridge ) proc using c, and surprisingly when
on a non-random table, the distributed version retained the speed of a random
table, while the single version became 2.5x slower.

~~~
jevinskie
Can you share your test harness and implementations? This would be fun to
tinker with. =)

------
rcthompson
So the problem is when the same byte occurs 2 or more times in rapid
succession, right? Couldn't you use a temporary variable to count the number
of consecutive occurrences of the same byte, and then once you hit a different
byte, add the total consecutive count to the appropriate slot? That way you
only write to the count table once for each run of identical bytes, and you
never write to the same slot of the table twice in a row. Assuming that the
compiler can fit all the necessary temporary variables into registers,
wouldn't this eliminate the problem entirely? Or would this be so much extra
work that the processor becomes the bottleneck instead of writes to memory?

~~~
bodyfour
The cost of an unpredicted branch inside the loop would probably outweigh the
benefit.

~~~
kabdib
This code is generating tons of D-cache misses, at the cost of hundreds of
cycles (in a modern x86 memory system), so you can afford a branch miss now
and then.

~~~
nkurz
I don't think this is correct. The input is sequential, so the hardware
prefetcher should be very successful: the only misses will be 1/4096 for the
first entry into a new 4KB page, and even this could be avoided with a single
judicious software prefetch.

He's counting 8-bit bytes, and thus has 256 entries per table. Using 64-bit
counters, this is 8 * 256 = 2048 contiguous bytes per table. 4 tables gets him
up to 8KB. Modern L1D is 32KB.

Alternatively viewed, 1700MB/s is 1.7 billion table additions per second. A
processor is running at about 3.5 GHz, which is 3.5 billion cycles per second.
Thus his current algorithm is takes ~2 cycles per byte. A branch prediction
errors costs 15 cycles, and thus would be relatively very expensive.

You'd probably be right for the 16-bit U16 version, though.

~~~
kabdib
Good point about prefetch.

The U16 version can probably benefit from a test for the first 256 values, and
leave the remainder of the range as vanilla counters. You might get a win for
less than 128 on the first test, in fact.

------
im3w1l
It seems what you really want is probability for different symbols. Have you
considered estimating them by sampling?

~~~
0x0
If the sampling is deterministic, it sounds like you'd risk malicious input
being able to cause pathological results? Not unlike the various hashmap
DoS'es from a couple of years ago.

~~~
baudehlo
I think you may be a few years out in that. The hash DoS's were discovered
about 11 years ago. Time flies eh?

~~~
0x0
You're probably right, but there was a least a big reprise of it in 2011 in
php - CVE-2011-4885

------
jevinskie
I wonder what would happen if you used AVX-512 registers to store the byte
distribution. You could use 16 of the 32 512 bit SIMD registers instead of
relying on memory. AVX-256 would require 32 SIMD registers but it has just 16
available.

~~~
dbaupp
Is there an efficient way to take a byte and increment the appropriate
location of the appropriate register?

~~~
nkurz
No, there is not.

There are reasonable branch free ways to increment one of the low 16 bytes of
an XMM register (shift and add), and slower ways to increment any of the low
32 bytes in a single YMM register (permute and add). Interestingly, a lookup
table that loads the appropriate addend vector is surprisingly fast as well.
But there is no branch free way to select which of several AVX/AVX2 registers
to work with.

But I'd be happy to be wrong -- maybe there is a way to address them using the
older stack oriented FPU instructions?

~~~
pmalynin
If one is willing to write self-modifying code, then it is possible to
overwrite VEX.vvvv instruction encoding with the register select, so for
example to select XMM0/YMM0 you'd have VEX.1111 (one's complement) or for
XMM15/YMM15 VEX.0000.

~~~
userbinator
Unfortunately although SMC could be shorter, it's even slower since the
pipeline gets flushed every time a write occurs to locations within the
current fetch window (not sure exactly how long that is, but it's not small.)

~~~
pmalynin
Yes, and the instruction cache becomes stale as well. I guess one way to avoid
is to have 16 code blocks back-to-back and then to do a like a jmp into the
section that contains the right register. JMP are pretty cheap, and the end
point is likely to be in cache anyway.

------
zobzu
interestingly this is the kind of tasks where i fear languages with no pointer
exposed may be slower. Any one has a good comparison?

