
A Fast x86 Implementation of Select (2017) [pdf] - espeed
https://arxiv.org/abs/1706.00990
======
ot
The authors claim their algorithm is novel, but it actually had been a well
known folklore technique for quite a while. For example it was used in Folly
as early as Jan 2016:

[https://github.com/facebook/folly/commit/b28186247104f8b90cf...](https://github.com/facebook/folly/commit/b28186247104f8b90cfbe094d289c91f9e413317#diff-666bb7d0944159a9ea430bd54196be4a)

~~~
stephencanon
Yeah, I definitely saw this technique in Intel sample code demonstrating use
of BMI before HSW was released. Probably 2011, maybe 2012.

That said, we worry too much about who the first person to have an idea was.
Any idea worth having is come upon independently by many people, and putting
the idea to work and communicating it to others are more valuable
contributions anyway.

------
Veedrac
> Recently, Pandey et al. [15] gave a new implementation of machine-word
> select, which we call PTSelect.

It's worth noting Jukka Suomela had this PDEP idea late 2014, and I've used
the trick a few times from that source.

[https://stackoverflow.com/a/27453505/1763356](https://stackoverflow.com/a/27453505/1763356)

------
lostmsu
Quick tip: this is about relational algebra's select operation (which is
basically row filtering; choosing columns is called 'project' there).

~~~
asdfasgasdgasdg
I didn't get that impression at all. This is "get me the rank (index) of the
first occurrence of x" in a succinct data structure. Nothing to do with
relational algebra AFAICT except to the extent I suppose that some relational
algebra implementations may use such data structures under the hood.

~~~
cmrdporcupine
Yes, but 'select' is the name of that operation in the relational algebra.
Which confuses some people whose first exposure to that model is through SQL,
where the keyword implies a combination of 'select' and 'project'. I think the
commenter's intent was to clarify that.

~~~
espeed
Yeah, succinct data structures aren't something taught in most intro to data
structures courses, and the concept may even be unfamiliar to oldhats since
SDSs are relatively new on the scene -- the public datasphere only started
becoming conscious of the idea in 1988 when Guy Jacobson introduced them in
his CMU thesis. And I have yet to find a digital copy of it online so for
those who want to read the original thesis, you may have to order a paper copy
of one or go to the CMU library to read it.

[https://scholar.google.com/scholar?q=guy+jacobson](https://scholar.google.com/scholar?q=guy+jacobson)

But, there's been a lot of development on succinct data structures since, and
this post on the fast _select_ algorithm was in-part a continuation of a
conversation from a post yesterday on succinct data structures in general and
the open-source lib SDSL 2.0. So if anyone wants to learn more about SDSs in
general, there's a bunch of links to videos and reference material in the
comments of yesterday's thread...

SDSL – Succinct Data Structure Library 2.0
[https://news.ycombinator.com/item?id=18204432](https://news.ycombinator.com/item?id=18204432)

------
purplezooey
I thought this was about select() with file descriptors, which also needs to
be fast. :)

------
derf_
I wondered why they used TZCNT instead of BSF. If their algorithm has
correctly located the word with the jth bit set, the output of PDEP shouldn't
be zero, and these should be equivalent (but BSF doesn't require a dependency
on BMI1 support, though if you already require BMI2 for PDEP, maybe this
doesn't matter).

But then I found
<[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168>](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168>).

~~~
BeeOnRope
Aside from that bug, it's like you said: might as well use tzcnt from BMI1 if
you are already using BMI2 since there are no CPUs implementing the latter but
not the former (and IIRC the reverse is even true).

Using tzcnt in principle is faster than bsf in many cases since bsf always has
a dependency on the output register (due to the zero input case), but tzcnt
doesn't. In practice tzcnt had a false dependency anyways, at least until
Skylake where that was fixed (again IIRC, I recall that one of these bit
manipulation didn't have their false report fixed, but the rest did).

~~~
nkurz
What did you think of their explanation of "memory parallelism" in section
4.3? I read it quickly, but thought it was pretty accurate. I liked that they
made explicit the link between the number of instructions and the degree of
parallelism. I was a little disappointed by the final statement that they
"suspect" that the CPU is capable of only 8 parallel requests.

~~~
BeeOnRope
The analysis looks exactly right to me (they don't explain how they calculated
MLP, but the numbers look about as expected for x86 ROB sizes/FB sizes).

One thing they don't mention is that the low MLP isn't _inherent_ in the other
algorithms! With the default/obvious implementation of a benchmark loop using
each method you hit this issue, but in principle I think you could re-write
almost any algorithm which is "instruction count MLP limited" like this into
another one that isn't. A simple sketch is to simply do a batch of N loads up
front, storing them into a temporary buffer, then do the computation part.

The loading part will have maximum MLP (after all, it's pure loads) and then
the computation part will get all L1 hits unless you picked N wrong.

It loses a bit because the loads aren't as effectively overlapped with
computation, but if the loads time "dominates" as presented here, it would
work well.

A more sophisticated approach would still try to overlap computation and
loads, at max MLP. You could do this by inserting additional prefetches or
actual loads into the computation stream, enough to get the required MLP. This
is actually kind of complicated and more dependent on the actual hardware
parameters which is why I mentioned the other way first.

------
0xFFC
Stony Brook University is one of those special universities which although
from an academic perspective not that famous, but in Systems software, they
are top-notch university.

~~~
aportnoy
Their grad math program is pretty good.

------
skissane
I was confused by the conference this is submitted to: "42nd Conference on
Very Important Topics (CVIT 2016)." I couldn't find any such conference with
such an odd name.

Eventually I worked it out. I think they are planning to submit this as a
conference paper, but at the time of creating the preprint weren't sure which
conference. So "Conference on Very Important Topics" is just a placeholder for
where the actual conference/proceedings info would go on the article first
page.

------
kolderman
Oh that Select. Yawn.

