

I parallelize an algorithm - lnmx
http://www.dalkescientific.com/writings/diary/archive/2012/01/17/I_parallelize_an_algorithm.html

======
onemoreact
He has a lot of overhead.

If you want to do something to 1/2 of an MxN matrix join rows so x from top to
x from bottom to get a constant amount of effort per Unit work + 1/2 unit
effort if there is an odd number of rows. Basically, worst case is you have to
wait for one core to process a full line when every other line is already
finished. But, while you can cut the matrix into smaller chunks that tends to
add overhead which is not worth it for large matrices split among a small
number of cores. outer loop is x: 0 to N/2, worker does row x and row (N - x
).

Now, in his case he wants a single number, that is the count of the number of
comparisons over a threshold. In his example he works out that _What about
per-thread count arrays?_ but forget's that he can just keep a separate count
per worker and then add it to the total when that worker finishes. Which is
far less code.

PS: Cache coherency can be improved by splitting rows into 2d shapes or
grouping rows into wider bands. The second option is a lot simpliar for a
similar speedup.

~~~
onemoreact
Just reread my comment it's not vary clear:

He has a lot of overhead related to random assignment in order to show a
progress bar and he also seems to be using a insane number of threads.

If you want to show steady progress as you do something to 1/2 of an MxN
matrix join rows so x from top to x from bottom to get a constant amount of
effort per Unit work + 1/2 unit effort if there is an odd number of rows.
Basically, worst case is you have to wait for one core to process a full line
when every other line is already finished. But, while you can cut the matrix
into smaller chunks that tends to add overhead which is not worth it for large
matrices split among a small number of cores. outer loop is x: 0 to N/2,
worker does row x and row (N - x ).

Now, in his case he wants a single number, that is the count of the number of
comparisons over a threshold. In his example he works out that What about per-
thread count arrays? but forget's that he can just keep a separate count per
worker and then add it to the total when that worker finishes or split off a
few workers add up the columns in those arrays. Which is far less _single
threaded_ code.

PS: Cache coherency can be improved by splitting rows into 2d shapes or
grouping rows into wider bands. The second option is a lot simpliar for a
similar speedup _as long as the bands are small enough not to blow your cache
and you don't have to many worker threads_. Basically, it's simple but not
cache agnostic.

~~~
dalke
I'm the author of that essay.

I mentioned in passing in a couple of locations that the code I describe is
only an approximation to the code which I am actually optimizing. The
fingerprint objects are bit patterns of (in this case) 2048 bits, and the
similarity between two fingerprints is defined as the number of bits in the
intersection divided by the number of bits in the union. (This is the
"Tanimoto" or "Jaccard" similarity.) Simple math says that if the one
fingerprint has B bits, and the threshold is t, then fingerprints to search
must have between B _t and B/t bits.

What I do then to speed up the search is to precompute the popcounts for each
fingerprint, sort them by popcount, and make an auxiliary index so I know
where the start and end of each popcount count region is. As a consequence,
this reduces the O(N_N) term by a factor of two. With that in place, my sort
algorithm is closer to:

    
    
        for (row=0; row<n; row++) {
          query_popcount = popcount(fingerprint[row]);
          start_popcount = num_bits*threshold;
          end_popcount = num_bits/threshold+1;
          for (target_popcount = start_popocount; target_popcount<end_popcount; target_popcount++) {
            for (col = start_of_popcount[target_popcount]; col<start_of_popcount[target_popcount+1]; col++) {
              num_bits_in_commmon = intersection_popcount(fingerprint[row], fingerprint[col]);
              if (num_bits_in_common / double(query_popcount + target_popcount - num_bits_in_common > threshold) {
                count++;
              }
            }
          }
        }
    
    

This is less comprehensible than the version I posted in the essay, but still
has the essential problem I wanted to resolve.

With these optimizations in place, it's more complicated to determine the
number of calculations which will be done in a row. For one, the popcount
distribution is highly data dependent. I don't know that number until I'm in
an inner loop, and so it isn't worthwhile to try and attempt to assign a
"constant amount of effort per Unit work."

In the usual case of high thresholds, and if the fingerprint popcount
distribution is broad, then I am mostly working along the near-diagonal of the
matrix, which improves my cache coherency. However, if I have enough
fingerprints then this simple linear scan will blow the cache, and the above
needs to be replaced by some cache oblivious search pattern. If that's what
you meant by "2d shapes" in your comment then perhaps you missed my mention of
Morton Z-ordering?

My goal, however, was to explain the thinking needed to minimize lock
contention and the difficulties in developing a parallel algorithm, so I used
a simplified version of what I actually did to get that point across. Your
comments deal with other aspects of the algorithm, and don't contribute to the
point.

BTW, what do you mean by "insane number of threads"? The target is a 8 core
machine, and my timings numbers show a scaleup of about 6x for that case. I'm
used to working on parallel molecular dynamics codes which use 10-100
processors, so this seems rather small.

Also, "he can just keep a separate count per worker and then add it to the
total when that worker finishes" isn't correct, or rather, I can't see how to
apply it to the symmetric form. The code I came up with has one single-
threaded code, of O(N) time, and after the final barricade. I don't see how
any other solution has less contention.

~~~
scott_s
He's probably referencing this line: _This is likely the cost of starting
N=110885 different worker threads in Python._

That's not actually accurate, but it's just a terminology problem. When you
say "thread," what people like me think of is the actual operating system
kernel thread. In your Python code, you only have 4 of them at any given time.
(I know this from "max_workers=4".) What I think you mean by it is
"indivisible work unit I'm supplying to my threading library," and I think you
have 110885 of them.

~~~
dalke
Ah, thanks! I've fixed the problematic text.

------
scott_s
Good introduction to using OpenMP for scientists, but I was confused by his
terminology. He kept referring to his Python code also as the "pthreads"
version, presumably because Python was using Pthreads for the parallelism.
That confused me, since I had assumed he meant that his "pthreads version" was
C code only, that directly used the Pthreads API. It's worth nothing that
gcc's OpenMP implementation is also using Pthreads.

edit: I think he's missing "count++" inside the inner if-statement in the
"Amdahl" and "Using many..." sections.

edit the second: I also think he has an unnecessary critical section in the
"Amdahl" section. Assuming the correction above, this should work:

    
    
      #pragma omp parallel for private(count, target_index)
      for (query_index=0; query_index<N; query_index++) {
        count = 0;
        for (target_index=query_index+1; target_index<N; target_index++) {
          if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
            count++;
            #pragma omp critical (add_count)
            counts[target_index]++;
          }
        }
        counts[query_index] += count;
      }
    

Because query_index is the induction variable for the outer-most loop, we know
that only one thread will ever see any given value for it. In other words, the
iteration space of query_index is completely partitioned among the threads.
Hence, accessing an array using query_index still gives use independent
iterations of the outer-most loop, and we don't need to wrap the array access
using it in a critical section. The above is a more involved case of the
canonical example:

    
    
      #pragma omp parallel for
      for (i = 0; i < N; ++i)
        sum[i] = a[i] + b[i];
    

The iteration space for target_index is not completely partitioned, so the
critical section is still needed.

~~~
dalke
Hi scott_s. Taking your points in order; by "pthreads" I mean "built on top of
the the POSIX thread library API, and available from any language which
supports the API calls."

While gcc may use POSIX threads for its implementation, they are not
necessarily inter-compatible. For example, I found on my Mac that OpenMP dies
with a segfault when used from a pthread. OpenMP also works with Microsoft's
C++ compiler, while the available pthreads support for that OS is built on the
actual native methods.

I lack a better way to distinguish these two types of threading libraries,
hence the confusion.

Thanks for the corrections! I am indeed missing the count++, now fixed. My
actual code, which I used to generate the numbers, was a bit more complicated
and using it would obfuscate the points of the essay.

You are absolutely right about how the second critical section is not needed!
That could should only be executed 110,000 times, while in the worst case the
other critical section is executed N*(N-1)/2 = 12 billion times, so I decided
to not redo the timings.

I've updated the document to point out that those were not needed, with a
pointer to this thread.

~~~
scott_s
As a systems research guy - someone who implements the sorts of things you're
using - I would recommend just calling it the "Python" version as the short
name, and clarify that you are using Pthreads from within Python as needed. In
that version, that you're using Python is the most interesting thing, since
many languages have bindings for Pthreads.

~~~
dalke
As a Python guy, it's an API layer on top of POSIX threads organizing calls to
a shared library extension written in C. Most of the time is spent in the C
extension (max 10 seconds out of 130), so doesn't seem right to call it the
"Python" version.

In any case, the section is "using pthreads from Python" and the chart lists
"Python/pthreads", so I think I labeled it well enough, and won't be making a
change.

(The point, after all, was to compare an OpenMP version to something which
doesn't use OpenMP.)

~~~
scott_s
Python is driving the program, and the threading library you're using provides
its own abstractions on top of Pthreads itself. Pthreads has no concept of
work pools - although many people end up implementing them when using
Pthreads. Pthreads is a relatively low-level library that specifies how to
create and join threads, and then provides mutexes and condition variables to
synchronize across them. The threading library you're using hides all of that
from you (which is a good thing!).

Where the code spends most of its time isn't the important thing for
classification. What's important is what decided the structure of your code?
And that was the threading library as provided by Python, not Pthreads. In
fact, the threading library you're using only incidentally uses Pthreads as
their implementation mechanism, similar to how gcc's implementation of OpenMP
only incidentally uses Pthreads.

------
rcfox
The per-thread counts are a step in the right direction, but OpenMP will take
care of this itself if you use the 'reduction' clause.

~~~
scott_s
I don't think you can. He is essentially keeping a histogram: for a given _i_
, how many times did we see it? That is not a scalar value, and reductions can
only be applied to scalars. If you think I'm wrong, please show me the code
you think that works, because I spent some time thinking about it, and I'm
quite curious.

~~~
dalke
I looked as well but it seems like reduction only works for scalars and not
for arrays.

------
acqq
At the section "What about per-thread count arrays?" he uses the
omp_get_num_threads() to allocate "per thread" array, but as far as I see he's
still not in the parallel region, the num threads can as well be 1 there.

Author also writes: "This is likely the cost of starting N=110885 different
worker threads in Python." As far as I understand, he has only 4 hard coded
threads (that they are they hard coded doesn't have sense too).

Finally I don't see why he should get only speedup of two with four cores, he
certainly doesn't explain it. EDIT 1: I believe commenter onemoreact found the
reason: the compiler probably assumes that all loop passes take around the
same time, but the author managed to make longest loop last N times longer
than the shortest.

~~~
dalke
Good catch acqq! I wrote the code for that essay first, copied it in my
working algorithm (which is a bit more complicated than what wrote here), and
after a few segfaults and other problems, got it working there by just using
the max thread count. However, that fix didn't make it back to the essay.

I've updated the essay code, with commentary about the problem. Thank you for
pointing it out.

I've updated the test so it says "starting N=110885 different jobs for the
worker threads".

Also, I never reported the speedup of one core vs. four cores. What I reported
was the speedup from computing across the NxN matrix vs. using symmetry and
only computing the N*(N-1)/2 similarities. That's the factor of two time.

Raw data computed yesterday, for a threshold of 0.80 (and a slightly different
problem), on an 8 core machine, using from 1 to 8 cores, is:

    
    
                   OpenMP             pthreads
        cores    time    scaleup    time     scaleup
          1    0:02:00    1.0      0:02:01    0.99
          2    0:01:06    1.8      0:01:06    1.8
          3    0:00:44    2.7      0:00:44    2.7
          4    0:00:32    3.75     0:00:33    3.6
          5    0:00:27    4.4      0:00:28    4.3
          6    0:00:23    5.2      0:00:23    5.2
          7    0:00:20    6.0      0:00:20    6.0
          8    0:00:26    4.6      0:00:19    6.3
    

This is a rented Amazon node, and I consistently found that 8 nodes using
OpenMP gave poor results, so I suspect I didn't have access to 8 cores.

This benchmark also doesn't have the randomization which confused my previous
benchmark run, so you can see that the two threading systems gave essentially
identical timings. Looks like I'll need to update a previous essay.

------
archangel_one
IIRC it should be possible to guard the increments with an "omp atomic" pragma
rather than an "omp critical", which may well provide better performance. It's
possible that the compiler is able to replace the critical section with an
atomic increment where appropriate, but equally possible that it doesn't.

