
Parallelism in one line - adamnemecek
https://medium.com/building-things-on-the-internet/40e9b2b36148
======
canjobear
The trouble with Pool.map is that the function you pass in has to be
picklable, i.e. defined at the top level in the module you are currently in.

So you can't use it in interactive sessions:

    
    
      In [1]: import multiprocessing
      In [2]: pool = multiprocessing.Pool(8)
      In [3]: def hello(x):
      ....:     return x+1
      ....:
      In [5]: stuff = range(10)
      In [4]: pool.map(hello, stuff)
      # cryptic errors
    

You also can't use it with lambdas or functions defined within functions;
suppose I put this in test.py:

    
    
      import multiprocessing
      def run_stuff():
          pool = multiprocessing.Pool(8)
            def hello(x):
              return x+1
          stuff = range(10)
          return pool.map(hello, stuff)
    
      print(run_stuff())
    

... I get the same errors.

This is sad because pool.map seems to be faster than anything I have written
myself which might replace it without the above limitations. Unless someone
out there has done better than me? :)

~~~
hcarvalhoalves
It works on the REPL as long as you create the pool after the function has
been defined:

    
    
        >>> import multiprocessing
        >>> def hello(x):
        ...   return x + 1
        ... 
        >>> pool = multiprocessing.Pool(8)
        >>> pool.map(hello, range(10))
        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    

It looks like the pool takes a peek at the module during instantiation. Maybe
could be worked around by having a lazy pool that only inspects when you call
map.

------
asperous
I realize this is useful for cpu-bound computations; but for network-bound
here is a REAL two-liner that will instantly make your program concurrent (not
parallel):

    
    
        from gevent import monkey
        monkey.patch_all()
    

This is a lot lighter than processes. You can combine the two- use gevent to
handle network, and then multiprocessing to handle cpu stuff!

~~~
johtso
I've heard that using gevent with multiprocessing can cause problems, is that
still the case?

~~~
inportb
I've used gevent in multiprocess situations, and it worked just fine. I did
rig up my own gevent-based IPC, though. Just create a socketpair before
forking, and attach some JSON-RPC (or protobuf/msgpack/etc) endpoints after.

------
smnrchrds
The reddit comments[1] on this article are very interesting as well. In
particular, /u/driftingdev suggested it would be better to use
concurrent.futures. To quote him/her:

In the last example:

    
    
        pool = Pool()    
        pool.map(create_thumbnail, images)
        pool.close()
        pool.join()
    

would be 2 fewer lines with concurrent.futures:

    
    
        with ThreadPoolExecutor(max_workers=4) as executor:
            executor.map(create_thumbnail, images)
    

[1]
[https://pay.reddit.com/r/Python/comments/1tyzw3/parallelism_...](https://pay.reddit.com/r/Python/comments/1tyzw3/parallelism_in_one_line/)

~~~
bigd
I have a gigantic difference in execution time between the two. Also, you can
have the same behavior importing contextlib.closing;

    
    
       from multiprocessing import Pool
       from contextlib import closing
       with closing(Pool(15)) as p
         p.map(f, myiterable)
    

takes <30 sec vs

    
    
       from concurrent.futures import ThreadPoolExecutor as Pool
       with Pool(15) as p
         p.map(f, myiterable)
    

that takes > 1min.

whereas ProcessPoolExecutor hangs, and I've no clue why.

------
andreasvc
What I don't understand is that neither exceptions nor segfaults are handled
gracefully when using multiprocessing. The former do not have the full
traceback, while the latter aren't even detected and cause a deadlock. This
requires running the code again in single processing mode to figure out where
the bug is. The new concurrent.futures module has the same problem.

~~~
nine_k
If you follow the monadic route and return a Maybe...

Stop. Let me reformulate. If you follow the standard AJAX route and return
either a {'success': True, 'payload': ...} or {'success': False, 'error': ...}
from your workers, your error handling will be fine.

~~~
andreasvc
That would not give me a traceback, and it would not catch segfaults or
ctrl-C. It's a different style of programming that would work around part of
the issue (only the exceptions, and only by checking for errors manually),
whereas what perplexes me is that multiprocessing and concurrent.futures are
not designed to handle these error conditions themselves.

~~~
nine_k
Why, it would give you a traceback if your worker process collects it, as a
reasonable worker would. Since workers are probably non-interactive, they
can't exactly receive a Ctrl+C. The dispatcher process, OTOH, will. I suppose
a pool should kill all processes on SystemExit.

When you think about more details, you grow more and more sophisticated
solution that best suits your needs. But the point of the post is different:
it's a dead simple way _to start_. Starting is often the hardest part.

~~~
andreasvc
Collecting tracebacks is the sort of plumbing you would expect a 'batteries
included' library to do. I didn't expect the workers to receive the ctrl-C,
I'd expect the main process to terminate, along with its children; none of
that happens.

I think it's clear that these sorts of details belong in the library.
Everybody could hack together their own traceback collector, segfault handlers
etc., but it's difficult to get right and in a scripting language you should
be able to forget about such things.

------
agumonkey
see concurrent.futures for a threadpooled map :

[http://www.reddit.com/r/Python/comments/1w6h6g/parallelism_i...](http://www.reddit.com/r/Python/comments/1w6h6g/parallelism_in_one_line/cez7kj5)

------
javafu
One liner in Java 8:

    
    
        List<String> contents = Arrays.stream(urls).parallel().map(UrlLoader::getContent).collect(Collectors.toList());
    

but as far as I know the API doesn't provide any way to get a String from an
URL, so you need an helper class that also workaround checked exceptions ...

    
    
        public class UrlLoader {
          private static String getContent(String url) {
            try {
              return new BufferedReader(new InputStreamReader(new URL(url).openConnection().getInputStream())).lines().collect(Collectors.joining("\n"));
            } catch (IOException e) {
              throw new IOError(e);
            }
          }
        }
    

so not a real one liner :(

------
afhof
Does anyone seriously think that the difficulty with parallelizing code is the
amount of boilerplate? The real problem is synchronizing multiple threads;
being too lenient means deadlocks; being too strict means under utilization.
The map() function doesn't really help with either of those.

In addition, it looks like you are unfairly picking on Java. The producer
consumer model presented isn't anywhere near what a competent programmer would
do. Building your consumer from within a method called Producer? Calling
isinstance (or instanceof) to check if the "poison pill" is put in the queue?
These are the signs of a crappy programmer, not a crappy language.

~~~
lars512
If you do a lot of data munging, then boilerplate is definitely an annoying
obstacle. Plenty of problems are "embarrassingly parallel", and the map
function may be fine for those cases.

------
kazagistar
Just as an example from julia...

[http://docs.julialang.org/en/latest/manual/parallel-
computin...](http://docs.julialang.org/en/latest/manual/parallel-computing/)

1) Put the addresses of your remote workers (via ssh) in a file, and point
Julia at the file.

2) Julia connects to remote workers on each host.

3) Use one of the convenient macros or functions to run the code (this evenly
distributes a map reduce to throw 200 million coins and count heads). > nheads
= @parallel (+) for i=1:200000000 > int(randbool()) > end

4) You can also do all of the pool stuff, manual message passing, etc, if you
need to.

------
bigd
Sadly is very easy to leave zombie processes. Usually I put the Pool in a with
statement, but is not bulletproof.

But yeah, multiprocessing.Pool.map,imap, imap_unordered are neat.

------
polskibus
I never used map for parallelization in Python, but I in .NET II used PLINQ.
It's fantastic for exploiting data parallelism and very easy to use (or
refactor to from sequential code).

------
coolsunglasses
People that don't understand the problem pretending their incantations work.

In the words of a great man, it's a jolly hard problem.

~~~
elnate
Can you clarify what you mean, please?

------
CmonDev
Looks like a relatively decent implementation of .NET's Parallel extensions.

------
laureny
tl;dr: if you run operations in parallel, they go faster, and if you put this
code in a library, you can parallelize stuff in one line of code.

