
Distributed NumPy Arrays - quasiben
http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3
======
sandGorgon
I recently filed a bug on pandas requesting for a save workspace feature (like
RData & save image feature)

It was rejected as being unpythonic[1] , even though the base functionality to
save a particular data frame is already present.

Can what dask is doing, be adapted to a simple case scenario of saving a
workspace snapshot?

[1]
[https://github.com/pydata/pandas/issues/12381#issuecomment-1...](https://github.com/pydata/pandas/issues/12381#issuecomment-185783910)

------
tomrod
I've spent some time recently with both dask and distributed. Continuum
Analytics has a real gem with Matthew Rocklin! I've found the libraries very
intuitive.

------
math_and_stuff
This is great! Any guesses as to what is leading to the large reduction times?

~~~
hcrisp
1/3 time is spent in the p_reduce step, and another 1/3 in elemwise. Not
exactly sure what those do, but I'm guessing it's related to the reduce-map-
reduce steps of evaluating the standard deviation and then dividing the
elements by this value. The mean has to be calculated twice in the formula of
the z score. It sounds like the client-worker communication mechanism might
have extra latency.

I wonder if this would work if the dask arrays are not equal in length, for
example if the files were time series of unequal duration.

Also, are there any plans for dask to support distributed numpy functions
requiring kernel computation at the array boundaries? For example
scipy.signal.lfilt? I believe it would require ghosting or further inter-dask-
array communication that is not yet present.

~~~
mrocklin
See
[http://dask.pydata.org/en/latest/ghost.html](http://dask.pydata.org/en/latest/ghost.html)

