
Xray and Dask: Out-Of-Core, Labeled Arrays in Python - shoyer
http://continuum.io/blog/xray-dask
======
Loic
As it is in Python, is it compatible with numba[0] if using the
_@jit(nogil=True)_ decorator? Having _f_ in _ds.groupby( 'some
variable').apply(f)_ being a jit compiled numba function would be great.

[0]: [http://numba.pydata.org/](http://numba.pydata.org/)

~~~
shoyer
Yeah, Numba makes it awesomely easy to write fast functions in Python that
release the GIL. You can already do this directly with dask.array by passing a
Numba compiled function to the map_blocks method:
[http://dask.readthedocs.org/en/latest/array-
api.html#dask.ar...](http://dask.readthedocs.org/en/latest/array-
api.html#dask.array.core.Array.map_blocks) \-- it should be pretty
straightforward to wrap this with xray.

------
lqdc13
Why not just sample further or use a machine where this dataset fits in
memory? A minute to compute the mean seems unreasonable if the goal is to
perform more complex tasks in the future.

~~~
shoyer
Indeed, those are both great options when possible. But easy access to
parallel computing is still quite useful.

For interactive analysis or building statistical models, you probably do
indeed still probably want your data fit in memory. But often it's most useful
to make your data smaller by calculating some sort of summary statistics
instead of subsampling. For example, if you're interested in climate change,
you might want to work with monthly means instead of the original daily or
sub-daily data. Currently, climate scientists usually do this sort of thing
with command line tools.

As for machines where datasets fit into memory -- that's also great, if you
have access to them. But even then, for most operations numpy will be limited
to a single core. Calculating the mean of 51GB of data is still pretty slow,
even if it already is in memory. Your machine with 256 GB of memory almost
assuredly has 32+ cores to go along with it, and it's a shame to let them sit
idle.

This post by Nikolay Koldunov gives some more context about the value of
dask.array for weather data:
[http://earthpy.org/dask.html](http://earthpy.org/dask.html)

~~~
ngoldbaum
Although you can use numexpr for some easy (albeit modest) threaded speedups
and streamlined memory usage. I've had more success coding custom array
processing routines in cython where I can easily exploit threads.

------
devty
excellent work!

