Hacker News new | past | comments | ask | show | jobs | submit login
Dask – A flexible library for parallel computing in Python (dask.org)
232 points by gjvc on Nov 17, 2021 | hide | past | favorite | 85 comments



I recently switched from Spark => Dask cause I like it so much. Disclosure: work for Coiled, the company founded by Matt Rocklin, the creator of Dask.

Here's what I've found thus far with Dask:

* way easier to build custom distributed compute systems with Dask than other alternatives. Dask futures & delayed APIs give you access to the "engine" so you can build your own custom race car

* Lots of data scientists are a lot more productive with Dask compared to Java / Scala technologies they're not comfortable with (I have a popular Spark blog / Spark books & love Spark, but some ppl just can't get productive with Spark)

* Dask cluster visualizations are so nice. So easy to understand how clusters are being computed in real time.

* We've been working closely with NVIDIA folks to provide real cutting edge GPU support.

I'm a believer in multi-tech data pipelines in the future. A data engineering team that loves Scala may write some ETL code in Spark and then pass off the baton to a data science team that loves Python and uses Dask to train machine learning models.

Seems like most other players in the data space want to take over an organization's entire data platform. I like how Dask likes to play nicely as part of the overall PyData ecosystem. I've always liked the Unix philosophy of building little tools that can be easily composed to solve a variety of different problems and feel at home in the PyData ecosystem.


Dask is really good for scientists, when I worked with bioinformaticians it was way easier to get them to use dask for out of memory processing than spark. In particular dask bag offers a great amount of flexibility for ETL use cases.

The problem with dask for me though (as a user of dask and prefect) is that I was never able to get the throughput of pySpark out of dask. For dask I can reach 100 mb/s on my laptop while pyspark can each 260 mb/s on my laptop for the same workload (cleaning and restructuring). This was around 2 years ago that I tested that I’m curious what the community has experienced now.

Also for most use cases your data isn’t really big data and dask is much simpler and fast enough which is why I reach for dask first.


We use it to go faster than spark via dask_cudf: bottleneck becomes pci/ssd, which is in GB/s.

For CPU, have not benchmarked latest CPU dask vs CPu spark. Would expect to see spark win on simple kernels (pandas vector ops) and lose on ML/C++ ones (ex: igraph vs graphx)

Would be interesting to see carefully done!


I found out that Dask is like half-done Spark. Dask kinda works with regular dataframes, but there are just too many inconsistencies that almost always Dask breaks on a working Pandas algorithm. It's just easier to install PySpark and work with it. IMHO.


I tried to use Dask to do a formst conversion of a 5 TB binary dataset coming from astrophysical simulation. It should have been a trivial job, since it was a single huge 3D array coming from a lot of files that were slicing the array in one direction, and I needed to do a conversion in each point, do a couple of transposes and reshapes, and then output a single file.

After two weeks tearing my hair out at the appalling speed, I scrapped that solution and spent one day throwing together a pipeline of classical shell tools (dd, sed, xxd etc.), orchestrated using GNU Parallell, and it did the whole job in three hours.


Funny and I had a similar real life experience with a hadoop cluster. I replaced a job that needed 10 machines with a few lines of C, some python and GNU parallel, which was perl I believe at the time. The process was taking over 24 hours on the cluster and became unworkable. I got it back to 2 hours on my slow Macbook air.

On Dask, it may not be super fast for very large data sets but it is sure easy to use.


I'm not making any claims about dask being super reliable or easy to use, but let's not pretend that a spark cluster is a trivial thing to set up and deal with.

I've had to administer spark clusters in the past, and there's a good reason why databricks is a thing.

Another detail that's important is that spark is dramatically overkill for the data sets dask is meant to work on. Dask is really geared for data sets that can't fit in memory but not necessarily data sets that can't fit on disk that spark was really meant for.


> spark is dramatically overkill for the data sets dask is meant to work on

I disagree with this, I have used spark standalone/single node both locally and in the cloud for numerous use cases because you can easily cook up spark sql or pyspark cleaning and ETL scripts for small data (a couple TB) and reach high throughput (>260mb/s). I am also someone who reaches for dask (and prefect) first for small data but in cases where the data is medium sized (100gb-10tb) I always reach for spark first, its simply easier to write fast code with spark (especially spark sql) without needing to think hard about optimizations. I also have not run into any operational issues running a spark standalone instance for a few hours or days to process small data I would use dask for.

I reach for dask first for most small data (<100gb) because its better integrated with python libraries like prefect that improve the dev experience.


Spark seems like it will be good, but always disappoints.

I doubt know much about spark, so I may get this wrong, but I feel that they need to greatly improve both: 1) the automatic configuration of single node systems 2) the changing of config throughout the processing

The fact that I don't even really know if this is the problem however is the real problem. I get that setting stuff up for a huge network of computers may be challenging, but standard defaults to use all processors on a single machine should be automatic.

Spark pisses me off that the defaults for a 1TB RAM, 20 core machine as a single node can't handle reading a 20GB parquet file with one line. And further, checking substrings can throw OOM errors in the next step without adjusting the config of worker memory, etc.

Worst of all, a 40 MB (!) file after filtering and processing can take like 10 minutes to save because of all the parallelism config. It's off course instantaneous on pandas. They really need to fix the experience for single machines (e.g. just like a small 18 core, 512 GB ram system), so that people can write packages that work for anyone, but with a small yaml flag, it can be sent to a cluster.


On the SQL front there has been some active work to make that experience better with DASK.

See dask-sql: https://dask-sql.readthedocs.io/en/latest/pages/api.html


Can't you just run it in Docker with a one-liner? I don't think you need a whole cluster just to leverage the Spark tooling.

If you're in the JVM you can basically embed it in-process.


Yes. I'm not a fan of Spark, dealing with JVM, new syntax everything, optimizing parallelism in a weird way but - it always works.

Dask, on the other hand, works some of the time. The rest of the time it'll keep running a calculation forever, or simply fail silently over and over, or some other unpleasant outcome.


I had a similar experience, I usually avoid the dask dataframe where possible and instead use bag and dask delayed. But its hard to get scientists to give up the dataframe mindset. Thats one of the reasons I wanted to try modin, I have heard the dataframe in modin is a bit easier to use.


I was mainly looking at Dask for the purpose of a more better Pandas DataFrame, but it seems like it's actually not that much better, just more distributed. It's like Dask is trying to do too many things and there doesn't do most of them well enough.

While I haven't tried it, vaex[1] seems much better, since it specifically to only address one pain point, which is to address the inefficiencies of the pandas DataFrame and aims to do that one thing well.

[1] https://vaex.io/docs/index.html


Vaex is amazing, I can definitely recommend it.


So much this. Dask does not remove the need to babysit workers and partitions, and that's the real difference between pandas and spark.


It's been over a year now, but I had an identical experience.


a path I haven't explored yet is Koalas, which touted a 80% API compatibility with pandas at version 1.0. Now its at version 1.8, and it has the pandas API as the main design principle for function interface design.

https://databricks.com/blog/2020/06/24/introducing-koalas-1-...


Our experience with Dask in distributed mode has been really painful to be honest. There's ways to shoot yourself in the foot that lead to random data corruption like changing a dataframe index with map_partitions without resetting the index afterwards. Not failure but successful runs with corrupted data. Sometimes your job gets bottlenecked on a single worker for no apparent reason so you cancel and rewrite it in a slightly different way and then it works fine. We had bugs in two of the last four versions that prevented our pipeline from running properly. If you have a dataframe of numpy arrays then unmanaged memory leaks so you need to convert them to python lists first. Again no warning by dask about this. Getting things to run efficiently requires a lot of thought about partitions, the operations you'll be doing, what indexes you have and so on.


We haven't seen any corruptions yet but sure there are some strange things that aren't well documented. One thing that hit us was that parquet could only been read using dask if you use the same backend. If you start with thousands of files partitioning is IMHO tricky because using dask naively will create task graphs that can be multiple gigabytes large. We also had a lot of dead lock situations. This is where intuitivity shoots you into the foot because your average data scientist has no clue what to do.

Fortunately many dead lock have been fixed in one of the latest releases. Other than that I have been really happy and thankful that dask exists. One othe major use cases is, however, that my colleagues use it as a simple joblib backed for our HPC cluster. Even if it sounds stupid: dask is the easiest way to get data scientists to use distributed computing resources (Disclaimer: I got active in the dask jobqueue project to get stuff fixed to run on HTCondor)


We use dask heavily, along with rest of the pydata ecosystem. I guess we are in the 'sweet spot' where data doesn't fit memory, to begin with, but once we perform any filtering and aggregations, switch over to pandas. That's exactly what dask recommends too. Our datasets don't exceed 100GB right now.

Also note, dask clearly acknowledges challenges dealing with data in the terabyte range https://coiled.io/blog/dask-as-a-spark-replacement/

Most of our use-cases right now involve using multiple cores of a big instance, than resorting to cluster computing.

With spark, there is additional/steep learning curve, complexities of dealing with cluster computing. And Spark-ML is not well known. With dask/pandas it's easy enough to feed scikit-learn and/or bring in dask-ml, just a pip install, and you can scale well known sklearn modules effectively.

I think in the end, it's about keeping things simple. As others said, if you are already invested big in Spark/Scala/Hadoop, that may make sense for you. For non-CS folks, this will be a challenge.

As for vaex, it's very interesting. One issue is that it seems to be able to want hdf5 and doesn't want to work with parquet. And it's API is not fully compatible with pandas.

Ray/Modin: played with it a bit and maybe it's a bit too new for enterprise uses and may be more geared for ML workloads. That's my take anyway and it may have progressed substantially, already.


If Dask doesn't consider their distributed version even on a single node (which is what we were using) to be production ready then they should label it as such.


Do you have any citation on why "Dask doesn't consider their distributed version" to be ready? If it is your own view, then that's ok.

I think dask is in heavy usage in real production systems. Let me cite one such usage here, from Capital One (no affiliation, just referencing a big bank for 'production ready' purposes) https://www.capitalone.com/tech/machine-learning/dask-and-ra... (also not necessarily suggesting any rapids/GPU usage, you can decouple it from the article)

And note the article is from Nov 2019. Two years is a substantial amount of time for further improvements.


You post seemed to argue that Dask is fine is you stick to relatively small data that fits on a single node then switch to pandas. You also noted that this is what Dask recommends. Implication being that I ran into issues because I didn't use Dask "the right way."

I don't see how you can argue both that and that dask distributed is production ready at the same time.

I've been in big data for 15 years and was probably one of the first few thousand production hadoop users. If you think "a big company used a big data tech so it's production ready" is an argument then I've got a few bridges to sell you. A lot of companies use a lot of technologies that they spend a lot of time beating into a shape where for their specific use cases they work just well enough to not get them all fired.


In the end, it's not about an OPEN SOURCE tool being perfect but whether it is helping you solving a problem. If it did not help you and YOU don't consider it production ready, then that's fine. But you seem to argue that Dask should put this disclaimer out there. That would imply that many other open source tools including Spark would have to do it.

Dask has solved specific problems for us and we are grateful about it. I remain open minded about other choices and listed them with the understanding I have about them.

Switching to pandas when you can is going with the philosophy of keeping things simple. I like the flexibility of going back and forth between these as and when I choose.


Dask and especially the dashboard is a very nice piece of technology.

But "naïve programming" (the art of writing a dumb algorithm with no regards to performance or optimizations) can bite you hard: https://docs.dask.org/en/latest/best-practices.html#avoid-ve...

I implemented an algorithm that was slower than the non-parallelized version this way :D


An anecdote that cannot be extrapolated: I did a course on using Dask a few years ago (Dask was fresh then). They teach us how Dask would improve our calculations, to the point of almost promissing a time improvement of "if you chop your dataframe in 8 partitions, your times will be divided by 8", no questions asked.

Then they gave us a sample of a dataframe, to analyze firstly with Pandas, and then with Dask, and compare. The times not only didn't improve, or even equal Pandas: they got worse by more than 10x. The teachers looked the code with a mix of disbelief and horror, and only after that they started to talk about tradeoffs, overheads, data fitting in memory, etc.


Yeah it took my computer 10 minutes to save a 40 MB file on Spark today. Instantaneous on pandas.


So, the trouble with Spark (as I suspect you're probably aware) is that a) it's distributed and b) it's lazy.

This means that nothing happens until it needs to. So I have experienced .head() taking hours if there's a lot of upstream processing.

Another reason for this could be that your file is distributed across many nodes, and thus you need lots of network requests to get the file.

I'm not disagreeing with you that Spark can often have problems, but it's a really useful tool that can do things that are really hard to do elsewhere (like distributed ML).


I'm pretty sure the problem was that it was fragmented across workers and such. I had already computed the filters and done a df.show() and count prior to trying save it.

It's just a really unfriendly system for creating packages for everyone to run on a single node unfortunately.


Oh yeah, there's stuff you can do, but again it takes tuning and ideally it shouldn't be so annoying.

I definitely get the frustration, but waiting for sklearn for hours to fit a fraction of the observations that spark can do is also frustrating, in a different way.

> run on a single node

yeah, Spark is pretty crap here, but it's not really one of their main use cases, to be fair.


>But "naïve programming" (the art of writing a dumb algorithm with no regards to performance or optimizations) can bite you hard

amen. people should learn to write fast code instead of "spraying and praying"

https://users.ece.cmu.edu/~franzf/papers/gttse07.pdf


Having used both ray, dask, and writing custom threads, my personal view is that while there are advantages I wouldn’t want to use any of these unless absolutely necessary.

My personal approach for most of these tasks are to try to break down the problem to be as asynchronous as possible. Then you can create threads.

The nice thing about dask is really the way you can effectively use it on pandas dataframe; Without any large overhead.

Having said that, we opted to write our own parallelization for this library:

https://github.com/capitalone/DataProfiler

As opposed to using the dask frame. Effectively, the dataframe processing wasn’t the bottleneck and easier to maintain the threading ourselves given the particular approaches taken.

That said, if I was working with large pandas dataframes, id likely use dask. For large datasets which couldn’t be stored in memory of use ray.io


To me, Dask is the solution to the wrong question. Rearchitecting the data loading + processing is better than trying to force data frames into a distributed world and will perform a lot better too.


The integration with xarray[0] and the Pangeo[1] community at large are pretty good. Not good enough for sub second response on large datasets but very good for analytic workloads, especially when you are dealing with larger than memory datasets.

If you want fast(ish) performance but you can wait a few minutes or longer its a great tool. Climate scientists love it because it lets them focus on their problem domain instead of focusing on dealing with developing parallel software.

If you are using it to magically speedup your api backend you will probably be disappointed, but you will also be just as disappointed trying to use a jackhammer to hammer in a nail.

[0] https://xarray.pydata.org/en/stable/ [1] https://pangeo.io/


I rely heavily on Dask/Pangeo stack to serve time-series weather data via Rest API primarily based on ERA5. You’re correct, it won’t give you sub-second responses but turns out this is more than sufficient for data analysis work.



If you read through the comments you'll find some contradictory explanations of what Dask does, so as some background:

At its bottom layer, Dask takes a graph of tasks and dispatches them to a scheduler. Schedulers can range from a few threads, a few processes (using multiprocessing), a few process on local machine (with the Distributed scheduler), or tens of thousands of machines.

That latter example is not made up, I've seen geoscience demos doing that in 3 lines of code on one of their giant clusters.

The task graph can be generic Python functions, using e.g. the bag API, which gives you a map-reduce-y like API: https://docs.dask.org/en/latest/bag.html

Separate, and optionally, Dask also has emulation layers for NumPy, Pandas, and scikit-learn I believe as separate project. Here you write code that looks like normal Pandas code, say, but instead of executing immediately it creates a execution graph, and then you submit the graph to your scheduler and it runs it for you. These emulation APIs are partial, by their nature, but also optional.

One interesting use case for this emulation API is processing larger-than-memory datasets. In ordrer to support multiple processes, and even more so multiple machines, Dask reimplements the NumPy and Pandas APIs using batching underneath. And so you can run a single machine and take advantage of that batching to process data that wouldn't fit in memory when using normal Pandas APIs, while also getting access (optionally) to multiple CPUs: https://pythonspeed.com/articles/faster-pandas-dask/

(If you specifically want Pandas APIs, there are other alternatives mentioned here in there in comments, e.g. Modin.)

My main personal experience was using the lower-level API to run image processing code in parallel, on a single machine with multiple worker processes. It worked great.


Apologies for being pessimistic. I've never had a reason to use dask. I like Pandas only for very small sample datasets. Nothing more than a few MBs. I deal with large amount of alpha-numeric data that can no way fit on a single computer. Even with multi-processing abilities, it's never really that useful and pandas does the job. I've to use large clusters and spark for any real datasets before I even think about using pandas. Spark has its quirks but I'd argue that it has a lot more features when it comes to distributed computing.


I don't think that's being pessimistic. I think it reflects dask's documentation. Some items I looked up because I remember from the past:

> If your data fits comfortably in RAM and you are not performance bound, then using NumPy might be the right choice. Dask adds another layer of complexity which may get in the way.

from https://docs.dask.org/en/latest/array-best-practices.html#us...

> In many workloads it is common to use Dask to read in a large amount of data, reduce it down, and then iterate on a much smaller amount of data. For this latter stage on smaller data it may make sense to stop using Dask, and start using normal Python again.

https://docs.dask.org/en/latest/best-practices.html#stop-usi...

So, dask even says not to use it for small data sets. And in their comparison to spark, they say spark has a lot more features.


ohh yeah. Thanks for pointing that out. I do remember reading this a few months ago.


Data in the 1-15 GB range is more common than you'd think. Spark is overkill and comparatively a nightmare to configure. You can fit 15 GB in memory nowadays on a single core, but you probably want things to move a little quicker.


I work at a University HPC centre and our new 72 core Icelake nodes have 512gb of RAM, and even our older 40 core Cascade Lake nodes had 256gb as standard, so I always scratch my head at who these libraries serve. That's pretty standard hardware for a HPC cluster, which is where people doing this stuff should be working anyway.


I wasn't thinking about HPC at all. I more had in mind "single-node" use cases like a workstation PC.

But even in a small cluster type of use case, I think the main users are scientists or other practitioners who, for whatever reason, don't have the support of a platform team or data engineering team that can keep a Spark (or other) cluster fed and cared for.


Off the top of my head, I can't remember what size nodes we use. I think it's m5x large AWS nodes with 16 cores and 128 GB RAM. Depending on the size of the data I use anywhere between 20 to 80 of these nodes. Sometimes even higher. That's why I've never had a need for these libraries.


I have seen european companies (claiming) doing big data on 16gb, 4-cpu, 15 node clusters and claiming that their spark jobd were not failing due to 90s hardware. I will send them your configuration. :)


You're definitely correct about Spark being a nightmare to configure. Luckily we've got data platform teams who take care of that. Analysts like I still do configure parameters depending on the scenario. You might be correct about data in the 1-15 GB being more common. I'm an analyst so most of the time, I've to aggregate data before I do any analysis. But data scientists in my team typically work on smaller datasets and use Sagemaker instances.


Spark is massive overkill for the vast majority of needs. Can't fit the data on a computer, like your use case? Yep, good for that.

Anything else, and it's not worth the overhead. Dask doesn't require a JVM cluster with executors and name nodes and hdfs and everything else that keeps databricks in business.


yeah, like others have pointed out, you might be correct about it being a massive overkill for most users. It takes me at least 10 to 15 minutes to spin up a cluster and configure it before I can start anything.


We have been using dask to support our computational pathology workflows [1], where the images are so big that they cannot be loaded in memory, let alone analyzed (standard pathology whole slide images are ~1GB; some microscopy techniques generate images >1TB). We divide each image into a bunch of smaller tiles and process each tile independently. The dask.distributed scheduler lets us scale up by distributing the tile processing across a cluster.

Benefits of dask.distributed: easy to get up and running, and has support for spinning up clusters on lots of different computing platforms (local machines, HPC cluster, k8s, etc.)

One difficulty is optimizing performance - there are so many configuration details (job size, number of workers, worker resources, etc. etc.) that it's been hard to know what is best.

[1] https://github.com/Dana-Farber-AIOS/pathml


Just FYI, Amazon and most other providers offer instances that go up to many TB of ram — the largest listed on EC2 right now is 24TB.

Depending on your workload, you may get far better throughput not worrying about distributing the work and data


I started looking at Dask recently... its scheduler/worker mechanism reminded me of the old Docker Swarm. I liked that!

But when I moved to a Kubernetes deployment (set up via the community provided Helm charts), I was completely mystified and got slowdowns as reported by others. At one point my physical worker node was using 1 core, and at other times, it was using all cores. When I dug into the default config they provide, it should have been 2 vcpus. So .. all in all .. I was completely mystified what was going on under the hood, which speaks to the lack of debugging mentioned by people.


I would not recommend Dask. We use it just for simple job scheduling (that is, none of its fancy data structures) and run into issues just getting the work done efficiently. This issue, for instance, keeps the cluster from actually being utilized fully: https://github.com/dask/distributed/issues/4501. I feel like I'm on crazy pills, because it seems pretty serious yet it's gotten no attention.


FWIW, I'm moving towards [vaex](https://vaex.io/docs/) for out of core python stuff

Dask is well built but is fundamentally flawed by tacking together pandas structures with multiprocessing, rather than having a fundamentally efficient underlying structure.


Vaex is amazing. I haven't used it in production but was blown away when I did some prototyping with it.


I love Dask for parallel computing because instead of threads or agents, you’re primarily using distributed data structures like distributed DataFrames, Series, and Bags.

It’s made our code very small, clean, and bug free.


A lot of the commentary here focuses on comparison with tabular/dataframe based analyses, but one of the killer use cases for Dask is more generalized out-of-core array computing. It greatly simplifies many of the forms of dimensional reductions or aggregations that one might perform on n-dimensional data cubes, without worrying about the boilerplate and cognitive overhead of blocking out your calculation and keeping track of all the intermediates. And in the fact that you can accomplish the reduction out-of-core and you've got an incredibly useful tool.

Is it the perfect tool for all applications? Absolutely not. But for many of my workloads as a geoscientist it allows me to trivially matriculate a toy/prototype code to something at "scale" (maybe not for production, but certainly for research) with very little effort. That frees me up to work on other problems.


Used Dask for various research experiments and scripts, and it's very good when you have to handle larger-than-memory datasets. When coming to algorithm parallelization, some quirks emerged, and debugging sent me completely nuts. I love Dask, but I think that the problem of parallelizing Python code is not solvable by Dask alone.


Dask's architecture is unable to handle large scale data processing tasks (e.g. ETL, dataframe operations) because it's just a distributed task scheduler. It works as long as tasks have very little communication across them. But data processing tasks like table join need heavy communication (e.g. shuffle) which requires a true parallel architecture like MPI to be done efficiently. Almost all of Dask's problems mentioned here go back this issue.

Bodo is a new compute engine that brings true parallel computing with MPI to data processing. Bodo is over 100x faster than Dask for large-scale data processing. Forget about speed, are you willing to pay AWS/Azure/GCP 100 times more than necessary (also increasing your carbon footprint)?

Bodo uses a new inferential JIT compiler technology which requires getting used to but it handles actual Pandas APIs (not "Pandas-like").

Disclaimer: I work for Bodo.


Recently discovered ray.io. Beats multiprocessing and can integrate with dask.


YMMV. dask with ray is 10% slower than dask with multiprocessing ray on my workload.


What's your experience with Ray


Ray is awesome and magical.

You can build really ergonomic distributed programs that are small wrappers around python classes.

I would prefer this to multiprocessing even if you are only on a single host.


I’ll totally second this. It takes away a lot of the tedium of multiprocessing programming.

I implemented it at my last job and it was a game changer. Simplified a lot of gnarly cluster management and let me focus on my code rather then figuring out how to distribute the work.


Fun fact. The first super computer in Denmark was called DASK. https://amp.ww.en.freejournal.org/2647879/1/dask.html


Dask is useful on a single laptop as well, it can be used to turn a computation that's too large for memory in xarray, to a chunked one that just crunches through, with the same code.


modin + ray would have been easier, and not required a daskification of the code, because modin's stated aim is complete compatibility with the pandas API. still, I've managed to limit it to one module in the project, and I am planning to try modin + ray vs dask to see if it is faster for this dataset.


I worked on a project that used dask to distribute integration tests. It worked, but it was painful and unintuitive.


My dissertation used a lot of dask code, and I found it to be an amazingly powerful tool for helping me to glue a few dozen other libraries together and orchestrate them as a single distributed system. sure it's half done spark, but if it were fully done spark I couldn't have used it as flexibly.


Why not just use the multiprocessing library?

Answer: it you can distribute way beyond a single computer with Dask


As mentioned, for all but the simplest use-cases, you probably want to use Spark though.

I think the time and energy that is spent on Dask would be better spent to get proper concurrency support into python.


In the academic world you can run Dask on the typical clusters using slurm or grid-engine, without any additional setup. You're not going to get a university IT department to create a Spark cluster for you.


If you want built-in GPU support (and distributed), you should check out cuNumeric (released by NVIDIA in the last week or so). Also avoids needing to manually specify chunk sizes, like it says in a sibling comment.

https://github.com/nv-legate/cunumeric


I see they also have have pandas replacement: https://github.com/nv-legate/legate.pandas. How is it different from cuDF?


You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.


I believe cuDF is not distributed? If you use Legate Pandas (or any Legate libraries) you'll get distributed as well as GPU execution out of the box.


super useful in the pandas and scikit-learn realms, taken to the "cloud." Please note that DASK runs locally too! (you can run DASK on your local machine with Debian-Ubuntu or RH derivatives)


You mean to say you can't run a local dask executor on other Linux distributions?


Unless things have changed, Dask is pure Python, so it should run anywhere Python can.


Actually not sure if anyone noticed this https://github.com/amzn/amazon-ray, that means a good amount of benchmark has been done and Ray won. Have a commitment from Amazon (and even setup that repo) is no small feat


idk, for me dask was more of a pain to set up than a single node spark cluster, and now that databricks has added a fully compatible pandas api to pyspark, there is even less reason for me to use dask locally.


Shout out to an alternative to Dask: MPIRE https://github.com/Slimmer-AI/mpire


Prefect can also run on Dask

https://www.prefect.io/


There is a hard cap on productivity given Dask is in Python, so you can't do advance stuff unless you code in C or C++.

That's the issue with these tools.


Python is pretty good at calling compiled code :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: