
Parallel Out-Of-Core Dataframes in Python: Dask and OpenStreetMap - Lofkin
https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/
======
IndianAstronaut
This is awesome. One of the reasons I shifted away from Pandas is it's
difficulty in dealing with out of core data. Can't wait to try this out.

~~~
nimrody
What did you end up using instead of Pandas?

~~~
IndianAstronaut
Right now I do my analysis work in SQL, currently MSSQL previous was MySQL.
Always struggle to apply any functions to the data. I then do some sampling
and analysis then in R.

------
bjlkeng
Comparison of PySpark vs Dask:

[http://dask.pydata.org/en/latest/spark.html](http://dask.pydata.org/en/latest/spark.html)

~~~
rxin
This is great work, but certain part of the comparison is not accurate,
probably due to their lack of understanding of Spark.

First and foremost, it would make more sense to compare against the DataFrame
API of Spark, which is very Pandas like.

"Dask gives up high-level understanding to allow users to express more complex
parallel algorithms."

\- I don't think this is true. Their example of complex algorithms (SVD) is
not that complicated, and there are even implementations of that in Spark's
MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing
task API.

"If you have petabytes of JSON files, a simple workflow, and a thousand node
cluster then you should probably use Spark. If you have 10s-1000s of gigabytes
of binary or numeric data, complex algorithms, and a large multi-core
workstation then you should probably use dask."

\- Actually a lot of Spark users use Spark as a way to parallelize Python
programs on multi-core machines.

"If you have a terabyte or less of CSV or JSON data then you should forget
both Spark and Dask and use Postgres or MongoDB."

\- Loading "terabyte" of JSON into Postgres seems pretty painful.

~~~
mrocklin
Thanks for the comments. I wrote most of that document so I'll try to explain
my reasoning in line.

> First and foremost, it would make more sense to compare against the
> DataFrame API of Spark, which is very Pandas like.

It would make more sense to me to compare dask.dataframe to spark's dataframe.
This document is comparing dask to spark. dask.dataframe is a relatively small
part of dask.

>> "Dask gives up high-level understanding to allow users to express more
complex parallel algorithms."

> I don't think this is true. Their example of complex algorithms (SVD) is not
> that complicated, and there are even implementations of that in Spark's
> MLlib directly. Spark's DAG/RDD API is essentially the low level user-facing
> task API.

The point here is that it's quite natural for dask users to create custom
graphs (here is another example
matthewrocklin.com/blog/work/2015/07/23/Imperative/). Doing this in Spark
requires digging more deeply into its guts. This sort of work is not idiomatic
or much intended in Spark.

> Loading "terabyte" of JSON into Postgres seems pretty painful

I've found that loading a terabyte of CSV into Postgres or a terabyte of JSON
into Mongo to be quite pleasant actually. I'd be curious to know what problems
you ran into.

------
justinsaccount
Their conclusion is interesting:

    
    
      If you have a terabyte or less of CSV or JSON data
      then you should forget both Spark and Dask and use
      Postgres or MongoDB.
    
    

I don't really have "big data" problems. I have "annoying data" problems. 500G
of 10:1 compressed csv log files that I want to run reports on every now and
then. Often just count or topk by a column, but sometimes grouping+counting
(i.e, sum of column 5 grouped by column 3 where column 2='foo')

I've been looking into tools like Spark and Drill, but my tests running on a
single machine found them to be extremely slow. Maybe things would be faster
if I converted the log files to their native formats?

I've been considering trying to load the data into a postgres db using
cstore_fdw, but what I really just want is a high performance sql engine for
flat files, something probably like Kdb.

Like this article that I read recently:
[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
I know this can be done efficiently enough on a single machine, I just need
the right software.

~~~
mrocklin
Flat CSV or JSON files are hard to parse. Fast CSV parsers and gzip
decompression both run at around 100MB/s. If you want to get faster than this
you'll need to use better (ideally binary) formats.

This notebook might interest you:
[http://nbviewer.ipython.org/gist/mrocklin/c16c5c483b2b9859de...](http://nbviewer.ipython.org/gist/mrocklin/c16c5c483b2b9859ded5)
, particularly the sections starting at "Eleven minutes is a long time." It
compares CSV costs (minutes) to custom binary storage formats (seconds) on a
20 GB dataset.

