

The Top Mistakes Developers Make When Using Python for Big Data Analytics - rbanffy
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics

======
stephanfroede
My question? Why Python?

The most (if not all) Big Data technologies are all based on the JVM (Java
and/or Scala), why not just using JVM based languages, like Java, Scala and
Clojure?

I have nothing against Python as such, but adding just another language is not
simplifying the job.

~~~
dalke
To understand why, I'll start with a quote from the Wikipedia page for "Big
Data":

> Big data is a broad term for data sets so large or complex that traditional
> data processing applications are inadequate. Challenges include analysis,
> capture, curation, search, sharing, storage, transfer, visualization, and
> information privacy. The term often refers simply to the use of predictive
> analytics or other certain advanced methods to extract value from data, and
> seldom to a particular size of data set.

See the last term? That's the meaning used in this article. You can see that
in "Mistake #1" when it mentions Python Pandas. That expects that data can fit
into RAM. You can see more of it in "10s of gigabytes of data, the power of a
scripting languange [sic] like Python, no matter how optimized, may not be
enough" \-- I used Python to process 10s of GB of data, and renting a 60 GiB
machine from Amazon costs $1.680/hour.

If you believe that it's not "Big Data" unless you need a cluster of machines
to have enough RAM to work with it, then I can well understand why you might
complain about Python in this context.

On the flip side, I've seen, or heard of, "Big Data" projects which start with
the expectation that it will require a cluster, and never investigate if
'traditional data processing applications' are adequate. Eg, in one project I
developed optimizations that gave an overall 40x performance boost, so that
one machine was needed when previously my client required a cluster.

If Big Data includes using machine learning to identify patterns in data sets
too large to understand by people, then we were using Python to do data mining
of large chemical screening data in the late 1990s. Even earlier, Python was
being used to control supercomputing tasks, where the high-level glue code was
in Python and the low-level code in C or Fortran.

Unlike simple map-reduce jobs, these included codes with complex inter-node
dependencies, like molecular dynamics, where the network bandwidth is another
important performance factor. A molecular dynamics simulation can product 10s
of GB per day, so can easily count as "Big Data" projects which cannot be done
using the JVM-based solutions you mentioned.

[http://www.infoq.com/news/2014/01/bigdata-
languages](http://www.infoq.com/news/2014/01/bigdata-languages) also gives a
viewpoint on the lack of importance of the specific language on big data
analysis.

