
Mistakes Developers Make When Using Python for Big Data Analytics - yarapavan
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics
======
tedchs
In my experience, developers' top 3 mistakes in this area are:

1\. Convincing themselves they have "big data" instead of just "data". If it
fits in RAM on your laptop, it's definitely not "big".

2\. Thinking an example of "sophisticated analytics" is "Average over time
with no standard deviation".

3\. Resume Driven Development over pragmatic solutions that get stuff done.

~~~
bunderbunder
I'm beginning to think that "big data" is the most unfortunate possible term.

By any reasonable definition of 'big' even a gigabyte or two counts as big.
And since 32bit isn't quite one for the history books just yet, many of us
still have 2-4GB fresh in our heads as a threshold where you might have to
start thinking about the data needing special treatment.

~~~
vonmoltke
Not only that, most users of "big data" make assumptions about what the
underlying dataset is. Not all large datasets are unstructured text you are
trying to run information extraction on.

~~~
bunderbunder
Indeed. It's amazing how impressive you can look rescuing someone from a "big
data" quagmire by just loading that CSV file into a halfway decent RDBMS and
letting a simple SELECT statement spin away from a couple tens of μs.

Not that I've got anything against container ships, mind you, it's just that I
find they're a rather unwieldy vehicle to use for getting groceries home from
the market.

------
micro_cam
I kind of feel like pandas and ipython are anti patterns here.

They seem super convenient but I've found that writing a lower level analysis
with independent scripts linked via make or similar saves massive amounts of
time in the long run.

IE the first few steps should retrieve and process the data until it is just
arrays of numbers (or whatever your actual analysis needs) that can be handled
with efficient numpy code.

You can use pandas for this but a real database works too. After this step
pandas becomes irrelevant and just leads to lots of unnecessary allocations
from recasting data on the fly.

The problem with ipython is that it leaks like a sieve and you end up with all
sorts of copies of the data, worker processes and no longer used variables
which will eat all of your ram even on a big cloud instance.

It is much nicer to have each analysis script start with a clean stack and
release its memory when done. Plus you can use non python utilities like grep,
vowaple rabbit etc as intermediate steps.

I've found practices like these significantly lower the memory requirements of
analysis and allow one to tackle bigger datasets on single machines.

Each to their own though.

~~~
numlocked
I find that IPython NB is a great tool for initial exploration -- to sketch
out the ETL steps that are needed, what types of computations I'm going to be
doing, etc. As soon as those ideas are roughed out, I switch to writing Python
scripts that I can run through completely, write tests for, git commit and get
reasonable diffs, and more. I'll generally keep an ipynb open in a tab for
experimentation, but I'm doing all of the work in PyCharm or Spyder.

This also forces me to think about the engineering implications of the
analysis a bit sooner. Inevitably these projects are not one-offs, and will
need to be at least repeated regularly, if not outright productionalized --
and ipynb files do not lead to production-ready code.

I also agree that using a real database is often a better option than pandas.
Many folks avoid it because it's less comfortable to spin up postgres + a new
DB instance than it is to sit in the comfort of IPython, but it's totally
worth it. Exploring and previewing the data via SQL is just so much faster and
more intuitive than spinning it around with pandas.

~~~
jimbokun
And note seccess's comment about using in memory SQLite db to eliminate the
need to configure postgres:

[https://news.ycombinator.com/item?id=8930488](https://news.ycombinator.com/item?id=8930488)

------
mkesper
Even when not using pandas, do yourself a favour and use
csv.DictReader/DictWriter, as csv format has quite some quirks.

[https://docs.python.org/3/library/csv.html#csv.DictReader](https://docs.python.org/3/library/csv.html#csv.DictReader)

~~~
t1m
The csv module has it's fair share of quirks to be aware of as well, like not
supporting Unicode.

~~~
juliangregorian
Even in Python 3?

~~~
mkesper
In Python3 it's completely fine, if you open the files with the correct
encoding. For Python2, you've got to use UnicodeReader/Writer (and to open in
binary mode):
[https://docs.python.org/2/library/csv.html?highlight=csv#exa...](https://docs.python.org/2/library/csv.html?highlight=csv#examples)

------
makmanalp
Great points about data provenance and sanity checking at the right level.
This is all valid for any kind of data and not just python stuff.

> Doing the task in vanilla Python does have the advantage of not needing to
> load the whole file in memory - however, pandas does things behind the
> scenes to optimize I/O and performance.

There's a neat way around this, just set iterator=True or chunksize and
pandas'll return you an iterable TextFileReader object:
[http://pandas.pydata.org/pandas-docs/stable/io.html#io-
chunk...](http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking)

Another tip is that even before you get into cythonizing stuff, see how much
of the computation you can push down into high performance libraries like
numpy - use a map() with a vectorized numpy function, and store your stuff in
a numpy data structure instead of a manual for loop on a regular array-of-
arrays, etc.

Sidenote, is there a decorator version of cythonmagic? Basically I want to
annotate my functions with it since sometimes the non-typed basic cython is
still much faster than the pure python version, and I don't have to manage the
compilation step.

------
arca_vorago
I would agree most strongly with number 6. Having spent time in the biotech
arena, there are so many fractured data format structures. A good example of
this would be small variations in .fastq files. There are different formats
generated from different equipment, and different software expects format A
when you have format B, and I've seen people trying to shove the wrong formats
into pipelines that blow up or create a tone of work as a result.

Honestly though, the biggest problem with people using python for bid data
though is just the opposite of number 2, sometimes you need to use that
framework, but when you rely too heavily on various frameworks, that's how you
end up in dependency hell, largely due to the aforementioned format issues.
eg. Framework 1 update fixes bug X but creats bug y in framework 2's parsing.

~~~
bdevine
Your "bid data" typo seems like a nice little Freudian slip!

As far as your second point though: although it's not ideal, don't virtualenvs
do the trick? If you really needed to, you could even set up a workflow of
different sandboxes to pass data through. If the alternative to relying on
frameworks is rolling your own, frankly I would probably choose the former,
but that's just me.

~~~
arca_vorago
Good point on virtualenvs, but it's a python specific fix (which does work).
The problem is often you are tying in other kinds of tooling.

The other thing to remember about roll your own is you can get efficiency that
frameworks can't, and that time difference adds up fast. For example, we
designed a worker distribution system, rolled our own worker management code
on top of a framework, and reduced compute times from ~1-2days to ~4 hours.
That's a huge increase in productivity that no tool or framework could give
us.

There is power in roll your own, I would just say whip your programmers into
submission regarding good commenting/documentation though.

------
bsg75
Nothing to do with the mythical concept of "Big Data" here.

Not necessarily a bad post, but the "Big Data" title misleading.

------
raincom
big data (noun) 1\. another name for data warehousing.

