
Data analytics with Pandas and SQLite - duck237
https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/
======
vegabook
This is riddled with problems.

a) I routinely load 6GB of financial data into R in less than 45 seconds, and
it's complex and hierarchical to boot. Admittedly that's in R's native file
format but even a flat CSV takes less than 5 minutes.

b) Why SQL? Has this person considered Pytables or h5py? Here is an excellent
file format created by guys who work with large data sets all day long, which
is vastly superior to an on-disk relational database for data sets of this
size (and much bigger). Moreover, they map directly to Numpy.

c) Why SQL (II)? Why is all the wrangling being done in slow SQL when Pandas
has 100x faster in memory querying capability via groupby and query methods?

d) The visualization is terrible. "Other" dominates the dataset and dwarfs any
of the more interesting categories and their relationship to time of day. This
is a very clear case for when a chart grid would have been preferable, easily
constructable in Pandas or of course GG Plot, and surely doable with Plot.ly.
We would have been table to compare how different categories have different
time-of-day profiles.

e) If we're going to push data visualization tutorials into the big wide
world, why not introduce people to quantile plots, useful here to compare the
distribution of complaints (literally one line in R), or more simply and very
intuitively, we could have superimposed all the hour-by-hour complaint density
plots on each other and see how they compare (again, literally a few lines in
R as in plot(density(x)); par(new = TRUE); plot(density(y), add = TRUE, col =
"xxx"). After having normalised the data of course. This has its analogues in
Python. A second, bog-standard, _unstacked_ , bar chart could then have been
used for aggregate complaint volume by hour, and dare I say it, even a pie
chart would have been better here for comparing aggregate complaint category
sizes.

This looks like a pitch for Plotly aimed at casual HTML developers, not
serious data scientiests.

~~~
IndianAstronaut
>The visualization is terrible.

That's a feature not a bug. Plotting libraries for Python have always been
abysmal. Matplotlib is both ugly and hard to use.

~~~
vegabook
Wow I have to agree with this (even though the post is using plotly). I don't
know where the culture difference comes from between the R and Python
ecosystems that makes the latter so much worse than the former when it comes
to graphics. Python seems to be able to do so many things so well, but then at
the final output stage it all falls apart and I find myself back in R. On that
point, I even wonder if with all the Julia bruhaha, the actual _real_
competitor in data science might actually be Javascript! Much faster than
Python, and a community that truly understands graphics quality (I'm looking
primarily at D3 but things like Mathbox.js and even Hicharts are pretty
decent). I'm thinking that SVG and increasingly Webgl will incentivize linear
algebra skills in the Javascript ecosystem, and those same skills will
suddenly be transferrable to data.

~~~
IndianAstronaut
Besides D3, Google charts, which are also in JS, are also quite elegant.

------
haddr
Well, this is surely useful for some use cases. But for reapeated analyses
that 50 min loading time is a no go and there is certainly big room for
improvements. For instance, using R with data.table to process ~4GB file would
result in few minutes spent on loading and processing, while giving a very
similar flexibility of REPL and scripting. There should a similar way to do it
in python i guess...

~~~
maxerickson
The loading time is probably because of the way they are using the ORM:

[https://github.com/pydata/pandas/issues/8953](https://github.com/pydata/pandas/issues/8953)

I guess it also isn't a great approach to do row based filtering using a data
frame.

~~~
blumkvist
All in all, excellent showcase of the product and the team behind it.

------
rcpt
If you'd like to do this whole thing with open source tools take a look at
[http://bokeh.pydata.org/en/latest/](http://bokeh.pydata.org/en/latest/) in
place of plotly

~~~
elliott34
But if you're setting up bokeh server for the first time...bring a friend and
some coffee

------
aheilbut
The take home message is that there is very large class (probably a majority)
of data analysis problems for which SQL databases are very well suited. Though
PostgreSQL would be a better choice.

And while we're at it, we may want to consider a) defining some indexes and b)
buying a little more RAM

~~~
vegabook
SQL is useful when you have a use case that will require randomly querying the
dataset. This is a case of synchronous parsing in which any database
(relational or not)'s random-access capabilities are not used at all, yet
you're paying for it in _much_ lower sequential throughput and indexing
overhead. HDF5 is a far superior here.

For a vast data set (> 100 gig) one may argue that a database will allow you
to parse "on disk", but that is not the case here and if you're doing data
science you will quickly realise that even 32 gig in your machine is a small
amount of RAM. What you'll gain in on-disk ability with an indexed database,
you'll quickly be frustrated with because your wrangling "discovery/cleaning"
workflow will be frustratingly slow. The real answer is that 64 or 128 gig of
RAM with HDF5 is almost indispensable for the working "medium data" scientist
(most people), before we're having to talk about Spark, Hadoop, YARN etc for
the ginormous, properly "big data" cases.

As you say, RAM is where it's at. The real takeout is that 128 gig of dimms is
cheap compared with the time you'll waste trying to do medium data on disk
with (no)SQL.

------
Lofkin
This is a job for blaze, out of core pandas and numpy frontend to large
datasets.
[http://blaze.pydata.org/en/latest/](http://blaze.pydata.org/en/latest/)

------
hackerews
Hehe - this entire workflow can be done in Google Sheets (don't even need
Excel) by querying the NY open-data API. No code necessary.

~~~
aw3c2
On the other hand: "This entire workflow can be done in Pandas and SQLite. No
proprietary third-party software or services necessary." I vastly prefer that.
:)

------
elchief
Best title change ever.

Can't we all just agree that "big data" requires at least 2 computers?

~~~
robmccoll
I'm going to say not necessarily.
[https://www.sgi.com/products/servers/uv/uv_2000_20.html](https://www.sgi.com/products/servers/uv/uv_2000_20.html)

------
elliott34
plotly is awesome. A lot of companies spending $50k + a year on BI software
can do better (and save a ton of money) by throwing plotly iframes on some
simple static webpage and update them at any interval using cron jobs and
python.

~~~
greggyb
I work in the business intelligence space as a consultant.

The cost of any BI project is dominated by the time and effort put into ETL
and data modelling. 80% of our effort goes toward this back end work. With
this in mind, the presentation layer is essentially an afterthought. The
lion's share of BI cost goes to expensive humans. If you can spend a few
$100Ks for presentation tools that integrate seamlessly with the back-end
analysis services and streamline report design/publication, then it's a no-
brainer.

With technical employees/consultants at a premium, and in short supply
(especially in the data modeling space), it is worth $100Ks annually to have a
toolset that allows non- and minimally-technical employees to quickly and
easily build new reports.

Take a look at Microsoft's PowerBI dashboards, or at the visualization
capabilities in Tableau. The functionality there is what businesses want/need.

Those tools and Plotly equally require a solid data warehouse behind them to
support any meaningful and timely analysis. The marginal cost of a highly-
integrated and easy-for-non-technical-resources reporting layer is pretty low
after that investment.

~~~
elliott34
"a toolset that allows non- and minimally-technical employees to quickly and
easily build new reports"

Been there, done that.

This is the mantra of the BI system marketing machine that I've never actually
seen work in practice.

But I'm glad you've seen BI enterprise systems succeed where the end users are
happy and feel their toolset is flexible enough for the new daily challenges
they encounter (sincerely).

I love Tableau as an exploration tool and think the UX and visualization
capabilities are awesome. Just not the solution end users wanted. I've also
rolled out a massive cloud based enterprise BI system (data warehousing, ETL,
etc.) at a separate company. That wasn't the solution, either. Plotly was.
But, just my two cents.

~~~
greggyb
Microsoft's PowerBI platform is growing on me for an end-user UI.

You're definitely right that none of the enterprise solutions are completely
"there."

It still seems to me that "Export to Excel" is the strongest BI feature of any
tool. Pivot tables are ubiquitous. Throw any OLAP cube up on a server and host
a workbook on SharePoint and you're probably 60% of the way to a good BI
ecosystem.

^^ This is in terms of data vis and presentation ^^ The backend work still
dominates.

Fair disclosure: my company is a Microsoft Partner, so my primary exposure and
all of my work is in that ecosystem. For geek-cred, I run Arch and OpenBSD for
all my personal systems.

------
leereeves
> Big data analytics with Pandas and SQLite

> A Large Data Workflow with Pandas

> Data Analysis of 8.2 Million Rows with Python and SQLite

> This notebook explores a 3.9Gb CSV file

Big data?

~~~
rdtsc
Yeah like others point out this is small data. My rules are:

1) Does it fit in a desktop computer memory? If it does it is "small data".

2) Does it fit on a hard drive in a desktop? Then it is just "data". Or medium
data.

3) If you need a cluster or some centralized network storage to fit and manage
it, you might have big data.

4) Next level up is streaming data. Your data doesn't even sit anywhere
because it is accumulated faster than you could ever processes it.

~~~
semi-extrinsic
The best definition I've seen of Big Data is "the database indices don't fit
in memory on the beefiest single server you have access to". (And this doesn't
mean you can claim big data because you only have access to your 4 GB RAM
laptop.)

------
blumkvist
Aaaah, big data. Lovely.

And this from people who are marketing a visualization product. SMH...

------
Scott-S-McCoy
No one is going to comment on the table concisely named "data" in all of these
examples?

