
Getting Started with MapD, Part 2: Electricity Dataset - jonbaer
http://randyzwitch.com/mapd-pjm-electricity-data/
======
qeternity
I want to believe that GPUs can accelerate databases but I rarely see head to
head comparisons. Is 100ms for this query really that good? With only 4mm rows
how would vanilla Postgres hold up, not to mention something like Citus or
Memsql (both of which are at the core my startup’s big data pipeline).

~~~
felixge
100ms for this query is good compared to PostgreSQL (~4s when running the
query as-is, ~275ms when using a window function) considering that MapD
doesn't seem to utilize an index for it.

That being said a 146MB CSV with 5 million rows is not a challenge for vanilla
PostgreSQL either. If you're willing to add an index and use window functions
(which MapD doesn't seem to support), you can run this query in ~34ms:
[https://gist.github.com/felixge/fffef4bc41a5e87e26f62c2a5c8e...](https://gist.github.com/felixge/fffef4bc41a5e87e26f62c2a5c8ed5bd)

I have no experience with MapD, but people often significantly underestimate
the performance of relational databases and reach for the big/expensive toys
too soon.

~~~
randyzwitch
You are right, as a demonstration there is a trade-off between using a real
dataset and making the data large enough to demonstrate the performance gains.
You are proving the other part of my argument though, in that you don't need
to add an index (though, the comparison between using a window function in
Postgres is fair, since that would be the canonical way to write that query)

As far as the "big/expensive toys", I'm using the Community Edition, which is
FOSS.

~~~
felixge
I guess you're just trying to show how to use MapD rather than showing when it
makes sense to use it, which is cool. For the latter, I found the results over
here fairly interesting:
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)

As far as "big/expensive toys" go, I was referring to total cost of ownership
of putting something like MapD in production vs PostgreSQL. The latter can run
on any machine (no GPU needed), there are plenty of hosted services, and there
is a large pool of people familiar with it. If this is just about running some
analysis on a local machine, that argument doesn't apply so much :).

~~~
tmostak
Note that MapD can run on CPU as well and is generally quite fast. I haven’t
benchmarked but on a dataset of this size the numbers might be similar.

I’d also say that while this is a great example of getting up and running with
MapD the system really shines when doing interactive queries over datasets
from the 100M to 100B record range without requiring indexes.

------
aw3c2
The exact same query takes 60-70ms in SQLite for me. This is not a very good
MapD advertisement...

edit: Actually, I get this (uncached) in the sqlite3 CLI (the 60-70ms where
via sqlitebrowser):

    
    
        Run Time: real 0.018 user 0.018164 sys 0.000135
    

Here is the database if you want to play around with it.
[https://transfer.sh/VkWXe/sqlite.db.gz](https://transfer.sh/VkWXe/sqlite.db.gz)

~~~
arnon
GPU databases have a higher latency, due to the need to copy data to and from
the GPU. The data processed has to end up in the host RAM, to be sent to the
client - it can't be sent directly from the GPU RAM.

The GPU really only helps when you need a higher throughput - eg. you're
analyzing many billions of rows, or are running a relatively compute-intensive
query.

~~~
aw3c2
Yeah, I know. Maybe the post(s) should not talk about performance for those
badly fitting examples then.

~~~
niviksha
hi - i took a look at the sqlite db. not to nitpick, but you have 4 indexes
there?

the overhead of indexes at tiny data sizes like this is minimal, so no
surprise that sqlite or pg will compare favorably at laptop scale (FWIW, i ran
this on a 4-core macbook pro with no GPU (i.e CPU only) and without indexes,
mapd runs the query at ~1.5 sec while sqlite takes ~6).

however, the bigger point is about how this experience scales. take a look at
this public demo -
[https://www.mapd.com/demos/ships](https://www.mapd.com/demos/ships). it's 11
billion geo data points, and you'll see the same low millisecond response that
FridgeSeal is talking about - again, no indexing (to verify, go to dev tools
in your browser, turn on SQLLogging(true) in the console and look at the query
times reported for the round trip. i'm sure you appreciate that the overhead
(both creation and maintenance) of indexing scales with data size, plus
sqlite/pg dont (AFAIK) offer anything by way of GPU-rendered charting.

~~~
aw3c2
Of course I use indexes. It's utterly silly not to do so.

My point was that the submitted post did not make a convincing argument pro
MapD when query times are that much slower than a trivial SQLite database.

~~~
niviksha
i'm not questioning your use of indexes. i'm questioning the basis of the
unqualified assertion that 'query times are slower than a SQLite database' \-
with indexes off, and both running on CPUs, the basis of comparison isn't
tilted one way or the other, and then it isnt clear at all that mapd is 'much
slower'.

------
aw3c2
I meant to post some SQLite and Postgres numbers but the CSV that the code
generates (on "wget [http://www.pjm.com/pub/operations/hist-meter-
load/{1993..201...](http://www.pjm.com/pub/operations/hist-meter-
load/{1993..2018}-hourly-loads.xls")) is garbage:

    
    
        2010-01-15 00:00:00,DPL,24,2017.425
        ...
        ,1/3/2017,1,RTO
        ...
        7/5/2017,VMEU,22,97.495

~~~
randyzwitch
The Python code works for Python 3 and pandas v0.22.0. There are some rows
that are mangled, but my assumption was this was an artifact of reading Excel
files, and I didn't go back and fix every single row in the dataset.

    
    
      ACTUAL_DATE,ZONE_NAME,HOUR_ENDING,MW
      1993-01-01 00:00:00,AE,1,855.0
      1993-01-01 00:00:00,BGE,1,2358.0
      1993-01-01 00:00:00,DPL,1,1150.0
      1993-01-01 00:00:00,GPU,1,3772.0
      1993-01-01 00:00:00,JCPL,1,1632.0
      1993-01-01 00:00:00,METED,1,929.0
      1993-01-01 00:00:00,PECO,1,3015.0
      1993-01-01 00:00:00,PENLC,1,1211.0
      1993-01-01 00:00:00,PEPCO,1,2137.0

~~~
aw3c2
I tried both Python 2.7.14 and Python 3.6.4. pandas==0.22.0 and xlrd==1.1.0.
[https://transfer.sh/2T4fw/hourly_loads.csv.gz](https://transfer.sh/2T4fw/hourly_loads.csv.gz)
is what I get.

It looks like dates are stored not as dates but strings in the more recent
spreadsheets.

edit: dateutil's parser.parse(x) & strftime("%Y-%m-%d") helps

