

Processing 6 billion records in 4 seconds on a home PC - rck
https://github.com/antonmks/Alenka/blob/master/green.md

======
aiiane
Hm. I'm looking at this and "6 billion in 4 seconds" seems misleading - the
test that appears to be referring to is "Query 6", which (a) only examines
records from 1994 (b) on a table with entries sorted by timestamp, such that
(c) only the portions of the data which fall in the correct timerange are
actually sent for processing.

In other words, it's not actually really looking at a full 6 billion records
for that query. More accurate would be the next query discussed, "Query 1",
which takes 72 seconds to look over a much more significant portion of that 6
billion records.

It's still a pretty impressive set of numbers (as one would expect from GPU
SIMD processing), but it irks me when short descriptions bend the facts to try
to sound more significant. (Not to mention anything about the disk time
subtraction.)

~~~
seabee
If it makes you feel better (worse?), even the big guys do this; I recall
reading SAP HANA marketing materials which did a similar thing to OP[0]. With
enough precomputation and knowing which data to skip, you too can achieve
better than linear performance; though at least they aren't suggesting they
are churning through the entire dataset in a few seconds.

[0]
[http://www54.sap.com/content/dam/site/sapcom/global/usa/en_u...](http://www54.sap.com/content/dam/site/sapcom/global/usa/en_us/assetmgt/docs/2012/08-aug/27/09/sap-
hana-performance.pdf)

------
MichaelGG
"time is counted as total processing time minus disk time."

Can anyone explain why that's a valid benchmark for him to use? Certainly the
Hadoop version had significant disk access time?

~~~
Bill_Dimm
I'm guessing: Total time will vary dramatically depending on whether the
database is on HD, SSD, or in memory, so he is separating out that component.
Of course, he might be optimizing something that doesn't help much if it takes
forever to read the data from disk.

~~~
dragontamer
Not just the disk, but also transferring it over to the GPU is often a
bottleneck.

If there is back and forth talk between the GPU and CPU, things will slow down
considerably.

------
idle_processor
>NVidia Titan GPU : it is a relatively cheap, massively parallel GPU

Relatively cheap compared to having to build clusters, perhaps, but $1000
isn't cheap for a desktop computing GPU.

A mid-high tier card (GeForce GTX 770) is closer to $400. A mid-range gaming
card (GTX 760) is closer to $260.

Those finding the topic link interesting may also be interested in this CUDA
radix sorting article[0] from 2010, as it featured "one billion 32-bit keys
sorted per second."

[0]
[https://code.google.com/p/back40computing/wiki/RadixSorting](https://code.google.com/p/back40computing/wiki/RadixSorting)

~~~
jd007
He's probably talking about professional line of graphics cards (e.g. Tesla),
where $1k is considered quite cheap. Titan is the first consumer oriented card
by nVidia that can be used as a pro-grade card (no throttling on compute
tasks) so maybe that's what the OP was talking about.

Of course for gaming the Titan is not considered cheap at all.

------
DannyBee
"* - time is counted as total processing time minus disk time. "

So, in other words, i subtracted time that both _actually_ have to spend, for
no good reason.

In order to further make results look better, I only subtracted it from my
database, instead of running tests myself and subtracting it from both.

~~~
antonmks
The disk time was subtracted because it is in-memory system and when your
compressed datasets fit into memory your disk subsystem hopefully will be
irrelevant.

~~~
DannyBee
So in other words, it makes the system look better.

Look, disk time counts, whether it's hadoop loading it into the memory of a
given machine, or you reading it from disk and transferring it into a GPU
piece by piece.

Your "hopefully it will be irrelevant" is, well, crazy. I work for an employer
with plenty of in-memory systems (very large ones in fact), and it certainly
doesn't discount disk time. In fact, it matters a lot!

~~~
antonmks
Disk time matters only when you read your data first time. The following
queries won't have to read from the disk. The compressed data may sit in
memory for days and the queries won't touch the disk. Now, if your compressed
data doesn't fit into memory, then disk speeds of course would matter a lot,
on this I agree with you.

~~~
DannyBee
You are making a lot of assumptions about working set sizes, etc. In any case,
even if only once, it is still a cost you are paying, and a cost hadoop is
paying, and it is completely wrong to simply subtract it out when comparing
performance.

------
staunch
There's nothing about the Hadoop model that precludes the use of GPUs instead
of CPUs. Hadoop solves the problem of storing massive quantities of data and
processing it using a large number of machines. There's no reason the
processing can't be done using GPUs.

~~~
sandfox
Hadoop cluster of of GPU + SSD'd up machines is not going to be cheap.... (but
would be fun)

------
cartick
First, Hadoop has two parts, HDFS and MapReduce. This so called benchmark,
compares only the computation part of it. For people who say Hadoop is slow,
they never really understood what Hadoop is. MapReduce is meant for processing
big data in a batch oriented way and not meant for real time analytics.
However, there are many technologies that work on top of Hadoop that will give
real time analytics capabilities like HBase,and Impala. Column oriented
storage is available in Hadoop too, Parquet. Also with the Hadoop, the real
power comes with the availability of UDFs and streaming. Please don't do any
stupid benchmark like this without getting to know what you are comparing to.

------
brandynwhite
I've been using hadoop for 4 years now (author of hadoopy.com) so I'll chime
in. I'll state the use cases that Hadooop/MapReduce (and to a close
approximation the ecosystem around them) were developed for so that we're on
the same page. 1.) Save developers time at the expense of inefficiency
(compared to custom systems), 2.) Really huge data (several petabytes), 3.)
Unstructured data (e.g., webpages), 4.) Fault tolerance, 5.) Shared Cluster
Resource, and 6.) Horizontal scalability. Basically people already had that
and wanted easier queries, so it's been pulled that way for the second
generation now 1.) Pig/Hive and 2.) Impala and others.

Of the 6 design considerations I listed, none of them are really addressed
here. If you outgrow a single GPU then you have a huge performance penalty
growing (that's a vertical growth). If you want to make your own operations
(very common), then this would be impractical.

It's a nice idea but it'd be better to compare against things like Memsql and
the like, where they have been designed from first principles for fast SQL
processing. I'd recommend just dropping any Hadoop/HBase comparisons and
compare within the same class, Hive is embarrassingly slow even in the class
it's in (compare it to Google's Dremel/F1 or Apache Impala).

~~~
antonmks
Comparing against Impala doesn't change anything, Impala is still well behind
even on a cluster with just 600 million records :
[https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEV...](https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEVMeTQwZGJSOVQwcFRSRFFFUmcxWWc#gid=6)

Your other considerations are still valid though. Although the point was to
show the inefficiency of Hadoop/MapReduce when it comes to relational
operations.

------
quizotic
The two most interesting things about this article to me were unstated.

1\. The TPC-H benchmark is measured in price-for-performance ($/QphH, or
dollars per queries-per-hour). At 4 seconds for Q6, he's getting ~900 queries
per hour. The cost of his rig is probably ~$2k, so he's under $2 per QphH. The
top TPC-H scores are around $.10, but <$10 is pretty good for a first go.

2\. The standard knock against GPU processing is the time it takes to load GPU
memory. GPU processing may be blazing once data is in memory. But there was an
MIT paper last year claiming you couldn't load the GPU fast enough to keep up.
Evidently, he's keeping up.

With regard to comparing his performance to hadoop/hive - yeah it's apples and
oranges, but he's in good company. Hadapt, Hortonworks Stinger, Cloudera
Impala, Spark/Shark and others all rate themselves on how many times faster
they are than Hive.

And frankly, I don't buy the whole "the point of MR is for huge, horizontally
scaling networks" If you factor out Yahoo!, Facebook, Amazon, LinkedIn and a
few others, the largest remaining hadoop clusters are all WELL south of 1000
nodes. And most run on homogenous high-end hardware.

------
shousper
So, I found this from back in 2011: [http://www.tomshardware.com/news/ibm-
patent-gpu-accelerated-...](http://www.tomshardware.com/news/ibm-patent-gpu-
accelerated-database-cuda,13866.html) However, I couldn't find any commercial
or even (active) open source projects on this topic. It seems like something
that would be valuable to businesses working with big data, so what's the hold
up? Has nobody reached this scale yet? Is it still too expensive? I don't get
it.. Maybe I'm overthinking it.

~~~
sendob
[http://wiki.postgresql.org/wiki/PGStrom](http://wiki.postgresql.org/wiki/PGStrom)

