
Fast Database Emerges from MIT Class, GPUs and Student’s Invention - signa11
http://data-informed.com/fast-database-emerges-from-mit-class-gpus-and-students-invention/
======
tmostak
Hi all, MapD creator here - I'd be happy to answer any questions. The 700K
figure was done by rendering polygon files to textures and using them as
lookup tables - it ran pretty fast on the CPU so its not all the GPU there.
However if a point falls on a border - perhaps 1% of cases for say a medium-
sized raster of the US - you have to do a geometric lookup as usual - which I
didn't benchmark. Probably better would be to do some tiled implementation -
basically something like textured quad-trees.

And as I told the journalist I wasn't sure if I did the PostGIS indexing right
- I feel like the quote got taken a little out of context - but the speedups
were "high" to "quite high" regardless.

~~~
throwaway1979
Heh ... when I saw the speedup numbers, the first thing I wondered was ...
index? Regardless, this is impressive work. Do you have a tech report you can
point to for someone to get more details? I'd love to learn more about your
setup. Also, would you mind describing your programming background? I have a
cs PhD and I admit I never had the guts to get into GPUs! You've given me a
whole lot of motivation!

~~~
tmostak
Thanks! Yeah I'd like to revisit this benchmark soon (I'm actually working on
rendering choropleths on the GPU now). I'm working on a database paper that
will hopefully be ready for VLDB submission by next month - that would
probably be the best.

In terms of CS background - nothing formal except for AP comp sci I took in
high school like 12 years ago. However I've always loved hacking on things on
the side, and a combination of my research on the Middle East/burning desire
to make an Iphone app kind of got me back into it. Then to be fair I took a
GPU Graphics class (basically OpenGL) at Harvard and then a Database course at
MIT while doing my Masters in ME Studies, which is where this project started.

------
btilly
The same story was submitted to /., and the person who wrote the database
answered several questions about it there. See
[http://search.slashdot.org/story/13/04/22/2225240/harvardmit...](http://search.slashdot.org/story/13/04/22/2225240/harvardmit-
student-creates-gpu-database-hacker-style) for more.

------
tmostak
(MapD creator again) - Also, check out <http://worldmap.harvard.edu/tweetmap>
for a live demo - even though its been a bit buggy under Slashdot load and I
had to disable GetFeatureInfo requests (i.e. click to get tweet). Also see
here for a demo of a new interface I'm messing with that does animation -
<http://www.youtube.com/watch?v=Foo8pYbSPv4> .

~~~
intended
And roughly half an hour ago, I was hoping for something that could help me
map GIS data and animate the changes over time so that I could see before and
after effects.

Look forward to seeing this released.

~~~
tmostak
Is your data relational? I can import anything from Postgres, MySQL, Solr (but
has to be flattened), or shapefiles. If you tell me more perhaps I could put
it into the system for you and we could try it out.

~~~
zzleeper
Hi Todd! Do you have an email? (mine is sergio.correia@duke.edu )

I'm working on a project linking bank credit with the geographic distance
between individuals/firms and bank branches. It's a fairly big dataset (~4MM
geocoded individuals with matched credit data) and MapD would seem like a cool
tool to visualize how the data changes with time.

Cheers

~~~
tmostak
Hi, yes im tmostak at csail dot mit dot edu (does that fool the spambots these
days)? Yeah email me and we can chat!

------
contingencies
Hrrm. Since it's light on details, how about zooming out for some random
conjecture about industry trends.

Notion: The current environment definitely has a faster-evolving database
layers landscape than the previous decade. It follows that for longer-term
systems, database schema-naive architecture becomes far more attractive. How
exactly should we achieve that?

For all the 2005-enterprise-2.0-ishness of it, message-orientation (ie. on-
the-wire protocol first) seems a fair play. This seems to hark back slightly
toward the traditional (waterfall) approach of a nominal pre-coding design-
time, an extroverted design orientation that focuses on understanding inputs
and outputs before coding. Contrast the modern fire-and-forget webdev norm;
model some data in some can-abstract-databases-as-long-as-they-fit-a-prior-
model language-specific framework, and iterate.

Some of this is just common sense but perhaps we are about to see a shift in
architectural paradigm. Of course, the modelling/code generation people have
been harping on about this for years!

------
laurentoget
He is not the first person to try to implement a GPU base database, and this
is not a trivial undertaking by any means.

<http://gpgpu.org/tag/databases>

Does anybody know any source with more details on this work?

~~~
sendob
not his, but I found this interesting: [http://www-
cs.ccny.cuny.edu/~jzhang/papers/ACMGIS10_Extended...](http://www-
cs.ccny.cuny.edu/~jzhang/papers/ACMGIS10_Extended.pdf)

------
sendob
Curious to learn more as the specifics come out.

I am familiar with pg storm which was using pg FDW with CUDA, it had some
pretty severe limitations last I looked, I also remember someone doing a sort
implementation on cuda on the mailing list which was pretty slick, Tim Child
also did some presentations(I never saw code personally, would love a link if
anyone has), that was doing opencl as a procedural option which I think would
still be really cool, and potentially easier to deal with as an extension
where applicable.

I am going to be very interested to see what is possible with postgres 9.3
writable fdw and potentially something like this. May be that postgis will
find its way to similar performance given access to the same tech.

35k loc is compact, cannot wait to see the code!

------
downrightmike
Slides from the Conference where he won the prize: [https://cga-
download.hmdc.harvard.edu/publish_web/Annual_Spr...](https://cga-
download.hmdc.harvard.edu/publish_web/Annual_Spring_Workshops/2012_Africa/presentations/Todd_Mostak.pdf)

Didn't see a link to a published thesis.

------
eelsen
The claim of a 700,000x speedup makes me suspicious of pretty much everything
else about the work.

~~~
danbruc
The question is what kind of processing he did. Usually database systems are
heavily constraint by the I/O bandwidth of the system - even the cheapest
consumer CPU is able to process data orders of magnitude faster than discs are
able to deliver new data.

That is why database servers usually have the fastest discs money can buy, as
much memory as you can fit in there in order to keep as much data as possible
in main memory and the largest possible caches to avoid hitting the relatively
slow memory bus as often as possible. The situation becomes much worse if the
database is not read-only because all changes have to be persisted and you
have to hit the disc on every change.

Therefore I think using GPUs will buy you nothing for common database
application - processor speed was never your problem. But this does not
exclude the possibility that there are operation, for example correlating
millions of data points from your data set, where the processing power of GPUs
comes handy, but this is most likely not what you will see in one of your line
of business applications.

The number 700,000 is probably the result of a highly tuned GPU implementation
versus a general purpose CPU implementation. A quad-core Core i7 at 3.0 GHz
has a theoretical peak performance of 96 GFLOPs [1], the currently fastest
supercomputer peaks at 27,112,500 GFLOPS (Titan, 560,640 cores, 8.2 MW [2]).
700,000 times the mentioned Core i7 is still more than twice the performance
of Titan, 67,200,000 GFLOPs.

[1] [http://stackoverflow.com/questions/15655835/flops-per-
cycle-...](http://stackoverflow.com/questions/15655835/flops-per-cycle-for-
sandy-bridge-and-haswell-sse2-avx-avx2) [2]
<http://www.top500.org/lists/2012/11/>

~~~
reeses
It appears from the graphs (yeah, I know) that he's probably doing trig (to
locate the tweet) and collision detection (to determine the area of the city).
He could also have more control over single precision algorithm choice than he
would have via his original solution. It could be something as simple as pre-
calculating the geolocation with a parallel CUDA kernel during ETL into
PostGIS and bypassing its index generation.

I also have an itchy feeling that there may have been some confusion regarding
conversions between "times" and "percent", possibly multiple times. [1]

Extrapolating that an iterative solution would take 40 days, without running
it for 40 days, is tricky as well. Given multiple tweets in close proximity,
PostGIS should benefit from caching the R-tree index.

I'm not assuming the inventor is doing these things, as his db prof was
impressed, as well as CSAIL and the body awarding his prize. [2] I think, if
anything, the reporting is intended to be high-level enough to have more broad
appeal, which makes it more difficult for us to evaluate the results in a
Gedankenversuch. "Reduce slowness here..."

[1] The sidebar says,'Some statistical algorithms run 70 times faster compared
to CPU-based systems like MapReduce,' while the quote in the article
says,'“The speed ups over PostGIS … I’m not an expert, and I’m sure I could
have been more efficient in setting up the system in the first place, but it
was 700,000 times faster,” Mostak said. “Something that would take 40 days was
done in less than a second.”'

[2] Hey, logical fallacies do not necessarily mean the argument is wrong!

------
pbbakkum
Very cool project! I'm glad to see this area is getting some attention. I did
research in GPU databases a couple summers ago at NEC, if anyone is interested
the project is at <https://github.com/bakks/virginian> .

Some background mixed with some opinion:

This is an area still without a major commercial competitor and there aren't
yet any such public product efforts as far as I know (IBM has a couple of
researchers publishing on the topic). The difficulty is that really exploiting
the GPU requires rethinking architecture from the ground up, including how
your query plan handles parallelism, how your data is stored, and most
significantly, how to efficiently move data to the GPU for processing. Most
databases aren't partitioned nearly well enough to shoehorn GPU processing in
on the backend, and it seems like the vast majority of recent database
development has been around highly distributed systems and commodity hardware,
eg Hadoop. A German company called ParStream has a DB that uses GPUs for
certain indexing operations, though last time I spoke to them it sounded like
there is much much more that can be done with GPU hardware.

In general, speedup is highly variable, and depends on the data set, indexing
characteristics, data type (fixed vs variable length), individual query, etc,
but its fairly common to see 10x speedups over optimized multicore query
execution. Though the extreme parallelism of the GPU is responsible for some
of this, in my opinion the big win there is just absolute memory throughput.
Newer Tesla GPUs have a memory throughput of > 100 GB/s without breaking a
sweat, while most CPUs are around 30 GB/s. If you can push data through your
hardware that quickly, its easy to see how the GPU can have such an advantage.

The difficulty, however, is actually moving the data into GPU memory (1-8 GB
generally) to exploit this throughput, which is a major bottleneck (some of my
research was around doing this better). I understand that GPU direct disk
access is just becoming a reality, which could make this easier. The holy
grail, however, is a full-powered integrated server CPU/GPU. If both of these
components had equivalent main-memory access and this architecture becomes
cheap and commoditized, then GPU databases will immediately become extremely
relevant.

I expect that there will be a serious commercial competitor in this area
within 3 years, either in the relational data warehousing space (think
Vertica) or closely integrated into MapReduce type execution. My guess as to
the reason it doesn't exist is that 1) production DB development is really
hard and 2) GPU processing is only a really big win for certain workloads,
namely heavy analytical/research workloads over fixed-size data. As I said, if
GPUs become commoditized in server hardware or fully integrated with CPUs,
however, I expect this domain will quickly become extremely important. This is
a very cool area and its common to see 10x speedups even with optimized CPU
implementations.

Always happy to chat about this stuff, feel free to shoot me a message.

~~~
tmostak
Actually I've enjoyed reading your papers and would love to talk... will email
you!

------
im3w1l
Why are CPU's built like CPU's and not like GPU's? I mean this running things
on the GPU business clearly means that the CPU does not meet the needs of
programmers adequately.

Shoehorning things to run on the GPU seems to me a dirty hack, indicating that
there is an underlying problem that could and should be addressed in a better
way.

~~~
Groxx
Because most consumer software is not as insanely parallel as doing millions
(billions?) of isolated lookups. If you ran most applications on the GPU,
you'd see something like a 1000x _slowdown_ because you're only able to use a
single path (instead of hundreds / thousands at a time), and each path is way
way slower than your average CPU.

~~~
im3w1l
>Because most consumer software is not as insanely parallel as doing millions
(billions?) of isolated lookups.

I think this is more due to tradition than due to necessity. But I have no
hard data and sources for this.

~~~
Groxx
I doubt this is so. Lots of consumer software ends up being I/O bound, not CPU
bound - even if they were completely rewritten to be crazy-parallel, you
wouldn't notice. Good portions of games could probably be parallelized
significantly more than they are, but they're quite a bit different than most
software, so that's not very indicative.

Then there's stuff like parsing data. Lots of times data is context-sensitive
- that "<item>" could be a tag, or it could be wrapped in <![CDATA[ <item>
]]>. Without reading through it in order, you can only make certain
assumptions. Some of this can be redone in a more parallel-friendly way, some
of it is not so much - if it's encrypted or compressed, you're probably forced
to pre-process it sequentially, or lose some of the quality of the security /
compression.

For those (and think of how much crypto and compression there is in consumer
stuff! Every SSL web endpoint, secure wifi connection, zipped file...), a
powerful-core CPU or a custom piece of hardware (that only does a few
algorithms) is your only option.

~~~
im3w1l
>I doubt this is so. Lots of consumer software ends up being I/O bound, not
CPU bound - even if they were completely rewritten to be crazy-parallel, you
wouldn't notice.

So for these programs it wouldn't matter either way.

>Then there's stuff like parsing data

How many states can you be in exactly? You can be in a CDATA, you can be in a
quoted attribute, you can be in a tag, you can be between tags. Most other
states are very local such as ampersand declarator ( &l ) , tagname started (
<it ), etc. You could process a text snippet for all of the more global
possibilities at the same time (or just chance it and throw your guess out if
you are wrong).

I suppose you are right about compresssion and crypto.

------
yangyang
Alenka is another open-source GPU database engine that looks quite
interesting:

<https://github.com/antonmks/Alenka>

------
corresation
Fantastic project, and this absolutely sounds like an embarrassingly parallel
processing task. To the original itch, so to speak-

"Then he plotted the Islamist indicators from 40 million tweets, ranging from
August 2011 through March 2012, against 5,000 political districts from the
Egyptian census."

So the core problem is that you have 5,000 non-overlapping polygons and you
want to determine from an x,y which they fall within? I have to confess that
I'm surprised that just 40 million such points would take several days.

~~~
Someone
Let's assume a naive approach. Also, assume each point falls within a polygon.

For every point to test, you need, on average, to test half the polygons.
Testing a point against a polygon means testing the point against, on average,
about half the polygon's edges. Each of those tests takes a multiplication, an
addition, and a comparison.

That's 4E7 points times 2500 polygons times, at least, 1.5 edges. Total:
1.5E11 multiplications, additions, and comparisons.

With 100-sided polygons, it would be 1E13 or so
multiplications/addition/comparison steps.

Memory pressure is dominated by polygon storage. 5000 poly's, 100 points each,
is 500,000 edges at two doubles each, or 4MB. That fits comfortably in cache.

So, 1E9 multiplications/addition/comparison steps a second should be possible.
That's 10,000 seconds or 3 hours.

=> Should be doable on one CPU in hours (famous last words)

~~~
corresation
It really is a fascinating sounding problem and I'd really like to see the
source data (regions and points), if I'm understanding the problem right, to
discern alternative approaches.

