
CUDA grep - ColinWright
http://bkase.github.io/CUDA-grep/finalreport.html
======
scott_s
Interesting work! But I'm unclear on the end-to-end performance comparison,
which includes the cost of copying data to and from the device.

I know you guys are eyeballs deep in your data, so what's obvious to you may
not be obvious to a reader. From what I can tell, your performance numbers are
reported as an average per regex. But you're calculating that without
including the string copying cost; you're only timing the cost of performing
the computation on the GPU.

You show the overhead in absolute numbers. It seems small compared to the cost
of processing the whole file, but I need to know the cost for grep to process
that whole file as well. Basically, you show me a number, but you don't
explain the significance of this number.

In my experience, the best way to deal with such issues is to always report
entire-application performance. (This can be in addition to other numbers, of
course.) That way I can say "Ah, calling grep took _x_ seconds on average,
calling their program took _y_ seconds on average, and _y_ < _x_ , so they've
improved performance." If you report entire-application performance, your
reader doesn't need to understand the particulars to at least determine if
your technique is a performance win.

~~~
bkase
Thanks for the advice. As soon as we get this project to run successfully on
CUDA 5, we'll update the finalreport document with the entire-application
performance numbers (hopefully before the end of tomorrow).

------
bkase
Well I woke up to a nice surprise!

Since people seem really interested in this, mburman and I are going to spend
the day fixing up this project.

This was a final project for 15-418 in Spring of 2012 (that was a great
class). We need to update it for the newer versions of CUDA, and we'll
probably clean up some of our code since we're both better programmers now.

Pull-requests always welcome!

------
Breakthrough
Always interesting to see applications/algorithms sped up with GPGPU
implementations. It looks like they get a great amount of speedup by parsing
each line in parallel (in the thread blocks), followed up by parsing each
individual regular expression in a separate blocks.

Admittedly, I assumed this was already the case after reading the article
title, since parsing each line in parallel would present a great deal of
parallelism with growing filesizes. Indeed, they also came to the same
conclusion:

 _we realized that it was a much better idea to parallelize across the lines
of the input file against which we had to match regexs_

That was also a great choice for future scalability, as no doubt you would get
greater parallelism with a growing input file (while also allowing
opportunities to speed up more complex expressions).

------
glurgh
It's seems to be a project for

<http://dolores.sp.cs.cmu.edu/15418_spr13/>

Which answers the first question that comes to mind 'why would anyone do this,
for an i/o bound process'. For homework, of course.

~~~
BruceIV
I think that getting decent speedup on this kind of non-obvious GPU problem is
an impressive result. Their presentation here is somewhat lacking in academic
rigor (I for one would have liked to see a description of how they got their
timing results, as well as some tests with a smaller source file, or possibly
many small source files), but given that they've provided the source code
their results should be easy enough to verify.

~~~
glurgh
Oh, I didn't mean to imply there was anything wrong or obvious about it, or
for that matter, with homework.

------
haberman
Doesn't this approach require that you start with a sequential scan of the
file to break it into lines? And does the fact that you beat grep by 8-10x,
even after this up-front cost, then imply that such a scan is at least 10
times faster than an actual grep? If so that's surprising; I thought that grep
was a lot closer to the raw cost of scanning a string.

~~~
Peaker
I am not an expert on CUDA, but couldn't you split the file into relatively
large chunks, find the nearest newline to the chunk points, re-split on those
positions, and use that division of labor?

~~~
diroussel
Yes, that is a good approach, and it was used in some of the faster
implementations of Tim Bray's Wide Finder competition.

[http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-
Finde...](http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder)

------
DigitalTurk
Very nice work!

It seems like this is only fast on large files, though, because the text needs
to be copied from the main RAM memory to the GPU, which introduces latency. I
wonder what latency would be like if this algorithm was instead run on the
kind of unified memory architecture that you see in e.g. the PS2 and XBox One.

Also, I don't quite follow why they're compiling the finite automata on the
GPU. To me their explanation that they didn't want to copy the automaton node
per node sort of sounds like there's a lot of room for optimization here. E.g.
maybe the regular expressions could be compiled to OpenCL code.

Then again, they did also find that pattern matching is a memory bound problem
so maybe emitting native code is pointless. Anyone know if there are regular
expression engines that compile emit native x86 code?

~~~
dbaupp
Newer versions of PCRE have a JIT:
[http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express...](http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions#Features)

~~~
simonster
V8, SpiderMonkey, and jsc also all have regular expression JITs.

~~~
cglace
I believe pypy also has a regex jit.

~~~
kingkilr
Yup we do.

------
1amzave
Very cool overall. However:

> _Searching for hay in the haystack is not a common use case for a regular
> expression matcher._

Where'd that conclusion come from? Surely "does this pattern exist anywhere in
this string?" with the answer "no" can't be that that unusual...I'm pretty
sure I do this fairly often, in fact.

EDIT: just realized I probably misinterpreted. Looking at the context more, I
suppose they're referring to the frequency of matches, not their presence or
absence. However, is 'grep -v commonpattern' (another not-too-unusual use-
case) kind of equivalent?

------
snaky
I suppose the fastest grep with no GPU support is beagrep -
<https://news.ycombinator.com/item?id=5645040>

~~~
atdrummond
jonstewart, you're hellbanned. Nobody can see your comments.

~~~
nkurz
Wow, that's one of the most egregious bans I've seen yet. He has one quippy
but non-offensive political comment from 3 years ago that was voted down to
-7, and then a series of solid on-topic technical comments every few months.
Maybe a mod could just restore him without requiring him to email asking for
forgiveness? He seems worth keeping around.

~~~
snaky
Am I only one who doesn't have a clue what are you talking about and who is
jonstewart?

~~~
BruceIV
HN "hellbans" users who have karma that's below a certain threshold (0, I
think) - the site looks normal to them, but no one else can see their posts.
You can see the post if you turn on "show dead" in your profile.

------
UK-AL
There are already a few regular expression matchers built for GPU based
computation. They often have modified finite state machine implementations in
order to exploit GPUs.

There's a NFA implementation called iNFAnt, which explores multiple paths
through the NFA simultaneously, avoiding backtracking normally required in
NFAs.

------
clamb
Since you're streaming the input dataset you may want to explore the option of
mmap()ing the files and using cudaHostRegister()/cudaHostUnregister() and
letting the GPU operate directly on CPU memory rather than copying it over.
Should saturate the PCIe bus, which is way faster than a disk.

------
jhdevos
I don't see what hardware they tested on. What are the specs of this GHC 5000
computer they mention?

