
Ask HN: What would be the fastest way to grep Common Crawl? - freediver
Most recent Common Crawl includes 80TB WET (extracted text) files from the latest web crawl.<p>Assuming you have the files locally what would be the fastest way to &quot;fgrep&quot; (string search) through them?<p>Testing with ripgrep on my 10 core iMac Pro I get about 6 seconds to grep through 20GB file. That means about 5 minutes per TB or almost 7 hours for common crawl. What setup would I need to do it in &lt;100ms? :)
======
burntsushi
ripgrep author here.

When you want to search TBs of data, then you have to ask: what do you need to
do? If you only need to search the data set once for a single query, then
ripgrep (or similar) is probably your shortest path in terms of end-to-end
time to get your results.

If you however want to run many queries against an unchanging or growing data
set at this scale, then I think you probably want to find some way to index it
before hand. Tools like SOLR or Elasticsearch are fairly standard and
shouldn't cost you too much in terms of learning how to use them. You can
perhaps go faster than Elasticsearch/SOLR/Lucene if there is something special
about your problem that permits you to design a custom indexing strategy. But
that requires knowing more about your search goals and also costs more to
develop (probably).

Otherwise, if you're willing to pay, then spin up a bunch of machines with
lots of RAM and partition the data set such that each partition fits in memory
of a single machine. Then parallelize your search. (This is probably a middle
ground between grepping the whole thing on a single machine and using
something like fulltext indexing.)

~~~
freediver
Thanks for the reply. This is mostly a thought experiment in can you 'grep the
web' (without indexing etc).

If you split 24TB across ~128 servers ( so filling up 256 GB RAM per server)
you would have to search 256 GB of RAM in 100ms, or ~2.5TB/sec. Unfrotunately
most RAM will give you only 20GB/s (DDR4 3200：25.6 GB/s) . Multi-core setups
do not help much as they share memory channels (for example AMD's latest 64
core 'monster' has only 4 shared memmory channels). So it seems you would need
insane number of servers (15,000 with ~2GB RAM) to do a 'grep of the web' in
100ms. Missing something?

~~~
burntsushi
I don't think you're missing anything. As I said, it's a middle ground. :-)

~~~
freediver
Thanks for jumping in! What do you use to alert you?

~~~
burntsushi
Nothing. I just search HN from time to time and check the threads I'm
participating in. :-)

------
freediver
Update: splitting files to 100M chunks increases search speed by 4x. So it's
1.5 sec to grep through 20GB. Smaller file sizes do not yield improvement.

~~~
eesmith
Your machine's RAM bandwidth is about 20GB/second, right?

It looks like you're close to that limit.

The only way to go faster is to not search everything. Which means pre-
computing, or otherwise knowing more about the structure of what you plan to
search.

