
Oh Good Grep Web Grepper: A New Web Intelligence Feature From Blekko - krishna2
http://searchengineland.com/oh-good-grep-web-grepper-a-new-web-intelligence-feature-from-blekko-92730
======
acabal
I was super excited until I read that you have to submit requests to be voted
on. While I understand the difficulty (impossibility?) of having this kind of
service on-demand, I really would rather not have to submit my obscure and
possibly business-intelligence related queries to a community for voting. This
could be a game-changing service if they could somehow make it on-demand.

~~~
krishna2
It is on-demand. And no. of votes is not a pre-requisite (it is however
helpful though). However, we do have want to keep an eye out on the number of
grep-jobs we can run a day without affecting other services' performance. You
should submit your grep!

PS: I am one of the engineers who built wegbrep

~~~
badclient
I second the OP. I had the "no f'ing way!" expression as I was reading the
post until I read the part about making a request. Big downer :(

~~~
krishna2
Thanks for the feedback both you and the OP. We will make this more clear.

------
randomstring
Here is an example web grep that I ran to find the top ranked sites that used
kissmetrics.com for user tracking. Last month there was a huge blow-up over
kissmetrics possibly using ETAGs and other hacks to track users across
multiple sites.

[https://blekko.com/webgrep?page=view&id=3f469c08de300c12...](https://blekko.com/webgrep?page=view&id=3f469c08de300c1211884b86ce814b2a)

~~~
pyre
That grep is too generic to imply that someone was _using_ it vs just linking
to kissmetrics.com.

~~~
krishna2
Agreed, If you know the exact js file or code snippet that kissmetrics
requires - then it would be an even better grep. Wanna take a stab?

------
binarymax
This looks very cool, but what I was really hoping for, is way to _finally_
enter a regular expression in a search box and get results back.

Looks like I'll need to wait a bit longer.

~~~
greglindahl
Grep is a mapjob which takes hours to run. You'll be waiting quite a while
until anyone can afford to quickly run regex queries against billions of
documents! And by then, there will be hundreds of billions of documents.

~~~
_delirium
Google Code Search _does_ do regexes against an impressively large set of
documents nearly instantly, though it's clearly much smaller than the set of
all webpages. It'd be interesting to know how much Google could scale it;
could they handle 100x the number of documents in the current code search?
10,000x?

~~~
tikhonj
One thing you should note is that Google Code Search, as far as I know,
supports regular expressions that are actually _regular_. This means you can't
have an expression like /(ab..)\1/, for example.

In all, re2, the regular expression engine that Google Code uses, is a very
interesting project; you should read about it on its google code page:
<http://code.google.com/p/re2/>.

~~~
greglindahl
The issue is not so much how much cpu time the regex evaluation takes up, it's
the I/O time of loading every byte of every page we've crawled.

That being said, re2 does look pretty cool... having a guarantee that nothing
in an re can blow up is pretty nice, on top of the overall speed improvement.

------
diegogomes
How many sites with <script> X installed is amazing! Very cool feature.

------
alukasiewicz
Cool, wondering what people are going to grep for

~~~
krishna2
Here is a list of Greps that have already completed:
<http://blekko.com/webgrep?status=completed>

