
Ask HN: Is there a search engine which understands regular expressions? - gnosis
Are there any general purpose search engines which allow searches to be made using regular expressions?<p>I know http://www.google.com/codesearch allows them, but it's a special purpose search engine for searching through source code.  I'm looking for a general purpose search engine, like Google/Bing, that can search through all the usual document types using regexes.
======
stevelosh
Regexes would be great, but I'd settle for a "raw" mode where the search
engine just _searches for the exact string_.

Example: putting the following into the Google search box:

    
    
        "*foo" "bar->baz"
    

Finds any page with foo, bar and baz, even thought I tried to tell it that the
asterisk and arrow were important.

[https://encrypted.google.com/search?hl=en&q=%22*foo%22+%...](https://encrypted.google.com/search?hl=en&q=%22*foo%22+%22bar-%3Ebaz%22)

~~~
pbhjpbhj
Google is getting less and less usable for search for me because of something
akin to what you describe.

For example I was searching for Groupon info in a particular market and google
returns me results that mention "group" but doesn't tell me that it's doing
so. Really, really annoying to find barely any of the returned results contain
or relate to the word that you're searching.

I tried quoting groupon but this doesn't work you have to negate the search
thusly: 'groupon -"group on"'. Why don't they give you the option to remove
their guessed results (like "did you mean") or simply limit related/thematic
results to those using the ~ modifier. Grrr.

~~~
stevelosh
I'm pretty sure that for the single-word case you describe you can get around
that by using plus, e.g.

    
    
        +groupon +chicago
    

Unfortunately my example with special characters doesn't seem to work with
that trick.

~~~
pbhjpbhj
I used to use that for other engines but Google have shifted if this is
necessary. (all figures from my local Google not .com)

    
    
        +groupon -"group on" = 177M pages
        groupon = 124M pages
    

Huh?

    
    
        +groupon = 81.2M pages
        +groupon +chicago = 20M pages
        +groupon +chicago -"group on" = 36.9M pages
    

But this gets altered (sometimes) to remove the "+" when additional terms are
included.

It's a mess.

------
pittsburgh
I've also tried to find a search engine that supports regex and have come up
empty. I hope somebody on this thread pleasantly surprises me, but I now doubt
one exists.

Since parsing regular expressions is so slow compared with performing an
indexed search, it's difficult to think of a way to make that scale for a
dataset as large as the public web. There's also the problem of having to
protect against regex denial of service attacks:
<http://en.m.wikipedia.org/wiki/ReDoS>

I've been able to (very partially) make up for the lack of regex support by
taking advantage of Google's operators and wildcards:

<http://www.googleguide.com/wildcard_operator.html>

<http://www.googleguide.com/advanced_operators.html>

Some examples:

    
    
       "solar|lunar eclipse 1700..1800"
    
       "William * Clinton"
    
       Columbus -Ohio -Georgia -Christopher
    

This is hardly a replacement for regex, but it's the best I've been able to
come up with.

------
smoove
I guess the main problem is that you really can't build an index for regexes,
you would need to apply the search regex "live" to all the documents the
searchengine knows - this will not scale at all.

Also, if you let a user search for any regex, it would be really easy to
overload the server, by entering very complex regexes.

~~~
gnosis
Only a small minority of users even know what regexes are, and fewer still use
them.

I'm not sure that allowing regexes would put an undue burden on the search
engines. But if it ever becomes an issue, the search engine could easily deal
with the problem by simply slowing down the search if it contains a regex.

I'd happily wait 2x, 5x, or even 10x as long for my query to complete if I
could use a regex. For some important queries for which non-regex searches are
inadequate, I'd even be willing to wait hours or days, since the alternative
would be not being able to perform the search at all (or returning so many
false positives as to be useless).

~~~
mryan
<http://www.worldwidewebsize.com/> suggests Google is currently indexing
around 46 billion web pages. Running a regex across that amount of data would
be a lot more than 10x slower.

You say you would be happy to wait days for a result, but what incentive would
Google have to run long-running regex processing tasks, without showing you
any ads or gathering any useful info in the process?

I wish it would happen, but I can't see any incentive for the big players in
search to do it at the moment. Like you say, so few people would use it.

~~~
gnosis
_"Running a regex across that amount of data would be a lot more than 10x
slower"_

Would it really? I'd like to see some hard data on that.

 _"what incentive would Google have to run long-running regex processing
tasks, without showing you any ads or gathering any useful info in the
process?"_

What incentive does Google have for allowing regexes to be used in searches of
source code, which it already does?

It's useful, and it gets Google goodwill from its users. Plus, many of its own
employees probably benefit from it.

The number of users of Google's regex code search feature is probably no
greater than the number of people who'd use regexes in general search, perhaps
even smaller.

As far as ads go, I'd bet the vast majority of people who use Google's code
search engine run ad blockers and don't see any ads anyway. I very much doubt
that Google gets much if any profit from running it. And yet they do it.

~~~
smoove
>>Would it really? I'd like to see some hard data on that.

The process would be this:

-> User submits Regex

-> Google fetches all documents in it's database (46 billion documents according to mryan) - If we assume 1kb of data per document (wich is probably way to small), google just fetched 43869 GigaByte of data

-> now google somehow iterates over said 43869Gb (we assume we have a lot of RAM btw.) and check if the regex matches any of them

-> Search results are delivered to user (days later?)

I can not give you any "hard facts", but the problem is that if you can not
build an index, you have to look at each individual document. And in google's
case the amount of documents is just way too high.

------
pbhjpbhj
I've a vague recollection of using some limited subset of regular expressions
in maybe about year 1999-2000? However, I have a very bad memory and could be
confusing with some specialist tech databases I used to access.

Before I adopted Google I used Teoma, AllTheWeb, Magellan/Excite and probably
some others so it was possibly one of them? Anyone recall such a thing?

Edit: Looks like <http://www.searchlores.org/main.htm#exalead> (Exalead,
private beta) is doing regular expression search.

------
motochristo
<http://duckduckgo.com/> might help. I know it utilizes the bang syntax.

~~~
gnosis
Thanks, but that's really not the same at all.

Those are just predefined custom searches, a feature built in to Opera (my
browser of choice), and probably other browsers as well.

Unfortunately, custom searches are still limited to using whatever syntax the
search engines they query use, so if that search engine does not support
regexes, using a custom search (or "bang search") won't help.

They can still be a valuable search tool, but not what I'm looking for.

~~~
pmjordan
I suspect it's not that easy to graft them onto existing search engines
efficiently. I don't know what sort of data structure search engines usually
use, but in order to support regexes efficiently they would almost certainly
have to be a prefix tree of some sort.

~~~
gnosis
<http://www.google.com/codesearch> allows them, so it is doable.

It may not be efficient for larger datasets and large numbers of users, but
that need not be a problem for the search engine owners, as they could just
take their sweet time in returning results.

~~~
pmjordan
It still costs them CPU/IO time and thus represents a possible DoS vector.

