

Ask HN:Why there's no Regular Expression search for web? - bluegene

What's keeping Google/other search engines from implementing Regular expression search in particular?
======
lacker
The main problem is that you would need a totally different indexing system.

Roughly, search engines work in two phases: retrieval, and scoring. Retrieval
is when you figure out of the billions of documents in the index, which are
the top few thousand that could be worthy of being search results. Scoring is
when you look at each of those documents in more detail to figure out the
actual top ten.

Scoring based on regular expressions wouldn't be too tough. Retrieval is the
killer. Typically retrieval works based on "posting lists", which are
basically indices for each word of which documents contain that word. To
retrieve based on regular expressions, you would need posting lists for
individual characters or short sequences of characters. That would take a lot
more space.

You might be able to hack together some hybrid that would use existing posting
lists. For example, if you required that the regular expression contain a word
within it. But pure regular expressions would require a different index. That
sort of added complexity is not worth it for the feature.

------
curtis
One problem is that there's no easy way to build a regular expression index
for the web. In the general case the only way to do regex search is to scan
the entire content.

It might be practical to do a hybrid search -- a conventional word or phrased
based search to return a limited set of documents that can then be brute-force
searched using a regular expression. This could be especially handy for
programmers searching for code samples, a position I often find myself in.

~~~
seiji
Regex capable code search: <http://google.com/codesearch>

~~~
curtis
True, but codesearch only searches codebases. But suppose I want to search for
mixed content and code. A lot of my programming related searches lead me to
StackOverflow, a mailing list entry, or a blog post.

------
zck
Imagine the added complexity that you would require to do that -- you'd need
to have more hardware than a general-purpose search engine. It's also
complicated to precalculate anything, as there aren't a list of regexes that
are more likely to be entered, unlike text (a dictionary).

Who would use the regex search? Only programmers. So your market is _tiny_
compared to a general-purpose search engine.

So more expensive queries that are harder to code up for many fewer people?
Sounds like a losing bet.

------
brudgers
Sounds like a promising idea for a YC application.

~~~
petervandijck
pg asks:

\- how will you make money?

\- how will you implement this cheaply enough?

\- who will really use this? what are they doing now instead?

~~~
brudgers
Initially, with a subscription model for people with an interest in search
results relevant to their purpose rather than suffering through search results
relevant to the purposes of advertising sales. This will allow controlled
growth to match resources to volume.

[edit] The right model might be a sort of meta-search engine - feed the regex
to something like Wikipedia to determine plausible keywords and then return
aggregated search results based on the keywords. At prototype and small scale
actual search results could be aggregatte from other search engines such as
Google or Bing.

[edit] Interestingly, Wikipedia already has regex capability built in to
AutoWikibrowser.

[http://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Regul...](http://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Regular_expression)

~~~
petervandijck
pg then asks: how many people are willing to pay a subscription for this? How
do you know that?

And then pg says: I worry.. I worry..

