
Ask HN: Should I used a lexer and parser for this? - Trindaz
A technical question I'm hoping some experienced hackers can help with:<p>My django application has a search feature for searching articles. It needs to handle logic features (AND, OR, (), etc.) <i>and</i> search articles accordingly using django's ORM tools <i>and</i> be used to create a regular expression that's used in my python code for certain searches that aren't performed at the database level. I've read about lexers such as ply - but I'd be interested to hear about anyone else who's had a similar problem. Take logic expressions, then transform them into something a database can use for searching, and then transform them into something else entirely for regex searches.
======
jerf
You don't need a specialized lexer for this. A regex to split on key words
(AND, OR) and symbols (parens, whatever else you've got) is sufficient. A
parser would be helpful, _if_ you really, _really_ think your customers are
going to run complicated queries against your database with nested OR clauses
and serious use of parentheses and stuff. You may have such customers, they do
exist, but bear in mind the average user has no idea what the difference
between "A AND (B OR C)" and "(A AND B) OR C" is; if you don't know why your
users are above average and picky, you might consider something very, very
simple instead.

But let's go on assuming that's not the case, because it's certainly a fair
possibility. The next step is to find The Tutorial that all parsers _always_
have like it's some sort of law of the universe, the Arithmetic Expression
parser. This covers basic operator precedence (multiplication and division vs.
+/-), parenthesized expressions, hopefully the unary negation operator, and
hopefully at least sketches the evaluation code for the resulting parse tree.
Good news! This is almost exactly what you want, except multiplication is AND,
addition is OR, and unary negation is NOT. You just need to tweak your regular
expression to either emit "wordA wordB wordC" as a series of tokens which your
parser will add together, or with just a bit of work on your regex, they'll
come out as one token. Take your time transforming the grammar, one step at a
time; make it so it takes "3 AND (1 + 3)" instead of "3 / (1 + 3)" first,
little steps, because the error messages that come out of parsers can be a bit
confusing at first and you want your changes to be minimal.

This may take a few hours to wrap your head around, but it's a very easy use
case for parsing, and the time spent learning how to really, truly parse is
time well spent.

Oh, and be careful with the generated regular expressions; there are some
DOSes that can be done with those, and be sure you do something sane with
asterisks and other special characters. If your regex engine supports the
equivalent of \Q and \E in perl, use it.

------
amccloud
You should checkout <http://haystacksearch.org/>

