Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text.
"Parse" is a questionable word choice. A proper Regular Expression can only describe patterns found in "regular languages", though modern backtracking regex engines have a bit more power. Parsing implies a parser, which means matching a grammar somewhat more sophisticated than a regex engine can handle.
Regular languages can be described by regular grammars, and regular expressions are a notation for regular grammars. So you can indeed parse text with reference to a regular expression.
Most people would call a parser which only used strictly regular expressions a lexical analyser, lexer or tokenizer, but this is only a convention to simplify language definition and parser implementation when using simple techniques like LALR or LL.
But other uses may not need that level of sophistication. For example, a packet analyser in a router might use a dynamically updated state machine (a DFA), built from multiple regular expressions, for efficient dispatching of packets based on rules. I don't think it would be wrong to describe that component of the router as a packet parser.
For example, this (to me) blindingly obvious patent seems to be using a modified DFA to make routing decisions, and calls it a parser:
There seems to be this emerging pattern in my career where I stumble across all the people in the world who have decided to try and validate HTML using regular expressions.
That poor bastard.
Really? So, we can parse HTML with regular expressions nowadays?
I think a more appropriate word choice would be "recognize" instead of "parse", with regard to the OP of the thread.
My only quibble: I'd promote the character classes (especially \s and maybe \S) into this list from the "expert" category. Here's part2 (or what I call: stuff I can google later, if I need it), if anyone wants to keep reading: http://immike.net/blog/2007/06/21/extreme-regex-foo-what-you...
Many programmers seem to think that they possess just the right amount of knowledge. Someone who knows Unix utilities but not algorithm design will argue that Unix utilities are vital, but algorithm design is a random obscure topic; someone with the opposite skillset will defend it just as eloquently. I think programmers have no obligation to know anything. If you can do the job, I don't care how much you use Google.
I'm spoiled enough to think that boring things shouldn't be deliberately memorized. It's enough for me to know the mathematical concept of a regular language, which lets me recognize situations where regexes would help. For concrete syntax we always have cheatsheets. This attitude has its advantages: for example, I never needed to be explained why parsing XML with regexes is a bad idea.
+ A lot of programmers aren't familiar with the concept of a regular language, especially those that are hobbyists, or who went to shitty schools. Sad but true.
+ Regular expressions are useful commonly enough that I would be very surprised if an okay programmer with any sort of experience under their belt couldn't remember at least  . * ? + .
I definitely agree with your main point. There is, however, a weird subjective layer of knowledge that tends to be present in developers of any experience. That doesn't mean that I would look down on an individual programmer who claimed experience that didn't know how to make a regex character class off the top of their head, but I would be a bit surprised.
We had a great experience at one of our contract jobs: we found a smart math student with no webdev experience, and he did most of the programming while I constantly consulted him on "platform issues" and jumped in to code any tricky parts he had problems with. He became a competent webdev in about two months - way faster than it took me to learn this stuff by myself - then I handed the project over to him and left. I wonder why other employers don't do this more: mix experienced programmers with smart newbies who come from hard-science backgrounds. Seems to be a winning combination.
>> s = "hello world, I love you world"
>> s.sub(/(.*?)world/, '\1universe')
=> "hello universe, I love you world"
>> s.sub(/(.*)world/, '\1universe')
=> "hello world, I love you universe"
Every regexp that you apply to a particular block of text reduces
the applicability of regular expressions by an order of magnitude.
The minimum any programmer needs to know about regexps is when they are applicable, and more importantly, when they are not.
PyParsing - http://pyparsing.wikispaces.com/ - this
is probably the most well respected parser library
picoparse - http://github.com/brehaut/picoparse - My
parsec inspired parser combinator library. It is
significantly more light weight than pyparsing, and
I would recommend pyparsing in its place for more
heavy lifting parsers.
RParsec - http://rubyforge.org/projects/rparsec/ - A
ruby implementation of Parsec. I've heard good
things about it.
Parsec - http://legacy.cs.uu.nl/daan/parsec.html -
This is an amazing library, worth anyones time.
FParsec - http://www.quanttec.com/fparsec/ - This is
reason enough to learn F#; it is a fairly true port
of the haskell library and very nice to use.
Packrat Parsing - I believe this is part of the
standard lib. http://www.scala-lang.org/docu/files/api/scala/util/parsing/combinator/PackratParsers.html
fnparse - http://github.com/joshua-choi/fnparse
jparsec - http://jparsec.codehaus.org/
But there's a big reason regexes are overused compared to parsers--regex engines come stock in most programming languages, and parser generators don't. There's a big mismatch in complexity between overusing something the language gives you for free and building in a parser. Perl 6 deserves some credit for their "rules" system, which can actually take a grammar and parse from it.
I'm thinking specifically of the Parsec inspired family, which has implementations or derivations in many common languages, even C# and Java (despite being noticably more cumbersome in those cases there).
[I cannot speak for perl 6, but i trust you are correct]
http://weitz.de/regex-coach/ - It's Windows only though
There are many others, but his one thought me quite a lot
also, python's regex module can show you how it interprets a regular expression by passing the re.DEBUG flag to re.compile:
>>> import re
>>> re.compile('a*b+', re.DEBUG)
max_repeat 0 65535
max_repeat 1 65535
<_sre.SRE_Pattern object at 0x100460230>
I think it's easier to get if you match a pattern to some text by hand, taking a simple pattern like ^a+bc$ and trying to see if it matches:
Here are a few test cases:
1) ^a+ means it starts with 1 or more 'a's, so those eat all my 'a's and I'm matching bbbcc against 'bc' now
2) b* means 0 or more 'b's; those eat up everything but the c at the end.
3) c$ matches exactly one c at the end, so yeah, we have a match. If there had been one more c (or anything else), the match would fail because it wouldn't hit the $ (which is end of string).
1) ^a+ eats all the 'a's again.
2) b* ... what b? It's "zero or more" so yes, we can skip it.
3) c$ Yes, there's one c at the end. We have a match.
1) We need one* or more 'a's at the start. That's the difference between + and . Match fails immediately.
1) ^a+ matches the a.
2) b means it's optional, so we don't fail... yet. But we don't match anything, either.
3) c$ That c matches, but we're not at the end of the string yet. There's still "bbc" left to match and we're out of pattern. Fail.
You can do something similar with the tools other people have linked in comments. This was just to give an example pattern and a few test cases to get you started.
It's weird, though. Even though this clicks for me, I don't feel like I get monads at all. Yeah, I've read the 3 monad laws (great... so something being a monad means that I can monad it, un-monad it and do something vaguely similar to composition on monads) and I can do some very simple Lisp which uses built-in stuff they say is built from monads (like printing). But I don't know how they work internally at all, or how they're fundamentally different from all the other Lisp statements ("causes side affects" isn't terribly specific... why is printing to the screen a side effect? Is setq a monad?).
In other words, there's probably some topic for every programmer that they really have to struggle with. As easily as regexes come to me, I struggle to understand other stuff.
I've been at it for quite a while, incidentally. But I'm not going to give up any time soon. So keep working on it.
It does help, but I think there's still a lot I don't know. I mean, 'bind' and 'return' still seem like the names are backwards to me. Somehow, I feel like 'return' should give me a value, rather than shoving a value into a monad. But that's probably because I'm used to C and its relatives.
Perhaps I simply need to read up more on FP itself. I have a huge bookshelf full of half-read books (and yes, plenty of fully-read books) but nowhere near enough free time...
It's super handy when learning Regular Expressions because it shows results as you type.
$ printf 'ab[c]d[ef][g[h]i\n' | sed 's/\[[^]*]/-/g'
kind of disappointed that there's 13 comments and no mention of jwz's quote.
in terms of actual thoughts, this is certainly an aptly named article.
> Some people, when confronted with a problem, think "I know, I'll quote Jamie Zawinski." Now they have two problems.
2. No, they cannot! extended regexps are a nice hack, but not really met formal regular expression definition.
3. man grep
It should also be included in the bare minimum:
Regex will drive you crazy. Not just writing the expression, but installing a library (especially if there is no handy installer) ect. Regex likes to punch you in the balls whenever possible. (I can say this because I've written many line of Regex and I do love it.)
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.