Regular expressions match regular languages (hence the name). If your language involves pairs of things (e.g HTML), it's not regular. Perl hacked support for this in via backreferences and other extensions, but these are slow and illegible. Use a proper context-free grammar parser if you need to parse a context free grammar, you know?
More broadly, people fear and misunderstand regexes because they have no idea how they work. It becomes much easier if you understand how they map to deterministic finite state machines. Recommended reading: https://swtch.com/~rsc/regexp/regexp1.html
Once you understand how they work, you can basically read a regex left to right and intuitively know all the strings they'd match. There is no such thing as an unmaintainable/illegible basic regex - they're just words with some placeholders in them - it's when you cram in extended functionality (which is basically a programming language where all the keywords are single characters) that shit hits the fan.
>Use a proper context-free grammar parser if you need to parse a context free grammar, you know?
The thing is that regular expression are supported as language feature or as standard library in pretty much every language. If you want to build a proper parser, you'll have to jump through a lot more hoops. For instance for C++ I've tried tons of different lexer and parser generators and they all suck for various reasons. (Verbose syntax, uses global variables, C only, lexer not compatible with parser and vice versa,..) Most people seem to end up writing their own parsers from scratch.
The only time I've ever seen parsing done right is with Parsec for Haskell.
I really hate that link. Not only is it uninformative, I disagree with it. If you know the shape of your html data, parsing it with regular expressions can be much easier than with a dedicated parser.
You might be interested in OMeta, whose goal is to make language experimentation quicker and more accessible by providing a kind of interface layer between a declarative grammar and a host language. I'm still reading the paper, so I can't vouch for it yet. But it's from VPRI and has Alan Kay's backing.[0]
I really enjoyed learning OMeta, and I'd recommend playing with it to others. However, the performance of the JavaScript implementation is really bad. It uses exceptions as a primary mechanism for control flow, which is not generally well optimized in JS. I observed a toy calculator grammar parsing a string about 20 characters long throw and catch hundreds of exceptions.
I've had good success with Jison as a JS parser generator that is performant enough to feel good about using in production.
It's a mistake to write your own parser because they can get so complex so quickly to even do basic stuff, and you'll lose sight of the original goal of the project. See Prolog - a whole programming language built around the idea of parsing language (albeit NLP)!
I suggest you take alook at Antrl4 for a powerful but easy to use parser plus lexer combo.
I tried Antlr, even bought the book, and at least for C++ I found it pretty unworkable. Even with the book the documentation felt very incomplete. Maybe things work better on the Java side.
Antlr is shit; it will waste your time for anything real. Bison+flex (once you figure out the quirks, undocumented assumptions and flags) and treetop are usable.
Disclaimer: A long time ago in a galaxy, far, far away, I wrote an optimizing Java-to-MIPS compiler (sans GC, so leaky heap) in C++ using Flex/Bison and again in Java using JavaCC.
Except nobody uses strict regexes anymore. In particular, without the minimal match (*?) operator, they're of limited utility in real-world situations. It's easy to tell people to use a context-free grammar parser, but the ones that I know of require a lot more setup than a one-liner regex. There's probably an opportunity for someone to really improve the practice of programming by coming up with a compact language to specify and match structures in CFGs, or some subset of them.
JSON-Schema and XML-Schema provide the "specify and match" functionality, in the sense of deciding whether or not a given instance of a document conforms to a defined standard. For simple regex-style queries there are XPath and JSONPath expressions.
If you want to actually touch them as plain objects it's a very straightforward task of implementing a data provider to marshal the objects. In Java this is provided by libraries such as Jackson (JSON) and JAXB (XML). These basically work just like Hibernate.
JSON and XML cover many of the real-world use cases of parsing such CFGs. If you make your data fit into one of those boxes it's very straightforward to validate or marshal them according to those schema protocols. They obviously do have a greater complexity than just writing a regex, but that's kind of the nature of using a more expressive language, and it's by no means an insurmountable increase.
"There is no such thing as an unmaintainable/illegible basic regex..."
To someone who knows BRE. I am one of those people. It's ERE and PCRE I do not understand very well.
Sharing solutions to common problems using BRE on HN always seems to trigger (unwarranted) criticism using either of the exact words you mention, or synonyms for them. "Unmaintainable" (by who?). "Illegible" (to who?).
I "maintain" 100's of BRE scripts. They are perfecty legible to me. None of them are so complex I cannot re-write them in a short time. It is the structure of the input that is complex and which takes time to recall.
I also use lex, a common utility found on almost all UNIX derived OS; this article seems to ignore that option. I like to think it's faster than Perl or Python, but I cannot say for sure.
> I like to think it's faster than Perl or Python, but I cannot say for sure.
It is. I implemented an assembly language parser with pyparsing. It worked okay but the function call overhead with a combinator-based parser handling both the lexing and grammar was murder. I replaced it with regexes and got a 6x speedup. Not something I would do with a complex grammar though. Native code would obviously blow this away in speed but it is fast enough now.
I use regexes all the time for parsing data on a variety of 3rd-party websites. Just because they aren't perfect at matching every potential theoretical situation doesn't mean they shouldn't be used. In practice, regexes can be a simple and reliable way to grab data out of HTML. Don't dismiss them out of hand!
Another point, generally overlooked by the theoretical purists, is that HTML in the wild is rarely correct, and your perfect HTML parser will barf when trying to process it. Regexes on the other hand don't have to care about exact syntax and can cope with horribly mangled data.
An ordinary regex obviously can't parse html, because html is not regular (given nested elements and the pumping lemma). But what you can easily do with a regex is to tokenize html - extract anything that looks like a start tag, for example. The simply approach will obviously get some things wrong - a tag inside a comment is meaningless, for example - but for a lot of uses, this difference simply isn't important.
If the job is to extract all links from a webpage, regexes will do just fine, and will probably be easier to write and understand than alternate approaches. (This is absolutely not a fair comparison of course; you could compare writing a regex engine to writing a html parser. But I digress.)
If the job is to determine whether a given webpage is a member of the set that includes all valid html documents. then a regex is not sufficient.
If the job is to extract a list of syntax tokens from a webpage, a regex is likely fine.
If the job is to assign semantic meaning to every token in that list, a regex just won't work.
Either way, the point is to know what you're doing. Much "parsing" of webpages is not parsing in the formal language sense, and who cares that it isn't because it doesn't need to be.
No, tasks like "extracting all links from a webpage" are absolutely trivial using an HTML parser. Run the following XPath query on an HTML object:
//a[@class='specified_string']/@href
Yes, you have to understand the syntax XPath to write such expressions, just like you have to know the language of regexes. Or at least be able to google them.
The answer to "who cares" is "you", because you're the one who's going to catch hell when your regex failed to capture some hyperlink that utilized some feature of XML that exceeded your test-cases. The one-liner above is guaranteed to Just Work on all valid XML documents, so why even create such a monstrosity?
Everyone knows that Regexes Cannot Parse HTML, and yet people still try it because they think they're smarter than Noam Chompsky. The real truth is that everything looks like a nail to these people, because all they have is a hammer.
Then you're not talking about parsing HTML/XML, are you? How could you possibly know which links or syntax tokens are actually going to be displayed on a page if you feed the browser's parser an invalid document?
There are fault-tolerant HTML parsers like TagSoup that are specifically designed to handle dirty HTML and spit out a valid document object. If you have sources that are malformed badly enough that it's still not working, you can define custom SAX properties to handle them. But a task like that is certainly a best-case effort and the interpretation of such a library is no more or less valid than the interpretation of the browser's parser. It's not a valid document to start with and nothing can make it so.
If you are only parsing values out of a single specific data template, you know it's not going to parse as HTML or XML, you know that it's never going to contain weird values, and you know it's never going to change - then go hog wild. But it's fundamentally a brittle approach that only holds as long as those assumptions do. I've made the mistake of believing some of those about my data and it's bitten me before. And I really question the implicit assertion that "most html parsing" would fall into that exceedingly narrow category. Especially after a couple years of feature creep.
Just keep your logic general and normalize your data. Offer a failover to a fault-tolerant parser in your data layer with a logged warning. This is much more durable and doesn't silently generate invalid tokens or silently fail to capture valid tokens. Regexes simply cannot offer the capability to fail loudly. So once you are no longer actively babysitting your custom regex parser it could have started failing at any time - how would you even know?
Good HTML Parsers typically use engines of browsers and will actually match websites very well. They'll let you use selectors and not regex and are a lot more accurate.
It also won't break if the website adds a single attribute or a quirky value.
It is not a question of "theoretical purity", it is a fact that it is simply not possible in the real world to parse a html document using only a regex. But this is often misunderstood as saying they regex shouldn't be used in parsing. Regex'es are quite useful as a part of a parsing process, eg. to locate the tags and attributes and separate them from text. This is the scanning or tokenization step. But after this, you need to create a hierarchical data model corresponding the DOM, and you need to do that outside of a regex. A regex outputs only a string or flat list of strings (in all regex libraries I know of), it does not output a hierarchical data structure, so it follows that you need something more than a single regex to parse a hierarchical data structure like a programming language or a html document.
Only exception would be if you need to extract very specific data items which are not hierarchical in nature. Eg. you want to extract only the specified character encoding or something like that.
Your point about the "perfect HTML parser" is kind of missing the point, since regexes does not magically solve the problem of imperfect HTML either. A perfect HTML parser would be one which implement HTML spec fully, including the error recovery rules. These are quite complex, eg, if you reach a <p> then an open <h1> is implicitly closed, but an open <i> is not. Try implementing this logic in a single regex! ("Regexes does not care about perfect syntax" - what does that even mean? A regex match exactly what you tell it to match, just like a parser parses what you program it to parse.)
I guess I must be doing something impossible then. I've been happily screen-scraping many different websites with regexes over the last ten years. The data I gather is critical to my work. And yet, regexes are fine.
Your viewpoint is the exact problem I'm complaining about. You think what I am doing is impossible. How can you be so certain? That's the sign of someone who is too tied up in getting something perfect.
In fact, while I've hit many problems in gathering data over those years, I can't think of any problem that has been because of regex limitations. And none of my regexes are overly complicated or rely on obscure extensions. Some sites get redesigned and I have to adjust my code, but those redesigns would break any kind of parsing as the pages got completely redesigned (and the URLs and site structure often change as part of this...)
If you need to gather data from a website, your real problems will be in the networking and reliability side of things.
It's possible to a certain degree, but it's just way way harder then you'd do if you used a proper parser. For example, let's say I want to get all the recent question links off http://stackoverflow.com/questions/new , With RegExp, I can write a clever thing that parses `<a href="(.?*)" class="question-link"` and _hope_ that the order never changes and the class never comes before the href and that they don't change that in a design or in other pages and that the href does not contain `class="question-link` inside it and to a degree that's valid.
The thing is - the alternative is to write a query selector - which is another more suitable domain specific language for making selections - only on the DOM instead of text, I'd just write `$(".question-link").get().map(x => x.href)` to get the hrefs and I know it's __always perfectly safe__. Now that example is trivial, if I only want links where the questions are tagged with C#, I get a much harder problem with Regex, but with query selectors it's still mostly trivial.
So, it's not that it's particularly hard to use regular expressions to solve it, it's just a lot harder than the alternative which is super simple and obviously correct.
Your method relies on the class names not being renamed (e.g. I see "question-hyperlink" on /questions, not "question-link"). I'd skip the class entirely and match on the URL, since I doubt stackoverflow want their URLs to break:
/<a[^>]*href="(\/questions\/\d+[^"]+)"/i
But... we can go back and forwards posting examples and find fault in any regex that I post or any selector that you post. It's missing the point. Both methods are at the mercy of web page redesigns. Both methods can be made more robust against certain changes, but cannot survive other changes. You are trying to say that regexes won't work. I am saying that both methods work.
Look, you have essentially two options here: 1) add a full featured DOM parser to your program - do that if you have to understand the DOM; 2) write regex - do that if you don't care about the DOM, if you know how the server formats the data you need and if the server formats the data you need in a reproducible way.
> (...) and that they don't change that in a design or in other pages (...)
Well if they change the design of the pages then you will have to rewrite your regex accordingly in order to find the data you need. But if that happens, odds are high that your program, which uses a full featured DOM parser, will have to be rewritten as well in order to handle the modified output of the DOM parser...
in fact, in practice, i find that scrapers using regexp matchers are more robust against the kinds of template changes that people make than scrapers based on the html tree structure.
You're probably using some extension and not a "strict" regular expression engine. Regular expressions describe regular languages, which are level 3 in the Chomsky hierarchy https://en.wikipedia.org/wiki/Chomsky_hierarchy and they formally, provably, do not have the expressive power to describe HTML. This has already been posted in this thread, but make sure to read Larry Wall's quote in the second answer: http://stackoverflow.com/questions/6751105/why-its-not-possi...
Possibly there is some confusion about the word "parsing". If it is just the question of scanning for some substring or pattern on a webpage, sure you can do that with a regex. But this is not what is usually considered parsing.
Right. And that confusion is probably fueled by the existence of smart (invalid html swallowing) parsers mainly used in scraping, like beautiful soup and nokogiri.
Maybe you are indeed doing the impossible, but I consider it more likely that you either (a) is using some parsing logic (like a stack, recursion or some such) intermingled with the pure regexes, or (b) is not extracting hierarchically structured data from the html.
Could you please stop insisting that I'm doing something impossible? Do you think I am imagining my work? I assure you that no backtracking or recursion are involved. Stacks? Well, sometimes I extract bits of a page with one regex and process it more with further regexes in another function, so I guess you could argue that the language uses a stack for a function call, but still... please stop grasping.
I'm very sure that you could invent some website that relied on some hideous deeply recursive complicated structure, and it would be painful to grab data from it. In my experience, those cases are extremely unlikely. If you let these extremely unlikely situations stop you from picking a simple and useful solution for all the other cases, you are indeed getting tied up in trying to be perfect.
I'm not suggesting that you code doesn't work, if that is what your thinking! It's just a terminology thing.
What you describe ("sometimes I extract bits of a page with one regex and process it more with further regexes in another function") sounds like a recursive descent parser, which is a very common way to implement a simple and fast parser.
I'm not suggesting you can't parse html with the help of regexes, I'm just stating that you cannot parse html only with a regex: You cannot apply a single regex to the html and get something similar to a DOM out. You need some kind of stack or recursive programming logic in order to extract a recursive or hierarchical data structure from a string.
Not sure I understand the relevance of this comment to the post in question. Unless your objection is that using capture groups in this way is an example of 'cramming in extended functionality'?
On the other hand, this post has spawned the usual regex thread including a zalgo link and a pointer to a regex that finds prime numbers, so I guess you've done your part to perpetuate regex mythology.
That's not a solution. I asked for a something that checks decimal numbers. A real solution looks uglier: check out https://github.com/matthiasgoergens/Div7, it starts
Regexes we use are an extended form of so-called "regular expressions", which describe regular languages and finite automata, and in that part of maths, a language that accepts binary numbers divisible by N is a very common toy problem.
I agree with the general gist of your post, but I should point out that all finite languages are regular. So while potentially infinite HTML documents cannot be parsed by regular expressions, they don't turn up very often.
You're missing the distinction between a finite language, and an arbitrarily large finite production of an infinite language.
No-one cares about "infinite HTML documents". I don't even think the Chomskyan hierarchy concerns itself with languages with "infinite" productions. All you have to worry about is infinite languages -- i.e., languages with arbitrarily large productions.
There's a key difference between "infinite" and "arbitrarily large": the latter is quantifiable. While indeed to can build a regular expression to match any finite subset of HTML, it can only match HTML documents up to some fixed size. I can always give you a (finite!) document that is one tag deeper that your regex will choke on.
"But recursive parsers have the same issue!" you say. "Their stack will run out of memory at some point!" Yes, but they have a key difference: the amount of stack (memory) they require is bounded by the size of the document. This is not true for a regular expression! In fact, not only would a regular expression to match a given subset of HTML require memory exponentially proportional† to the size of the document, the automata itself would be similarly massive!
I really wish someone came up with and promulgated a concise handy built-in ubiquitious equivalent of regular expressions for, say, PEGs. The closest I've seen are DCGs in Prolog. Would make so many parsing problems more easy to do correctly!
† It's possible I'm wrong about this since it's early morning and I'm basing this off my intuition rather than a proof. The part about the automata itself being exponential w/r/t the size of the document is definitely true though.
You can't always give me a larger HTML document - eventually you will run out of usable universe. Yes, I should have said 'bounded by some finite maximum size' rather than just finite, but it's not a great leap to see that in reality all documents will be bounded by such a maximum.
Finite state automata do indeed need an exponentially larger number of states compared to a pushdown automata, I made no claim as to efficiency. The point remains - for all practical purposes, you can consider all languages to be regular and using a stack is merely an optimisation.
Sure, but that doesn't change the fact that you cannot implement anything at all that accepts more than a regular language. In practice, however you implement it, you will have only implemented something equivalent to some finite state machine (due to the finite memory available to you).
You are confusing regular language with finite state machine. I don't know why there is so much resistance to this. All real machines have bounded memory available to them, thus they can only accept regular languages. Therefore, regular expressions are as powerful as any machine in existence.
Thinking about this some more, there is no reason that a large number of states has to have a proportional amount of memory. If the states are represented in binary notation (thus needing logarithmic number of bits) and the state transition function is represented as a binary decision diagram then this could be quite compressed indeed.
Interesting idea! Wait, here’s another: maybe you could take your regular grammar that accepts some large number of things that a context-free grammar it’s emulating would accept, and implement it using a stack or something. That would save space and remove the arbitrary restriction.
Well you can keep picking nits about implementation choices, but unless your stack can grow unbounded (because you have unlimited resources) then you haven't implemented a PDA and haven't removed any restriction at all. You've just optimised space usage.
If you really believe that regular expressions cannot parse any HTML document in reality (eg given that all web browsers in practice limit the nesting depth of HTML) then please present some evidence.
> If you really believe that regular expressions cannot parse any HTML document in reality (eg given that all web browsers in practice limit the nesting depth of HTML) then please present some evidence.
You can build a regular expression to match any HTML document to any fixed depth. Set that to whatever you think “all web browsers” limit HTML nesting to “in practice” – citation very much needed, I don’t believe they do – and voilà! You have produced something absolutely useless and probably several million characters long.
I don’t know what you’re arguing. I don’t think you know what you’re arguing either. It’s pointless to continue talking.
For example, see this old commit setting the WebKit maximum nesting depth to 2048. I am told it is 512 by default today. https://www.mail-archive.com/webkit-changes@lists.webkit.org... I'm fairly certain all browsers will impose such a limit or risk blowing their stack and crashing. Turns out there are advantages to setting upper bounds after all. Who'da thunk?
Of course, the point (if you read back) was not to say that you should use REs for all parsing. Just merely to correct a commonly repeated mistake that 'REs cannot parse HTML'. They can do so just fine.
Ah, okay. That’s much more straightforward. The answer, though, remains that regular expressions that are actually regular cannot parse HTML that is actually HTML. Regular expressions can parse HTML up to a certain depth specified in advance – which is not the definition of HTML.
Here is the TL;DR. This regex matches Tarzan but not "Tarzan":
"Tarzan"|(Tarzan)
You can also include more than one case of what you don't want to match. This one also finds only the cases of Tarzan that don't match the first three patterns:
Tarzania|--Tarzan--|"Tarzan"|(Tarzan)
You can even use more complex regexes. This matches all words not in an image tag:
<img[^>]+>|(\w+)
And likewise this matches anything not surrounded by <b> tags:
That's not exactly it, the regexp does match both "Tarzan" and Tarzan, but the capture group 1 will only be set for strings that contain Tarzan without quotes. So by examining that group after matching you know in which case you are. (that also means you can only use this "trick" when you can examine capture groups, i.e. not generally in editors).
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
while (<$FH>) {
$csv->parse($_);
next if $. == 1;
my @fields = $csv->fields;
$ra->add($fields[1]);
}
my $suburbs = $ra->as_string;
Might be good for hacks, but it'll go out of date eventually and can't be used for commercial purposes without incurring the wrath of AusPost. Thanks for hosting it, though.
This technique is unreliable in practice, and the author's discussion is confused.
First, their explanation doesn't make sense. They're supposing that there's some determinacy in the order in which a matcher can be expected to examine the different possible matches. But that's provably not the case: if it were, then deterministic and non-determinsitic finite automata would be inequivalent.
But the technique in question does seem to require some determinacy as to which of several alternatives will match against a string. Where does that determinacy come from? The semantics of the alternation operator (the '|') as usually formulated don't specify any preference among alternations. For that reason, POSIX additionally requires that a matcher return the longest possible match (and if there are several such, the leftmost is what must be returned). Where you do find an explicit guarantee concerning which of several different possible ways of matching will be preferred, it's almost certainly because the engine is aiming at POSIX compliance.
Such compliance has a significant cost, though, as it requires the matcher to consider all possible matches (in order to find the largest). For that reason, most regex engines forego strict POSIX compliance and only guarantee that some match will be returned if one exists, not that that match will be the leftmost longest. Some engines offer the option of requesting strict POSIX behavior, but the default will always be to eagerly return the first match encountered (and recall the point above that there provably can't be a guarantee about the order in which matches are encountered, in general).
You should never do this in production code unless you're sure that your matcher is POSIX-compliant.
I almost wrote this off as it seemed to be about how to write unmaintainable regex soup but the author pulled something quite elegant out of the hat at the end.
"People often forget to solve the problem they need to solve (match x), and instead work on other things (find a regex to match x)."
Completely agree. One common mistake is building a complex regex to match the elements you want to find from a string, when an easier approach is to split the string on a simple regex that matches the things you want to throw away.
No I dont want to capture anything, I just want to match a string that contains Tarzan but not "Tarzan". Your example seems to match both
both. Is that too much to ask?
No, it's not. Just run through the matches returned by this regex, and reject the ones which don't have a captured group. Done. Did you want a more complex solution?
This "trick" is simply exploiting a bug in regex implementations.
The regex
"Tarzan"|(Tarzan)
should match the string
"Tarzan"
in two ways: first, matching the entire string; and second, matching the substring "Tarzan" in the whole string "\"Tarzan\"". But most regex implementations drop extra overlapping matches. I argue this is incorrect behavior, because it complicates understanding what a regex means - you have to understand the /order/ in which your regular expression matcher interprets your regular expression, which is an implementation detail. I conjecture that a DFA-based regex engine would not be able to exhibit this order-biased behavior, at least not with the standard approach.
However, it's interesting that this "bug" turns out to be a "feature" for the case of excluding other behavior. I'm not sure what conclusion to draw from this.
Sorry, I made an assumption there that you were talking about the practical applications of regex, not the theoretical applications, and I was asking you to explain how you would practically return multiple matches in... any environment.
This was a small regex designed to create multiple answers to see how you resolved the issue, obviously we can engineer regexes that return far more results. So something's got to give. I don't agree with you that regex's innately imply all matches are valid.
The original article already relies on finding multiple matches, in order to ignore the matches that don't contain the group that we're interested in.
Python's regex library, for example, can return multiple matches. It has three functions:
- `re.match`, which checks whether the whole string matches.
- `re.search`, which finds the first location in the string that matches.
- `re.findall`, which finds all non-overlapping matches.
I was simply suggesting that the "non-overlapping" constraint in findall is a "bug", in some sense, because it exposes implementation details of the regex engine.
But, again, given that it is apparently a useful bug, maybe I am wrong. But that leaves open the question what the right spec for regex matching is, anyway.
that is a very nice hack. this kind of 'do as much work per standard library call as you can' approach is pretty much the way to go in interpreted languages like python or octave, which is in part why numpy is so popular — not only does it allow you to program at a higher level of abstraction, it also makes your program run more efficiently.
however, i think that technically, that's not an A* search he implemented, just a breadth-first search. i'm not an expert (i've never even implemented A* search), so i could be mistaken. i'm interested to hear whether other people agree.
The article actually talks about the prime validator regex, and goes on to say that while it's an awesome trick, it is not the "best ever" because it has limited scope.
I for one think the trick described in the article is pretty useful, and might even use it someday (and possibly have already without realizing it).
Very nice trick, while using of foo|(bar) is very simple, somehow I don't see such approach being used very often, and it looks like it could simplify a number of things.
Maybe I'm too old. I tend to think of a regex as either matching or not matching.
Finding a bit of code that uses a capture to determine whether a match was found seems like it would easily be confusing/inobvious.
Some pretty clear commenting and it would be ok... maybe.
Also... I wonder how well it would work as part of a larger regex, one that already uses captures (or non-capturing groups)? The examples are all nice, short and sweet... but how often do regex based solutions stay short and sweet? A few maintenance cycles/years and suddenly you've got this funky regex/capture thing that only Bob understands and he's way to busy to talk to you for 5 minutes... and once you change things then Bob suddenly finds time to review your code to complain how you broke it for such a simple change. There goes your bonus you told the wife you were sure to get so you could take her and the kids on vacation. The day after your divorce finalized Bob sends you a fix request to use that improved scheme of yours because the old regex one isn't flexible enough anymore.
I thought the answer was going to be "tricking the world into thinking regex was a good idea". I've always considered regex to have the rare and elusive "write only" flag. Write only. As opposed to read only. Because once you write a regex that's it. You will never know what it does ever again.
I have taken to thoroughly explaining what a regex does in the code comments. Not just "match that and not that" but going through every character in the expression. It is a practice i took up when playing around with brainfuck.
My magnum opus was a program that factored numbers in an arbitrary amount of time. That is _NOT_ something you hammer out in an afternoon, and without commenting everything thoroughly (like what I was doing, where I was in code execution, what state the whole program was in) I would never have made it.
Non-related story time:
At first when playing with brainfuck, I wrote my own BF-interpreter that compiled to an IL which was easier to debug (with my own interpreter that could give me stack traces with comments about where I was in memory and what the hell was going on). After a month, I really didn't need to. My brain had gotten used to reasoning about BF-code and I could actually understand what code was doing as long as I knew which kind of compiler/interpreter it was written for.
Just use tests and comments. For instance, in many language it's trivial to declare regexp in multiple lines like this:
regex = "(^[0-9])" //catch the initial digit (group 1)
+ "/" //skip the following slash
+ "([A-Z]+) //capture the identifier (group 2)
...
Regexps are too powerful a tool to ignore. I'd much rather write a simple regex than a whole screenfull of code, especially when the former does the work faster (because regex engines are pretty efficient).
Another way of doing it is to use variables and helper functions, particularly if you're going to be creating lots of regexps from similar components. This is what I usually do in Python. For example, while making no promises about the value of this code specifically:
You can handle more or less stuff this way according to how much you and/or your readers like regexp syntaxp. The above is probably further than I'd take it in practice; I'm familiar with the regular expression syntax, but I'd still probably at least use something like the `ident' variable just to keep clutter out of the regexp.
(A nice demonstration of this sort of thing is emacs's rx module (see, e.g., http://emacswiki.org/emacs/rx). I couldn't find any good non-emacs documentation about this, nor much that would make sense to people unfamiliar with lisp, but when you're in emacs you can get help on it using C-h f rx RET.)
I agree. In terms of variables, I think a good strategy is assigning names to meanings, e.g.
PRODUCT_ID = "[0-9]{2}[A-Z]{1,5}"
...
CUSTOMER_ID = "[0-9]{2}[A-Z]{1,5}"
...
...
regexp = "(" + PRODUCT_ID + ")" // capture product ID in group 1
+ "something something" // something something
+ "(" + CUSTOMER_ID + ")"; // capture customer ID in group 2
In this example, even though the two constants contain the same regular expresison, they refer to two different concepts. Part of the problem with understanding regexp-based code is connecting parts of the expression with what they mean. Above strategy addresses this.
I also like assigning names to capture groups I depend on. So instead of, later in code, asking for e.g. matcher.group(2), I ask for matcher.group(GROUP_CUSTOMER_ID). Makes for a much more readable code.
Speaking of Emacs's rx, it's absolutely amazing and I'm sad that I only discovered it just few days ago :(. Another similar concept, from the other side of "code = data" equality is Common Lisp's (of course it's Lisp again) CL-PPCRE and its internal representation of regular expressions:
Simple regexes are still fairly readable (e.g. a three number version string match), but if you find yourself having to worry about stuff like matching within escapable delimiters, it's time to put the regexes down and pick a different solution.
I feel like I gain more than I lose by simply excluding regexps from my toolbelt. (It's a personal choice, I'm not saying it'd be the same for others, since their values may differ from mine.)
I've found regexes very useful when writing quick parsers. For these cases it's often the simplest solution. If the alternative would be a huge chunk of hand written parsing code I would claim the regex is more maintainable.
This is a bit true, but when you want to get things done, one line regex is often better than a whole file of code.
Write only regexes are trivial once you have tests: when you don't understand one, throw it off and replace with something you do. Tests pass? Here you go.
Whereas having a whole file of many lines of code is much harder to replace when you don't understand what's it all about.
In general, yes. That's what I always document my regexes well and include write unit tests with all of the positive/negative test cases that I used whilst building the regex.
That makes them pretty maintainable.
Why go to this effort? Because for they're beautiful and a very powerful tool when used for the right problem. Unfortunately they're widely misused (i.e. email validation).
Unless it is not overly complicated, it is possible to write regular expression in such a way you can later on understand what you were thinking at the time you wrote them (using the extended syntax available in Perl and elsewhere - it allows you to break up the regex into multiple lines and embed comments).
It takes a little more effort, and one does not get the "look how dense it is"-rush, but I guess the same can be said for all code. There is a point where it breaks down, but there is a range of text-matching/extracting problems where regexes are both useful and - if one makes the effort - maintainable.
Unimpressive. The author of this article obviously didn't have a compiler class where one learns how regexes are basically glorified NFAs that are deterministicly convertible into a much more efficient DFA state machines (read: PCRE JIT), instead of assuming regexes are processed by O(N^2) algorithms.
'convertible into a much more efficient DFA state machines (read: PCRE JIT)'
pcre supports reduction to dfa and also jit, but not only are they not the same thing, they are mutually exclusive. also, ever regexp engine i've seen that supports capturing and backreferences uses not worst-case quadratic-time but actually worst-case exponential-time algorithms, although i'm pretty sure this isn't actually unavoidable.
may i suggest that the next time you think about posting a comment that begins with 'Unimpressive. The author of this article obviously didn't', that you include less than one major technical error per sentence in it.
Reminds me of this quote from Jamie Zawinski: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
Why does this remind you of that quote? Because it's about regular expressions? Because it seems to me that if you have the problem described in this post, and you solve it the way the post describes, using regular expressions, you have solved your problem, regardless of what jzw says.
When I'm faced with a problem, I think about using recursion. Then I have two problems. Then I think about using recursion. Then I have three problems. Then I ...
More broadly, people fear and misunderstand regexes because they have no idea how they work. It becomes much easier if you understand how they map to deterministic finite state machines. Recommended reading: https://swtch.com/~rsc/regexp/regexp1.html
Once you understand how they work, you can basically read a regex left to right and intuitively know all the strings they'd match. There is no such thing as an unmaintainable/illegible basic regex - they're just words with some placeholders in them - it's when you cram in extended functionality (which is basically a programming language where all the keywords are single characters) that shit hits the fan.