Hacker News new | past | comments | ask | show | jobs | submit login
How "junior" developers can become regex wizards (joshuakemp.blogspot.com)
39 points by joshuakemp1 on Nov 28, 2013 | hide | past | favorite | 44 comments



A couple things:

1. Be careful what you use regex's for. Email addresses are very difficult[0]. HTML is impossible[1].

2. There are a number of tools that make it easier to understand, including [2].

[0] http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

[1] http://stackoverflow.com/a/1732454/2363

[2] http://ivanzuzak.info/noam/webapps/fsm_simulator/


> HTML is impossible.

This comes up a lot. Most languages' "regular expressions" aren't, in fact, regular. A true regular expression wouldn't be able to match HTML, but Perl regular expressions (the de facto standard) can because of backreferences.

Edit: I'm not saying this is a good idea; it most certainly isn't. I'm just saying its possible.


Perl's regex support can match HTML because of recursive matching, not only because of backreferences (the latter of which is widely implemented, the former not quite so).

That being said, there surely are some fun languages that can be matched by what's commonly called regular expressions. Notepad++ was notable (before switching to PCRE) that its "regular expressions" could not even match every regular (or even finite) language (http://stackoverflow.com/a/4815422/73070). Many regex engines allow matching languages that are context-sensitive, while at the same time not accepting all context-free languages.


[0] Yep. Although, to be fair, that is simply a generated expression based on what I assume are much simpler starting rules. That said, and this is a topic that never ceases to come up, be simple and be lenient.

For every person who thinks their use case is special, it probably isn't. It is almost always better to do trivial validation that allows false positives than to try to be exact and forget a corner case (and, most likely, you'll end up forgetting multiple corner cases).

Want to help users with typos in email addresses without rejecting them? Use something like mailcheck:

https://github.com/Kicksend/mailcheck

And, tons more discussion here: http://stackoverflow.com/questions/201323/using-a-regular-ex...

[1] Yep. That said, it is apparently useful for certain parts of HTML parsing/validation, as AntiSamy I believe uses regular expressions frequently. https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Proj...


This might be overkill but I found I never "got" regular expressions until a class made me think about them as state machines. The additional bashing over the head of having to implement a parser/matcher made it really stick. The quirks and syntax make much more sense when you know why and how a regex engine works.

That said, anything involving extended/perl regex I wind up googling.


Huh, to me I "got" regexp the first time I used it (although it took a while to learn the details). To me if you understand the idea of wildcards, you understand regexp.


I began to understand Regex after reading the awk chapter in Masterminds of Programming. After that I understood how "the machine" inside might work. I really understood how to apply Regex to single strings after I started to see a regex as a "mask", so very similar to your wildcard approach.


How to become a regular expression wizard:

1. Write a bunch of regular expressions.

2. Fix them when they break.

Anybody can do this, however junior they may be. (And yes, it does grant you a superpower.)


3. Write tests.

Don't change your existing regular expressions without tests, or Bad Stuff happens.


I don't usually write regular expressions that are complex enough to need maintenance. If you are writing one that is enough effort that it isn't disposable, then you might want to reconsider whether you're using the right tool.


Assuredly, like most/anything in life, repetition and dedication will be your regex salvation.

Tools such as this are shortcuts for newcomers to grok the various operators rapidly. I think they are very worthwhile.


http://swtch.com/~rsc/regexp/regexp1.html should be required reading before any developer tries to become a "regex wizard".


Is it worth mastering? No. Worth understanding? Yes.

Regex is used in so many applications and commands, it would be silly not to learn it.

You don't need to be a wizard, but do understand the basics and it will get you far.


Somewhere in between "the basics" and "wizard" is probably best. Regexp is powerful, extremely valuable in the right places, and knowing more than the basics will be useful. On the other hand wizard-level expertise is not necessary to net 99% of the value of regexp.


That depends. The skills I developed with sed and awk have paid dividends. But when I hear someone speak of becoming a regex master, as if a regex was an entire tool unto itself and not just a way of representing a pattern, that makes me think of Perl regexes in all their absurdity.


Also, check for your language's options for white space and comments within regular expressions [0]. Regexps don't have to be blobs of characters -- you can use white space to make them more readable and use comments embedded within a multi-line regexp to describe what/why you are doing.

Bonus: it makes them easier to diff, too!

We don't write our code on one line with no comments, writing regexps should be no different.

[0] Python example: http://docs.python.org/2/library/re.html#re.VERBOSE


Making "junior" developers solve problems with regex sounds like a recipe for terrible maintainability, unless it is necessary


I do actually somewhat agree here. If one doesn't know what they are doing, it is very easy to do regular expressions incorrectly.

That said, regex is sometimes necessary, so it is important for developers to be competent in this realm. In my opinion, ideally, junior developers would start with using them in non-production environments to become familiar, then go from there. It is also important to be able to distinguish which problems should be solved by regular expressions, and which shouldn't. A good mentor here can be great.


If you really want to master Regular Expressions in a way you are not likely to forget do what I did in the '80s. Write a regex interpreter and then a compiler. I was doing a contract for an embedded controller and found a great use for compiled regular expressions within it (I remember every nuance of regular expressions but can't for the life of me remember what my use case was.)

Debugging it was the really fun part but my earlier career with IBM had taught me how to test software effectively and some embedded work with PAL's (an early form of programmable logic) taught me how to really use state machines. I was surprised how little time it took to write and debug. I don't think you can really appreciate the elegant logic of the regular expression language without implementing it.


I found http://www.regular-expressions.info/ to be an invaluable resource when learning or understanding regular expressions. Basically every page that explains a feature in the reference also explains what's going on within the engine, how it works, when it backtracks, etc. Those things are sometimes hard to see in applications that just give you the matches from a text (as rubular.com seems to do).


You don't need to be a developer or a wizard to be proficient in writing regular expressions. Regular expressions are used by all kinds of professions; that's why regular expression capability is included in almost every text editor.

The best way to learn to love regular expressions is to use them outside of a programming context, where you can get real-time feedback with actual test data. Some text editors will even highlight matches as you type the expression out.


During University one of the projects we were given was to write a regular expression parser and evaluator. From the moment that project was complete I have never had trouble understanding regular expressions. I thought it was an excellent way to learn them.


I'm torn a little on my opinion of this post. On the one side I applaud anyone who's willing to self-learn and isn't a "I'll just Google it" developer. On the flip side I tire of hearing people tell me that they've written a CMS of their own, in some cases piggy backing on someone who's solved the an identical problem pays dividends and will be more mature than your solution.

Perhaps for a self-learner a nice approach is to write it yourself and then go searching for a solution. That way you'll likely: validate what you did, figure out that you missed a detail or perhaps learn some new regexp foo or different way of writing the same thing.


I love these types of things. Kudos to the author.

This one in particular reminds me of a tool I've always found useful. It's an interactive Regex builder just like the one linked in the OP. I would say it's got some additional compelling features: like mouse-over breakdowns of each expression as you build it, a handy reference list as well, but also a community concept and saved expressions. Really, I've never found anything better. My only complaint is that it's flash-based, but it's an amazing tool, so can't really complain too much.

http://gskinner.com/RegExr/


Regular expressions were simple and straightforward tools until Perl. They were originally intended to be equivalent to finite state machines, but, thanks to Perl, you can write regexes that may or may not halt, and no one can can ever prove one way or the other. If you want to take the best parts of regexes and leave the rest, don't bother with all of the Perl extensions and just study the basics.


Go read Mastering Regular Expressions.


This is the bible of regex. Its explanation of "unrolling the loop" will change how you write regular expressions if you don't already use that technique. Its discussion of the operation of NFA and DFA engines is great too.


Aye. I thought of myself as a regex wizard, and had answered many regex questions on perlmonks.org, but even then did I learn a lot about regexes while reading that book.


Can't remember the last time I had to actually hand craft a regex.

Any serious developer these days will be using standardised libraries for this sort of validation and not reinventing the wheel with some half baked do-it-yourself regex.


ugh.

do they really need to be good at regex? we don't all work on the internet you know... regex is basically pointless for most application development. most programmers i consider to be exceptionally talented can not write a regex without reference (although they will do it when necessary by using reference - and very well too).

on the other hand the general approach to problem solving advocated here is quite sound. "find good tools" "don't rush pointlessly" "measure don't guess" "google is good, copy-paste blindly is bad"


we don't all work on the internet you know...

And the author didn't say that you must know regex to be a good dev. He said that he was learning it, and that he was putting more effort into it and finding it was paying off.

"I want to be a good developer one day, and I think as a young developer we should put in the extra time to try and really understand something and not just always do what is the quickest."


RegEx probably has most use in data validation, which has nothing to do with internet.


i generally backlash against this. i've seen articles where people are stupid enough to say "if you don't know regex, you aren't good" my experience tells me the converse is true. good programmers are so good they end up working in environments where regex is an irrelevant curiosity and almost never a tool.





You don't need to be a developer to understand or use regex. I use regex on a daily basis as I have to deal with a lot of text manipulation.

--edit

I am not a developer.


Please don't use regexes.


Ever? For any purpose whatsoever? That seems overly prescriptive.


Sure, for scripts, adhoc analysis, command line hacking and whatnot, go ahead.

Regexes should be seen more like a last resort than a good software engineering choice. They're a code smell. I have seen so many incorrect, slow regexes from people who don't know what they are doing that I have to recommend as a best practice that you don't use them unless you are going to study automata, read Friedl's book, the Dragon book, and study with Tibetan regex monks for years, and mentor everyone who has to maintain your code until you die.


so how do you e.g. remove trailing whitespace?


Is that a serious question? Almost every major language has some kind of string trim/strip function for that purpose.


good point. missed that. but does your text editor have a trim function, too?

i've thought of a case i used a regex the last time. and it was removing whitespace in some code file.


a regex tester online?!??!? what "hacker news"!!! This has never been done!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: