Hacker News new | past | comments | ask | show | jobs | submit login
Common Regular Expressions Made Simple (github.com/madisonmay)
26 points by madisonmay on Dec 17, 2013 | hide | past | favorite | 24 comments



I am reminded of VerbalExpressions https://github.com/VerbalExpressions which hit HN in early August. Interested in learning more about regex's I did some work on a Racket port.

That project shows there are two tracks by which to tackle the problem that regex syntax is incoherent gobbledygook. The first is creating a regex version of pho with 5000 functions and mix and match names piece by piece. The second is to rename regex symbols to something that is easier for humans to parse.


Very good point! It's not the reasoning and logic behind regex that would turn a beginner away from it, its the syntax that is being used that often results into expressions turning into impenetrable goo!

(thanks for the verbal expressions link btw, seems really promising)


Unless I'm not understanding my way around the project, this seems to be a very small library of regular expressions. As one would expect, Perl's offerings blow anything else out of the water: http://search.cpan.org/~abigail/Regexp-Common-2013031301/lib....


I'm gonna be That Guy and say that the e-mail regex isn't upto scratch and probably should not be included.

generally, very good.


I'm gonna be That Other Guy and say that the date and phone regex are respectively english-language and US specific. So it's common for a narrow definition of common.


Yep. This library is the very application and example of H. L. Mencken's:

> There is always a well-known solution to every human problem — neat, plausible, and wrong.

often paraphrased as "For every complex problem there is an answer that is clear, simple, and wrong."


The time regex is, too. In German you can expect to encounter the text fragment

> um 6:00 am 05.12. (at 6:00 on 12/05/..)

If i read it correctly, the time regex would extract "6:00 am" as time, but the "am" is wrong (German uses 24h format).


Haha that's excellent. Reminds me of a normalisation rule I wrote as part of a larger system to convert "Joe Bloggs Md." into "Dr. Joe Blogs MD" (where MD is Medical Doctor). TIL that "Md." is a common abbreviation for "Mohammed" in large parts of the world...


The text-to-speech system in use at my local GP's surgery (that announces to patients which which rooms they need to go to) pronounces 'Dr' as 'Drive', rather than 'Doctor'. I thought someone would have tested that!


When dealing with this general problem, the proper tool would more likely be language/culture detection + Named-Entity Recognition.

Simple regular expressions can be good enough if you're aware of the domain restriction though.


Yeah, I threw together this module intending to supplement NER on a text classification project I'm currently working on, not as a replacement for NER.


"Um 12:30 am..." seems like a better example, since 6:00 is still 0600, but one parsing would make 12:30 am into 0030 rather than the intended 1230.


Heh, thanks. I only looked at misparsing and didn't think about the consequences :D


I'm gonna be That Other Other Guy and say....stuff like this is the reason why most programmers I meet take ages to do anything custom. Evey body uses this framework and that library and make bulky code that could actually be implemented with 2 far more efficient lines and will struggle when the need to customize presents itself. Learn Regex...you will have a crazy powerful weapon in your arsenal.


Why is this being downvoted?

This is an excellent point. Petty substitutions like verbal expressions might be useful if you're just getting started, but ultimately it's a crutch and it's best just to learn pure regular expressions. They're not that difficult.

Same with bundling a ton of dependencies. Lots of people (especially contemporary programmers, primarily web developers) seem to be deathly afraid of writing custom code to handle a job. It's not "reinventing the wheel", it's implementing logic easily extensible within your application without the hassle of upstream, especially if you're only using a small portion of a library or framework. Using 15 libraries for a 600-line script isn't best practice, it's cowardice.


There are pros and cons to both approaches. If I see a junior developer trying to reinvent the wheel, he's probably going to build a pretty shitty wheel. The whole point of using dependencies isn't laziness or cowardice, it's leveraging others' work to save time, energy, and potential headaches caused by subpar custom implementations. Of course, that doesn't absolve us from learning the guts of our dependencies, but chastising developers for avoiding unnecessary work is absurd. The key is in learning when to import and when to DIY.


It really shouldn't have been downvoted.


The author does explicitly note that in the readme:

> Please note that this module is currently English/US specific.


Previous discussion & article about why you just shouldn't bother with email regexes at https://news.ycombinator.com/item?id=5763327


I'm the guy who says "poke the SMTP server" rather than validate the email address.


Doesn't really apply here, the goal is to detect the existence of a (potential) email address in the text, you can poke the SMTP server once you've found something address-looking to ensure that it's an address and not, say, a twitter thing.


Isnt there an i18n extension built on lib ICU in python , like Intl extension in PHP ?

http://site.icu-project.org/

I'm sure there are better alternatives than just regexp to validate numbers, emails ,etc ...


It's not validation in this case, it's finding them in a body of text. For validation you generally have it easier because you often only deal with a single datum in a field (and thus string).


The CLDR provides locale-specific formats for some datum types (IIRC numbers, dates, and durations). These formats can probably be reversed into the corresponding regular expression in order to perform locale-aware data detection.

Of course further complexity is added in people being able to provide partial and context-dependent information, which is much harder to detect e.g. "December 3rd"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: