I am reminded of VerbalExpressions https://github.com/VerbalExpressions which hit HN in early August. Interested in learning more about regex's I did some work on a Racket port.
That project shows there are two tracks by which to tackle the problem that regex syntax is incoherent gobbledygook. The first is creating a regex version of pho with 5000 functions and mix and match names piece by piece. The second is to rename regex symbols to something that is easier for humans to parse.
Very good point! It's not the reasoning and logic behind regex that would turn a beginner away from it, its the syntax that is being used that often results into expressions turning into impenetrable goo!
(thanks for the verbal expressions link btw, seems really promising)
Unless I'm not understanding my way around the project, this seems to be a very small library of regular expressions. As one would expect, Perl's offerings blow anything else out of the water: http://search.cpan.org/~abigail/Regexp-Common-2013031301/lib....
I'm gonna be That Other Guy and say that the date and phone regex are respectively english-language and US specific. So it's common for a narrow definition of common.
Haha that's excellent. Reminds me of a normalisation rule I wrote as part of a larger system to convert "Joe Bloggs Md." into "Dr. Joe Blogs MD" (where MD is Medical Doctor). TIL that "Md." is a common abbreviation for "Mohammed" in large parts of the world...
The text-to-speech system in use at my local GP's surgery (that announces to patients which which rooms they need to go to) pronounces 'Dr' as 'Drive', rather than 'Doctor'. I thought someone would have tested that!
Yeah, I threw together this module intending to supplement NER on a text classification project I'm currently working on, not as a replacement for NER.
I'm gonna be That Other Other Guy and say....stuff like this is the reason why most programmers I meet take ages to do anything custom. Evey body uses this framework and that library and make bulky code that could actually be implemented with 2 far more efficient lines and will struggle when the need to customize presents itself. Learn Regex...you will have a crazy powerful weapon in your arsenal.
This is an excellent point. Petty substitutions like verbal expressions might be useful if you're just getting started, but ultimately it's a crutch and it's best just to learn pure regular expressions. They're not that difficult.
Same with bundling a ton of dependencies. Lots of people (especially contemporary programmers, primarily web developers) seem to be deathly afraid of writing custom code to handle a job. It's not "reinventing the wheel", it's implementing logic easily extensible within your application without the hassle of upstream, especially if you're only using a small portion of a library or framework. Using 15 libraries for a 600-line script isn't best practice, it's cowardice.
There are pros and cons to both approaches. If I see a junior developer trying to reinvent the wheel, he's probably going to build a pretty shitty wheel. The whole point of using dependencies isn't laziness or cowardice, it's leveraging others' work to save time, energy, and potential headaches caused by subpar custom implementations. Of course, that doesn't absolve us from learning the guts of our dependencies, but chastising developers for avoiding unnecessary work is absurd. The key is in learning when to import and when to DIY.
Doesn't really apply here, the goal is to detect the existence of a (potential) email address in the text, you can poke the SMTP server once you've found something address-looking to ensure that it's an address and not, say, a twitter thing.
It's not validation in this case, it's finding them in a body of text. For validation you generally have it easier because you often only deal with a single datum in a field (and thus string).
The CLDR provides locale-specific formats for some datum types (IIRC numbers, dates, and durations). These formats can probably be reversed into the corresponding regular expression in order to perform locale-aware data detection.
Of course further complexity is added in people being able to provide partial and context-dependent information, which is much harder to detect e.g. "December 3rd"
That project shows there are two tracks by which to tackle the problem that regex syntax is incoherent gobbledygook. The first is creating a regex version of pho with 5000 functions and mix and match names piece by piece. The second is to rename regex symbols to something that is easier for humans to parse.