
Python's Hidden Regular Expression Gems (2015) - Chris2048
http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems/
======
atdt
> (Have a look at what dir() returns on a regex pattern object).
    
    
      Python 2.7.13 (default, Dec 20 2016, 16:00:28)
      [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import re; dir(re.compile(''))
      ['__class__', '__copy__', '__deepcopy__', '__delattr__',
       '__doc__', '__format__', '__getattribute__', '__hash__',
       '__init__', '__new__', '__reduce__', '__reduce_ex__',
       '__repr__', '__setattr__', '__sizeof__', '__str__',
       '__subclasshook__', 'findall', 'finditer', 'flags',
       'groupindex', 'groups', 'match', 'pattern', 'scanner',
       'search', 'split', 'sub', 'subn']
    

Looks OK to me. What am I missing?

~~~
js2
That was apparently fixed in Python 2.7:

[https://github.com/python/cpython/commit/6116d4a1d1ecd6e9b94...](https://github.com/python/cpython/commit/6116d4a1d1ecd6e9b94cfaa524b9b7cd5a6b5447)

[http://bugs.python.org/issue12099](http://bugs.python.org/issue12099)

But that commit pre-dates the blog post by many years, so maybe Armin was
referring to something else, or maybe he hadn't realized it had been fixed?

------
fernly
Couple relevant links. "regex"[1] is an upward-compatible "re" replacement
with better performance and many added features including fuzzy matching.

Second, the tokenizer technique of stacked expressions described in the OP is
also documented in the Python re module doc[2], at least conceptually.

[1]
[https://pypi.python.org/pypi/regex/2017.04.29](https://pypi.python.org/pypi/regex/2017.04.29)
[2] [https://docs.python.org/3.5/library/re.html#writing-a-
tokeni...](https://docs.python.org/3.5/library/re.html#writing-a-tokenizer)

------
aargh_aargh
Previous HN discussion here:

[https://news.ycombinator.com/item?id=10600520](https://news.ycombinator.com/item?id=10600520)

------
DonaldPShimoda
This is super neat! But... why is it undocumented? Is there a reason it's kept
a "secret"?

~~~
kingosticks
Surely it would have been way more useful to contribute the missing
documentation rather than write the article. Maybe that's harder to do than it
sounds.

------
natch
2015.

"Ignoring Python 3.." that's getting harder and harder to do these days.

~~~
josteink
I'm rather the opposite: I ignore anything ignoring Python 3. That's the only
version of Python I can bother touching these days.

~~~
RUTHLESS_RUFUS
It is official; Netcraft confirms: Python 2.7 is officially dead.

One more crippling bombshell hit the already beleaguered Python 2.7 community
when josteink confirmed that he would not be bothering to touch anything older
than Python 3 code.

All major surveys show that Python 2.7 has steadily declined in market share.
2.7 packages are very sick and their long term survival prospects are very
dim. If legacy Python is to survive at all it will be among trolling
dilettante dabblers. 2.7 continue to decay. Nothing short of a miracle could
save it at this point in time. For all practical purposes, Python 2.7 is dead.

FACT: Python 2.7 is dying.

~~~
zu03776
Wasn't the same said of BSD at one time?

~~~
RUTHLESS_RUFUS
Yes, ha, I was just taking the piss. It's an old Slashdot trope to declare
things dead by Netcraft survey.

------
falsedan
> _[Python 's implementation of regular expressions is] one of the best of all
> dynamic languages I would argue_

This sounds dishonest to me; it's deliberately couching the claim with a
qualification ("imo") and disqualifying the 'dynamic' languages with better
regex support (Perl, Ruby, JavaScript). The counter-claim is already prepared
as a "No true Scotsman" argument: these other languages aren't as good because
they don't have a scanner built into the regex language.

I also don't understand why the author needs to qualify the claim on
dynamic/non-dynamic grounds. Can someone please help me understand?

~~~
coldtea
> _This sounds dishonest to me; it 's deliberately couching the claim with a
> qualification ("imo")_

He gives his opinion, so he's humble about it ("imo", i could argue, etc).

It's not some great weasel-word trick to lure people, nor it's meant as an
absolute scientific conclusion. Just says what he has seen in his experience.

> _and disqualifying the 'dynamic' languages with better regex support (Perl,
> Ruby, JavaScript)._

How's that dishonest? He's not disqualifying them from the comparison he makes
with some trick -- he just explicitly states that Python, in his opinion, is
better than them.

> _The counter-claim is already prepared as a "No true Scotsman" argument:
> these other languages aren't as good because they don't have a scanner built
> into the regex language._

Even if he said that, that's not how the "No true Scotsman" fallacy works.

No true Scotsman involves you adding to the necessary qualities AFTER you've
made an initial argument that doesn't include the new qualities. If you spell
out in advance that you consider feature X important for a good regex engine,
then you might be wrong, but you're not making a No True Scotsman.

Besides, his point is not that Python has a better regex library mainly or
because it has a scanner feature. The latter is just one feature of the
library that he presents, not what he defines as what makes Python's lib
great.

For that he merely says that it's "one of the better designed core systems
from a pure API point of view".

> _I also don 't understand why the author needs to qualify the claim on
> dynamic/non-dynamic grounds. Can someone please help me understand?_

Because he is interested only in or experienced only in dynamic languages and
their regex libs?

Or because while he knows that there are better regex engines for non-dynamic
languages, he still thinks Python has the better regex engine among the
dynamic ones.

It's a casual phrase in a blog post, not some huge marketing / FUD conspiracy
that needs to be explained.

~~~
falsedan
Thanks for replying! I'm explicitly challenging the claim 'Python has one of
the best [standard regular expression core library] of all dynamic languages'.
Its regular expression support is not as good as Perl's, for features and
speed. Ruby's support is roughly equivalent to Perl's, and JavaScript's is
roughly the same as Python's. PHP's suffers from a clunky interface but is
competitive on features+speed.

If we open the floor to all languages, clearly lex/yacc is the winner w.r.t.
writing a scanner. There's little difference between Python's and Java's
support, and Boost::Regex makes PHP look good in comparison.

Can you go into more detail for:

> _That 's not how the "No true Scotsman" fallacy works._

Here's how I think it would go:

1> Perl has better regex support than Python

2> Well Python has a scanner built into the standard library, and thus can't
be compared to languages which do not have that feature.

1> … but writing a scanner with the standard regex support in these languages
is trivial

2> Doesn't matter, it's built-in and thus those languages aren't comparable.

> _there are better regex engines for non-dynamic languages, he still thinks
> Python has the better regex engine among the dynamic ones_

Why does that matter? It feels exclusionary: in this narrowly-defined group,
the choice I prefer has the best features. Did they mean, "of the languages I
am familiar with, I've most enjoyed using regexes in Python"? That's a far
more honest claim.

> _It 's a casual phrase in a blog post, not some huge marketing / FUD
> conspiracy that needs to be explained. _

Communication is done every day, everywhere. These blog posts influence
peoples' opinions, and (more importantly) _influence their approach to
writing_. I want more articles to be balanced, honest, and informative, not
rah-rah puff pieces.

~~~
coldtea
> _Here 's how I think it would go_

"A True Scotsman" might go like that, but here was no step (1). The True
Scotsman is all about the extra constraints added in the transition of (1) to
(2).

If one starts at 2 (no prior discussion), they can define whatever they want
as essential attributes for a Scotsman. They may be wrong in what they list,
but they are not making a N-T-S fallacy.

> _Why does that matter?_

Because the post is aimed (by its author) at people using dynamic languages.

If I want to do some regex work, and mostly program in dynamic languages (or
have found that dynamic languages fit my project better, etc.), then I just
need to see an evaluation of regex engine across dynamic languages -- I don't
care where C++ or Haskell regex engines might have more features.

In other words, it's about what constraints people put in searching for a
language/lib to use.

Some people place the constraint "must be in a dynamic language" higher than
"must be better overall" when they search for a lib/framework (in our case, a
regex engine).

> _Communication is done every day, everywhere. These blog posts influence
> peoples ' opinions, and (more importantly) influence their approach to
> writing. I want more articles to be balanced, honest, and informative, not
> rah-rah puff pieces._

Maybe, but between marketing, hidden sponsored posts, fake news and
crappy/agenda-driven "real" news, I wouldn't start from Armin's blog :-)

~~~
falsedan
Oh, I meant that the counter-argument would use "No true Scotsman", not that
the article did.

> _Maybe, but between marketing, hidden sponsored posts, fake news and crappy
> /agenda-driven "real" news, I wouldn't start from Armin's blog :-)_

I think I can comment on this article, and that the few HN commentators here
will see it, and my message will certainly reach a small audience. For those
relatively larger problem, I don't have access to a suitable medium or
audience and thus, even though the privation is greater, I would reach zero
people.

------
threepipeproblm
I don't have a dog in the fight, but... this reminded me of Rob Pike's
suggestion to avoid regular expressions in parsing and lexing tasks.
[https://commandcenter.blogspot.com/2011/08/regular-
expressio...](https://commandcenter.blogspot.com/2011/08/regular-expressions-
in-lexing-and.html)

------
Neutrion
As of Python2.7.13 x64 there is no keyword argument "skip" in Scanner.scan
method.

