
Python's Hidden Regular Expression Gems - temp
http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems/
======
Grue3
Python's re has nothing on CL-PPCRE [1] though. The ability to build up a
"regular expression" from S-expressions is just too useful.

[1] [http://weitz.de/cl-ppcre/](http://weitz.de/cl-ppcre/)

~~~
nanny
It was also twice as fast as Perl in benchmarks at one point or another.

~~~
kbenson
That's not exactly hard to do, depending on features. There's a definite
trade-off between features and the type of regex engine that can be
implemented.

~~~
DasIch
It seems to me that one should be able to detect which features are actually
used in a regular expression and choose one of multiple different underlying
implementations based on that.

~~~
kbenson
Newer versions of Perl actually support a pluggable regex engine system, so
you can use specific regex engines for specific tasks.

------
willvarfar
Another very-cool undocumented feature on another regex engine is re2's Set.
It compiles a collection of regex to a single regex, and so allows you to very
efficiently match a string against an array of patterns.

~~~
andreasvc
Which re2? I maintain a fork of re2 but it's not in there [1].

If you mention re2 the main cool feature about it is that it is efficient,
matching in linear time using DFA. Unfortunately unicode strings need to be
encoded to utf8 but if you can design your application to work with utf8
bytestrings you can avoid that cost.

[1] [http://github.com/andreasvc/pyre2](http://github.com/andreasvc/pyre2)

~~~
willvarfar
[https://github.com/google/re2/blob/master/re2/set.h](https://github.com/google/re2/blob/master/re2/set.h)
<\-- c++ API

[https://github.com/google/re2/blob/master/re2/prog.h#L339](https://github.com/google/re2/blob/master/re2/prog.h#L339)
<\-- c

~~~
andreasvc
Neat. I should consider wrapping that.

------
andreasvc
I wonder what the reason is to include code in a release without documenting
it. Maybe this article can form the basis for finally documenting this
feature?

There's also the reverse with Python: useful code in the documentation not
included in the standard library.

~~~
notzorbo2
This happens pretty often in the Python world. There's a bit of an unwritten
rule to leave implementation details public that would be private in other
languages. Many libraries simply don't bother with prefixing privates with '_'
and just leave things undocumented that you probably shouldn't touch/use.

One notable example is importing libraries in your code automatically exposes
them to the caller.

    
    
        $ cat lib.py
        import re
        
        def somefunc():
            pass
    
        $ python
        >>> import lib
        >>> lib.re
        <module 're' from '/usr/lib/python2.7/re.pyc'>
    

Many packages also do `import *` from files which polutes the package
namespace with all kinds of stuff you really don't want in there. For example,
the popular Requests package:

    
    
        >>> import requests
        >>> requests.logging
        <module 'logging' from '/usr/lib/python2.7/logging/__init__.pyc'>
    

The logging module is not a public part of requests' API. It's just there
because requests uses it internally.

So to answer your question, I'd say it's just common practice. If it's
undocumented in Python, you should pretend it doesn't exist.

~~~
orf
Importing libraries from other libraries like this is very useful:

    
    
        $ python
        >>> import lib
        >>> lib.re
    

That's how __init__.py files work, and is part of what makes Python awesome.
Using `import *` is very bad practice in modules (except in very specific
cases) because it brings in a bunch of crap you don't want and didn't expect.
Modules should define a '__all__' list of 'public things' you want to export,
but restricting access is very anti-python as we're all consenting adults.

    
    
       If it's undocumented in Python, you should pretend it doesn't exist.
    	

I don't agree. It's fairly common to dive into 3rd party packages code to see
what's occurring and to use 'undocumented' things (which is mostly because the
documentation is bad rather than being hidden away). Just look at the Django
`_meta` API, which people relied on because it was the only place you could
get some specific model information in a stable way, despite being
undocumented and private. Now it's been formalized into a proper API.

Pythons extensive use of duck typing also makes it a lot easier to work with
undocumented stuff, you can make some wide ranging changes to internals
(changing types completely, turning properties into functions) but as long as
it quacks roughly the same nothing breaks.

~~~
andreasvc
What is useful about 'import lib; lib.re'? I cannot think of a situation in
which a direct important wouldn't be better, while that has many obvious
advantages. The __init__.py is a special case of course, but you would only
use that for a package's own modules.

> It's fairly common to dive into 3rd party packages code to see what's
> occurring and to use 'undocumented' things

It may be common, but it doesn't convince me that it's a good idea. It seems
to me that it would be better if the language forces you to design the public
API properly, than to resort to using undocumented/private APIs.

~~~
orf
> What is useful about 'import lib; lib.re'?

Nothing, but it's a side effect of an awesome feature of Python: nothing being
private. Which is incredibly useful. 'lib.re' is exactly the same case as
'lib.actual_library_function', why should Python add the ability to somehow
stop these from being included? It would increase complexity for no gain.

~~~
andreasvc
You're just repeating that it is awesome and useful without saying why. I
think distinguishing public & private variables offers more support for
structured programming and is therefore desirable.

~~~
orf
> You're just repeating that it is awesome and useful without saying why.

Sorry, I thought you were asking why you are able to import other modules
imports.

> I think distinguishing public & private variables offers more support for
> structured programming and is therefore desirable.

You can prefix attributes and functions with a single underscore to mark them
as private, or a double underscore to make them more private (the attribute
name gets mangled).

Anyway, Python doesn't have a enforced notion of privateness because it's a
bad idea. By marking something as private you're saying "I, as a developer sat
here writing this know better than all of the users of my library. Their lives
may depend on using something I haven't exposed properly in my API, but too
bad. I know best".

So you end up jumping through ridiculous hoops to access private properties
(because even in languages with private, nothing is _truly_ private), all
because some guy thought he knows best a long time ago while writing the
library you are using.

So a better approach (IMO) is to mark something as private with convention (a
prefixed underscore), which means "this is private, don't depend on it",
without restricting your access. You can drive a car, have sex, pay taxes, but
not access a private variable? Bleugh.

That's more of a cultural thing though, I'm sure enforced private makes more
sense in statically typed, compiled languages with lots of classes (and even
then I would argue they are still bad for the reasons above), and matter more
in huge codebases.

------
fulafel
I did a double take upon seeing "The regex module in Python is really old by
now" and had to make sure it's talking about the current re module! re was
introduced alongside the older regex module around Python 1.5, the latter was
finally removed in Python 2.5.

~~~
andreasvc
Additionally there's a newer module also named 'regex':
[https://pypi.python.org/pypi/regex](https://pypi.python.org/pypi/regex)

~~~
rspeer
And this newer 'regex' is actually really good at tricky cases such as
matching word boundaries. (An apostrophe or a non-ASCII character is not
necessarily a word boundary!)

~~~
andreasvc
> An apostrophe or a non-ASCII character is not necessarily a word boundary!

I don't see how a regular expression library could help with that (other than
proper Unicode support), because word boundaries are a language-specific,
linguistic problem; i.e., you will need to supply a list of possible
contractions anyway.

Tokenization of natural language text may appear like a straightforward and
solved problem, but there are actually lots of messy details to get right.

------
fndrplayer13
Really interesting read. To be honest, I got a little lost around the Scanner
implementation portion. Guess its time to take that example and play around
with it myself. My one suggestion would be to maybe walk through an example of
how the Scanner would work to demonstrate your point.

Thanks for the great post.

~~~
the_mitsuhiko
I made a larger class in a github repo and an example of what you can do with
it here: [https://github.com/mitsuhiko/python-regex-
scanner/blob/maste...](https://github.com/mitsuhiko/python-regex-
scanner/blob/master/examples/wiki.py)

------
ioquatix
There is nothing unique about Python's implementation of regular expressions.
Ruby has an equally (if not more) powerful `StringScanner`. This is a nice
article but it could have done without the "it's one of the best of all
dynamic languages I would argue" tone.

~~~
the_mitsuhiko
> Ruby has an equally (if not more) powerful `StringScanner`

The string scanner is just a step by step matching of individual expressions.
Because they are not folded together the "skip non matching" part has been
done in Ruby which is precisely what the Python scanner avoids.

------
mkesper
Lately there had been a comparison of Javascript, Perl and Python regex
machinery posted here iirc (can't find right now) and the external regex
module was found to be much better regarding unicode support.

~~~
berntb
(How could non-3 Python have good Unicode support? :-) )

The default Python regexp gets into a tailspin on less well formulated
regexps, which did work well with both PCRE and Perl 5. (I wrote an
application (specialized query language) a few years ago where programmers
entered regexps. Sometimes the execution just hanged. This was 2.6 and early
2.7.)

I haven't tried the external regexp module, but hope it is better.

------
nichochar
Amazing! This Scanner object seems so beautifully pythonic and intuitive. It's
a shame it's not documented honestly: it would be a great way for beginners in
python to write language parsers

------
larkinrichards
I believe there is a small bug in the final example, in the tokenize
definition it references 'self' where I believe it should reference 'scanner'.

------
stefantalpalaru
> it's one of the best of all dynamic languages I would argue

The author should learn about PCRE. I wrote a Python wrapper for it that
includes a drop-in 're' substitute:
[https://github.com/stefantalpalaru/morelia-
pcre](https://github.com/stefantalpalaru/morelia-pcre)

------
lugus35
[http://doc.perl6.org/language/regexes](http://doc.perl6.org/language/regexes)

Read. Recite. Review.

