(defun create-tokenizer (rules)
(loop for (token rule) in rules
for regex = (if (stringp rule) `(:regex ,rule) rule)
collect `(:register ,regex) into alternatives
collect token into tokens
(let ((scanner (ppcre:create-scanner `(:alternation ,@alternatives)))
(tokens (coerce tokens 'vector)))
(lambda (string &key (start 0))
(multiple-value-bind (match-start match-end registers)
(ppcre:scan scanner string :start start)
(setf start match-end)
(aref tokens (position-if-not #'null registers))
(loop with tokenizer = (funcall (create-tokenizer '((ws "\\s+")
"(1 + 2) * 3")
for token = (multiple-value-list (funcall tokenizer))
=> ((par-open 0 1) (num 1 2) (ws 2 3) (plus 3 4) (ws 4 5) (num 5 6)
(par-close 6 7) (ws 7 8) (mult 8 9) (ws 9 10) (num 10 11))
If you mention re2 the main cool feature about it is that it is efficient, matching in linear time using DFA. Unfortunately unicode strings need to be encoded to utf8 but if you can design your application to work with utf8 bytestrings you can avoid that cost.
https://github.com/google/re2/blob/master/re2/prog.h#L339 <-- c
There's also the reverse with Python: useful code in the documentation not included in the standard library.
One notable example is importing libraries in your code automatically exposes them to the caller.
$ cat lib.py
>>> import lib
<module 're' from '/usr/lib/python2.7/re.pyc'>
>>> import requests
<module 'logging' from '/usr/lib/python2.7/logging/__init__.pyc'>
So to answer your question, I'd say it's just common practice. If it's undocumented in Python, you should pretend it doesn't exist.
>>> import lib
If it's undocumented in Python, you should pretend it doesn't exist.
Pythons extensive use of duck typing also makes it a lot easier to work with undocumented stuff, you can make some wide ranging changes to internals (changing types completely, turning properties into functions) but as long as it quacks roughly the same nothing breaks.
> It's fairly common to dive into 3rd party packages code to see what's occurring and to use 'undocumented' things
It may be common, but it doesn't convince me that it's a good idea. It seems to me that it would be better if the language forces you to design the public API properly, than to resort to using undocumented/private APIs.
Nothing, but it's a side effect of an awesome feature of Python: nothing being private. Which is incredibly useful. 'lib.re' is exactly the same case as 'lib.actual_library_function', why should Python add the ability to somehow stop these from being included? It would increase complexity for no gain.
Sorry, I thought you were asking why you are able to import other modules imports.
> I think distinguishing public & private variables offers more support for structured programming and is therefore desirable.
You can prefix attributes and functions with a single underscore to mark them as private, or a double underscore to make them more private (the attribute name gets mangled).
Anyway, Python doesn't have a enforced notion of privateness because it's a bad idea. By marking something as private you're saying "I, as a developer sat here writing this know better than all of the users of my library. Their lives may depend on using something I haven't exposed properly in my API, but too bad. I know best".
So you end up jumping through ridiculous hoops to access private properties (because even in languages with private, nothing is truly private), all because some guy thought he knows best a long time ago while writing the library you are using.
So a better approach (IMO) is to mark something as private with convention (a prefixed underscore), which means "this is private, don't depend on it", without restricting your access. You can drive a car, have sex, pay taxes, but not access a private variable? Bleugh.
That's more of a cultural thing though, I'm sure enforced private makes more sense in statically typed, compiled languages with lots of classes (and even then I would argue they are still bad for the reasons above), and matter more in huge codebases.
Like how would a language that uses explicit exports stop someone exporting everything?
I agree that accessing libraries indirectly probably isn't useful, but I think being able to do dir(lib) and see the namespace that is in use is a good thing (at least in the context of Python).
> being able to do dir(lib) and see the namespace that is in use is a good thing.
It is, but it is most useful when what you get a curated list of members (__all__) intended to be public.
I don't know if I'd call this (common) practice an unwritten rule in the Python world.
rather, there's no real notion of visibility.. so the only thing you can do to make something 'look private' is name it obscurely (i.e. the underscore prefix, as you mentioned) and leave it undocumented.
it's perhaps my least favorite part of python module semantics.
Of course, this habit comes with the risk that you might have to do more maintenance on your code on library updates.
As documentation takes effort to write and constitutes a sort of contract with your users, it makes sense for library writers to only document the truly essential.
This link may do nothing if you are not part of the beta! http://docs-beta.stackexchange.com/documentation
Presumably it's considered an implementation detail and the author didn't realize it was useful for anything else.
> The 2.0 engine provides another (undocumented) feature that can be used to optimize this even further. The scanner method is used to create a scanner object and attach it to a string.
See http://bugs.python.org/issue5337 where the Python developers wondered if it should be documented.
There's a 10 year old notice at https://mail.python.org/pipermail/patches/2006-November/0210... saying that the code would crash if the scanner was called from multiple threads.
There's no doubt much more information about this - the above was all I cared to find out.
I don't see how a regular expression library could help with that (other than proper Unicode support), because word boundaries are a language-specific, linguistic problem; i.e., you will need to supply a list of possible contractions anyway.
Tokenization of natural language text may appear like a straightforward and solved problem, but there are actually lots of messy details to get right.
Thanks for the great post.
The string scanner is just a step by step matching of individual expressions. Because they are not folded together the "skip non matching" part has been done in Ruby which is precisely what the Python scanner avoids.
The default Python regexp gets into a tailspin on less well formulated regexps, which did work well with both PCRE and Perl 5. (I wrote an application (specialized query language) a few years ago where programmers entered regexps. Sometimes the execution just hanged. This was 2.6 and early 2.7.)
I haven't tried the external regexp module, but hope it is better.
The author should learn about PCRE. I wrote a Python wrapper for it that includes a drop-in 're' substitute: https://github.com/stefantalpalaru/morelia-pcre
Read. Recite. Review.