Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Python from scratch- RegEx (hodspot.com)
43 points by hodbby on March 7, 2012 | hide | past | favorite | 10 comments


I applaud your effort. Regex is a valuable skill (really, language) which you will use across languages and programs as it gives you access to an efficient and pretty general method for scanning and extracting things from text. And if you study fundamentals of computer science (like the Chomsky hierarchy) you will also find that regular expressions are important there too.


But, to make life more exciting, the regular expressions you'll see in actual CS/math are strictly less powerful than the Perl-style regexes you see in Python. E.g. the language accepted by /(a+b+)\1/ is clearly not regular.


For a good nerdy time, check out the first implementations of glob and regexp. 20 years on they still work in modern Pythons. Soon after Guido decided to make globs a special case of regexp, and his elegant recursive code was no more.

http://svn.python.org/view/python/trunk/Lib/glob.py?revision...

http://svn.python.org/view/python/trunk/Lib/sre_compile.py?r...


In case you're curious, here's what I consider a modern pythonic solution - https://gist.github.com/1995010


Looks shorter, i need to learn what you wrote and tell you if it is OK or with bug. Anyway thanks for dropping your comment


PEP8 recommends 4 spaces for indentation. At least you should try to make it consistent.


Somehow it looks easier and clearer to use TAB over Spaces. Now that you linked me PEP8 (First time i see it) i will start using 4 spaces. Thanks for your comment


You're welcome. There's a pep8 package on PyPI that implements validation against that recommandation and plugins for most popular editors that make it easy to check your code. Editors can also be configured so that pressing the tab key actually inserts 4 spaces.


> I would expect Repetition to act like Wildcards but ' + ' is not a wildcard.

The meaning of the term "wildcard" may be ambiguous. The plus sign, called a "repetition operator", is used to modify what precedes it, like this:

\w+ will match one or more word characters. Word characters are usually in the set A-Z, a-z, 0-9 and the underscore.

In much the same way, \w* will match zero or more word characters.

And \w? will match zero or one word characters.

If you want to use one of these repetition operators in your search, preceded it with a reverse slash:

"true\?" will match "true" followed by a question mark, while "true?" will match "tru" optionally followed by "e".


Thanks for your answer. I wrote it to show example of my confusion.

Anyhow. I read your words and will code it later tonight. Thanks man.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: