> RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.
Really interesting articles in the second link detailing how this works!
It feels great to know that speed, the only major drawback, is being seriously improved in each release. Here's to the future!
Documentation link for anyone that is interested
I guess it would be nice to know if optimizations like this are already covered in the high-level libraries like `multiprocessing`.
But web frameworks should implement it.
Neither trolling nor beating dead horses:
I recently wanted to do some screen capture and needed tkinter, which of course has different imports for py2 and py3. adding library aliases would help. I honestly wished people would use transitional deprecation warnings more often to educate about changed apis.
As long as 50% of tensorflow projects are still in Python 2 it is always a pain to encounter Python 3.
from urllib2 import urlopen
from urllib import urlretrieve
from urllib.request import urlopen, urlretrieve
I'm willing to live with Python's space related insanity. Mandating that indenting is meaningful. Mandating a best practice of using spaces instead of tabs for indent* (which makes tab width matter). Mandating that a tab equals 4 spaces instead of 8. (Honestly, I REALLY hate this, most of the Windows developers are happy with the space usage koolaid, let us have a default of tab equals 8 spaces of indent.) Those are ALL, REALLY ANNOYING, but lint tools can save me.
What will bite me for as long as Python 3 exists, and maybe Python 4, is how they handle Unicode support and encoded outputs. Pretty much the ENTIRE rest of the computing world does one of two things.
1) Doesn't care what the encoding is, and flings around non-validated 8 bit sequences of data.
2) Does care, and defaults to assuming well formed UTF-8 data-streams for text data, BUT still doesn't do anything to enforce it UNLESS ASKED. Most importantly invalid data streams just continue flowing through the processing tools until they reach a point where a human has designated they're willing to care and /do/ something about it.
If Python 4 wants to 'fix' Unicode support:
* Use a binary sequence base storage class with metadata tags.
> Length (in bytes, in 'display length', and maybe in 'codepoint length')
> encoding* (including if normalized and how so)
> If the encoding has been validated.
* Support automatic coercion between standard encodings
* Allow user defined helpers for custom conversions
* Make all input/output default to UTF-8**
People end up using hacks like https://github.com/whitequark/rack-utf8_sanitizer which shouldn't be necessary in the first place, because you should know whether you're receiving raw bytes or text. But what you thought was text that's utf8-compatible actually includes characters from another encoding.
`String.valid_encoding?` is probably the worst programming idea. If you have to ask that, it's too late. There should've been an exception raised when the input was initially processed, or the type should not pretend that it's a real text.
print(str(b'I DONT WANT TO add "UTF-8" everywhere!!','UTF-8'))
`open` also defaults to your system encoding, which is likely utf8 judging from the message.
not quite default!
The encode (or is it decode? I always forget) defaults to utf8, which is what you should be using.
print('I DONT WANT TO add "UTF-8" everywhere!!')
And get the same result. Or if you really want to start with bytes:
As the basis of a high-level language's string type, UTF-8 is objectively incorrect. Strings in a high-level language should be as clean an abstraction of Unicode as possible, and leaking implementation details of the particular byte-encoding scheme up to the programmer is not acceptable.
also having different string methods .byte_length .char_length(?) .codepoint_length seems a good idea.
or string.length(aspect=bytes) but what should be the default aspect?
Now, I should point out here that I’m not really knocking the people who were writing, say, command-line and file-handling utilities in Python. For years, Python sort of accepted the status quo of the Unix world, which was mostly to stick its fingers in its ears and shout LA LA LA I CAN’T HEAR YOU I’M JUST GOING TO SET LC_CTYPE TO C AGAIN AND GO BACK TO MY HAPPY PLACE. A bit later on it changed to “just use UTF-8 everywhere, UTF-8 is perfectly safe”, which really meant “just use UTF-8 everywhere because we can continue pretending it’s ASCII up until the time someone gives us a non-ASCII or multi-byte character, at which point do the fingers-in-ears-can’t-hear-you thing again”.
So a lot of what you’ll see in terms of complaints about string handling are really complaints that Unix’s pretend-everything-is-ASCII-until-it-breaks approach was never very good to begin with and just gets worse with every passing year.
I stand by this: we had a couple of decades of Python catering to this brokenness, and it made life miserable for everyone who didn't work in that particular domain. Python 3 changed that. Does it mean life got harder for some people? Yup. But life got a lot easier and more reliable for many more people, and it's a tradeoff I'm willing to accept.
It's the difference between easier to debug: "you asked me to read a value here, but your assumptions about the encoding don't match reality" exception and the hard to debug: "I did a lot of processing; you thought this thing is a valid text, but it isn't; have fun tracking down how it got here in the first place" exception.
Proposed "string like" Object: __IF__ you want to turn on debugging, sure, force it to validate assumptions at runtime/compile time. Otherwise call verify() when you're willing to handle that being some result indicating "we have a problem".
Maybe the verify() call returns the byte-access-offset of the first non-conforming sequence.
I think we've got a fundamental disagreement here. I don't believe they're similar at all. They just happened to get confused a lot in the past when it didn't matter that much.
That would be a terrible thing for people who work with strings as strings, and would essentially tell such people -- who make up a lot of Python's user base! -- "go use something else, this language is only for sysadmins to write Unix utilities now".
If you require an eternally frozen implementation of Python 2's behavior in order to get your work done, you can download the source and keep it around as long as you like. Nobody else has an obligation to support you or hold back the rest of the world in order to suit your use case in your preferred way.
That's what the support for a generic array type, which MIGHT be handled like a string (if the programmer chooses), and which when designated as being in a Unicode format of ANY type, and processed with data of any valid Unicode type, will automatically coerce together the correct way, and produce a valid result* (assuming valid inputs, and which might still need to actually be validated (again, bitflag cleared), if anyone cares).
THAT is what a high level language should be.
Python has actually __REGRESSED__ in that respect. It's handing of Unicode is //low// level. It forces the programmer to need to care about the data types, and deal with legacy invalid data strings in some method that's up to them. Except they can't /ignore/ invalid sequences, because that option isn't supported.
So the following lines are at equivalent indentation:
Does it add py2 conventions to py3 or
Does it add py3 conventions to py2 ?
Python 3 passed Python 2 on performance years ago.