Since one of the mentioned improvements is performance of the builtin regular expression library... for anyone curious, there's a drop-in replacement, re2, that leverages a Google library to provide linear time guarantees for arbitrarily complex or user-provided regexes. (You lose lookahead assertions, but most use cases for them can just be handled in pre- or post-processing.) It's an incredibly handy tool that lets us confidently use complex regexes in production.
> RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.
Really interesting articles in the second link detailing how this works!
Absolutely. But for most common uses of backreferences, this is largely mitigated by having a post-regex step that compares capturing groups for equality, which induces overhead also linear in the length of the string. Of course this falls short of true backreference support, but for tasks where we do not expect potential matches to overlap, it gets the job done.
Backreferences are one of the items that makes regexes non-linear, which defeats the point of being able to trust arbitrary regexes to have good performance.
> Also collection before a POSIX fork() call may free pages for future allocation which can cause copy-on-write too so it’s advised to disable gc in master process and freeze before fork and enable gc in child process.
I guess it would be nice to know if optimizations like this are already covered in the high-level libraries like `multiprocessing`.
They aren't / shouldn't be. This is something you need to opt into specifically because it changes behaviour. For example if you have done resource still open, calling gc.freeze will prevent it from ever cleaning up.
I recently wanted to do some screen capture and needed tkinter, which of course has different imports for py2 and py3. adding library aliases would help. I honestly wished people would use transitional deprecation warnings more often to educate about changed apis.
As long as 50% of tensorflow projects are still in Python 2 it is always a pain to encounter Python 3.
```
try:
from urllib2 import urlopen
from urllib import urlretrieve
except ImportError:
from urllib.request import urlopen, urlretrieve
```
Is 15 years to prepare still not long enough to migrate your code incrementally? You can get Py2 into a nearly-ready-for-Py3 state with just a little effort stretched over whatever horizon you desire. Beating a dead horse doesn't do anything about your stagnant code base.
I used to be quite militant about this, but you have to get that often business goals don't align. You can often do it on the side, during other changes, but convincing your manager that it's worth it is a hard problem.
No, because they did a bunch of things that make automated porting not work well and provides a ton of developer friction.
I'm willing to live with Python's space related insanity. Mandating that indenting is meaningful. Mandating a best practice of using spaces instead of tabs for indent* (which makes tab width matter). Mandating that a tab equals 4 spaces instead of 8. (Honestly, I REALLY hate this, most of the Windows developers are happy with the space usage koolaid, let us have a default of tab equals 8 spaces of indent.) Those are ALL, REALLY ANNOYING, but lint tools can save me.
What will bite me for as long as Python 3 exists, and maybe Python 4, is how they handle Unicode support and encoded outputs. Pretty much the ENTIRE rest of the computing world does one of two things.
1) Doesn't care what the encoding is, and flings around non-validated 8 bit sequences of data.
2) Does care, and defaults to assuming well formed UTF-8 data-streams for text data, BUT still doesn't do anything to enforce it UNLESS ASKED. Most importantly invalid data streams just continue flowing through the processing tools until they reach a point where a human has designated they're willing to care and /do/ something about it.
If Python 4 wants to 'fix' Unicode support:
* Use a binary sequence base storage class with metadata tags.
> Length (in bytes, in 'display length', and maybe in 'codepoint length')
> encoding* (including if normalized and how so)
> If the encoding has been validated.
* Support automatic coercion between standard encodings
* Allow user defined helpers for custom conversions
* Make all input/output default to UTF-8**
but permit raw output of such types without validation or re-encoding, just like every other UNIX tool that knows what an encoding is.
I think this world exists in Ruby and as far as I can tell it fails worse in comparison. You end up in situations where 2 layers of gems don't care what they're passing through and operating on, then you get the result and can't tell anything useful about the contents. Was the encoding preserved? Is it even encoded?
People end up using hacks like https://github.com/whitequark/rack-utf8_sanitizer which shouldn't be necessary in the first place, because you should know whether you're receiving raw bytes or text. But what you thought was text that's utf8-compatible actually includes characters from another encoding.
`String.valid_encoding?` is probably the worst programming idea. If you have to ask that, it's too late. There should've been an exception raised when the input was initially processed, or the type should not pretend that it's a real text.
As a serialization and transmission format, UTF-8 is fine.
As the basis of a high-level language's string type, UTF-8 is objectively incorrect. Strings in a high-level language should be as clean an abstraction of Unicode as possible, and leaking implementation details of the particular byte-encoding scheme up to the programmer is not acceptable.
Ok, is it reasonable to ask for all string getter/setter file read/write ops to default to UTF-8?
also having different string methods .byte_length .char_length(?) .codepoint_length seems a good idea.
or string.length(aspect=bytes) but what should be the default aspect?
I'm going to quote something I wrote a couple years ago[1]:
Now, I should point out here that I’m not really knocking the people who were writing, say, command-line and file-handling utilities in Python. For years, Python sort of accepted the status quo of the Unix world, which was mostly to stick its fingers in its ears and shout LA LA LA I CAN’T HEAR YOU I’M JUST GOING TO SET LC_CTYPE TO C AGAIN AND GO BACK TO MY HAPPY PLACE. A bit later on it changed to “just use UTF-8 everywhere, UTF-8 is perfectly safe”, which really meant “just use UTF-8 everywhere because we can continue pretending it’s ASCII up until the time someone gives us a non-ASCII or multi-byte character, at which point do the fingers-in-ears-can’t-hear-you thing again”.
So a lot of what you’ll see in terms of complaints about string handling are really complaints that Unix’s pretend-everything-is-ASCII-until-it-breaks approach was never very good to begin with and just gets worse with every passing year.
I stand by this: we had a couple of decades of Python catering to this brokenness, and it made life miserable for everyone who didn't work in that particular domain. Python 3 changed that. Does it mean life got harder for some people? Yup. But life got a lot easier and more reliable for many more people, and it's a tradeoff I'm willing to accept.
Currently in python you know that you're holding either raw bytes, or something that can be successfully serialised to utf8. With your proposed solution you'll find out which one it is when you try to encode it.
It's the difference between easier to debug: "you asked me to read a value here, but your assumptions about the encoding don't match reality" exception and the hard to debug: "I did a lot of processing; you thought this thing is a valid text, but it isn't; have fun tracking down how it got here in the first place" exception.
Python 3: We have two things that are ALMOST the same, and which if we'd done it correctly, could have been converted just by changing what we're willing to call it (in the "downgrade" direction; or also verifying if you want to achieve what I hate that Python 3 is forcing on programmers).
Proposed "string like" Object: __IF__ you want to turn on debugging, sure, force it to validate assumptions at runtime/compile time. Otherwise call verify() when you're willing to handle that being some result indicating "we have a problem".
Maybe the verify() call returns the byte-access-offset of the first non-conforming sequence.
I think we've got a fundamental disagreement here. I don't believe they're similar at all. They just happened to get confused a lot in the past when it didn't matter that much.
Which "solution" was that? Your comment suggesting that len() should be the length in bytes of a UTF-8 representation of the string?
That would be a terrible thing for people who work with strings as strings, and would essentially tell such people -- who make up a lot of Python's user base! -- "go use something else, this language is only for sysadmins to write Unix utilities now".
No, the one that re-unifies the sequences of bytes that might be "strings" under one container object which has fields indicating what kind of format it is and automatic coercion methods for common types.
Pretending that strings and sequences of bytes are identical and interchangeable is fundamentally flawed. The only reason people were able to get away with it for as long as they did was because of immense pain inflicted on the rest of the Python programming community.
If you require an eternally frozen implementation of Python 2's behavior in order to get your work done, you can download the source and keep it around as long as you like. Nobody else has an obligation to support you or hold back the rest of the world in order to suit your use case in your preferred way.
For compatibility reasons, len(unicode type) should probably return the size if accessed as a raw 8 bit sequence, so an alias to .length_bytes. I also think the other names should similarly start with .length_ .
len() should return the length in individual addressable units, so that obj[len(obj)-1] always returns the last item, whether that's a byte or a unicode character.
What's unacceptable is forcing the programmer to deal with the possibility of 'garbage' strings by default. Also having major issues when half of the libraries want 'binary string like sequences' and the other half want 'native Unicode strings'.
That's what the support for a generic array type, which MIGHT be handled like a string (if the programmer chooses), and which when designated as being in a Unicode format of ANY type, and processed with data of any valid Unicode type, will automatically coerce together the correct way, and produce a valid result* (assuming valid inputs, and which might still need to actually be validated (again, bitflag cleared), if anyone cares).
THAT is what a high level language should be.
Python has actually __REGRESSED__ in that respect. It's handing of Unicode is //low// level. It forces the programmer to need to care about the data types, and deal with legacy invalid data strings in some method that's up to them. Except they can't /ignore/ invalid sequences, because that option isn't supported.
A literal tab character is equivalent to 8 spaces in Python source (or 1-7 spaces if the tab is preceded by a number of spaces that is not an even multiple of 8). This follows well-established Unix standards.
So the following lines are at equivalent indentation:
Neither. It adds a wrapper in `six.moves` to various modules which moved between py2/3. It will not automatically fix existing code, but it does allow you to use one import line.
Neither, it just let's you write code that runs on 2 and 3 quite easily, by hiding import differences and changes to builtins. It's worth it if you are distributing code that has to run on 3 and 2, but if your upgrading your existing 2 code base it's not that great.
Python 3.0 was slower than Python 2.7, largely because some new modules were initially implemented in pure Python in order to get the APIs out there for people to experiment with. They're since been replaced with implementations in C.
Python 3 passed Python 2 on performance years ago.
https://pypi.python.org/pypi/re2/
https://github.com/google/re2/wiki/WhyRE2
> RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.
Really interesting articles in the second link detailing how this works!