> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)
It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:
>>> '\udead'.encode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
>>> '\ud83d\ude41'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.
Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.
It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.