

Python's missing string type - mborch
https://maltheborch.com/2014/04/pythons-missing-string-type

======
HelloNurse
Do you want to experiment with rope-style data structures containing variable
length encodings of characters? Fine, but please don't confuse efficient data
structures with correct Unicode handling, which is a major improvement in
Python 3 even if you resent it. Metaphorically speaking, what do you prefer to
have: a painful scar or undiagnosed skin cancer? Assumptions about input
encodings are the latter.

Ropes might or might not perform better than current array-based str and bytes
implementations, but attempting to unify the two data types is just wrong. For
example, would you count lengths and offsets in bytes or in characters? You
need two data types to get the two possible behaviours.

~~~
mborch
I would count lengths and offsets with respect to characters (codepoints).
Note that for fixed-length encodings such as "raw", "ascii", or "latin-1",
this always means bytes.

------
mmerickel
This is a neat little optimization which can help you avoid needing to convert
everything into utf-8 underneath the hood - leaving things in their
raw/encoded forms. This may help in certain cases but would probably also make
the C-api slightly more complex.

The real problem with Python2 is that it attempts to auto-decode bytes into
unicode by guessing the encoding. This actually works most of the time, but
not always. Unfortunately the fact that it works at all causes people to ship
code that they think is fine... until later when a byte-string comes along
with a non-standard encoding and it blows up. Python3 fixed this by making it
blow up every time.

~~~
mborch
I think HTTP request and response headers make a great example.

You read a raw stream of bytes line by line (8-bit fixed), and split on
newlines.

In this protocol we know that each string is "latin-1", so we can extract and
decode each header with this encoding. At virtually zero cost, because we
don't copy, or transcode the actual string data.

And what's more, we stay true to the protocol. These headers were never
unicode-encoded (any variant), so it is awkward to suddenly have unicode
strings to deal with in the rest of the program.

The alternative would be bytes, but in Python 3, those are simply impossible
to work with as strings.

Why is Python 3 unpopular? It's because it does not really advance Python as a
language. It's a bit cleaner around the edges, but the cost was very high for
very little gain.

------
TheLoneWolfling
Better than that.

Use a rope type that's entirely (modified) UTF-8 throughout, but with the
wrinkle that every character in a node must take the same number of bytes per
character. (Use overlong encodings internally where appropriate, namely where
you have (for example) 1 one-byte character in the middle of a long run of
two-byte characters.)

You can even combine that approach with this.

~~~
mborch
I can see how that could be a useful optimization for nodes already in UTF-8,
but in my scheme, the rope would have to accommodate fixed-width codecs such
as "raw" and "latin-1", too.

For instance, if you read bytes from a file with no encoding specified, then
the resulting rope would be comprised of only "raw" nodes.

If it's PNG image data, then it wouldn't make sense to decode them as UTF-8
:-)

~~~
TheLoneWolfling
Well, yeah. The above approach just means that you can (almost) always index
in O(log(n)) time or better.

If you want to handle multiple encodings, you drop the length field entirely:
you have an encoding field, with the different lengths of variable-length
charsets internally being expanded out to different encodings.

(So, instead of having UTF-8, you have UTF-8-1-byte, UTF-8-2-byte,
UTF-8-3-byte, and UTF-8-4-byte (as well as plain UTF-8). Probably internally
mapped to a 16-bit enum or something. Again, this is all internal: to write
out to a file or something you map it back to the underlying encoding.)

------
awaretek
I don't think your suggestions would work very well as stated. However, I hope
soemthing will be in Python 3.5 to add a second string type that helps make it
easier to port old libraries to Python 3 AND helps make it easier to use
Python 3 in use cases that are currently more complicated than they need be.

Practicality bearts purity.

~~~
mborch
The Zen of Python claims "There should be one-- and preferably only one
--obvious way to do it."

I think multiple string types go against that. It is the simplicity that makes
it practical; not an argument of purity in my opinion.

To me what's important is that handling of encoding happens on the I/O
barrier, and not anywhere else.

