> And with that basic idea that strings are just a different view on a bytestrea...

flohofwoe · 2023-10-20T06:49:58

> Much of the pain of the transition was figuring out which strings were bytes and which were Unicode data.

And for a lot of code (that which just passes data around), this shouldn't matter.

It's basically "Schroedinger's strings", you don't need to know if some data is valid string data until you actually need it as a string, and often this isn't needed at all (IMHO all encodings/decodings should be explicit, not just between bytestreams and strings, but also between different string encodings - and those should arguably go into different string types which cannot be assigned directly to each other - e.g. the standard string type should always only be UTF-8). Also, file operations should always work on bytestreams (same in the IO functions of the C stdlib btw).

amluto · 2023-10-20T23:31:05

> It's basically "Schroedinger's strings", you don't need to know if some data is valid string data until you actually need it as a string, and often this isn't needed at all

Then you can pass around an untyped value, which is the default in all versions of Python. With type annotations, one can spell this typing.Any.

When you finally do need your value to be a string, you need to decide whether it’s a runtime error when it needs to be a string or whether it’s a runtime error way up the call stack. Especially if databases are involved (or network calls, etc), this decision matters.

> e.g. the standard string type should always only be UTF-8

It almost kind of sounds like you’re arguing in favor of Python 3’s design, where str is indistinguishable from UTF-8 except insofar as you need to actually ask for bytes (e.g. call encode()) to get the UTF-8 bytes.

> Also, file operations should always work on bytestreams

So how do you read a line from a text file?

> (same in the IO functions of the C stdlib btw).

Are we talking about the same C? The language where calling gets() at all is a severe security bug, where fgets returns int, and where fgetwc exists?

flohofwoe · 2023-10-21T10:05:17

> So how do you read a line from a text file?

In that case you need to know upfront how the text file is encoded anyway, since text files don't carry that information around.

If it is a byte-stream encoding from the "ASCII heritage" like UTF-8, 7-bit ASCII, or codepaged 8-bit "ASCII" - whatever that is actually called...): load bytes until you encounter a 0x0A or 0x0D (and skip those when continuing), what has been loaded until then is a line in the text file's encoding. If the original encoding was codepaged 8-bit ASCII you probably want to convert that to UTF-8 next, for that you also need to know the proper codepage though (not needed for 7-bit ASCII since that already is valid UTF-8 - in UTF-8, every byte with the topmost bit cleared is guaranteed to be a standalone 7-bit ASCII character and every byte with the topmost bit set is part of a multi-byte sequence for codepoints above 127, that's why one can simply iterate byte by byte over an UTF-8 encoded byte stream when looking for 7-bit ASCII characters (such as newline and carriage-return).

The gist is that the file IO functions themselves should never be aware of text encodings, they should only work on bytes. The "text awareness" should happen in higher level code above the IO layer.

> Are we talking about the same C?

What I meant here - but expressed poorly - was that C also got that wrong (or rather the C stdlib, C itself isn't involved). There should be no "text mode IO" in the C stdlib IO functions either, only raw byte IO. And functions like gets(), fgets() etc... shouldn't be in the C stdlib in the first place.

amluto · 2023-10-22T02:57:37

Python 3 actually works approximately the way you’re describing:

https://docs.python.org/3/library/io.html#io.TextIOWrapper

open is just a factory function, conceptually inherited (I think) from C.

altfredd · 2023-10-20T02:36:02

Unless you enjoy getting hacked, all strings received from outside sources are bytes.