Hacker News new | past | comments | ask | show | jobs | submit login

Ok, is it reasonable to ask for all string getter/setter file read/write ops to default to UTF-8?

also having different string methods .byte_length .char_length(?) .codepoint_length seems a good idea. or string.length(aspect=bytes) but what should be the default aspect?




I'm going to quote something I wrote a couple years ago[1]:

Now, I should point out here that I’m not really knocking the people who were writing, say, command-line and file-handling utilities in Python. For years, Python sort of accepted the status quo of the Unix world, which was mostly to stick its fingers in its ears and shout LA LA LA I CAN’T HEAR YOU I’M JUST GOING TO SET LC_CTYPE TO C AGAIN AND GO BACK TO MY HAPPY PLACE. A bit later on it changed to “just use UTF-8 everywhere, UTF-8 is perfectly safe”, which really meant “just use UTF-8 everywhere because we can continue pretending it’s ASCII up until the time someone gives us a non-ASCII or multi-byte character, at which point do the fingers-in-ears-can’t-hear-you thing again”.

So a lot of what you’ll see in terms of complaints about string handling are really complaints that Unix’s pretend-everything-is-ASCII-until-it-breaks approach was never very good to begin with and just gets worse with every passing year.

I stand by this: we had a couple of decades of Python catering to this brokenness, and it made life miserable for everyone who didn't work in that particular domain. Python 3 changed that. Does it mean life got harder for some people? Yup. But life got a lot easier and more reliable for many more people, and it's a tradeoff I'm willing to accept.

[1] https://www.b-list.org/weblog/2016/jun/10/python-3-again/


Can you answer what ways things got easier that are /not/ equal or better in the solution I proposed?


Currently in python you know that you're holding either raw bytes, or something that can be successfully serialised to utf8. With your proposed solution you'll find out which one it is when you try to encode it.

It's the difference between easier to debug: "you asked me to read a value here, but your assumptions about the encoding don't match reality" exception and the hard to debug: "I did a lot of processing; you thought this thing is a valid text, but it isn't; have fun tracking down how it got here in the first place" exception.


Python 3: We have two things that are ALMOST the same, and which if we'd done it correctly, could have been converted just by changing what we're willing to call it (in the "downgrade" direction; or also verifying if you want to achieve what I hate that Python 3 is forcing on programmers).

Proposed "string like" Object: __IF__ you want to turn on debugging, sure, force it to validate assumptions at runtime/compile time. Otherwise call verify() when you're willing to handle that being some result indicating "we have a problem".

Maybe the verify() call returns the byte-access-offset of the first non-conforming sequence.


> We have two things that are ALMOST the same,

I think we've got a fundamental disagreement here. I don't believe they're similar at all. They just happened to get confused a lot in the past when it didn't matter that much.


Which "solution" was that? Your comment suggesting that len() should be the length in bytes of a UTF-8 representation of the string?

That would be a terrible thing for people who work with strings as strings, and would essentially tell such people -- who make up a lot of Python's user base! -- "go use something else, this language is only for sysadmins to write Unix utilities now".


No, the one that re-unifies the sequences of bytes that might be "strings" under one container object which has fields indicating what kind of format it is and automatic coercion methods for common types.


Pretending that strings and sequences of bytes are identical and interchangeable is fundamentally flawed. The only reason people were able to get away with it for as long as they did was because of immense pain inflicted on the rest of the Python programming community.

If you require an eternally frozen implementation of Python 2's behavior in order to get your work done, you can download the source and keep it around as long as you like. Nobody else has an obligation to support you or hold back the rest of the world in order to suit your use case in your preferred way.


For compatibility reasons, len(unicode type) should probably return the size if accessed as a raw 8 bit sequence, so an alias to .length_bytes. I also think the other names should similarly start with .length_ .


len() should return the length in individual addressable units, so that obj[len(obj)-1] always returns the last item, whether that's a byte or a unicode character.


Unfortunately defaulting to UTF-8 is often a bad choice in Windows. Curse Microsoft!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: