Hacker News new | past | comments | ask | show | jobs | submit login

You just made the point of the GP, Ruby does handle encoding, but it does so badly:

    > str = "\xa4"
"" creates a UTF8 string, that one happens to contain an invalid char sequence right off the bat

    > str.split('')
        ArgumentError: invalid byte sequence in UTF-8
... which can only be detected at the worst possible time: when processed at runtime

    > puts str
    �
... and not even always, as sometimes ruby decides to handle strings as just bytes

    > str.force_encoding("iso-8859-1").encode("utf-8")
... and leaves the responsibility of whether they're bytes or strings, (and which encoding of) solely on the programmer. "force_encoding" is a thing that just should not exist on a string.

Python 3 and Go do it the correct way: array of bytes are array of bytes and strings are {array of bytes + encoding}. Ruby could have done it too by introducing a new type or something but chose to do otherwise in the name of a theoretical backwards compatibility that doesn't exist in practice since every singe release of Ruby introduced backwards-incompatible changes anyway.




AFAIK you are incorrect about Python 3. Python 3 strings are internally represented as ASCII, Latin-1(since 3.3), UCS-2, or UCS-4. So for an arbitrary encoding this statement would be incorrect.

Ruby's model look much more like array of bytes + encoding to me, how else could I change the encoding for example of a latin-1 string into an invalid utf-8 string with force_encoding? Ruby holds off the validation of the string until it's really needed. Lazy validation doesn't strike me as odd for a dynamic language.

I agree that String#force_encoding would be nice if it were not needed but in the real world you get all kinds of broken encoded data. String#encode most of the time is enough but as a last resort, String#force_encoding is there to use.

I'm a bit surprised that you mention Go because if I look at https://blog.golang.org/strings "Similarly, unless [the literal string] contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8."

This is exactly the same behavior as in Ruby. I put the invalid byte sequence there deliberately and Ruby and apparently also Go don't have a problem with that.


Regarding Go, what matters is that you read/write []byte and transform it to string explicitly.

> in the real world you get all kinds of broken encoded data

> String#encode most of the time is enough but as a last resort, String#force_encoding is there to use.

When your byte arrays and strings get passed around your code soon you don't have a way to validate which one is safe and which is not, which one is a byte array and which is a string.

The thing is not about content, it's about communication and contracts. #force_encoding should not be available on a string, and #encode should not be available on a byte array. If you receive something that may be broken, it should be received into a byte array (b"" in python, []byte in Go), then "cross the gate" to be a string, and that moment is where you eventually perform sanitisation (u"" in python, string in Go). From then on a consumer of your string will be able to assume its content and its declared encoding match.

> I'm a bit surprised that you mention Go [...] This is exactly the same behavior as in Ruby.

Indeed, but you have two types to use and discriminate against, with assorted funcs corresponding to each level of abstraction.

From the Go doc [0]:

> It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

I am definitely not concerned about what is in a string. I am concerned about distinguishing between "I just read that bunch of data from somewhere and I'm not yet sure what it is" vs "Ok, that data has been through some earlier process that took guarantees as to what format I expect it to be in". Different abstraction levels. Where Go gets it right is that io.Reader/Writer reads/writes []byte, not string. So you're explicitly acting on your own responsibility when you do:

var b = []byte{0xA4} var s = string(b) foo.IExpectSomeSpecificEncoding(s)

So, when to insert a sanity check becomes obvious, as well as relying on the static type system/method dispatch to check things around for you becomes incredibly useful. and foo.IExpectSomeSpecificEncoding can then use the string as an opaque container.

[0]: https://blog.golang.org/strings


A Python 3 string is not an array of "bytes + encoding". A Python 3 string is a sequence of Unicode code points, and you neither know nor are supposed to care what the underlying bytes in memory are. All operations on strings in Python 3 are in terms of code points: iterating yields the characters corresponding to the code points, indexing yields the character corresponding to the code point at the index, and length reports the number of code points.

If you want bytes + encoding in Python 3, you want the 'bytes' type, not the 'str' type.


This was a shortcut way of summing things up. Indeed I don't care what it is behind the scenes, it's a string object and that's what matters vs being raw bytes that I can liberally interpret or fix. "A Python 3 string is a sequence of Unicode code points" is just what I mean, those are just bytes in memory that are interpreted as Unicode code points by Python, and it's opaquely exposed to the developer as a string.


But that's quite a different thing. In Python, you only got Unicode strings.

In Ruby you got strings that are arrays of bytes that are lazily interpreted as string with the set encoding.

In that regard. Ruby strings are more powerful than Python strings as you can handle different encodings. In Python, you have to work around that with the bytes objects to handle non Unicode encodings.


As a Python person, I'd say it's not so much about "power" as it is about correctness: equating "sequence of bytes in an encoding" with "string" is the source of a lot of (potentially hard-to-find) bugs and annoyances. So the approach of saying that a string is a sequence of Unicode codepoints, and that there's a separate type to represent sequences of bytes (plus codecs to handle translating between the two types), is a better way to do things.

And I think this is borne out by the sheer number of articles that have to be written to remind people who use "strings are bytes in an encoding" languages of just how many assumptions they end up making that turn out to be wrong (most of which boil down to expecting bytes-in-an-encoding types to behave the way Python's string really do behave, as sequences of codepoints rather than as sequences of bytes where there may or may not be one-to-one-correspondence between bytes and codepoints).


> Ruby strings are more powerful than Python strings as you can handle different encodings. In Python, you have to work around that with the bytes objects to handle non Unicode encodings.

True, but that's another debate entirely. The counterpoint is that Python's stance makes it extremely powerful as a consumer of strings has a lot of guarantees about the string he's receiving, which is great for defensive programming.


It's documented that the default encoding in ruby is UTF-8.

    > "\xAA".force_encoding('binary').split('')
    # => ["\xAA"]
If you want a binary string, you can do

    str = String.new('...', encoding: 'binary')
NOTE: `'binary'` is an alias for `Encoding::ASCII_8BIT`


> It's documented that the default encoding in ruby is UTF-8.

The complaint is not that the default string has no encoding, it's that proper encoding is not validated when creating the string, but only when processing it (and even then it might depend on the actual operation being applied).

So given (1) a string object with (2) a proper encoding, it may still blow up in your face at any point unless you defensively check every string you get with `valid_encoding?`




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: