Hacker News new | comments | show | ask | jobs | submit login

AFAIK you are incorrect about Python 3. Python 3 strings are internally represented as ASCII, Latin-1(since 3.3), UCS-2, or UCS-4. So for an arbitrary encoding this statement would be incorrect.

Ruby's model look much more like array of bytes + encoding to me, how else could I change the encoding for example of a latin-1 string into an invalid utf-8 string with force_encoding? Ruby holds off the validation of the string until it's really needed. Lazy validation doesn't strike me as odd for a dynamic language.

I agree that String#force_encoding would be nice if it were not needed but in the real world you get all kinds of broken encoded data. String#encode most of the time is enough but as a last resort, String#force_encoding is there to use.

I'm a bit surprised that you mention Go because if I look at https://blog.golang.org/strings "Similarly, unless [the literal string] contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8."

This is exactly the same behavior as in Ruby. I put the invalid byte sequence there deliberately and Ruby and apparently also Go don't have a problem with that.




Regarding Go, what matters is that you read/write []byte and transform it to string explicitly.

> in the real world you get all kinds of broken encoded data

> String#encode most of the time is enough but as a last resort, String#force_encoding is there to use.

When your byte arrays and strings get passed around your code soon you don't have a way to validate which one is safe and which is not, which one is a byte array and which is a string.

The thing is not about content, it's about communication and contracts. #force_encoding should not be available on a string, and #encode should not be available on a byte array. If you receive something that may be broken, it should be received into a byte array (b"" in python, []byte in Go), then "cross the gate" to be a string, and that moment is where you eventually perform sanitisation (u"" in python, string in Go). From then on a consumer of your string will be able to assume its content and its declared encoding match.

> I'm a bit surprised that you mention Go [...] This is exactly the same behavior as in Ruby.

Indeed, but you have two types to use and discriminate against, with assorted funcs corresponding to each level of abstraction.

From the Go doc [0]:

> It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

I am definitely not concerned about what is in a string. I am concerned about distinguishing between "I just read that bunch of data from somewhere and I'm not yet sure what it is" vs "Ok, that data has been through some earlier process that took guarantees as to what format I expect it to be in". Different abstraction levels. Where Go gets it right is that io.Reader/Writer reads/writes []byte, not string. So you're explicitly acting on your own responsibility when you do:

var b = []byte{0xA4} var s = string(b) foo.IExpectSomeSpecificEncoding(s)

So, when to insert a sanity check becomes obvious, as well as relying on the static type system/method dispatch to check things around for you becomes incredibly useful. and foo.IExpectSomeSpecificEncoding can then use the string as an opaque container.

[0]: https://blog.golang.org/strings




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: