Hacker News new | past | comments | ask | show | jobs | submit login

The author found it hard to "find the right API entry point in Go documentation".

For the record, Go produces one U+FFFD per byte, not per maximal contiguous run, when iterating over bad UTF-8. This is part of the language specification, not just a library, although the standard libraries follow this behavior. For example, in the standard UTF-8 library, https://golang.org/pkg/unicode/utf8/#DecodeRune says that the size returned is 1 (i.e. 1 byte) for invalid UTF-8.

The relevant language spec section is https://golang.org/ref/spec#For_statements and look for "If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string."

Example code: https://play.golang.org/p/OLIWcjLIvF

I'll note that both Go and UTF-8 were invented by Ken Thompson and Rob Pike. I'm sure that the Go authors were aware of UTF-8's details. (Go also involved Robert Griesemer, but that's tangential).




Thank you. I was looking for something that takes a potentially invalid buffer of UTF-8 and returns a guaranteed-valid buffer and failed to find a function like that.

(And, indeed, Go is an interesting case due to its creators being the inventors of UTF-8, too.)


Yeah, there's not really a guaranteed-valid buffer concept in Go. Even if you have valid UTF-8, you still have to iterate over it to e.g. rasterize glyphs, and iterating over possibly-bad UTF-8 is no harder than iterating over known-good UTF-8.

If you want to compare to other UTF-8, validity alone isn't always sufficient. You often have to e.g. normalize anyway, and normalization should fix up bad UTF-8. Again, a guaranteed-valid buffer type wouldn't win you much.


Oh, in case you were wondering, the ubiquitous term "rune" in the Go documentation is simply shorthand for "Unicode codepoint".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: