
How many replacement characters? - hsivonen
https://hsivonen.fi/broken-utf-8/
======
kbenson
A well laid out case. Given the churn the new change in preference it might
cause in the ecosystem, it's probably worth reverting, especially since
apparently no real benefits beyond "it feels right" (according to this
accounting) have been put forth.

It may cause the committee to lose a little face, but less than digging in
your heels over a decision that has no ramifications to you but does to
others, and you have no real justification for. Hopefully they see that and
acquiesce, or at least come back with a well thought out rebuttal that isn't
dismissive.

------
rossy
I completely agree with the author on this. The old Unicode 9.0 best practice
made sense for a UTF-8 decoder that consumed input using a state machine, and
like the article says, if you accept the correct byte-ranges in each state as
in Table 3-7, your state-machine-based decoder will implicitly reject every
kind of invalid sequence, including overlong encodings, encoded UTF-16
surrogates and encoded out of range characters. A state machine also makes
other things trivial, like validating UTF-8 input without outputting
characters and writing a streaming UTF-8 to UTF-16 converter. Another property
is that it rejects invalid input as soon as possible, eg. as soon as it
consumes input that can't possibly be part of an invalid sequence.

This is an example of a particularly elegant UTF-8 decoder which uses a state
machine:
[http://bjoern.hoehrmann.de/utf-8/decoder/dfa/](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/)

The proposed Unicode 11 best practice suits a decoder that checks for overlong
sequences, encoded surrogates and out of range values as a sort of post-
validation step, after consuming all the bytes of a potentially valid UTF-8
sequence. Not only is this different to the behaviour of every existing
decoder except for ICU, it also seems less elegant, more complicated and more
error prone to me. If I understand correctly, even bytes like 0xc0 and 0xf5,
which never form part of a valid UTF-8 sequence, won't be rejected immediately
in this kind of decoder.

The article makes a pretty solid argument for why this difference in behaviour
matters, even though it's just a best practice and not an official
requirement. The two key points in this for me are that most existing UTF-8
decoders produce identical results matching the Unicode 9.0 best practice, and
that there was an actual bug in Chrome when two internal UTF-8 decoders
produced differing results. I think I'd make a stronger conclusion though: Not
only should they keep the current recommended best practice, they should also
elevate it to a requirement in order to prevent bugs like the one in Chrome
from happening in future. Most existing UTF-8 decoders are already compliant,
and it would be a nice property of UTF-8 if all byte sequences, including
invalid ones, decoded to the same sequence of codepoints in every decoder.

------
kazinator
How I have implemented things is that when the UTF-8 decoder encounters an
invalid sequence, it retreats to the beginning of that sequence. It then
converts the first byte of that sequence to a replacement characters, and
consumes it. Then it resets to its initial state and begins decoding starting
at the following byte.

Basically we are saying "no match occurs for a valid UTF-8 pattern at this
input position; let's do error recovery by dropping a byte, pooping it out as
a replacement character and trying again."

I have used the surrogate pair range U+DC00 - U+DCFF for replacement
characters. On output, I convert these back to individual bytes. Thus the end-
to-end decode+encode is binary transparent: any byte string can be decoded to
a sequence of code points, some of which may be replacement characters, and
that sequence will encode back to the original byte string. (This requirement
cannot be achieved if multiple bogus bytes are collapsed into one replacement
character.)

Well, that's not the full story: to have this transparency property, we also
need to ensure that when some U+DCXX occurs by means of a valid UTF-8 pattern,
we nevertheless treat it as invalid. I.e. there is a rule that if the UTF-8
decode works, but a U+DCXX code-point emerges, then we retreat to the start
and drop a byte as a replacement character, as if a bad code had been seen.

~~~
eridius
Your UTF-8 decoder is identical to just "emit a REPLACEMENT CHARACTER for
every bogus byte".

Also, the use of surrogate pairs mean you actually don't have a UTF-8 decoder
at all, you just have something that's similar but produces a sequence of
potentially-invalid unicode scalars.

> _we also need to ensure that when some U+DCXX occurs by means of a valid
> UTF-8 pattern_

It can't. Trying to encode surrogate pair codepoints in UTF-8 is strictly
invalid.

~~~
Dylan16807
> It can't.

Why did you cut off "we nevertheless treat it as invalid."? You're misreading
that line, and actually in agreement on what to do. It's the bit pattern that
is valid, which is how you can parse the code point and see that the code
point is invalid for UTF-8.

~~~
eridius
I cut you off because you basically said "and if we see this thing that is
defined as invalid, we treat it as invalid". Which is something that all UTF-8
parsers do, so there's no point in calling it out like this is special
behavior.

~~~
Dylan16807
That's not me.

The point is that this variant parser accepts many things that are normally
errors and turns them into faux-surrogates, so it's worth restating that it
rejects surrogates in the source file.

~~~
eridius
My apologies.

In any case, here's what OP said:

> _Well, that 's not the full story: to have this transparency property, we
> also need to ensure that when some U+DCXX occurs by means of a valid UTF-8
> pattern, we nevertheless treat it as invalid. I.e. there is a rule that if
> the UTF-8 decode works, but a U+DCXX code-point emerges, then we retreat to
> the start and drop a byte as a replacement character, as if a bad code had
> been seen._

But this whole paragraph is wrongheaded. A U+DCXX codepoint _cannot occur_ as
a result of a UTF-8 decode, because it is defined as invalid. Even just that
last bit there, "as if a bad code had been seen"... a bad code _was seen_!
This paragraph makes me question whether OP actually understands UTF-8
decoding at all.

~~~
Dylan16807
I think the meaning is clear. They're talking about _part_ of the UTF-8
decoding process, the part that actually decodes leading/trailing bytes into
arbitrary 21 bit numbers. What term would you use for that?

~~~
eridius
Well, I wouldn't even bring it up, because the UTF-8 decoding process by
definition cannot produce code points in the surrogate pair range, as those
are illegal to encode in UTF-8. But if I must, I might talk about the "bit
pattern" and the integral value that results from interpreting it. I certainly
wouldn't talk about code points resulting from a UTF-8 decode.

~~~
eridius
Somebody is apparently upset about what I said. Why?

------
nigeltao
The author found it hard to "find the right API entry point in Go
documentation".

For the record, Go produces one U+FFFD per byte, not per maximal contiguous
run, when iterating over bad UTF-8. This is part of the language
specification, not just a library, although the standard libraries follow this
behavior. For example, in the standard UTF-8 library,
[https://golang.org/pkg/unicode/utf8/#DecodeRune](https://golang.org/pkg/unicode/utf8/#DecodeRune)
says that the size returned is 1 (i.e. 1 byte) for invalid UTF-8.

The relevant language spec section is
[https://golang.org/ref/spec#For_statements](https://golang.org/ref/spec#For_statements)
and look for "If the iteration encounters an invalid UTF-8 sequence, the
second value will be 0xFFFD, the Unicode replacement character, and the next
iteration will advance a single byte in the string."

Example code:
[https://play.golang.org/p/OLIWcjLIvF](https://play.golang.org/p/OLIWcjLIvF)

I'll note that both Go and UTF-8 were invented by Ken Thompson and Rob Pike.
I'm sure that the Go authors were aware of UTF-8's details. (Go also involved
Robert Griesemer, but that's tangential).

~~~
hsivonen
Thank you. I was looking for something that takes a potentially invalid buffer
of UTF-8 and returns a guaranteed-valid buffer and failed to find a function
like that.

(And, indeed, Go is an interesting case due to its creators being the
inventors of UTF-8, too.)

~~~
nigeltao
Yeah, there's not really a guaranteed-valid buffer concept in Go. Even if you
have valid UTF-8, you still have to iterate over it to e.g. rasterize glyphs,
and iterating over possibly-bad UTF-8 is no harder than iterating over known-
good UTF-8.

If you want to compare to other UTF-8, validity alone isn't always sufficient.
You often have to e.g. normalize anyway, and normalization should fix up bad
UTF-8. Again, a guaranteed-valid buffer type wouldn't win you much.

------
derefr
Thought: what if Unicode decoding was "lossless" in the face of errors, such
that the replacement characters _represented_ the bitstring of non-decodable
bytes? (E.g. 256 reserved-codepoints for each possible octet value, that
render as e.g. "[FF]" in a box; and then another 255 for the set of 7-bit,
6-bit, 5-bit, etc. overhangs.)

~~~
kazinator
been there, done that; it is very useful.

    
    
      1> [(file-get-string "/bin/ls") 0..15]
      "\x7F;ELF\x02\x01\x01\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00\xDC00"
      2> (file-put-string "foo" (file-get-string "/bin/ls"))
      t
      3> (sh "cmp /bin/ls foo")
      0
      4> (sh "sha256sum /bin/ls foo")
      a90ba058c747458330ba26b5e2a744f4fc57f92f9d0c9112b1cb2f76c66c4ba0  /bin/ls
      a90ba058c747458330ba26b5e2a744f4fc57f92f9d0c9112b1cb2f76c66c4ba0  foo
      0
    

There is no need to handle any fractional bytes if the original input is a
sequence of bytes; so many whole bytes have to be recovered, not so many whole
bytes plus three bits or whatever.

~~~
1_player
I might be sleep deprived... but what exactly is this script doing?

You're making a copy of /bin/ls into foo, and sha256sum the copy and the
original.

    
    
        $ head -c 15 /bin/ls
        $ cat > foo < /bin/ls
        $ cmp /bin/ls foo
        $ sha256sum /bin/ls foo
    

I don't get it.

~~~
kazinator
> _You 're making a copy of /bin/ls into foo_

By getting its contents as a character string formed by passing the binary
through a UTF-8 decoder, and writing out that string via the UTF-8 encoder.

------
vorg
> The proposal is ambiguous about whether to do the same thing for five and
> six-byte sequences whose bit pattern is not defined as existing in Unicode
> but was defined in now-obsolete RFCs for UTF-8 [...] If five and six-byte
> sequences are treated according to the logic of the newly-accepted proposal,
> the newly-accepted proposal matches the behavior of ICU.

Regarding 5- and 6-byte sequences, perhaps the Unicode Consortium in their
ambiguity and the ICU in its implementation are allowing for their possible
return to Unicode. One day in the far-off future when UTF-16 finally dies, it
will be feasible to increase the codepoint repetoire back up from 1 million to
2 billion, which is easy to implement in both UTF-8 and UTF-16.

------
jfk13
This raises what appear to be important points. Has it been formally submitted
to Unicode in some way?

~~~
captaincrowbar
Yeah, there's a long discussion about this on the official Unicode mailing
list recently. You can read it at
[http://unicode.org/pipermail/unicode/](http://unicode.org/pipermail/unicode/)

~~~
beerbajay
Link to the thread:
[http://unicode.org/pipermail/unicode/2017-May/thread.html#53...](http://unicode.org/pipermail/unicode/2017-May/thread.html#5389)

------
skybrian
It seems like only harm is that "implementations have to explain themselves"
and one Chromium bug.

~~~
eridius
The Chromium bug is a demonstration of the fact that different behaviors in
different implementations can lead to real bugs. There's no reason to think
that one Chromium bug is the only time this will ever matter.

~~~
Dylan16807
It says in the spec that the number of replacement characters can vary, so
it's hard to blame Unicode for a bug caused by two parsers making different
numbers.

(Unless the argument is that there should be an official required number,
which is a different discussion entirely.)

~~~
eridius
No, you're right in that it's perfectly legitimate for multiple parsers to
behave differently here. But if there's one behavior that nearly all parsers
have standardized on, that's very valuable because it makes it a lot easier to
use two different parsers without a problem.

