Since that's the point of comparison, what's the Go standard library's strategy? Is it inherently slower than this or does it behave better for different scenarios?
How does this compare to using simd (like simdutf8)?
Unrelated, but is it my imagination or all string algorithms would benefit from UTF-32 over UTF-8?
UTF-32 always uses 32 bits per character, so you can easily paralelize it because you can easily tell where a character starts in a bit stream.
UTF-16 can have 16 or 32 bits, so if you have a sequence 2 characters of 16-32 bits, or in UTF-8's case, a sequence of 3 characters that are 8-16-16 bits, they won't fit neatly in a box. You can't just say char_ptr + 32 to land on the 32nd character, making it much harder to paralelize.
In addition, UTF-32 can cause issues on the network due to byte ordering (little endian vs. big endian). There’s really two encodings, UTF-32LE and UTF-32BE.
For all of these reasons, the world has largely standardized on UTF-8, plus for historical reasons UTF-16 in the JVM, CLR, Windows, and JavaScript.
Reminds me of a developer who insisted that his service crashing when someone entered an address longer than 1024 characters in a form was no big deal since "no one has an address that long"
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:
invalid bytes
an unexpected continuation byte
a non-continuation byte before the end of the character
the string ending before the end of the character (which can happen in simple string truncation)
an overlong encoding
a sequence that decodes to an invalid code point
OP is arguing against checking as a matter of course, on the hot path, not saying that you shouldn't ever test ever. That particular counterargument is a bad one.
Clearly, that's not what they wrote. They told people to fix their code that generate broken utf-8 instead, implying that validating fast is pointless since you don't need to validate at all. I don't understand how you came up with this interpretation. And I'm also not sure what you call out as a bad counterargument and how you would do what you suggest. And even if you don't validate on the hot path, it doesn't hurt to do it faster, so I don't understand why one would be against a more efficient computation.
But that's not what I'm answering to, and I was not taking sides so I don't know why you would reply to my comment with this. I was just clarifying something. I'm answering a comment that might have misread what they answered too, that's it.
I replied to you because you made the implication much more direct in your wording, but the implication is wrong. I'm not saying it's your opinion, you just made it easier to reply to.
OP was saying not to validate as part of your normal pipeline. They weren't saying not to validate ever, such as while poking around or testing, which is how you could still figure out that bad utf-8 exists and where it is coming from.
> And even if you don't validate on the hot path, it doesn't hurt to do it faster, so I don't understand why one would be against a more efficient computation.
It's not about being for or against the speed of calculation. It's that removing unnecessary calculations is even faster. So questioning the need is good.
If 'hot path' was unclear and you think I'm making things up, let me try a different wording. As a hypothetical: If I don't think something needs to be validated in production, I might question a library dedicated to doing it really fast, and say your producers should be fixed instead. Me saying that doesn't mean I think you should be stumbling around blind, never testing those producers.
That all depends on context. If users are submitting data then you don't have access to the producers. I think user data is a good objection, and other people brought that up immediately, I just think it's a different objection than the idea of "if you never check, how will you know?".
Fair enough, though it seemed disingenuous of OP to question the optimization of a hugely common process and say "just fix your code" when it's not applicable to a vast (majority?) number of (obvious?) cases. It's not like a huge part of today's computer software was not currently running on servers needing to check user input, including the very website we are commenting on. Browsers themselves need to do it too because they can't blindly trust what the server sends them. It seems like it almost takes effort to ignore those cases. To the point people commenting here even forget that there are other things than the web.
If OP wanted to share with us the idea of moving validation to another step in the process, giving a case where it would be applicable, that would have been fine but why question an optimization in the first place? Why not both? It just feels like they are rejecting stuff for the sake of rejecting stuff and I assume this is the reason they got downvoted.
UTF-8 validation is something that happens constantly. And even if it's was not, this optimization is probably an interesting technical feat, and we are in a Show HN post too, rejecting stuff like this is just not going to fly on HN. You can question things but you need to be insightful and nice.
So you asked what I thought, but I never gave an opinion on UTF-8, I gave a generic hypothetical.
If you're asking about OP's idea, then I agree that validating the UTF-8 inside of JSON is needed sometimes, though it's not required in all situations. But I wasn't commenting on that, I was commenting on the specific objection of "how do you know if you don't check?", because I don't think that specific objection is very strong.
There are various contexts where the inputs are defined to be UTF-8 and nonconforming inputs are rejected. For example, a `string` field in a protocol buffer.
I don't see how. A user of a conforming protobuf library can assume, after a successful decode, that a string field contains valid UTF-8. That doesn't generalize to everything called "string".
I was trying to make the point that string validation isn't limited to protobuf. You need to do it for any network protocol that uses strings, which is most of them.
Since that's the point of comparison, what's the Go standard library's strategy? Is it inherently slower than this or does it behave better for different scenarios?
How does this compare to using simd (like simdutf8)?