Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Charcoal – Faster utf8.Valid using multi-byte processing without SIMD (github.com/sugawarayuuta)
58 points by sugawarayuuta 8 months ago | hide | past | favorite | 33 comments



The write up is pretty nice https://sugawarayuuta.github.io/charcoal/

Since that's the point of comparison, what's the Go standard library's strategy? Is it inherently slower than this or does it behave better for different scenarios?

How does this compare to using simd (like simdutf8)?


What is the gist of the algorithm? I'm not proficient at reading Go. Is it perhaps doing https://en.wikipedia.org/wiki/SWAR which is indeed SIMD?


A description of the algorithm is here [1] and it is indeed SWAR.

[1] https://github.com/sugawarayuuta/charcoal/blob/main/docs/ind...



Yes. The algorithm is very much using SIMD in the abstract sense but it isn't using SIMD instructions.

It is basically using 64 bit integer operations to check 8 bytes at a time.


Unrelated, but is it my imagination or all string algorithms would benefit from UTF-32 over UTF-8?

UTF-32 always uses 32 bits per character, so you can easily paralelize it because you can easily tell where a character starts in a bit stream.

UTF-16 can have 16 or 32 bits, so if you have a sequence 2 characters of 16-32 bits, or in UTF-8's case, a sequence of 3 characters that are 8-16-16 bits, they won't fit neatly in a box. You can't just say char_ptr + 32 to land on the 32nd character, making it much harder to paralelize.


Read https://unicode.org/faq/utf_bom.html#utf32-2.

In addition, UTF-32 can cause issues on the network due to byte ordering (little endian vs. big endian). There’s really two encodings, UTF-32LE and UTF-32BE.

For all of these reasons, the world has largely standardized on UTF-8, plus for historical reasons UTF-16 in the JVM, CLR, Windows, and JavaScript.


Great work! I also liked reading about your journey with sonnet in the Gophers Slack :^)

Do you have any plans on proposing to upstream some of your work?


Yep. Implementation looks simple enough that, if it is a drop-in replacement, should be up-streamed.

Curious to know if there are any known caveats?

Otherwise, go for it. May seem daunting, but it's not. I've submitted a couple of proposals and patches; pretty happy with the process.


See also Ridiculously fast unicode (UTF-8) validation [1] from three years ago.

[1] https://news.ycombinator.com/item?id=24839113


Why do we need to check utf-8 validity fast?

If you have invalid utf-8, go fix whatever code it is that is producing invalid utf-8...


This is about input validation, something you can't trust to be valid.

And obviously you don't want to spend more time doing this than necessary.


Reminds me of a developer who insisted that his service crashing when someone entered an address longer than 1024 characters in a form was no big deal since "no one has an address that long"


How do you know if you have invalid utf8?


From the Wikipedia article...

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

    invalid bytes
    an unexpected continuation byte
    a non-continuation byte before the end of the character
    the string ending before the end of the character (which can happen in simple string truncation)
    an overlong encoding
    a sequence that decodes to an invalid code point
https://en.wikipedia.org/wiki/UTF-8


I think the question of your parent commenter is rather "how do you know if you have invalid UTF-8 if you don't check?"


OP is arguing against checking as a matter of course, on the hot path, not saying that you shouldn't ever test ever. That particular counterargument is a bad one.


Clearly, that's not what they wrote. They told people to fix their code that generate broken utf-8 instead, implying that validating fast is pointless since you don't need to validate at all. I don't understand how you came up with this interpretation. And I'm also not sure what you call out as a bad counterargument and how you would do what you suggest. And even if you don't validate on the hot path, it doesn't hurt to do it faster, so I don't understand why one would be against a more efficient computation.

But that's not what I'm answering to, and I was not taking sides so I don't know why you would reply to my comment with this. I was just clarifying something. I'm answering a comment that might have misread what they answered too, that's it.


I replied to you because you made the implication much more direct in your wording, but the implication is wrong. I'm not saying it's your opinion, you just made it easier to reply to.

OP was saying not to validate as part of your normal pipeline. They weren't saying not to validate ever, such as while poking around or testing, which is how you could still figure out that bad utf-8 exists and where it is coming from.

> And even if you don't validate on the hot path, it doesn't hurt to do it faster, so I don't understand why one would be against a more efficient computation.

It's not about being for or against the speed of calculation. It's that removing unnecessary calculations is even faster. So questioning the need is good.

If 'hot path' was unclear and you think I'm making things up, let me try a different wording. As a hypothetical: If I don't think something needs to be validated in production, I might question a library dedicated to doing it really fast, and say your producers should be fixed instead. Me saying that doesn't mean I think you should be stumbling around blind, never testing those producers.


> you just made it easier to reply to

Ah, makes sense. Thanks!

Well, the implication is not wrong. If you validate only during testing, you never validate some inputs, the ones that matter: user input.

I now see what you mean by hot path, I was imagining something like validating asynchronously, later, when not serving a request for instance.


That all depends on context. If users are submitting data then you don't have access to the producers. I think user data is a good objection, and other people brought that up immediately, I just think it's a different objection than the idea of "if you never check, how will you know?".


Fair enough, though it seemed disingenuous of OP to question the optimization of a hugely common process and say "just fix your code" when it's not applicable to a vast (majority?) number of (obvious?) cases. It's not like a huge part of today's computer software was not currently running on servers needing to check user input, including the very website we are commenting on. Browsers themselves need to do it too because they can't blindly trust what the server sends them. It seems like it almost takes effort to ignore those cases. To the point people commenting here even forget that there are other things than the web.

If OP wanted to share with us the idea of moving validation to another step in the process, giving a case where it would be applicable, that would have been fine but why question an optimization in the first place? Why not both? It just feels like they are rejecting stuff for the sake of rejecting stuff and I assume this is the reason they got downvoted.

UTF-8 validation is something that happens constantly. And even if it's was not, this optimization is probably an interesting technical feat, and we are in a Show HN post too, rejecting stuff like this is just not going to fly on HN. You can question things but you need to be insightful and nice.

You make good points but OP didn't make them.

Anyways.


You don't think JSON should be validated in production?

Presumably you think you should always be able to trust anyone who sends you JSON?

I mean Go already does a bunch of UTF-8 validations in its JSON encoder/decoder [1]. No need to make those faster.

[1] https://pkg.go.dev/encoding/json#Marshal


Hi, what?

I said hypothetical, and I did not use the word JSON.


Both JSON encoding and decoding require UTF-8 validation.

If there's a workload that makes sense to optimize because it's used in production, it's UTF-8 validation.

So why are we even arguing whether UTF-8 validation is necessary/useful in the "hot path" or in "production"?


So you asked what I thought, but I never gave an opinion on UTF-8, I gave a generic hypothetical.

If you're asking about OP's idea, then I agree that validating the UTF-8 inside of JSON is needed sometimes, though it's not required in all situations. But I wasn't commenting on that, I was commenting on the specific objection of "how do you know if you don't check?", because I don't think that specific objection is very strong.


There are various contexts where the inputs are defined to be UTF-8 and nonconforming inputs are rejected. For example, a `string` field in a protocol buffer.


This could be generalized to any `string` field in any network packet.


I don't see how. A user of a conforming protobuf library can assume, after a successful decode, that a string field contains valid UTF-8. That doesn't generalize to everything called "string".


I was trying to make the point that string validation isn't limited to protobuf. You need to do it for any network protocol that uses strings, which is most of them.


Your question is valid, but your subsequent statement is silly and dangerous.


Man i would love to live in your perfect world where you never have to interact with 3rd party data.


Let alone malicious third parties.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: