Why would you index into a "character array"? Also, what is a "character" to you...

throwaway09223 · on Oct 4, 2022

> Why would you index into a "character array"?

Obviously, because you need to examine characters. Here are a couple examples:

* You're implementing any software that deals with rfc6531 compliant email. You'll need to parse strings as unicode to understand where to route mail, for example to understand 你好@example.com

* You're implementing a database, or any software with a search function which indexes unicode fields. You'll have to understand the logical character structure of the data, or your searches won't match properly.

> Also, what is a "character" to you?

Characters are defined by the unicode specification. In UTF8 they can be represented by multi-byte sequences. Using the example of 你好@example.com, the @ is the seventh byte position, but the third character position. You cannot implement software that handles internationalized emails without parsing and understanding strings like this.

tialaramex · on Oct 4, 2022

> Characters are defined by the unicode specification.

Unicode's glossary offers four definitions for this word, which ought to be a strong hint that this is not a technical term but instead a vague idea people have which doesn't align very well to the technical problem.

throwaway09223 · on Oct 4, 2022

As I said initially: The interface (including the underlying unicode specification) is not good.

This is precisely my point. Internationalization is building on a foundation of sand.

dontlaugh · on Oct 4, 2022

That's not how you'd parse email. In fact, the bit before @ is encoded differently from the bit after @.

You're much more likely to iterate over elements in a string than index. You might iterate over code points, abstract characters, glyphs, grapheme clusters, etc. You might do it before or after a certain kind of normalisation. Since there are many ways to split a string (depending on what you're doing), it won't be possible to make all of them have O(1) indexing.

throwaway09223 · on Oct 4, 2022

What do you mean "that's not how you'd parse email?" I didn't describe an algorithm.

Why don't you explain how you would parse the above email address without understanding the unicode encoding?

colejohnson66 · on Oct 4, 2022

If you're asking how to get everything after the '@' symbol, you can use knowledge of UTF-8 (assuming your byte array is UTF-8) to just say `FindIndexOfByte(utf8Buffer, (byte)'@') + 1`. But treat everything else as an opaque blob. The URL portion of that email address, as far as you should be concerned, it just the bytes after the '@' symbol. If you want to verify if it's a valid URL, you should use a URL library, not some hack that assumes a TLD is 3+ characters. However, even that naive method of a byte search is wrong. According to RFC 2822, '@' symbol is actually allowed on the left hand side if escaped. You should be using an email parsing library that implements the spec properly.

Naming the smallest unit of information in C a 'char' was a terrible mistake made decades ago. A character has never been a single byte in many countries. Shift-JIS, for example, can be one or two bytes, and it's two and a half decades old. Even in UTF-16 (used by Java and C#), a 'char' isn't actually a character, but a codepoint. One that could even be an unpaired surrogate.

Really, you should never be indexing into the backing byte/char array unless you know what you're doing. If you think you need to, you should be asking yourself, "what is it I really need to do?" Because you, as a programmer dealing with internationalization, must not assume English/Western ideas about language. For example, in abugidas[0], a single word can be a whole sentence. For a more common language, Japanese typically doesn't use spaces because they're not needed when writing kanji. But that's not foolproof because hiragana and katakana aren't for words, but sounds. How do you do a word count in those languages? You don't.

[0]: https://en.wikipedia.org/wiki/Abugida

throwaway09223 · on Oct 4, 2022

> You should be using an email parsing library that implements the spec properly.

> you should use a URL library, not some hack

I think what you're somehow missing is that I am specifically offering authoring an email parsing library (or URLs, or whatever) as an example of where things break down.

Your answer is essentially "let someone else program it, because it's complicated and fraught with peril" -- but my entire point is that it's complicated and fraught with peril.

You are agreeing with me, and your post is a fine example of why the interfaces are bad and error prone.

colejohnson66 · on Oct 4, 2022

> Your answer is essentially "let someone else program it, because it's complicated and fraught with peril" -- but my entire point is that it's complicated and fraught with peril.

Are you asking how one would write code for the libraries you should use? In general, there already is one: ICU.[0] But if you had to write your own, as I said above, Unicode provides algorithms[1] that one might want to implement, in addition to a massive (over 1000 page) spec[2] detailing how to handle the various languages. For example, if you need to write the code to handle bidirectional text, there's UAX/TR #9[3]. Want to implement case changing? See Chapter 3.[4]

There's also the Unicode Character Database[5] that contains information on every code point. Your library/program would need to hold a copy of this whole thing.

[0]: https://icu.unicode.org/

[1]: https://unicode.org/reports/

[2]: https://www.unicode.org/versions/Unicode15.0.0/

[3]: https://unicode.org/reports/tr9/

[4]: https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf

[5]: https://www.unicode.org/reports/tr44/tr44-30.html

throwaway09223 · on Oct 4, 2022

No. Let's do a quick recap:

You said that people ought never address text by characters or bytes.

I'm pointing out that this is nonsense and I offered several examples.

You responded saying basically that those tasks are complex and someone else should do it.

In your followup response, you recommend using ICU which ... has an interface for iterating by character. Which is all I had mentioned to begin with. Sigh.

withinboredom · on Oct 4, 2022

Why the hell do you need to route the email unless you’re working on an email server? Send the email and have the user validate it. If they validate it, it’s a valid email. Don’t be clever, be smart.

Even if you’re working on an email server, you shouldn’t be touching that code unless it’s broken or you’re writing an implementation from scratch. In that case, treat it as an opaque blob.

throwaway09223 · on Oct 4, 2022

> unless you’re working on an email server?

We are talking about email software, inclusive of servers, clients, plugins, etc. It's literally the example I'm offering (but there are neigh infinite others)

> Send the email and have the user validate it. I

But I haven't proposed a problem of validating an email. Besides, even if the problem is validating an email this is not necessarily a workable answer.

> Even if you’re working on an email server, you shouldn’t be touching that code unless it’s broken

A lot of the arguments supporting unicode seem to boil down to acknowledging that all these things are broken, yet disregarding them because the problems exist at a more technical level than most of HN commenters are familiar with.

You might not write this kind of software for a living, but some of us do and we know that these interfaces are not good.