indil's comments

indil · 2024-04-18T16:22:32

Ugh, please don't capitalize black. That's the kind of stuff Berliner was talking about.

indil · 2024-04-13T01:47:41

Um, no. In general, if you tell someone to stop messaging you, they get to send one more message to react to that and tie up the conversation. "OK. You still haven't addressed points A, B, and C, so I still disagree. Let's wrap it up here then." That's perfectly reasonable and polite.

indil · 2024-04-03T17:32:45

>How passionate do you need your accountant to be?

Oof. Yes.

indil · on Feb 3, 2023

>Do you read the docs of every basic feature you use?

Yes? You don't?

thiht · on Feb 3, 2023

indil · on Sept 9, 2022

I feel sorry for individuals that struggle because of this, but collectively, the schadenfreude is delicious. You get what you vote for.

YeBanKo · on Sept 10, 2022

> You get what you vote for. Not often the case, economic and social maps don’t align with voting maps. Especially in landlocked environment like the peninsula can be totally dominated by SF, where you have to go for services, business, jobs, school, etc.

Another point of influence is SFO – owned and operated by San Francisco. They can screw the air travel for many of their non-constituents.

indil · on Sept 9, 2022

go.mod lists minimum versions. Minimum Version Selection may increase the versions used as required by other packages in the build. go.mod isn't a lock file.

indil · on Sept 9, 2022

>Reversing a string is a useless operation in the real world

I'm not sure why you focused on this one example, which was just meant to indicate the nature of the issue, not cite a broad concrete problem. There are plenty of situations where you'd want to operate on graphemes, not code points, like deleting the previous grapheme in a text editor. It would certainly help programmers write correct code if the two were the same.

>doing away with combining marks and encoding everything as precomposed would be impossible because you cannot have a definitive list of every single combination of letters and diacritics that may mean something to someone

It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

acdha · on Sept 9, 2022

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points.

Well, 10 code points because vowels can be capitalized and 12 because ÿ is used in other languages.

That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to.

indil · on Sept 9, 2022

>Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed.

Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

>but means all of your documents require substantially more storage than they used to.

Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.

acdha · on Sept 9, 2022

> Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.

I believe it's over a hundred thousand (don't forget scholars need to work with classical and/or obscure characters which aren't in common usage), and while not common new ones are being added. Han unification is a good cautionary example to consider anytime you think something related to Unicode is easy: https://en.wikipedia.org/wiki/Han_unification

Now, there are on the order of 150K characters in Unicode so there is definitely a lot of room even for Chinese. I'm not so sure about the combinations because there are languages which use combining marks extensively (e.g. Navajo) and things like emoji skin tone modifiers (multiply everything with skin by 5 variants) or zero-width joiners to handle things like gender and you can get a lot of permutations if you were trying to precompose those to individual code points.

This is already sounding like a ton of work, even before you get to the question of getting adoption, and then you have to remember that the Unicode consortium explicitly says that diacritic marks aren't specific to a particular known language's usage so you either have to have every permutation or prove that nobody uses a particular one.

https://unicode.org/faq/char_combmark.html#10

The big question here is what the benefit would be, and it's hard to come up with one other than that everyone could treat strings as arrays of 32-bit integers. While nice, that doesn't seem compelling enough to take on a task of that order of magnitude.

Charlotte_Buff · on Sept 10, 2022

> It seems to me it would be trivial to enumerate these combinations, and assign code points to them.

Far from it. Even if you limit yourself to just Latin, the number of valid (whatever “valid” even means) combinations is already unmanageably gargantuan. Just look at phonetic notation as one example of many. The basic IPA alone uses over 100 letters for consonants and vowels, plus dozens of different diacritics, many of which need to be present concurrently on the same base letter. Make the jump to extended IPA or any number of other, more specialised transcription systems – and there are plenty – and you’ll never see the end of it.

Sure, it may be technically possible to create an exhaustive list of letter-and-diacritic combinations, just like you can technically create an exhaustive list of every single human on Earth, but good luck getting there. And good luck making sure you didn’t miss anything in the process.

Of course, you don’t need to limit yourself to Latin, because Unicode has 160 other writing systems to offer.

Writing systems like Tibetan and Newa where consonants can be stacked vertically to arbitrary heights and then have vowel signs and other marks attached as a bonus as well.

Or Hangul which would occupy no less than 1,638,750 code points if all possible syllable blocks were encoded atomicly, and that doesn’t even account for the archaic tone marks, or those novel letters that North Korea once tried to establish that aren’t even in Unicode yet.

Or Sutton SignWriting whose system of combining marks and modifiers is so complex that I’m not even gonna explain it here.

If you eschew combining characters then yes, you will create an encoding where every code point is at the same time a full grapheme cluster and that definitely has concrete advantages, but as a consequence you have now assigned to yourself the unenviable task of having to possess perfect, nigh-omniscient knowledge of every single thing that a person has ever written down in the entirety of human history. Because unless you possess that knowledge, you will leave out things that some people need to type on a computer under some circumstances.

Every time some scholar discovers a previously forgotten vowel sign in an old Devanagari manuscript, you need to encode not only that one new character, but every combination of that vowel sign with any of the (currently) 53 Devanagari consonants, plus Candrabindu, Anusvara, and Visarga at the very least, just in case these combinations pop up somewhere, because they’re all linguistically meaningful and well-defined.

It’s doable, in a sense, but why would you subject yourself to that if you can just make characters combine with each other instead?

indil · on Sept 9, 2022

>Why would your data structure make that easy to do?

Reversing a string merely indicates the problem. There are many cases for operating on graphemes instead of code points. For example, deleting the previous grapheme in a text editor when pressing backspace/delete. I think most programmers assume they're dealing with graphemes when they're actually dealing with code points. See, for example, the rune type in Go.