Truncating at codepoint boundaries at least avoids generating invalid (non-UTF-8) strings, but can still result in confusing or incorrect displays for human readers, so for best results the truncation algorithm should take extended grapheme clusters into account, which are probably the closest thing that Unicode has to what most people think of as "characters".
To avoid this and a bunch of other confusion, when accepting user input I recommend normalizing it to the composed form before writing to a DB or file. While unicode-aware tools and software should handle either form just fine, you probably want to avoid that there's something in the pipeline somewhere that treats the decomposed and composed form of the same string as different.
The emoji "" can't be normalized further -- it's a "" followed by a "". If you just split on code points rather than grapheme clusters, even after normalizing, your naïve truncation algorithm will have accidentally changed the skin colors of emoji. Or turned the flag of Norway into an "". Or turned the rainbow flag into a white flag .
EDIT: oh lord Hacker News strips emoji. You get the idea even though HN ruined the illustrations. Not my fault HN is broken.
Whether the absence of emojis on HN is a feature or a bug is arguable. But if you can't figure out a way to work around this constraint (e.g. put your example literally anywhere else on the web and post a link here) HN is probably not a good fit for you.
Reading a discussion thread where each message is just a link to some pastebin with the actual message isn't very nice. Besides, I wasn't going to write the message again after HN removed arbitrary parts of it, hence the edit; I think people got the gist. You may feel that discussion about Unicode doesn't belong on HN but I feel otherwise.
Reading discussion threads full of silly emojis isn't "very nice" either, at least for a certain kind of audience. It's a tradeoff, and the powers that be at HN have decided to optimize for sober discussion over expressivity. It's a defensible decision. Keeping HN from degenerating into Reddit is already hard enough.
I saw something about Arabic text, where that naive truncation at codepoint boundaries turns one word into a different word! Like the sequence of codepoints generate something that is represented as a single glyph in fonts, but truncated its totally different glyphs. I don't remember more details, I don't know any Arabic, but grapheme clusters aren't just about adding diacritics to latin characters. In other languages it all might work quite differently. So truncating at word boundaries (at breakable white-space or punctuation) is probably best. Though of course that way you might truncate the string by a lot. shrug-emoji
(I don't think the talk where the stuff about Arabic was mentioned was Plain Text by Dylan Beattie, but I haven't re-watched it to confirm. So maybe it is. Can't remember the name of any other talk about the subject right now.)
Randomly truncating words can have the same effect in any language. It's outright trivial to find examples in English or German. I don't understand why one has to invoke Arab script for a good example.
Yes, but you don’t end up with different glyphs. Arabic script has letter shaping, that means a letter can have up to 4 shapes based on its position within the word. If you chop off the last letter, the previous one which used to have a “middle” position shape suddenly changes into “terminal” position shape.
I'm thinking even bog-standard European umlauts, cedillas, etc go multi-byte in Unicode? (Take a string of ÅÄÖåäöÜü and chop it off at various byte limits and see.)
IIRC the lower levels of Windows will happily work with filenames that are not valid Unicode strings, for example if you use the kernel API rather than Win32.
But what about Win32? If you create a file before normalization and then open it using the normalized form, will it open the same file or return file not found?
What about other systems? For example AWS' S3 allows UTF-8 keys, with no mention of normalization[1].
On the phone so can't try myself right now.
Anyway for general text I agree, but for identifiers, filenames and such I prefer to treat them as opaquely as possible.
They're both valid UTF-16, though. Can you create a filename with only half of a surrogate pair in it?
I don't use Windows, so I can't check. Linux literally allows any arbitrary byte except for 0x00 and 0x2F ('/' in ASCII/UTF-8). It's a problem for programming languages that want to only use valid Unicode strings, like Python. Rust has a separate type "OsString" to handle that, with either lossy conversion to "String" or a conversion method that can fail. Python uses the custom use Unicode range to represent invalid byte sequences in filenames. It's all a mess. JavaScript doesn't give a damn about the validity of their UTF-16 strings.
(Note that Rust's OsString is different from it's CString type. Well, I guess under Unix they're the same, but under Windows OsString is UTF-16 (or "WTF-16", because it isn't actually valid UTF-16 in all cases).)
I tried using U+13161 EGYPTIAN HIEROGLYPH G029[1], which resulted in a string of length 2 as expected.
Using both chars (code units) and just the first char (code unit) worked equally fine. In Windows Explorer the first one shows the stork as expected, while the second shows that "invalid character" rectangle.
So yeah, treating filenames as nearly-opaque byte sequences is probably the best approach.
The Linux kernel doesn't validate filenames in any way, so a filename in Linux can contain any byte except 0x2F ('/', which is interpreted as directory separator) and 0x00 (which signals the end of the byte string).
ETA: of course some file systems have other limitations, for example '\' is not valid in FAT32.
Even if you only support scripts for which Unicode has composed codepoints, these days you likely can’t get away without properly handling emoji, and there are no precomposed versions of all the numerous emojis that are made of multiple code points (eg. skin color and gender variants as well as flags).
It's good advice to normalise to pre-composed form, but that doesn't solve the problem the previous poster mentioned as not everything exists as a composed form. That said: most things do have a composed form, so you can probably get away with it – right up to when you can't.
Yeah, working an a library system our path was to compose everything (taking into account of course that the octet sizes specified in the directory may or may not actually be accurate depending on whatever system produced the record) and around the same time deprecate any pretense we had of supporting MARC-8.
This will probably fail if the thing being chopped off is a composed emoji, like the flag emoji (where it can chop off the second letter of the ISO code and just leave a bewildering to the user but completely valid first letter) or the ZWJ sequence emojis which will leave a color or half a family or other shenanigans, depending where it cuts.
Well, if the user entered a French flag, and then you show it back to them as a white flag, you may cause a bit of an international incident. Or worse, accusations of telling very old jokes.
Don't do this. Use a language (like C#) or library (like libunistring) that can do grapheme cluster segmentation. In .NET it's StringInfo.GetTextElementEnumerator(). In libunistring it's u8_grapheme_breaks(). In ICU4C it's icu::BreakIterator::createCharacterInstance(). In Ruby it's each_grapheme_cluster(). Other ecosystems with rich Unicode support should have similar functionality.
I pasted your comment here into GPT-4o and asked for the Python equivalent, it suggested this which seems to work well:
import regex as re
def grapheme_clusters(text):
# \X is the regex pattern that matches a grapheme cluster
pattern = re.compile(r'\X')
return [
match.group(0)
for match in
pattern.finditer(text)
]
Note that regex is not the re module from the stdlib, it's a separate third party module that exposes the more powerful capabilities of PCRE like grapheme clustering directly.
That’s fine unless you are a language or library creator in which case knowing how to do it properly can’t be deferred to someone else. Perhaps porting someone else’s correct implementation is good but someone somewhere has to implement this. If they don’t share their knowledge this will always be esoteric knowledge locked away unless those who do that kind of work share their knowledge and experience. Most of us are not those people, but some are.
As it turns out, I am writing my own language, and my language supports grapheme cluster segmentation. I just used libunistring (and before that, I used ICU). TFA is not doing this correctly at all; the Unicode specification provides the rules for grapheme cluster segmentation if you wish to implement it yourself[0]. There's nothing to be learned from TFA's hacky and fundamentally incorrect approach. OP's technique will freely chop combining code points that needed to be kept.
Now, extended grapheme cluster enumeration is much more complex than finding the next non-continuation byte (or counting such), but to perform those correctly you would ultimately end up reading the official spec at unicode.org and perusing reference implementations like ICU (which is painful to read) or from standard library/popular packages for Rust/Java/C#/Swift (the decent ones I'm aware of, do not look at C++).
In Perl (OP's chosen language) you can use the Unicode::Util package. That's why I was pretty clear that you can use a different language or a different library. This seems to be a pretty uncharitable reading of my post. Use the right tool for the job.
Fun fact: part of why TOML 1.1 has taken so long to land is because of open questions around unicode key normalization. That in itself sounds dry and boring, but the discussion threads are anything but.
I cut my programmer teeth on Koha years and years ago. Still one of the warmest open-source communities I've ever been involved in, especially to a shy teenager with a lot of opinions.
Great to see new faces in the community, sad to see the sheer insanity of MARC21 still causing chaos. MARCXML is gonna make it obsolete Any Day Now!
MARCXML is just a new format to encode what's the vast majority of the MARC 21 standard (or for that matter any other MARC variety).
BIBFRAME is gonna make it obsolete Any Day Now!
("any day now" in this sector means that librarians have been talking about it for two decades and in about two decades something might actually happen)
Yeah I don't know what the sell is there... throw away the fidelity of your data when WEMI/FRBR/BIBFRAME/semantic web is coming any ~year~ decade now, soon (lol), while re-learning everything, definitely going out to tender because your current system won't do it and shift your processes and integrations. All so you can end up halfway to DC, yeah no.
The reason libraries don't do cataloguing any longer anywhere near as much hasn't got much to do with MARC 21 being hard.
The fidelity seems pretty good, at least if you convert to MARCXML and then use the XSLT from Library of Congress to generate it. IIRC it has record types for all the FRBR levels. It is also not flat like DC. It was a joy to work with from a record aggregator perspective, especially if you were generating it from MARC. You can even put the full table of contents into it. At the time that one of my colleagues wrote the "MARC Must Die" article (at least if I remember correctly) the teams working on RDA and MODS had a lot of overlap and MODS was being designed with the era's cataloging theory in mind. There was a moment in time where it seemed like a "new MARC" might go in that direction.
Having catalogers or metadata librarians write directly in MODS XML by hand never made sense (although some folks tried this), but as far as something usable to ship around I'd rather get MODS than MARC or dublin core. I really don't want to have to query a triple store to aggregate records.
Catalogers ideally would have tools that make it easy for them follow RDA/AACR2 descriptive practices without having to think about the details of MARC or MODS or linked data.
I've been out of the business for a couple of years, so I have not been following Library of Congress' BIBFRAME transition.
There are some hard-to-handle edge cases when doing display length truncation in Unicode, e.g. the character U+FDFD or "﷽" is four bytes but can be very long depending on the typeface*, so "completely" solving it is quite hard and has to depend on feedback from your rasterization engine.
This is a completely unrelated problem since the article is quite clearly about limiting to a certain maximum byte length and not display length. For display length you don't even need Unicode for that to depend on the font and shaping engine.
MARC can do all kinds of crazy things. I used to work with folks who had been hacking on MARC since the 1960s. If I remember correctly, at one point it got punched onto dangling chad cards (and of course was used to print the cards in the card catalog in the library).
> The real problem is that USMARC uses an int with 4 digits to store the size of a field, followed by 5 digits for the offset.
A colleague told me they used to exploit this "feature" to leave hidden messages in MARC records between fields.
Well, until some system comes along that relies on the directory for the tags only and just splits the record using the separator characters. Which is a valid enough approach to either work around bad encoding or if your record is on something other than a magnetic band and you don't need to know what exact offset to move to.
I never really worked with MARC much (except for a script for generating patron records once a quarter to load new students and staff into III and somehow we marked obsolete users to change their status) but I used to work at the successor organization to the University of California Division of Library Automation (nee University Library Automation Program), and one of the folks telling MARC war stories was describing doing this with with a tool he created specifically for creating pathological MARC records. They aggregated records from local systems into the systemwide "Melvyl" (during ULAP they produced microfiche binders of the union catalog) -- I don't know that they ever redistributed the MARC to other display systems.
* https://hoytech.github.io/truncate-presentation/ / https://metacpan.org/pod/Unicode::Truncate
* https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundarie...
Truncating at codepoint boundaries at least avoids generating invalid (non-UTF-8) strings, but can still result in confusing or incorrect displays for human readers, so for best results the truncation algorithm should take extended grapheme clusters into account, which are probably the closest thing that Unicode has to what most people think of as "characters".