How do I allow "stępień" while detecting Zalgo-isms?

egypturnash · 2024-11-24T15:04:51 1732460691

Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.

n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.

ahazred8ta · 2024-11-24T20:27:38 1732480058

N=2 is common in Việt Nam. (vowel sound + tonal pitch)

anttihaapala · 2024-11-24T21:13:33 1732482813

Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.

cryptonector · 2024-11-24T23:32:59 1732491179

u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step.

[0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.

zvr · 2024-11-24T18:20:21 1732472421

I can point out that Greek needs n=2: for accent and breathing.

seba_dos1 · 2024-11-24T16:16:33 1732464993

There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one.

cryptonector · 2024-11-24T23:33:31 1732491211

If you decompose then it uses combining codepoints. Still nothing special.

KPGv2 · 2024-11-24T15:12:59 1732461179

I could answer your question better if I knew why you need to detect Zalgo-isms.

account42 · 2024-11-27T12:22:47 1732710167

Because they are an attack vector. They can be used to hide important information as they overflow bounds (can be solved with clipping but then you need to do that everywhere it matters) and have the ability to slow text renderers to a crawl.

dpassens · 2024-11-24T21:10:42 1732482642

Why do you need to detect Zalgo-isms and why is it so important that you want to force people to misspell their names?

zootboy · 2024-11-24T15:05:11 1732460711

For the unaware (including myself): https://en.wikipedia.org/wiki/Zalgo_text

If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character.

https://stackoverflow.com/a/11983435

tobyhinloopen · 2024-11-24T21:22:57 1732483377

We have a whitelist of allowed characters, which is a pretty big list.

I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value)

https://www.geeksforgeeks.org/lodash-_-deburr-method/