My rule of thumb is to treat strings as opaque blobs most of the time. The only ...

rafram · 2024-11-25T04:37:34 1732509454

Sanitizing your strings immediately before display is all well and good until you need to pass them to some piece of third-party software that is very dumb and doesn’t sanitize them. You’ll argue that it’s the vendor’s fault, but the vendor will argue that nobody else allows characters like that in their name inputs!

See the Companies House XSS injection situation, where their rationale for forcing a business to change its name was that others using their database could be vulnerable: https://www.theregister.com/2020/10/30/companies_house_xss_s...

arkh · 2024-11-25T09:05:43 1732525543

You sanitize at the frontier of what your code controls.

Sending data to a database: parametrized queries to sanitize as it is leaving your control.

Sending to display to the user: sanitized for a browser

Sending to an API: sanitize for whatever rules the API has

Sending to a legacy system: sanitize for it

Writing a file to the system: sanitize the path

The common point is you don't sanitize before you have to send it somewhere. And the advantage of this method is that you limit the chances of getting bit by reflected injections. You interrogate some API you don't control, you may just get malicious content, but you sanitize when sending it so all is good. Because you're sanitizing on output and not on input.

account42 · 2024-11-26T14:37:56 1732631876

What if the legacy API doesn't support escaping? Just drop characters? Implement your own ad-hoc transform? What if you need to interoperate with other API users.

Limting the character set at name input gives the user the chance to use the same ASCII-encoding of their name in all places.

shaky-carrousel · 2024-11-25T13:22:53 1732540973

Be liberal in what you accept, and conservative in what you send.

int_19h · 2024-11-25T18:21:33 1732558893

Please don't. This is how standards die.

https://datatracker.ietf.org/doc/html/rfc9413

afiori · 2024-11-25T09:05:56 1732525556

Forbidding users to use your service to propagate "litte bobby tables" pseudo-pranks is likely a good choice.

The choice is different if like most apps you are almost only a data sink, but if you are also a data source for others it pays to be cautious.

dcow · 2024-11-25T09:49:38 1732528178

I think it’s more of an ethical question than anything. There will always be pranksters and there will never be perfect input validation for names. So who do you oppress? The people with uncommon names? Or the pranksters? I happen to think that if you do your job right, the pranksters aren’t really a problem. So why oppress those with less common names?

afiori · 2024-11-25T10:04:57 1732529097

I am not saying to only allow [a-zA-Z ]+ in names, what I am Saying is that it is ok to block names like "'; drop table users;" or "<script src="https://bad.site.net/></script>" if part of your business is to distribute that data to other consumers.

dcow · 2024-11-25T10:18:58 1732529938

And I’m arguing, rhetorically, what if your name produces a syntax error—or worse means something semantically devious—in the query language I’m using? Not all problems look like script tags and semicolons.

foldr · 2024-11-25T12:20:17 1732537217

It's a question of intent. There aren't any hard and fast rules, but if someone has chosen their company name specifically in order to cause problems for other people using your service, then it's reasonable to make them change it.

account42 · 2024-11-26T14:44:19 1732632259

This is getting really absurd. Are you also going to complain that Unicode is too restrictive or are you going to demand being able to use arbitrary bytes as names. Images? If Unicode is enough, then which version.

There is always a somewhat arbitrary restriction. It's not unreasonable to also take other people into account besides the user wanting to enter his special snowflake name.

account42 · 2024-11-26T14:41:33 1732632093

No one is being oppressed. Having to use an ASCII version of your name is literally a non-issue unless you WANT to be offended.

Maybe also think of the other humans that will need to read and retype the name. Do you expect everyone to understand and be able to type all characters? That's not reasonable. The best person to normalize the name to something interoperable is the user himself, so make him do it at data entry.

mabster · 2024-11-27T12:49:08 1732711748

I was saying the exact same thing about how I don't understand why people get offended when they have to transcribe their name to use Hanzi!

We should have a world vote to settle which alphabet we use.

rob74 · 2024-11-25T08:16:59 1732522619

> but the vendor will argue that nobody else allows characters like that in their name inputs

...and maybe they will even link to this page to support that statement! But, seeing that most of the pages are German, I bet they do accept the usual German "special" letters (ÄÖÜß) in names?

account42 · 2024-11-26T14:49:38 1732632578

So? Have you considered that the names may need to be eventually processed by people who understand the German alphabet but not all French accents (and certinly won't be able to type Hanzi or arabic or whatever else you expect everyone to support)? Will every system they interact with be able to deal with arbitrary symbols. Does the font of their letterhead support every script?

It's reasonable to expect a German company to deal with German script, less reasonable to expect them to deal with literally every script that someone once thought would be funny to include in Unicode.

lyu07282 · 2024-11-25T11:45:42 1732535142

> The only validation I'd always enforce is some sane length limit, [..]

Venture into the abyss of UTF-8 and behold the madness of multibyte characters. Diacritics dance devilishly upon characters, deceiving your simple count. Think a letter is but a single entity? Fools! Combining characters lurk in the shadows, binding invisibly, elongating the uninitiated's count into chaos. Every attempt to enumerate the true length of a string in UTF-8 conjures a specter of complications. Behold, a single glyph, yet multiple bytes cackle beneath, a multitude of codepoints coalesce in arcane unison. It is beautiful t he final snuffing of the lie s of Man ALL IS LOST ALL I S LOST the pony he comes he comes he comes the ich or permeates all MY FACE MY FACE ᵒh god no NO NOOO O NΘ stop the an * gles are n ot real ZALGΌ IS TOƝȳ THE PO NY HE COMES

hnfong · 2024-11-26T20:38:48 1732653528

A nit - it's not UTF-8 or "multibyte" characters that's the main problem. The UTF-8 issue can be trivially resolved by decoding it into unicode code points. As long as you're fine with the truncated length not always corresponding to what you'd expect for Latin based alphabets it should be fine. (FWIW, if you are concerned with the displayed length, you'd need a font and a text layout engine to calculate the display length of displayed text)

The main issue with naïve truncation is that not every code point is a character (and I guess not every character is a glyph?). If you truncate the Unicode code point array at some unfortunate places like https://en.wikipedia.org/wiki/Ideographic_Description_Charac... , you'd just get gibberish or potentially very unintended results. (especially if you joined the truncated string with some other string)

ddulaney · 2024-11-25T01:29:54 1732498194

There's at least one major exception to this: Unicode normalization.

It's possible for the same logical character to have two different sets of code points (for example, a-with-umlaut as a single character, vs a followed by umlaut combining diacritic). Related, distinguishing between the "a" character in Latin, Greek, Cyrillic, and the handful of other times it shows up throughout Unicode.

This comes up in at least 3 ways:

1. A usability issue. It's not always easy to predict which identical-looking variant is produced by different input methods, so users enter the identical-looking characters on different devices but get an account not found error.

2. A security issue. If some of your backend systems handle these kinds of characters differently, that can cause all kinds of weird bugs, some of which can be exploited.

3. An abuse issue. If it's possible to create accounts with the same-looking name as others that aren't the same account, there can be vectors for impersonation, harassment, and other issues.

So you have to make a policy choice about how to handle this problem. The only things that I've seen work are either restricting the allowed characters (often just to printable ASCII) or being very clear and strict about always performing one of the standard Unicode transformations. But doing that transformation consistently across a big codebase has some real challenges: in particular, it can change based on Unicode version, and guaranteeing that all potential services use the same Unicode version is really non-trivial. So lots of people make the (sensible) choice not to deal with it.

But yeah, agreed that parenthesis should be OK.

speleding · 2024-11-25T11:17:56 1732533476

Something we just ran in to: There are two UTF-8 codepoints for the @ character, the normal one and "Full width At Sign U+FF20". It took a lot of head scratching to understand why several Japanese users could not be found with their email address when I was seeing their email right there in the database.

teddyh · 2024-11-25T12:55:48 1732539348

There are actually two more: U+FE6B and U+E0040.

tugu77 · 2024-11-25T07:16:48 1732519008

Type system ftw? As long as it's a blob (unnormalized), it should have a blob type which can do very little besides storing and retrieving, perhaps printing. Only the normalized version should be even comparable.

int_19h · 2024-11-25T18:24:28 1732559068

Why wouldn't blobs be comparable? A blob is just a byte array, and those have fairly natural equality semantics. They're wrong for Unicode strings, sure, but this is akin to complaining about string "1" not being equal to "01".

Muromec · 2024-11-25T02:11:59 1732500719

You can treat names as byte blobs for as long as you don't use them for their purpose -- naming people.

Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

>I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.

Few exceptions for you is entirety of the service for others. At the very least you interact with legacy software of payment systems which have some ideas about what names should be.

kmoser · 2024-11-25T02:31:45 1732501905

> Would your customer representative be able to pronounce my name somewhat correctly?

Are you implying the CSR's lack of familiarity with the pronunciation of your name means your name should be stored/rendered incorrectly?

Muromec · 2024-11-25T04:38:28 1732509508

Quite the opposite actually. I want it stored correctly and in a way that both me and CSR can understand and so it can be used to interface with other systems.

I don’t however know which unicode subset to use, because you didn’t tell me in the signup form. I have many options, all of them correct, but I don’t know whether your CSR can read Ukrainian Cyrillic and whether you can tell what vocative case is and not use that when inerfacing with the government CA which expects nominative.

ACS_Solver · 2024-11-25T09:13:42 1732526022

I think you're touching on another problem, which is that we as users rarely know why the form wants a name. Is it to be used in emails, or for sending packages, or for talking to me?

My language also has a separate vocative case, but I live in a country that has no concept of it and just vestiges of a case system. I enter my name in the nominative, which then of course looks weird if I get emails/letters from them later - they have no idea to use the vocative. If I knew the form is just for sending me emails, I'd maybe enter my name in the vocative.

Engineers, or UX designers, or whoever does this, like to pretend names are simple. They're just not (obligatory reference to the "falsehoods about names" article). There are many distinct cases for why you may want my name and they may all warrant different input.

- Name to use in letters or emails. It doesn't matter if a CSR can pronounce this if it's used in writing, it should be a name I like to see in correspondence. Maybe it's in a script unfamiliar to most CSRs, or maybe it's just a vocative form.

- Name for verbal communication. Just about anything could be appropriate depending on the circumstances. Maybe an anglicized name I think your company will be able to pronounce, maybe a name in a non-Latin script if I expect it to be understood here, maybe a name in a Latin-extended script if I know most people will still say it reasonably well intuitively. But it could also be an entirely different name from the written one if I expect the written one to be butchered.

- Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.

- Legal name because we're entering a contract or because my ID will be checked later on for some reason.

- Machine-readable legal name for certain systems like airlines. For most of the world's population, this is not the same as the legal name but of course English-language bias means this is often overlooked.

pezezin · 2024-11-26T00:32:34 1732581154

> - Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.

I am not sure that what you ask is possible, there might be local or international regulations that force them to write all the addresses in a certain way.

But on the positive side, I have found that nowadays most online shops provide a free-from field for additional delivery instructions. I live in Japan, and whenever I order something from abroad I write my address in Japanese, and most sellers are nice enough to print it and put it on the side of the box, to make the life of the delivery guys easier.

account42 · 2024-11-26T14:57:56 1732633076

The thing is "printable ASCII letters" is something usable for all of those cases. It may not be 100% perfect for the user's feelings but it just works.

ACS_Solver · 2024-11-26T15:33:38 1732635218

This is patently wrong and it's the sort of thinking that still causes inconvenience to people using non-ASCII languages, years after it's technically justifiable.

The most typical problem scenario is getting some package or document with names transformed to ASCII and then being unable to actually receive the package or use the document because the name isn't your name. Especially when a third party is involved that doesn't speak the language that got mangled either.

Åke Källström is not the same name as Ake Kallstrom. Domestically the latter just looks stupid but then you get a hotel booking with that name, submit it as part of your visa application and the consulate says it's invalid because that's not your name.

Or when Rūta Lāse gets some foreign document or certificate, nobody in her country treats is authentic because the name written is Ruta Lase, which is also a valid and existing name - but a different one. She ends up having to request another document that establishes the original one is issued to her, and paying for an apostille on that so the original ASCII document is usable. While most languages have a standard way of changing arbitrary text to ASCII, the conversion function is often not bijective even for Latin-based alphabets.

These are real examples of real problems people still encounter because lots of English-speaking developers insist everyone should deal with an ASCII-fied version of their language. In the past I could certainly understand the technical difficulties, but we're some 20-25 years past the point where common software got good Unicode support. ASCII is no longer the only simple solution.

dgfitz · 2024-11-25T07:21:09 1732519269

In this specific case, it seems like your concerns are a hypothetical, no?

swiftcoder · 2024-11-25T07:46:21 1732520781

Not really, no. A lot of us only really have to deal with English-adjacent input (i.e. European languages that share the majority of character forms with English, or cultures that explicitly Anglicise their names when dealing with English folks).

As soon as you have to deal with users with a radically different alphabet/input-method, the wheels tend to come off. Can your CSR reps pronounce names written in Chinese logographs? In Arabic script? In the Hebrew alphabet?

cowsandmilk · 2024-11-25T08:26:52 1732523212

You can analyze the name and direct a case to a CSR who can handle it. May be unrealistic for a 1-2 person company, but every 20+ person company I’ve worked at has intentionally hired CSRs with different language abilities.

Muromec · 2024-11-25T09:55:01 1732528501

First of, no you can't infer language preference from a name. The reasonable and well meaning assumption about my name on a good day makes me only sad and irritated.

And even if you could, I don't know if you actually do it by looking at what you signup form asks me to input.

michaelt · 2024-11-25T09:08:51 1732525731

A requirement to do that is an extremely broad definition of "treat strings as opaque blobs most of the time" IMHO :)

int_19h · 2024-11-25T18:28:21 1732559301

For one thing, this concern applies equally to names written entirely in Latin script. Can your CSR reps correctly pronounce a French name? How about Polish? Hungarian?

In any case, the proper way to handle this is to store the name as originally written, and have the app that CSRs use provide a phonetic transcription. Coincidentally, this kind of stuff is something that LLMs are very good at already (but I bet you could make it much more efficient by training a dedicated model for the task).

account42 · 2024-11-26T15:13:48 1732634028

This situation is not the same at all. The CSR might mangle a name in latin script but can at least attempt to pronounce it and will end up doing so in a way that the user can understand.

Add to that that natives of non-latin languages are already used to this.

For better or worse, English and therefore the basic latin script is the lingua franca of the computing age. Having something universal for internation communication is very useful.

GoblinSlayer · 2024-11-27T15:46:51 1732722411

FWIW, proquint encoding allows you to pronounce any sequence of bits, though the need for pronunciation eludes me, just copypaste it.

arghwhat · 2024-11-25T09:48:32 1732528112

> Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?

You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.

The only way to correctly deal with a name that you are unfamiliar with the pronunciation of is to ask how it is pronounced.

You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable - in many cases legal names must be represented accurately as your records might be used for e.g. tax or legal reasons later.

Muromec · 2024-11-25T13:10:04 1732540204

>You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable

But this is simply not true in practice and at times it's just plain wrong in theory too. The in practice part is trivially discoverable in the real world.

As to in theory -- I do in fact want a properly functioning service to use my name in a vocative case (which requires modifying it automatically or having a dictionary of names) in their communications that are sent in my native language. Not doing that is plainly grammatically wrong and borderline impolite. In fact I use services that do it just right. I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

arghwhat · 2024-11-25T14:33:51 1732545231

Sure, there are sites that mistreat names in ways you describe, but that does not make it correct.

> I do in fact want a properly functioning service to use my name in a vocative case. ... I also don't want to know to specify the correct version myself, as it's trivially derivable through established rules of the languages.

There would be nothing to discuss if this was trivial.

> Not doing that is plainly grammatically wrong and borderline impolite.

Do you know what's more than borderline impolite? Getting someone's name wrong, or even claiming that their legal name is invalid and thereby making it impossible for them to sign up.

If getting a name right and using a grammatical form are mutually exclusive, there is no argument to be had about which to prioritize.

account42 · 2024-11-26T15:18:07 1732634287

> You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.

Offensive if you WANT to be offended perhaps but definitely understandable, which is the main thing that matters.

throw_a_grenade · 2024-11-25T13:13:44 1732540424

Sorry to nitpick, but you underestimated: "many cases" is really "all cases", no exception, because under GDPR you have right to correct your data (this is about legal name, so obviously covered). So if user requests under GDPR art. 16 that his/her name is to be represented in a way that matches ID card or whatever legal document, then you either do it, or you pay a fine and then you do it.

That a particular technical solution is incapable of storing it in the preferred way is not an excuse. EBCDIC is incompatible with GDPR: https://news.ycombinator.com/item?id=28986735

hobs · 2024-11-25T03:42:25 1732506145

Absolutely not - do not build anything based on "would your CSR be able to pronounce" something - that's an awful bar - most CSRs cant pronounce my name - would I be excluded from your database?

Seriously, what are you going for here?

Muromec · 2024-11-25T05:12:01 1732511521

That’s the most basic consideration for names, unless you only show it to the user themselves — other people have to be able to read it at least somehow.

Which one is why the bag of unicode bytes approach is as wrong as telling Stęphań he has an invalid name.

hobs · 2024-11-25T05:25:00 1732512300

Absolutely not. There's no way to understand what a source user's reading capability is. There's no way to understand how a person will pronounce their name by simply reading it, this only works for common names.

soco · 2024-11-25T09:06:58 1732525618

And here we go again, engineers expecting the world should behave fitting their framework du jour. Unfortunately, the real world doesn't care about our engineering bubble and goes on with life - where you can be called !xóõ Kxau or ꦱꦭꦪꦤ or X Æ A-12.

Muromec · 2024-11-25T12:40:40 1732538440

I can be called what I want and in fact I have perfectly reasonable name that doesn't fit neither ASCII nor FN+LN convention. The thing is, your website accepting whatever utf8 blob my name can be serialized to today, without actually understanding it, makes my life worse, not better.

hobs · 2024-11-25T13:30:04 1732541404

No, it allows an exact representation of your name, it doesn't do anything to your life.

If you dont like your name, either change it or go complain to your parents. They might tell you that you cultural reference point is more important than some person being able to read your name off of a computer screen.

If you want to store a phonetic name for the destination speaker that's not a bad idea, but a name is a name is a name. It is your unique identifier, do not munge it.

Muromec · 2024-11-25T13:56:48 1732543008

But it does affect my life in a way you refuse to understand. That's the problem -- there isn't a true canonical representation of a name (any name really) that fits all practical purposes. Storing a bag of bytes to display back to user is the easiest of practical purposes and suggesting the practice that solve that is worse than rejecting Stępień, it's refusal to understand complexities, that leads to eventually doing the wrong thing and failing your user without even telling them.

>It is your unique identifier, do not munge it.

It's not a good identifier either. Nobody uses names as identifiers at any scale that matters for computers. You can't assume they don't have collisions, you can't tell whether two bags of bytes identify the same person or two different, they aren't even immutable and sometimes are accidentally mutable.

soco · 2024-11-25T14:15:09 1732544109

Then where is the problem? If the support can read Polish they will pronounce your name properly, if they're from India they will mess it up, why should we have different expectations? Nobody will identify you by name anyway, they will ask how to call you (chatbots do this already) and then use for proper identification all kind of ids and pins and whatnot. So we are talking here about a complexity that nobody actually needs, not even you. So let the name be saved and displayed in the nice native way, and you as programmer make sure you don't go Bobby Tables with the strings.

Muromec · 2024-11-25T15:24:48 1732548288

>if they're from India they will mess it up

Or not able to read at all.

>Then where is the problem?

Since you don't indicate for what purpose my name is stored, which may actually be display only, any of the following can happen:

A name as entered in your system is compared to a name entered in a different system or when you interface (maybe indirectly and unknowingly) with a system using different constrains or a different script, maybe imposed by their jurisdiction. As a result, the intended operation does not come through.

This may happen in the indirect way and invisible to you -- e.g. you produce an artifact, say and invoice or issue a payment card using $script a, which I will only later figure out I can't use, because it's expected to be in $script b, or even worse be in $script a presumed to match $script b they have on record. One of the non-obvious ways it can fail, is when you try to determine whether two names in the same script are actually the same to infer family relationship or something other that you should not do anyway.

It may happen within your system in a way your CSR will deny is possible as well.

That's on a more severe side, which means I will not try to use the name in any rendering that doesn't match MRZ of my identity document. Which was probably the opposite of what you intended allowing arbitrary bag of bytes to be entered. No, that is not made up problem, because I'm bored, it's a thing.

On a less sever side, not understanding names is a failure in i18n department, because you can't support my language properly without understanding how my name should be changed when you address me, when you simply show it near user icon and when you describe relations between me and objects and people. If you can't do proper i18n and a different provider can, you may lose me as a customer, because your attitude is presumed to be "everyone can just use ASCII and English". Yes, people exist that actually get it right because they put an effort in this human aspect.

On a mildly annoying, but inconsequential side people also have a habit of trying to infer gender based on names despite having gender clearly marked in their system.

hobs · 2024-11-25T16:39:22 1732552762

Managing the canonical representation of your name in my system is one of the few things you are responsible for.

The number of times I have had people ask me to customize name rendering, capitalize things, trying to build phonetic maps, all of these things to avoid data entry or confusion and all they do is prove out that you can't have a general solution to human names, you can hit a big percentage in a cultural context, but there's always exceptions and edge cases to the problem we're solving which can be described as "please tell me your name when you call or whatever so I can pronounce it right"

soco · 2024-11-26T08:36:21 1732610181

>Or not able to read at all.

"Hello, how should we address you?". Not everything must be done in code.

>when you interface (maybe indirectly and unknowingly) with a system using different constrains

I have yet to encounter a system recognizing assets and making automatic decisions based on name. It would fail already if the user switched first/last name.

>people exist that actually get it right

You could have started by explaining this right way and we'd be all smarter.

hobs · 2024-11-25T16:36:51 1732552611

There's no such thing as a data structure that fits "all practical purposes" that is correct.

There's no wrong thing - this is the best representation we can make given the system of record for the person's name.

They are definitely mutable, context dependent, and effectively data you cannot make assumptions about because of all those things.

If you want to do more than that you need a highly constrained use case, and its going to fail for "all practical purposes".

account42 · 2024-11-26T15:25:18 1732634718

Your examples are a great argument why we should not allow people to use arbitrary characters. Special snowflakes will be able to cope, I assure you.

kgeist · 2024-11-25T07:06:43 1732518403

>Would your customer representative be able to pronounce my name somewhat correctly?

Typical input validation doesn't really solve the problem. For instance, I could enter my name as 'Vrdtpsk,' which is a perfectly valid ASCII string that passes all validation rules, but no one would be able to pronounce it correctly. I believe the representative (if on a call) should simply ask the customer how they would like to be addressed. Unless we want to implement a whitelist of allowed names for customers to choose from...

manarth · 2024-11-25T07:40:27 1732520427

Derek would like a word.

https://www.youtube.com/watch?v=hNoS2BU6bbQ

Intermernet · 2024-11-25T08:18:02 1732522682

Many Japanese companies require an alternative name entered in half width kana to alleviate this exact problem. Unfortunately, most Japanese websites have a million other UX problems that overshadow this clever solution to the problem.

arghwhat · 2024-11-25T09:55:15 1732528515

This is a problem specific to languages using Chinese characters where most only know some characters and therefore might not be able to read a specific one. Furigana (which is ultimately what you're providing in a separate field here) is often used as a phonetic reading aid, but still requires you to know Japanese to read and pronounce it correctly.

The only generic solution I can think of would be IPA notation, but it would be entirely unreasonable to expect someone to know the IPA for their name, just as it would be unreasonable to expect a random third party to know how to read IPA and replicate the sounds it described.

red_admiral · 2024-11-25T11:29:58 1732534198

> Would your customer representative be able to pronounce my name somewhat correctly?

If the user is Chinese and the CSR is not - probably no, and that's not a Unicode issue.

account42 · 2024-11-26T15:32:54 1732635174

Yet the CSR will be able to adequately pronounce a romanized ASCII-only version of the chinese name. And that's an entirely reasonable thing to do for western organizations and governments just like you might need to get a chinese name to interact with the chinese bureaucracy.

benatkin · 2024-11-25T05:34:25 1732512865

> Would your customer representative be able to pronounce my name somewhat correctly?

Worse case, just drop to hexadecimal.

wvh · 2024-11-25T10:38:34 1732531114

Because you don't want to ever store bad data. There's not point to that, it will just create annoying situations and potential security risks. And the best place to catch bad data is when the user is still present so they can be made aware of the issue (in case they care and are able to solve it). Once they're gone, it becomes nearly impossible and/or very expensive to check what they meant.

77pt77 · 2024-11-25T01:45:29 1732499129

> If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away

This is laughably naive.

So many things can go wrong.

Strings are not arrays of bytes.

There is a price to pay if someone doesn't understand that or chooses to ignore it.

shakna · 2024-11-25T06:28:48 1732516128

> Strings are not arrays of bytes.

That very much depends on the language that you are using. In some, they are.

77pt77 · 2024-11-26T04:59:33 1732597173

No.

Those languages don't have strings.

shakna · 2024-11-26T09:58:52 1732615132

So Lua doesn't have strings? The type is called a string. The documentation calls it a string. It's certainly not a buffer.

lelandbatey · 2024-11-25T05:35:40 1732512940

And yet when stored on any computer system, that string will be encoded using some number of bytes. Which you can set a limit on even though you cannot cut, delimit, or make any other inference about that string from the bytes without doing some kind of interpretation. But the bytes limit is enough for the situation the OP is talking about.

hughesjj · 2024-11-25T01:49:11 1732499351

RTL go brrr

rpigab · 2024-11-25T09:11:28 1732525888

RTL is so much fun, it's the gift that keeps on going, when I first encountered it I thought, ok, maybe some junior web app developers will sometimes forget that it exists and a fun bug or two will get into production, but it's everywhere, Windows, GNU/Linux, automated emails, it can make malware hardware to detect by users in Windows because you can hide the dotexe at the beginning of the filename, etc.

Here it is today in GNOME 46.0, after so many years, this should say "selected": https://github.com/user-attachments/assets/306737fb-6b01-467... In previous GNOME versions it would mess up even more text in the file properties window.

Here's an article about it, but I couldn't find the more interesting blogpost about RTL: https://krebsonsecurity.com/2011/09/right-to-left-override-a...

JodieBenitez · 2024-11-25T10:23:27 1732530207

> or if you have to interact with some legacy service.

Which happens almost every day in the real world.

beagle3 · 2024-11-25T10:47:28 1732531648

You do need to use a canonical representation, or you will have two distinct blobs that look exactly the same, tricking other users of the data (other posters in a forum, customer service people in a company, etc)

kazinator · 2024-11-25T19:03:33 1732561413

> treat strings as opaque blobs most of the time

While being transparent is great and better than stupidly mangling or rejecting data, the problem is that if we just pass through anything, there are situations and contexts in which some kinds of software could be used as part of a chain of deception involving Unicode/font tricks.

Passing through is mostly good. Even in software that has to care about this, not every layer through which text passes should have the responsibility.