Hacker News new | past | comments | ask | show | jobs | submit login

In a while loop, if the index is not at the end of the string and the top two bits of the character at the index are both one, advance the index by one.

So what you're saying, is that it's very easy to get corrupted strings by anyone who doesn't have an understanding of utf-8 at the bit-level - which in my experience seems to be the majority of programmers.




Not at all. The language/library impementor should handle the details. My example was the argument checking of the slice function.

And indexing by code points doesn't solve the problem either. The majority of programmers don't know what a grapheme is or how to collate or sort unicode strings.


Except what happens in the real world is that people who are used to indexing and slicing ASCII strings however they please don't think "I should use a library for this", instead they just keep indexing and slicing as per usual and don't think anything of it until their Chinese customers start complaining of random program crashes, or missing text - which the developer then has difficulty trying to reproduce because hey, it works for them.

My only gripe with your argument is that I don't think it's easy to avoid corrupted text in modern text processing - which is precisely why there are libraries for it because it's actually really easy to get it wrong - even if you know what you're doing.


Which is why we have languages like Go where we can put those types of developers. Incidentally Go use UTF-8. Higher level languages like Go, Python, etc were designed so newbie and/or ignorant programers could do less damage.

When I was working on a project before Unicode we would switch our dev PCs to the other languages we supported. What a pain that was. Only issues we had was when a translated string was much longer than the screen space allocated to it. I belive Swedish was the main culprit. No problems with simplified and traditional Chinese as those were more compact. I have no sympathy for dev shops that can't get internationalization right. As with everything else in the corporate dev world management doesn't seem to want to hire/retain the more experienced programmers.

I think you have a gripe with my argument because you may be missing my point. If a high level language chooses to let a programmer index into a UTF-8 string at the byte level (for performance and other reasons) it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

The reason being is that the language function to slice a unicode string would either throw an exception or just advance to the next valid index. There wouldn't be a way for the programmer to slice a unicode string in the middle of a code unit.


I think you have a gripe with my argument because you may be missing my point

I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things, or keeping programmers who don't understand what they are doing away from that sort of thing.

The most egregious example that I've personally seen was a developer working on a legacy Cobol banking program that needed Chinese support retro-fitted to it.

The app was originally only developed with ASCII in mind and so sliced through strings willy-nilly, which naturally caused problems with Chinese text.

The developer working on the "fix" before me, was calling out to ICU through the C API of the version of Cobol that we used and was still messing things up - he'd actually modified ICU in some custom way to prevent the bug from crashing the program, but was still causing corrupted text.

I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary. Much simpler and resulted in the removal of an unnecessary dependency on ICU.

This bug had been outstanding for several months when I first joined that company, and it was the first one I was assigned to work on - and luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

it's very easy for it to prevent the the programmer from slicing in the middle of a code unit.

Okay, but even you made a mistake in your first example of what to do, and that's the sort of code that someone who knows what they are doing could write, and will seem to work in the conditions under which it was tested (working on my machine, ship it!), but that will cause seemingly random problems once it hits users.


> I get your point, it just doesn't apply to many real world situations I've seen where you don't have the luxury of just using a higher level language or a library that takes care of all these things

No, I still think your missing some of it. I am not advocating that what I said is the solution for everything.

Someone said that slicing UTF-8 strings leads to string corruption and endorsed the Python 3 frankenstien unicode type as a way to avoid it. I just gave a way of preventing that.

Now you argued that a novice programmer would fail to implement it properly. So you're comparing my method implemented by a novice programmer to a method implemented by profesional compiler writers. That hardly seems fair. :)

So my argument is that if my method were to be implemented by professional compiler writers it would prevent corrupted strings while still using UTF-8 as the internal representation.

> I basically undid all his changes, and wrapped all COBOL string splicing to call a function that always split a string at a valid position - truncating invalid bytes at the start/end as necessary.

> luckily for them they'd accidentally hired someone who had done lots of multilingual programming before.

So an expert programmer implemented a string splitting function that didn't corrupt strings. :D

> but even you made a mistake in your first example of what to do

I writing this on an iPad while watching TV and playing a game on another android tablet while looking at the wikipedia UTF-8 article on a tiny phone screen while a little white dog is trying to bite my fingers (wish I was making this up). Not exactly my usual programming environment. ;)


> Now you argued that a novice programmer

sigh if only it was novice programmers making these mistakes :-/


Upvote for that comment.

The stuff I've seen in some people's multithreaded code just makes me want to cry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: