Hacker News new | past | comments | ask | show | jobs | submit login

Are people seriously still deliberately using ASCII-reliant code?

It's interesting. Doesn't mean it's worthy of being put in production code.

it does if the text you are dealing with is specified as ascii only

For file names, URLs, domain names, etc. it's usually the safe thing to do.

Who's filenames aren't unicode? Also domains and URLs can be unicode too.

> Who's filenames aren't unicode?

Many filesystems don't support unicode or support only a subset of it:


> Also domains and URLs can be unicode too.

Domains: it depends at which level you are dealing with them. See https://en.wikipedia.org/wiki/Internationalized_domain_name

    Internationalized domain names are stored in the Domain 
    Name System as ASCII strings using Punycode transcription. 
URLs: Unicode characters are not allowed in URLs. See http://www.faqs.org/rfcs/rfc1738.html and http://www.blooberry.com/indexdot/html/topics/urlencoding.ht...

    only alphanumerics, the special characters "$-_.+!*'(),", and
    reserved characters used for their reserved purposes may be used
    unencoded within a URL.

exists in DNS as xn--ebkur-tra.is

Not as often as often as handling all characters.

Every time I've had to deal with Unicode and internationalization, it's been a problem.

For example, a few years ago I grabbed a source tarball from somewhere, I forget what or where. It had the author's name in a comment, which included an O with dots over it. That was the only non-ASCII character in the source code. No matter what I did, both Eclipse and command-line javac refused to compile the source.

Finally I wrote a script to delete his name from every source file manually. It compiled flawlessly.

Then there's the time I found some text files with two characters of binary junk at the beginning, followed by completely normal text. Again, I forget what I was doing, but some program was refusing to process them correctly. It was something internationalization-related called the BOM. Eventually I ended up writing a script to walk a directory and remove the first two bytes of every file. (This can probably be done with dd and xargs on UNIX, but I was using Windows at the time, which means that something like this will require spending an hour or so in your favorite programming language.)

These experiences lead me to believe that, for bootstrapped USA startups at least, you shouldn't worry about a market outside the English-speaking world.

If you need to worry about junk like accented characters or moon runes (Chinese/Japanese/Korean characters), it means you're big enough to afford to hire someone specifically to address the problem.

I assume this is a not very subtle troll? Java source is unicode? (The offhand reference to dd and xargs is a bit too much).

How do you define "English-speaking world", btw? Those too ignorant to have heard of non-ascii-characters (ie: excluding Canada, as anyone doing business there should at least have heard of French)?

Anyway, for anyone actually burnt by something similar on a GNU system try looking up recode(1).

What? You suffered from other peoples' bad internationalization, which implies that people shouldn't care about internationalization?

BOM sounds more like an issue with you switching Unicode documents between Windows and Unix, rather than a problem with internationalisation.


And personally I think to exclude all internationalisations because they're harder is a terrible attitude to have. Particularly these days when there's an online tutorials for pretty much any job imaginable (not to mention the numbers of helpful experts willing to give up their time for free on various forums and communities).

> which means that something like this will require spending an hour or so in your favorite programming language

Ok, this is where I stop worrying about how quickly I write code. Did this (removing BOM) quite a few times and it took just a few minutes in Python (under Windows). Heck, this could be two-liner I think :)

I, for one, applaud this attitude. It gives programmers and companies that know what they're doing a leg up over people who couldn't even bother to figure out UTF-8. Natural segmentation of a target market is a good thing.

I sense a daily wtf material here.

Yes, when dealing with RFC's that do.

I think HN is written Arc, which is not very Unicode friendly.


EDIT: It works fine for comments, at least.

Did ducttape stop being sticky?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact