Hacker News new | past | comments | ask | show | jobs | submit login

It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.



It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.


I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.


I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.


So did I! Now that's a war story....


More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.


Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?


In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.

http://en.wikipedia.org/wiki/Ebcdic


It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: