Hacker News new | past | comments | ask | show | jobs | submit login

or even better # Match only printable ASCII characters.

The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.

The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.

Thank you for that. You have no idea how annoying it is to port perl scripts from ASCII to EBCDIC when they do that kind of thing.

It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.

It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.

I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.

I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.

So did I! Now that's a war story....

More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.

Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?

In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.


It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact