My favorite regex of all time

noonespecial · on Nov 12, 2012

As someone who makes much of his living rehabilitating old perl scripts, please, if you must use such things, use them like this:

[ -~] #match only printable characters

It takes 5 seconds longer and with regexes, just knowing what the damn thing is trying to do is half the battle. When you use a regex, use a comment. Its the civil thing to do.

atsaloli · on Nov 12, 2012

I recommend using the /x suffix to extend your pattern's legibility by permitting whitespace and comments.

/x allows you to break up your regex into its component parts, one part per line, and then comment each part.

Here is what the manual says about /x:

/x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.

http://perldoc.perl.org/perlre.html

sbochins · on Nov 12, 2012

Yea, anytime I use a regex that isn't immediately obvious I put it in a function called get_<something>. Unfortunately people that write overly complicated and error prone regexes usually don't choose to document them.

laumars · on Nov 13, 2012

If a regex is going to be reusable, then yeah, I'd agree. But dumping single lines of code into their own functions just for readability isn't practical for real time systems. In those cases you really should be using comments as they get stripped out by the compiler.

entropy_ · on Nov 13, 2012

Couldn't those functions just be inlined by the compiler if they're simple regex-wrappers anyway?

I do agree that it might be overkill to move regexes to their own functions just for readability's sake but I don't buy the performance argument. Furthermore, regexes are most popular in scripting languages that no sane person would use for real time performance-critical systems anyway.

laumars · on Nov 13, 2012

1. Ahh right. I wasn't aware that happened.

2. Web sites are a classic example of scripting languages being used for real time performance critical systems (though I'm not arguing that all web sites are real time).

Sometimes the ability to modify code easily is as important to the choice of languages as the raw execution speed of the compiled binaries.

jbooth · on Nov 13, 2012

REGEXES aren't practical for real time systems.

If you're using a regex, and certainly if you're using a language other than C, you probably have space for the function call overhead.

laumars · on Nov 13, 2012

I don't really agree with that.

Sometimes C is inappropriate (eg you'd be nuts to build a website in C yet some sites do offer real time services)

Often the data set and/or logic required makes C an inappropriate language (eg you wouldn't use C for AI nor for some types of database operations).

And even in the cases where you're just building a standard procedural system, sometimes the interface lends itself better to other languages (eg C would be possibly the worst language for real time websites.)

But even in the cases where you're building a solution that's suited for C, there are still other performance languages which could be used.

"Real time" is quite a general term and as such, sometimes it makes more sense to use scripting languages which are performance tuned. Which is where writing 'good' PCRE is critical as RegEx can be optimised and compiled - if you understand the quirks of the language well enough to avoid easy pitfalls, eg s/^\s//; s/\s$//; outperforms s/(^\s|\s$)//; despite it being two separate queries as opposed to one.

jbooth · on Nov 13, 2012

"Real time" is commonly assumed to mean that you can't use a garbage collected language or need to be extremely careful doing so because random pauses of 100ms break your constraints.

If you're in a situation where the overhead of a couple of function calls is unacceptable, regexes are totally unacceptable and you need to write custom character manipulation.

This situation is really rare and in almost all business cases, using C is inappropriate.

hellrich · on Nov 13, 2012

Shouldn't most compilers (jit-)inline it?

yxhuvud · on Nov 12, 2012

or even better # Match only printable ASCII characters.

noonespecial · on Nov 12, 2012

The writer of said script needing rehabilitation probably doesn't have that much insight. Just try to tell me what you were trying to get done and that will be enough.

The worst case is when the original author never really had it clear in his/her mind what exactly that compound regex was trying to accomplish. They just kind of bodged and hacked till the usual input stream started coming out right. Trying to write a clear comment on the purpose of the regex helps with that too.

CountHackulus · on Nov 12, 2012

Thank you for that. You have no idea how annoying it is to port perl scripts from ASCII to EBCDIC when they do that kind of thing.

rmc · on Nov 12, 2012

It's not an ASCII v EBCDIC thing, its an ASCII vs Unicode thing.

CountHackulus · on Nov 13, 2012

It's not just Unicode either. I just mentioned EBCDIC because that particular regex has bit me before when I was translating perl scripts from Linux to zOS USS. Take a look at the code page for EBCDIC, you'll see quickly why it's a massive pain to sort through regexes like that.

natrius · on Nov 13, 2012

I honestly thought you were being sarcastic. I've never heard of someone who has actually used EBCDIC.

nobleach · on Nov 14, 2012

I'm sure you've heard of the IBM AS/400 which is still firmly entrenched in MANY Fortune 500 companies. Not to mention tons of state and county government installations handling payroll, inventory, taxrolls, etc. I had to deal with a Perl script which dealt with ASCII to EBCDIC to port data to an Oracle database. If you're a Windows only shop, that's fine, but don't assume that anyone whom isn't is ancient.

rmc · on Nov 13, 2012

So did I! Now that's a war story....

dredmorbius · on Nov 12, 2012

More generally, it's a characterset / collate sequence thing. Specifying a range with a start and end point requires understanding what that range specifies. Which can change depending on context, locale, characterset, etc.

tripzilch · on Nov 14, 2012

Also in the 32-127 ASCII range? I thought they just differ in 128-255 with the code pages and such?

dredmorbius · on Nov 17, 2012

In the case of EBCDIC, there are several places in the alphabetic collation sequence in which non-alpha characters are interspersed among the letter codes. Most notably between R & S, though it appears that I-J also includes a standout. The fact that there are multiple incompatible forms of EBCDIC doesn't help matters much.

Makes sorts really tweaky.

http://en.wikipedia.org/wiki/Ebcdic

Millennium · on Nov 13, 2012

It's both, but ASCII vs. EBCDIC is worse. Even in Unicode, the regex will still grab the printable characters that also happen to be part of ASCII: you won't see anything wrong until you get to characters outside that range. In EBCDIC, things get much hairier: it won't get capital letters, nor lowercase letters from r through z (but it will get all the other lowercase letters), nor brackets or braces (though it will get parens).

perlgeek · on Nov 13, 2012

Or just use [[:print:]]

tomjen3 · on Nov 12, 2012

Unless you want to seem clever and impresse the PHB.

It is the selfish (but smart) thing to do.

nathan_long · on Nov 12, 2012

I'm not sure who the PHB is, but I'm certainly not impressed by anything cryptic in a codebase. Deliberately writing code that's hard to understand should be a firing offense.

rada · on Nov 12, 2012

PHB: Pointy Haired Boss

nathan_long · on Nov 13, 2012

In that case, the smart thing to do is not to work for the PHB, rather than pervert your craft in an attempt to impress him/her.

Shorel · on Nov 13, 2012

In this case, the entire post was the comment.

You are right anyway.

hnriot · on Nov 12, 2012

Google is by far the best "comment"

citricsquid · on Nov 12, 2012

Are you saying people should google regular expressions? in my experience (correct me if I'm wrong) that doesn't work, I've never been able to get google to return relevant results even with quotation marks.

boyter · on Nov 12, 2012

Agreed, Google fails at this, however alternative search engines,

http://searchco.de/?q=%5B+-~%5D+ext%3Apod&cs=on http://symbolhound.com/?q=%5B+-~%5D

Symbolhound gives the answer quite well, and searchco.de has some examples of its use in the results.

hnriot · on Nov 12, 2012

I'm saying that usually comments are either wrong or out of date, developers code one regex, comment it, then fix a bug later and don't, then there's a discrepancy between the comment and the code. It's nearly always easier to just google the code and see what it does, if (as in this case) it's not obvious.

cjfont · on Nov 13, 2012

Your response doesn't address what citricsquid said, googling for a regex will almost never return helpful results.

hnriot · on Nov 13, 2012

Google regex and you'll find plenty of resources including tools to testing patterns. You won't find much for any specific pattern but read the docs and it will be apparent what this regex does. Familiariy and competence with regex is a basic component of being a developer.

kstenerud · on Nov 12, 2012

or make a function regexMatchingAllPrintableASCIIChars() and have it return the regex.

missing_cipher · on Nov 12, 2012

Search "[ -~]" (with or without quotes) to see how good Google's comment is.

logn · on Nov 12, 2012

The only ambiguous thing about this regex is knowing what's between space and tilde. Otherwise this is a pretty ordinary regex.

darklajid · on Nov 13, 2012

Hey, might just be me. I'm usually the 'Ben, can you help me with a regular expression' guy over here, but I stumbled, hard, and failed to connect the '-' with a range of characters (probably because I never thought of 'space to .. something').

So I read the snippet, thought 'Yeah, a character class of space, -, ~' and fell on my face in the next couple of lines.

Yeah, I should've known better, I know how to read it. If .. I invest the time and don't glance over a construct and hope to just get it instantly.

I wouldn't want to see this in a code base without proper documentation (be it a comment, a function name or whatever. Something).

wpietri · on Nov 13, 2012

The only thing ambiguous about it is most of it?

masterzora · on Nov 13, 2012

Not that I agree with the "expect people reading your code to Google things" mindset, but to be fair the only ambiguous thing is the ASCII table which is Googleable.

DanBC · on Nov 12, 2012

There is one author of the code, and potentially many readers.

That one author is the only person who knows what s/he is trying to achieve.

That author taking a few minutes to add some comments will save other people the time to search for answers and the time it takes to grok everything.

bmelton · on Nov 13, 2012

The best code is readable. Readability includes comments. If you're going to comment anything in your code at all, RegExes should be at the very top of that list.

Even if I can figure out what the regex matches (with Google or something else), that doesn't necessarily tell me WHY I'm matching on that particular pattern, or why I needed a RegEx in this spot, or what the intent was at the time of writing it.

jimwise · on Nov 12, 2012

This will not only miss non-ascii printing characters, but it's not even much shorter than typing

  [[:print:]]

to use the explicit character class.

DrCatbox · on Nov 12, 2012

The [[:print:]] will match any printable characters like åä, while the [ -~] will not.

I used this once as another safeguard against pushing binary data into the database. It was a poor system to begin with where you even have that possibility... and it happened at least once before the fix and my safeguard was in place.

e12e · on Nov 13, 2012

"å" is perfectly valid text input in my locale.

EvilTerran · on Nov 13, 2012

There will be situations where you need to check specifically for 7-bit ASCII printable characters only. I've worked with APIs that require everything outside that range to be escaped/encoded into it.

Email could be an example, I guess, although I haven't worked with it enough to know whether the whole "7-bits only" thing is still an issue these days.

coderintherye · on Nov 13, 2012

I think that was his point, that he had a good use for :print: over just -~

jack-r-abbit · on Nov 12, 2012

Jeepers... cut the guy some slack. He didn't say this is the bullet proof way of doing everything YOU want to do in all situations, every time, forever. He said "I thought I'd share my favorite regex of all time". And then explained what it does. Why does everyone have to poop on his favorite thing?

boyter · on Nov 12, 2012

My favorite regex is the following,

/^1?$|^(11+?)\1+$/

Which finds prime numbers. Although, I can't for the life of me think of a reason for using it.

http://stackoverflow.com/questions/3296050/how-does-this-reg...

lubutu · on Nov 13, 2012

I do dislike people calling that expression a "regex", because it isn't: regular expressions cannot contain backreferences, and must be computable in linear time, whereas primality tests are polynomial.

boyter · on Nov 13, 2012

While I agree I believe this comment by _delirium sums this up rather well,

http://news.ycombinator.com/item?id=1486502

full comment thread here http://news.ycombinator.com/item?id=1486158

lubutu · on Nov 13, 2012

I agree more with philh's response that there is no alternative term for the true meaning of "regular expression" — a regular language, as suggested by _delirium, is not the same thing.

I suppose I could accept "regex" as not being a regular expression as such, but the two are used so interchangeably that maintaining a distinction isn't very realistic. I'd personally rather a regular expression described a regular language, and "PCRE" (or so) used for the Turing-complete expressions with a similar syntax.

tshaddox · on Nov 13, 2012

I'm not a big fan of your explanation. To be more precise, true "regular expressions" are computationally equivalent to deterministic finite automata, which indeed can test an n-character string in O(n) time.

MileyCyrax · on Nov 13, 2012

NFAs and DFAs both recognise the regular languages (and only them).

laumars · on Nov 13, 2012

It's PCRE (Perl Compatible Regular Expressions) which is one of the most popular dialects of regex. But AFAIK there's isn't a hard and fast RegEx standard.

So I'd argue that code is RegEx.

I guess it's just a matter of perspective though.

mvzink · on Nov 12, 2012

I had to prove that in a formal languages class once and I still have no idea how it works.

emillon · on Nov 13, 2012

First, it does not match "prime numbers". It matches composite numbers in unary notation (n is represented by n '1' characters).

The first part (^1?$) allows "" and "1" to match (so that 1 is not detected as a prime).

The second part matches groups of two or more ones (11+?), repeated twice or more, ie products n*m, n ≥ 2, m ≥ 2.

The backreference means that \1 should match the exact same string as the first (11+?). It's different from using (11+?){2,} which would match n_1+n_2+n_3..., n_1 ≥ 2, n_2 ≥ 2, n_3 ≥ 2 (where submatch is independent).

lambada · on Nov 12, 2012

Are people seriously still deliberately using ASCII-reliant code?

manojlds · on Nov 12, 2012

It's interesting. Doesn't mean it's worthy of being put in production code.

easytiger · on Nov 13, 2012

it does if the text you are dealing with is specified as ascii only

olalonde · on Nov 12, 2012

For file names, URLs, domain names, etc. it's usually the safe thing to do.

UnoriginalGuy · on Nov 12, 2012

Who's filenames aren't unicode? Also domains and URLs can be unicode too.

olalonde · on Nov 13, 2012

> Who's filenames aren't unicode?

Many filesystems don't support unicode or support only a subset of it:

https://en.wikipedia.org/wiki/Filename#Comparison_of_filenam...

> Also domains and URLs can be unicode too.

Domains: it depends at which level you are dealing with them. See https://en.wikipedia.org/wiki/Internationalized_domain_name

    Internationalized domain names are stored in the Domain 
    Name System as ASCII strings using Punycode transcription.

URLs: Unicode characters are not allowed in URLs. See http://www.faqs.org/rfcs/rfc1738.html and http://www.blooberry.com/indexdot/html/topics/urlencoding.ht...

    only alphanumerics, the special characters "$-_.+!*'(),", and
    reserved characters used for their reserved purposes may be used
    unencoded within a URL.

benregn · on Nov 13, 2012

http://www.ebækur.is

jarek · on Nov 13, 2012

exists in DNS as xn--ebkur-tra.is

esrauch · on Nov 12, 2012

http://en.wikipedia.org/wiki/IDN_homograph_attack

recursive · on Nov 12, 2012

Not as often as often as handling all characters.

csense · on Nov 12, 2012

Every time I've had to deal with Unicode and internationalization, it's been a problem.

For example, a few years ago I grabbed a source tarball from somewhere, I forget what or where. It had the author's name in a comment, which included an O with dots over it. That was the only non-ASCII character in the source code. No matter what I did, both Eclipse and command-line javac refused to compile the source.

Finally I wrote a script to delete his name from every source file manually. It compiled flawlessly.

Then there's the time I found some text files with two characters of binary junk at the beginning, followed by completely normal text. Again, I forget what I was doing, but some program was refusing to process them correctly. It was something internationalization-related called the BOM. Eventually I ended up writing a script to walk a directory and remove the first two bytes of every file. (This can probably be done with dd and xargs on UNIX, but I was using Windows at the time, which means that something like this will require spending an hour or so in your favorite programming language.)

These experiences lead me to believe that, for bootstrapped USA startups at least, you shouldn't worry about a market outside the English-speaking world.

If you need to worry about junk like accented characters or moon runes (Chinese/Japanese/Korean characters), it means you're big enough to afford to hire someone specifically to address the problem.

e12e · on Nov 13, 2012

I assume this is a not very subtle troll? Java source is unicode? (The offhand reference to dd and xargs is a bit too much).

How do you define "English-speaking world", btw? Those too ignorant to have heard of non-ascii-characters (ie: excluding Canada, as anyone doing business there should at least have heard of French)?

Anyway, for anyone actually burnt by something similar on a GNU system try looking up recode(1).

andrewflnr · on Nov 13, 2012

What? You suffered from other peoples' bad internationalization, which implies that people shouldn't care about internationalization?

laumars · on Nov 13, 2012

BOM sounds more like an issue with you switching Unicode documents between Windows and Unix, rather than a problem with internationalisation.

http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

And personally I think to exclude all internationalisations because they're harder is a terrible attitude to have. Particularly these days when there's an online tutorials for pretty much any job imaginable (not to mention the numbers of helpful experts willing to give up their time for free on various forums and communities).

klibertp · on Nov 13, 2012

> which means that something like this will require spending an hour or so in your favorite programming language

Ok, this is where I stop worrying about how quickly I write code. Did this (removing BOM) quite a few times and it took just a few minutes in Python (under Windows). Heck, this could be two-liner I think :)

jarek · on Nov 13, 2012

I, for one, applaud this attitude. It gives programmers and companies that know what they're doing a leg up over people who couldn't even bother to figure out UTF-8. Natural segmentation of a target market is a good thing.

kolinko · on Nov 13, 2012

I sense a daily wtf material here.

yxhuvud · on Nov 12, 2012

Yes, when dealing with RFC's that do.

kamaal · on Nov 13, 2012

I think HN is written Arc, which is not very Unicode friendly.

stavros · on Nov 13, 2012

Δοκιμή.

EDIT: It works fine for comments, at least.

IheartApplesDix · on Nov 12, 2012

Did ducttape stop being sticky?

skrebbel · on Nov 12, 2012

It's clever, but it's also completely unreadable for anyone who didn't read this article. Regexes have serious maintainability issues as it is; let's not make it worse by putting clever tricks in them.

Terretta · on Nov 12, 2012

I don't understand why this is "completely unreadable".

What else could this have been besides match the character range from space to tilde?

michaelt · on Nov 12, 2012

The main risk in my mind was that it was some sort of control sequence for a feature I hadn't memorized.

I know there's some syntax I can use to create a zero width negative look behind recursive greedy named capture group back reference. Perhaps hyphen-tilde triggers something like that.

mikeash · on Nov 12, 2012

Most people would have to check an ASCII table to know what that range is, though.

slavak · on Nov 13, 2012

Which takes for granted the fact that your input stream is even ASCII to begin with. I'm too lazy to check, but I'm pretty sure this isn't going to catch all printable Unicode characters, for example - and then you're left scratching your head over what the hell the original author was trying to achieve.

jlgreco · on Nov 12, 2012

Presumably the space, commonly having no meaning in and of itself, could throw you for a moment or two. This isn't a regex `foo_[a-z]`, you have to stop and think about it for a moment.

I don't think it is particularly bad though. It's just not the most trivial of regexes.

akdetrick · on Nov 12, 2012

Every regex seems like a clever trick.

ceejayoz · on Nov 12, 2012

My favourite is this one: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

jballanc · on Nov 12, 2012

Ah, this is my favorite also. If seeing this doesn't make you second guess using a RegExp when a parser is more appropriate, well...you might be a Perl programmer?

l-p · on Nov 13, 2012

I think this one is more appropriate: http://nikic.github.com/2012/06/15/The-true-power-of-regular...

carlio · on Nov 12, 2012

This seems to be a T-shirt advert, why am I reading this on HN?

rfrey · on Nov 12, 2012

why am I reading this on HN

Because enough people voted it up within a set time window.

pkrumins · on Nov 12, 2012

I'm sorry that it sounds like it. It's really not. I commented about it on this thread http://news.ycombinator.com/item?id=4775100.

sk5t · on Nov 12, 2012

I suppose a single regex can be both "favorite" and "worst" at the same time... it's only slightly interesting to know where ~ appears in the ASCII character set, and while someone might recall that space is kinda near the beginning but after the control characters, is it the first helpful printable character? Who knows?

NegativeK · on Nov 12, 2012

> I suppose a single regex can be both "favorite" and "worst" at the same time...

We definitely aren't the only ones who appreciate horrible things.

INTERCAL comes to mind here.

gpvos · on Nov 12, 2012

My favourite regex is actually:

[^ -~]

Not to be used in a serious program, but only in an editor (or maybe one-shot data massage perl scripts), to find possible errors or unexpected stuff.

tathastu · on Nov 12, 2012

Also it's more interesting to put unprintable characters on a t-shirt.

IsTom · on Nov 12, 2012

This works for ASCII only, use unicode character classes instead.

csense · on Nov 12, 2012

That only matters if you need to process Unicode.

See my comments [1] [2] [3] for why Unicode / internationalization should be avoided.

[1] http://news.ycombinator.com/item?id=4369323

[2] http://news.ycombinator.com/item?id=4541039

[3] http://news.ycombinator.com/item?id=4775440

kolinko · on Nov 13, 2012

So, do you propose that u.s. bootstrapped startups have a disclaimer on the registration page saying: "you cannot put foreign characters anywhere in our system"?

Even if you focus on u.s., you will have problems. If you're doing a CRM, even u.s. users will put in foreign names from time to time. If you're building a CMS, users may want to put in a quotation in french, or will simply use copy&paste from Word, which replaces "-" with "—"...

I honestly have a hard time finding a u.s. centric startup which could afford to ignore unicode. The support requests, the fires caused by errors, and the disclaimer that you'd have to put on the registration page, would cost much more than simply learning how to code the f'n utf.

Building MVP is good practice in Lean. Saying "I'm bootstrapping hence I don't have the time to learn the programming tools" is just ignorance and incompetence. It's not like Unicode gives you extra work, it just requires you to learn a few basic concepts. If you try to build a site which doesn't support Unicode, you'll have to put a lots of safeguards everywhere to cover up for your incompetence.

smegel · on Nov 12, 2012

unjinxable · on Nov 12, 2012

I like to use a variation of this in vim to quickly see if an html doc I'm working on contains weird characters that I might want to replace with &html; entities:

  /[^ -~]

Mahn · on Nov 12, 2012

[\p{L&}] <- unicode version, in case you were wondering.

e12e · on Nov 13, 2012

No, you don't need '[' and ']' ? Also you don't need the '&' to get the equivalent of the above, but might have to add support for spaces?

  > \p{L} or \p{Letter}: any kind of letter from any language.

vs

  > \p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

Along with:

  > \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

Considering the op matches everything printable, including whitespace (or actually just space, not tab), numbers and punctation, I think the equivalent would be "\X" ?

All this based on glancing at:

   http://www.regular-expressions.info/unicode.html

tripzilch · on Nov 14, 2012

Wouldn't it be smarter, in that case, match something like [\x20-\x7F] (unsure if that's valid regex, but you get the point), it's more explicit that way, being very obvious about what characters are included, as well as immediately being clear (to me) about the intention of the character class. "0x20 to 0x7F" triggers the idea of "Printable ASCII" a lot sooner than <space> to <tilde>.

kamakazizuru · on Nov 12, 2012

what a plug - obviously the only real purpose of this was to sell those t-shirts.

pkrumins · on Nov 12, 2012

I'm sorry that was not the purpose. I just wanted to share this regex trick that I had in my mind. I added the shirts only later when I saw that the article is getting very popular. People really seemed to like my previous tees and it's great to make a little extra money and continue doing what I love - coding and writing blog posts.

bhanks · on Nov 12, 2012

well obviously people are interested in it. Can't fault you for trying to make a little scratch

kaokun · on Nov 13, 2012

Not all strings are ASCII! :-(

jervisfm · on Nov 13, 2012

Can anyone explain how this regex [- ~] matches ASCII characters ?

boyter · on Nov 13, 2012

It's pretty simple. Assuming you know regex... Im going to assume you don't since you are asking.

The bracket expression [ ] defines single characters to match, however you can have more then 1 character inside which all will match.

  [a] matches a
  [ab] matches either a or b
  [abc] matches either a or b or c
  [a-c] matches either a or b or c.

The - allows us to define the range. You can just as easily use [abc] but for long sequences such as [a-z] consider it short hand.

In this case [ -~] it means every character between <space> and <tilde>, which just happens to be all the ASCII printable characters (see chart in the article). The only bit you need to keep in mind is that <space> is a character as well, and hence you can match on it.

You could rewrite the regex like so (note I haven't escaped or anything in this so its probably not valid)

  [ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~]

but that's not quite as clever or neat.

mutation · on Nov 13, 2012

It doesn't. Space is significant here, and if '-' is a the front of the matching character class it matches literal '-'. Your regex '[- ~]' matches either '-' or ' ' or '~'.

_xhok · on Nov 13, 2012

I don't want to be "that guy" and it's probably just my own stupidity, but what's so special about this for it to frontpage HN?