
How many errors? - ingve
http://nedbatchelder.com//blog/201509/how_many_errors.html
======
thaumasiotes
> The regex here is using an r"" string, as all regexes should.

I don't understand how \ ever became the escape character for regular
expressions (and JSON!) in the first place. Every language gets its own unique
escape character -- that way, you don't need an exponential number of
characters in the source code to represent a single character in the model.
This seems to have been understood early on; C uses \ as the escape character
for strings and % as the escape character for printf directives. Common Lisp
uses \ for strings and ~ for format directives. HTML uses &. When I wrote a
regex parser, I made the escape character /, making a clear distinction
between \n (a raw newline - just character data to the regex parser) and /n
(an escaped n - special to the regex parser). But somehow every mainstream
regex library uses \, gratuitously adding a lot of mental load to figuring out
how much escaping you need. String.split("\\\s+") is a usability failure, not
something to be proud of, and definitely not something to imitate. What
happened?

Imagine you're looking for use of a single backslash in some JSON data. It'll
be escaped, appearing literally in the data as \\\\. If (because this is an
example to show why sharing escape characters between languages is idiotic)
you want to search for it with a regex, you need \\\\\\\\. But to get that
regex, you need to escape each of those four backslashes yet again, because
you construct them from strings. So you've got re.compile('\\\\\\\\\\\\\\\')
to match what is notionally a single backslash in the data you're searching
through. Compare a hypothetical world where \ is the escape character for
strings, @ is the escape character for regexes, and + is the escape character
for JSON:

    
    
        re.compile('++') # matches escaped + in the data
        re.compile('@@') # matches @ in the data
        re.compile('\\') # matches \ in the data

~~~
eridius
Why are you using regexes to search through raw JSON? That's pretty messy.

The short answer for "why does everything use \" is because using a different
escape character for each different grammar very quickly exhausts the
potential space of uncommon punctuation, and becomes exceedingly confusing as
anyone who works with more than one of these grammars now has to remember
which one uses which escape character.

~~~
thaumasiotes
> Why are you using regexes to search through raw JSON?

>> (because this is an example to show why sharing escape characters between
languages is idiotic)

I turned one backslash into eight backslashes with a contrived example. While
it's not common to go through so many layers, this problem is very real, and,
in the article, is directly responsible for a bug.

> using a different escape character for each different grammar [...] becomes
> exceedingly confusing as anyone who works with more than one of these
> grammars now has to remember which one uses which escape character

Are you really arguing that we could make C _less_ confusing by switching
printf("%d", x) to printf("\\\d", x)? I'd say that C strings and C format
directives use different escape characters _specifically because_ they're
likely to be used together. But regexes are intended to be used with strings,
too.

~~~
eridius
Since you had to pick a contrived example, that suggests the problem isn't
particularly common. If it was "real" enough, you wouldn't need a contrived
example.

> _and, in the article, is directly responsible for a bug._

Nope! It was not responsible for anything whatsoever. Yeah, the backslashes
got escaped, but they were _already_ broken, so it didn't actually break
anything. All it did was make "btest" match instead of "\x08test" match (both
of which are equally buggy, though "btest" is more likely to be triggered in
the wild).

> _Are you really arguing that we could make C_ less _confusing by switching
> printf( "%d", x) to printf("\\\d", x)?_

No. I did not say that at all, and I don't appreciate straw man arguments.

> _I 'd say that C strings and C format directives use different escape
> characters_ specifically because _they 're likely to be used together_

I doubt it. Neither of us was present (I'm assuming) when the printf format
spec was decided upon, but I'd wager that % was chosen because this has a
semantically different meaning than backslash-escapes. Backslash escapes are
just that: escapes. They're either stand-ins for certain un-typeable
characters, or they disable interpretation of the following character. But %
in printf isn't used as a character escape, it's used as a _token_. It's a
stand-in for a more complicated representation of a separate argument. And the
fact that it behaves differently than \ does warrants it using a different
character.

Whereas in other languages, \ typically has the exact same semantic meaning as
\ does in C strings (the set of character escapes may vary slightly, although
most languages tend to copy C's character escapes verbatim). And that's why
they use \\. It's the same reason most languages use "" to denote strings, or
[] to denote subscripting. It's common punctuation that behaves the same
across multiple languages, which makes it easy for the programmer to remember
and work with.

~~~
thaumasiotes
There is also a non-contrived example in my original comment,
String.split("\\\s+"). And the advice that regexes in python should always be
constructed from r"" strings is a direct response to this problem
specifically. I picked a contrived example to make clear that sharing escape
characters causes exponential growth as you add layers.

>> Are you really arguing that we could make C less confusing by switching
printf("%d", x) to printf("\\\d", x)?

> No. I did not say that at all, and I don't appreciate straw man arguments.

This is an argument in a very standard form, reduction to absurdity.

I see printf escapes and C string backslash-based escapes as semantically
identical; they are both ways of giving the parser instructions directly
rather than giving it a literal representation of what you want (this analysis
also nicely explains why they're called "escapes"). But since there is a good
amount of evidence that a lot of other programmers feel the same way you do,
let me ask this: HTML escapes (with &) have exactly the semantics you
attribute to \\. Do you think HTML would be better if it used backslashes
instead?

------
js2
Comment on the post links to [https://github.com/nose-
devs/nose/issues/11](https://github.com/nose-devs/nose/issues/11) which
reported the issue in 2011. Ugh.

~~~
bobwaycott
That is deceptive, as it is a mirrored issue from when the project was brought
to Github (which is why it reads so oddly, all from the same author). It'd
been reported nearly 20 months prior on googlecode:
[https://code.google.com/p/python-
nose/issues/detail?id=335](https://code.google.com/p/python-
nose/issues/detail?id=335)

This issue has been open since April 2010. What a shame.

------
monochromatic
I'm not 100% sure what the intent of the code is, but it sounds like something
that doesn't need a regex at all. Would

    
    
        'test' in x.lower()
    

work? Or maybe

    
    
        x[:4].lower() == 'test'
    

if you want it to be at the beginning.

~~~
lgas

        x.startswith("test")

~~~
monochromatic
Even better, forgot about that one. Still needs a lower() though.

------
TheLoneWolfling
You have a problem. You use a regex. Now you have two problems.

~~~
miander
Please do not use that phrase...
[http://regex.info/blog/2006-09-15/247](http://regex.info/blog/2006-09-15/247)

EDIT: I take that back, you're actually using it in its original context :)

------
jheriko
that code stinks on many levels.

matching test in a string with a regex is so far away from being part of a
good solution to this problem on any level imo... its a recipe for pain. i'm
sorry anyone has to work in that codebase.

~~~
eridius
You're taking regex hate way too far. This is actually pretty much an _ideal_
use-case for regular expressions. The correct regular expression looks like
/\b[Tt]est/, which is trivial to write by anyone who has a passing familiarity
by regex, easy to read, and will do exactly what it's trying to do.

~~~
jheriko
its nothing to do with the regex. its the algorithm and its suitability 'any
function containing test in its name' being a test is a stinky practice.

using something well defined is what i would expect from any programmer at any
level.

this is what we call a 'hack'... and its a particularly revolting one.

~~~
eridius
I'm going to go out on a wild limb here and guess that you have _no clue
whatsoever_ what a regex is, because your comment makes absolutely no sense.
The original regex from the article is wildly broken, yes. But the trival
regex /\b[Tt]est/ does _not_ match "any function containing test in its name".
And this regex is _extremely_ well-defined. The fact that you can't understand
it, and the fact that you are making no effort whatsoever to learn even the
most basic stuff about regexes, does not make this a 'hack', it just makes you
one.

~~~
jheriko
nope. use them every day, but not as a pro expert because its rarely
necessary. i don't know what \b is off the top of my head but i'd guess at
word boundary so its only checking for functions starting with test, which is
equally bad...

matching strings contents of function names is not a good way to work. using
well defined constructs is.

thank you for the hate though.

