
Don’t Use ISO/IEC 14977 Extended Backus-Naur Form - lelf
https://dwheeler.com/essays/dont-use-iso-14977-ebnf.html
======
ModernMech
> Clearly expressions like [a-zA-Z0-9] are shorter and clearer.

Obviously the above is shorter, but is it really clearer? The ISO/IEC
14977:1996 is longer but very explicit. The above definition contains an
implicit understanding about ASCII encodings i.e. that a-z, A-Z, and 0-9 are
disjoint ranges. Someone without an understanding of this might confuse a-zA
as one range, zA - Z0 as another range, and Z0-9 as a third range. Further,
one might wonder if a-9 is a range that would include all alphanumeric
characters. This is not exactly clearer than the ISO/IEC 14977:1996 notation,
where it spells it out for you in explicit terms.

> Ranges also make exceptions clearer, e.g., if you omitted the letter O it
> would be obvious in a range but not obvious in a long list.

This would be written:

letter_except_O = letter - "O"

I don't see how that's not obvious.

~~~
wahern
I tend to agree.

> Ranges also make exceptions clearer, e.g., if you omitted the letter O it
> would be obvious in a range but not obvious in a long list.

That feels true for 0-9A-Za-z, specifically. But that's only because of the
ubiquity of ANSI. In EBCDIC A-Z and a-z aren't contiguous, and POSIX only
requires 0-9 to be contiguous.

In practice those are esoteric caveats we can all ignore, but for _everything_
else (every other potential range) it can't be ignored at all. The range
short-hand is actually quite useless except for very terse scripts (e.g. sed)
and only then for A-Za-z (or A-Fa-f).[1]

Similarly, defining letter using the long form may be more error prone as
compared to the idiomatic short form, but that's the _only_ case. In practice
letter is often predefined, and in any event it's rather trivial to verify
oneself--step through the alphabet, then double check by counting letters to
26. I don't think I've ever had an error where I forgot a letter in a long-
form A-Za-z set, but I _have_ forgotten a letter in a long-form A-Fa-f set and
even done stupid stuff like a-e in short-form precisely because it's easier to
be sloppy when you _think_ it's difficult to get wrong. In terse code big
errors are inconspicuous.

[1] In writing portable sed and tr code I typically list the letters
individually as IME neither [[:alpha:]] nor [A-Za-z] are universally supported
from the _default_ implementations. /bin/sed and /bin/tr may not be the
system's POSIX-compliant implementations or may require special environment
presets, and it's usually easier to live within de facto limitations than to
debug and maintain a complex preamble that attempts to locate the POSIX-
compliant utilities.

------
sonofgod
Title woefully inaccurate: original is "Don’t Use ISO/IEC 14977 Extended
Backus-Naur Form (EBNF)" and the article can be summarised as "Use W3C EBNF
instead".

~~~
smarks
Agree. The article has several criticisms of ISO/IEC 14977 specifically, which
seem quite reasonable to me. The article isn't a criticism of EBNF in general.

------
dragontamer
Good article, bad Hacker-news title.

