Don’t Use ISO/IEC 14977 Extended Backus-Naur Form

ModernMech · on April 4, 2019

> Clearly expressions like [a-zA-Z0-9] are shorter and clearer.

Obviously the above is shorter, but is it really clearer? The ISO/IEC 14977:1996 is longer but very explicit. The above definition contains an implicit understanding about ASCII encodings i.e. that a-z, A-Z, and 0-9 are disjoint ranges. Someone without an understanding of this might confuse a-zA as one range, zA - Z0 as another range, and Z0-9 as a third range. Further, one might wonder if a-9 is a range that would include all alphanumeric characters. This is not exactly clearer than the ISO/IEC 14977:1996 notation, where it spells it out for you in explicit terms.

> Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list.

This would be written:

letter_except_O = letter - "O"

I don't see how that's not obvious.

wahern · on April 4, 2019

I tend to agree.

> Ranges also make exceptions clearer, e.g., if you omitted the letter O it would be obvious in a range but not obvious in a long list.

That feels true for 0-9A-Za-z, specifically. But that's only because of the ubiquity of ANSI. In EBCDIC A-Z and a-z aren't contiguous, and POSIX only requires 0-9 to be contiguous.

In practice those are esoteric caveats we can all ignore, but for everything else (every other potential range) it can't be ignored at all. The range short-hand is actually quite useless except for very terse scripts (e.g. sed) and only then for A-Za-z (or A-Fa-f).[1]

Similarly, defining letter using the long form may be more error prone as compared to the idiomatic short form, but that's the only case. In practice letter is often predefined, and in any event it's rather trivial to verify oneself--step through the alphabet, then double check by counting letters to 26. I don't think I've ever had an error where I forgot a letter in a long-form A-Za-z set, but I have forgotten a letter in a long-form A-Fa-f set and even done stupid stuff like a-e in short-form precisely because it's easier to be sloppy when you think it's difficult to get wrong. In terse code big errors are inconspicuous.

[1] In writing portable sed and tr code I typically list the letters individually as IME neither [[:alpha:]] nor [A-Za-z] are universally supported from the default implementations. /bin/sed and /bin/tr may not be the system's POSIX-compliant implementations or may require special environment presets, and it's usually easier to live within de facto limitations than to debug and maintain a complex preamble that attempts to locate the POSIX-compliant utilities.

sonofgod · on April 4, 2019

W3C is explicit: ranges are Unicode code points.

[a-9] is not a valid range (since 9<a). [0-z] is, but as you say it is misleading and you probably shouldn't.

There is a ambiguity that [A-Z] can be legally interpreted as "any of the characters A, hyphen, or Z". A rule specifying that hyphens must be first or last in the list would resolve this problem.

Also, good luck encoding

without ranges.

sonofgod · on April 4, 2019

Title woefully inaccurate: original is "Don’t Use ISO/IEC 14977 Extended Backus-Naur Form (EBNF)" and the article can be summarised as "Use W3C EBNF instead".

smarks · on April 4, 2019

Agree. The article has several criticisms of ISO/IEC 14977 specifically, which seem quite reasonable to me. The article isn't a criticism of EBNF in general.

dang · on April 4, 2019

Oops. My mistake. Fixed. Thanks!

dragontamer · on April 4, 2019

Good article, bad Hacker-news title.