Some examples of ambiguities:
1. It does not specify precedence. For example, if a line like "~~~" (or "[ref]: /url") is followed by a setext underline, is that a header, or is that the start of a fenced code block (or ref definition)?
2. The spec says: "Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks". It says as an example that "<a href="`">`" is a HTML tag. What happens for different placement of backticks, like "<a `href=""`>" or even "`<a href="">`" is left unspecified.
3. What is the precedence or associativity of span-level constructs? For example, does "<asterisk>a[b<asterisk>](url)" result in "a[b" being emphasised or "b<asterisk>" being linked?
Thing is, a specification-by-example like this would have to keep an ever-growing list of corner cases and give examples for each of them. To get completely unambiguous, the list needs to be very long, and when it gets very long, it becomes unwieldy to handle for an implementer of the spec.
Hence the need for a formal grammar, which is the shortest way of expressing something unambiguously. But it's not possible to write a CFG for Markdown because of Markdown's requirement that anything is valid input. So the next best thing is to define a parsing algorithm, like the HTML5 spec. (Shameless plug: vfmd (http://www.vfmd.org/) is one such Markdown spec which specifies an unambiguous way to parse Markdown, with tests and a reference implementation.)
So if "Standard Markdown" is NOT unambiguous and wouldn't be, then it's not a "standard", so calling it "Standard Markdown" is not quite proper.
Re (3): we have an asterisk which can open emphasis. So, to see if we have emphasis, the rules say to parse inlines sequentially until an asterisk that can close emphasis is reached. The first inline we come to is [b*](url), which is a link. There's no closing asterisk, so we don't have emphasis, but a literal asterisk followed by a link.
Re (1): I believe you are right that the case of a referenc e definition before a setext header line should be clarified. However, the other case seems clear enough. ~~~ starts a fenced code block, which ends with a closing string of tildes or the end of the enclosing container. The underline would be included in that code block either way.
Re (2): I believe the talk of precedence may be misleading here (I thought it would be useful heuristically). The basic principle of inline parsing is to go left to right, consuming inlines that match the specs. This resolves all of these cases. Perhaps the talk of precedence should be removed.
I am no stranger to formal specifications. I wrote what I think was the first PEG grammar for markdown (peg-markdown, which came to be used as the basis for multimarkdown and several other implementations). PEG isn't a good fit, especially for block-level parsing. It almost works for inline-level parsing, but there are some constructs (like code spans) that can't be done in PEGs. It might be worth specifying inline parsing in a pseudo-PEG format to avoid worries like those you've expressed.
I understand what you have now is a provisional spec, but I have reason to believe that a specification based on declaring constructs and defining by examples is never going to get completely unambiguous. A lot of the ambiguity in parsing Markdown lies in the interplay between different syntax constructs. A spec like yours doesn't address them at all, so they remain as ambiguities. All examples of ambiguities I gave involve the interplay of different constructs (more on them below).
It's debatable whether translating your code to English is "simple" without talking about memory addresses, pointers and arrays. In any case, vfmd is _not_ such a translation (I'm not saying that you imply that it is). vfmd was first written as a spec, then tests written to match the spec, and then implemented, followed by more tests. (However, the spec did get fixes during testcase development and implementation.)
> But it seemed to us that there was value in giving a declarative specification of the syntax, one that was closer to the way a human reader or writer would think, as opposed to a computer.
I agree there is value in making an easy-to-read syntax description. However, making a readable specification for document-writers and making an unambiguous specification for parser-developers are opposing objectives. The document writer asks "What should I do to get a heading?", while a parser developer asks "How should I interpret a line starting with a hash?". Your spec is good if you target only document writers, but falls short as a spec for parser developers, because of the ambiguities.
In vfmd, I addressed this by creating two documents - one for document-writers and one for parser-developers - that are consistent with each other.
On the specific examples:
> Re (3) ... the rules say to parse inlines sequentially until an asterisk that can close emphasis is reached
Yes, but where does your spec say that an asterisk can not close emphasis if it's contained within a link? As it stands now, going by the rules in the emphasis part of the spec (section 6.4), it should be treated as emphasis, and going by the rules in the link part of the spec (section 4.7), it should be treated as a link. The spec is silent on corner cases where multiple constructs overlap: Does the leftmost construct always win? What happens if it's not a well-formed link? What if three syntax constructs interleave?
> Re (1): ... ~~~ starts a fenced code block, which ends with a closing string of tildes or the end of the enclosing container. The underline would be included in that code block either way.
Going by the setext headers section of your spec (section 4.3), I'm not at all sure why a "~~~" line followed by a "===" line is not a setext header. Yes, your implementation interprets it as a code block, but your spec is ambiguous on how this _should_ be interpreted.
> Re (2): ... The basic principle of inline parsing is to go left to right, consuming inlines that match the specs. This resolves all of these cases. ...
If the basic principle of inline parsing is to go left to right and if all inline constructs should be parsed like that, then "[not a `link](/foo`)" should be interpreted as a link (which is contrary to Example 240 in your spec). Clearly, code spans should have a higher priority, but that needs more than a couple of examples to define correctly.
This principle is also looks contrary to your reply to (3) above, where you say "<asterisk>a[b<asterisk>](url)" is a link, not emphasis.
As noted above, the problem with a declarative spec for Markdown is the ambiguity (which is quite similar to the ambiguity in defining Markdown as a CFG, for example). As long as the spec is declarative, there will be multiple ways of interpreting an input (which, ironically, was the problem that parser-developers found with John Gruber's original Markdown syntax description too). Problems like this cannot be completely solved by providing examples because the combinations between the different constructs are too many to list as examples in a spec.
I only listed these items to illustrate the bigger problems in the design or style of the spec itself. Even if these individual items are addressed, there will always be more coming up, so I don't think it would make sense for me to keep finding and reporting ambiguities to your Discourse forum.
We considered writing the spec in the state machine vein, but I advocated for the declarative style. It may be worth rethinking that and rewriting it, essentially spelling out the parsing algorithm. As you suggest, a parallel document could be created for writers.
I'll need to study your spec further to see what the substantive differences are.
I'll be happy to open a post in talk.commonmark.org on the ambiguity problems caused by using a declarative style for the stmd spec. I'll do that once the forum is back (I can't seem to access it right now).
In parallel, I too will try to work out what the syntax differences are between stmd and vfmd. Meanwhile, please see: http://www.vfmd.org/differences/ (in case you haven't already).
Post on declarative style: http://talk.commonmark.org/t/571
Syntax diff: https://github.com/vfmd/vfmd-spec/wiki/commonmark-vs-vfmd
Honest question here: how do CFGs prevent you from parsing anything as valid input? E.g. AFAICT this CFG in BNF accepts anything as valid input (including no input!)
<s> ::= <x> EOF
<x> ::= CHAR <x>
Also, isn't there a compromise between HTML's crazy parsing strategy and a CFG? A formal grammar, even if not context-free.
Actually, I should have said it's not possible to write an _unambiguous_ CFG for Markdown.
Say we need to parse emphasis in span elements. "_a_" is em and "__a__" is strong, but "_a", "a_", "__a" and "a__" are normal text. If we write the rules for all these, we end up with a grammar than can generate the same string in many different ways. To determine whether an "_" is the syntax qualifier of an em or just part of normal text, we might have to look ahead an arbitrary number of characters, and potentially till the end of the input. This is why it's not possible to write a useful (or unambiguous) CFG for Markdown, and this is because of the requirement to not throw an error on any input.
> Also, isn't there a compromise between HTML's crazy
> parsing strategy and a CFG?
PEGs have been written for Markdown and they work because PEGs are inherently unambiguous, but use backtracking instead. But those PEGs don't handle nested blocks cleanly.
My own HTML5-ish Markdown spec (http://www.vfmd.org/vfmd-spec/specification/) is not as crazy as HTML5's, but admittedly, is not trivial to implement either.
> But those PEGs don't handle nested blocks cleanly.
What's the problem exactly?
More details on PEG for Markdown:
- from the author of the PEG grammar, who's also the spec-writer of "Common Markdown": http://talk.standardmarkdown.com/t/standard-markdown-formal-...
- from myself: http://www.vfmd.org/introduction/#prior-work
Also, I wrote an expanded explanation of essentially what I said about CFGs and Markdown: http://roopc.net/posts/2014/markdown-cfg/
> A specification-by-example like this would have
to keep an ever-growing list of corner cases and give examples for each of them. To get completely unambiguous,
the list needs to be very long, and when it gets very
long, it becomes unwieldy to handle for an implementer of
Even more troubling, they skipped the chance for some basic innovations which will probably ultimately result in a Standard Markdown 2 spec. So, for example, they are defining Markdown as a mapping to HTML, rather than a mapping to an internal tree structure which can then be serialized to HTML. If you make that change in perspective, then you can have Markdown for other languages too: not just HTML but also literate code in an arbitrary language, for example.
Another innovation which should probably work its way into Markdown as it becomes more of a file format is metadata. It's a little hard to remember, but acceptable metadata tagging was one of the killer features of MP3s, leading ultimately to their global rise. We don't have a good metadata expression for text files, and Markdown's embedded link references are, essentially, a sort of metadata already. Do this before it gets to the W3C so that we can start off a document with a simple
@author: Chris Drost
 This isn't a huge change in the language but it's a huge change in perspective. The main decision needed to fix this is to say that the "embedded HTML blocks" should have a special sigil at the beginning which is not the < character of the first tag; those "raw" blocks are then held separately in the Markdown tree, and the serializer to HTML passes the raw blocks through without HTML escaping or embedding in another tag.]
 Why not just use backticks? We could, of course. One problem here though is that there is no good way to distinguish those literate-code blocks which are commentary and those literate-code blocks which are code to be executed. If you don't fix that now, it will probably be fixed in SM2.