Hacker News new | comments | show | ask | jobs | submit login

The problem is that there's no formal grammar and the spec of "Standard Markdown", while being more specific than John Gruber's, is still full of ambiguities.

Some examples of ambiguities:

1. It does not specify precedence. For example, if a line like "~~~" (or "[ref]: /url") is followed by a setext underline, is that a header, or is that the start of a fenced code block (or ref definition)?

2. The spec says: "Code span backticks have higher precedence than any other inline constructs except HTML tags and autolinks". It says as an example that "<a href="`">`" is a HTML tag. What happens for different placement of backticks, like "<a `href=""`>" or even "`<a href="">`" is left unspecified.

3. What is the precedence or associativity of span-level constructs? For example, does "<asterisk>a[b<asterisk>](url)" result in "a[b" being emphasised or "b<asterisk>" being linked?

Thing is, a specification-by-example like this would have to keep an ever-growing list of corner cases and give examples for each of them. To get completely unambiguous, the list needs to be very long, and when it gets very long, it becomes unwieldy to handle for an implementer of the spec.

Hence the need for a formal grammar, which is the shortest way of expressing something unambiguously. But it's not possible to write a CFG for Markdown because of Markdown's requirement that anything is valid input. So the next best thing is to define a parsing algorithm, like the HTML5 spec. (Shameless plug: vfmd (http://www.vfmd.org/) is one such Markdown spec which specifies an unambiguous way to parse Markdown, with tests and a reference implementation.)

So if "Standard Markdown" is NOT unambiguous and wouldn't be, then it's not a "standard", so calling it "Standard Markdown" is not quite proper.




If you think there are ambiguities, please comment at http://talk.standardmarkdown.com. This is meant to be a provisional spec, up for comment. There is undoubtedly room for improvement.

The C and javascript implementations use a parsing algorithm that we could have simply translated into English and called a spec. (That's the sort of spec vfmd gives.) But it seemed to us that there was value in giving a declarative specification of the syntax, one that was closer to the way a human reader or writer would think, as opposed to a computer.

Re (3): we have an asterisk which can open emphasis. So, to see if we have emphasis, the rules say to parse inlines sequentially until an asterisk that can close emphasis is reached. The first inline we come to is [b*](url), which is a link. There's no closing asterisk, so we don't have emphasis, but a literal asterisk followed by a link.

Re (1): I believe you are right that the case of a referenc e definition before a setext header line should be clarified. However, the other case seems clear enough. ~~~ starts a fenced code block, which ends with a closing string of tildes or the end of the enclosing container. The underline would be included in that code block either way.

Re (2): I believe the talk of precedence may be misleading here (I thought it would be useful heuristically). The basic principle of inline parsing is to go left to right, consuming inlines that match the specs. This resolves all of these cases. Perhaps the talk of precedence should be removed.

I am no stranger to formal specifications. I wrote what I think was the first PEG grammar for markdown (peg-markdown, which came to be used as the basis for multimarkdown and several other implementations). PEG isn't a good fit, especially for block-level parsing. It almost works for inline-level parsing, but there are some constructs (like code spans) that can't be done in PEGs. It might be worth specifying inline parsing in a pseudo-PEG format to avoid worries like those you've expressed.


jgm, thanks for your comment. I have a lot of respect for your work in Markdown parsing (especially your PEG grammars and Babelmark2), which I find useful and valuable.

I understand what you have now is a provisional spec, but I have reason to believe that a specification based on declaring constructs and defining by examples is never going to get completely unambiguous. A lot of the ambiguity in parsing Markdown lies in the interplay between different syntax constructs. A spec like yours doesn't address them at all, so they remain as ambiguities. All examples of ambiguities I gave involve the interplay of different constructs (more on them below).

> The C and javascript implementations use a parsing algorithm that we could have simply translated into English and called a spec. (That's the sort of spec vfmd gives.)

It's debatable whether translating your code to English is "simple" without talking about memory addresses, pointers and arrays. In any case, vfmd is _not_ such a translation (I'm not saying that you imply that it is). vfmd was first written as a spec, then tests written to match the spec, and then implemented, followed by more tests. (However, the spec did get fixes during testcase development and implementation.)

> But it seemed to us that there was value in giving a declarative specification of the syntax, one that was closer to the way a human reader or writer would think, as opposed to a computer.

I agree there is value in making an easy-to-read syntax description. However, making a readable specification for document-writers and making an unambiguous specification for parser-developers are opposing objectives. The document writer asks "What should I do to get a heading?", while a parser developer asks "How should I interpret a line starting with a hash?". Your spec is good if you target only document writers, but falls short as a spec for parser developers, because of the ambiguities.

In vfmd, I addressed this by creating two documents - one for document-writers and one for parser-developers - that are consistent with each other.

On the specific examples:

> Re (3) ... the rules say to parse inlines sequentially until an asterisk that can close emphasis is reached

Yes, but where does your spec say that an asterisk can not close emphasis if it's contained within a link? As it stands now, going by the rules in the emphasis part of the spec (section 6.4), it should be treated as emphasis, and going by the rules in the link part of the spec (section 4.7), it should be treated as a link. The spec is silent on corner cases where multiple constructs overlap: Does the leftmost construct always win? What happens if it's not a well-formed link? What if three syntax constructs interleave?

> Re (1): ... ~~~ starts a fenced code block, which ends with a closing string of tildes or the end of the enclosing container. The underline would be included in that code block either way.

Going by the setext headers section of your spec (section 4.3), I'm not at all sure why a "~~~" line followed by a "===" line is not a setext header. Yes, your implementation interprets it as a code block, but your spec is ambiguous on how this _should_ be interpreted.

> Re (2): ... The basic principle of inline parsing is to go left to right, consuming inlines that match the specs. This resolves all of these cases. ...

If the basic principle of inline parsing is to go left to right and if all inline constructs should be parsed like that, then "[not a `link](/foo`)" should be interpreted as a link (which is contrary to Example 240 in your spec). Clearly, code spans should have a higher priority, but that needs more than a couple of examples to define correctly.

This principle is also looks contrary to your reply to (3) above, where you say "<asterisk>a[b<asterisk>](url)" is a link, not emphasis.

As noted above, the problem with a declarative spec for Markdown is the ambiguity (which is quite similar to the ambiguity in defining Markdown as a CFG, for example). As long as the spec is declarative, there will be multiple ways of interpreting an input (which, ironically, was the problem that parser-developers found with John Gruber's original Markdown syntax description too). Problems like this cannot be completely solved by providing examples because the combinations between the different constructs are too many to list as examples in a spec.

I only listed these items to illustrate the bigger problems in the design or style of the spec itself. Even if these individual items are addressed, there will always be more coming up, so I don't think it would make sense for me to keep finding and reporting ambiguities to your Discourse forum.


Your comments (coming from someone who has actually tackled this surprisingly difficult task) are some of the most valuable we've received; having them on the Discourse forum would be great.

We considered writing the spec in the state machine vein, but I advocated for the declarative style. It may be worth rethinking that and rewriting it, essentially spelling out the parsing algorithm. As you suggest, a parallel document could be created for writers.

I'll need to study your spec further to see what the substantive differences are.


Thanks. Really happy to see that you're open to a complete rewrite of the stmd spec (to a possible algorithm-based style).

I'll be happy to open a post in talk.commonmark.org on the ambiguity problems caused by using a declarative style for the stmd spec. I'll do that once the forum is back (I can't seem to access it right now).

In parallel, I too will try to work out what the syntax differences are between stmd and vfmd. Meanwhile, please see: http://www.vfmd.org/differences/ (in case you haven't already).



I (for one) can't wait to see what the two of you can do together. I highly respect the work you've both done around Markdown, and I think you could easily accomplish your goals (of a consistent and sensible Markdown parsing rule-set) as a team.


> But it's not possible to write a CFG for Markdown because of Markdown's requirement that anything is valid input.

Honest question here: how do CFGs prevent you from parsing anything as valid input? E.g. AFAICT this CFG in BNF accepts anything as valid input (including no input!)

    <s> ::= <x> EOF
          | EOF

    <x> ::= CHAR <x>
          | CHAR
Isn't it just a problem of crafting a proper grammar which covers all cases? I can see why HTML needs other parsing strategies, but it should be easy for Markdown since everything that is not a non-paragraph is a paragraph (or a blank line, i.e. a paragraph separator).

Also, isn't there a compromise between HTML's crazy parsing strategy and a CFG? A formal grammar, even if not context-free.


True, we can write a CFG that can accept any input, but not one that can parse Markdown.

Actually, I should have said it's not possible to write an _unambiguous_ CFG for Markdown.

Say we need to parse emphasis in span elements. "_a_" is em and "__a__" is strong, but "_a", "a_", "__a" and "a__" are normal text. If we write the rules for all these, we end up with a grammar than can generate the same string in many different ways. To determine whether an "_" is the syntax qualifier of an em or just part of normal text, we might have to look ahead an arbitrary number of characters, and potentially till the end of the input. This is why it's not possible to write a useful (or unambiguous) CFG for Markdown, and this is because of the requirement to not throw an error on any input.

> Also, isn't there a compromise between HTML's crazy > parsing strategy and a CFG?

PEGs have been written for Markdown and they work because PEGs are inherently unambiguous, but use backtracking instead. But those PEGs don't handle nested blocks cleanly.

My own HTML5-ish Markdown spec (http://www.vfmd.org/vfmd-spec/specification/) is not as crazy as HTML5's, but admittedly, is not trivial to implement either.


Interesting reply.

> But those PEGs don't handle nested blocks cleanly.

What's the problem exactly?


With PEGs, it's not possible to express nested content, so we have to collate for example the blockquoted content of blockquotes and process them separately. Per my understanding (and I might be wrong here), part of the problem is that PEG includes tokenization, so it's not possible to handle blockquote-levels in the tokeniser separately.

More details on PEG for Markdown: - from the author of the PEG grammar, who's also the spec-writer of "Common Markdown": http://talk.standardmarkdown.com/t/standard-markdown-formal-... - from myself: http://www.vfmd.org/introduction/#prior-work

Also, I wrote an expanded explanation of essentially what I said about CFGs and Markdown: http://roopc.net/posts/2014/markdown-cfg/


well, they are working on it. Rome wasn't built in a day.


It wasn't, but this is not a good way to build it either because (from my previous post):

> A specification-by-example like this would have to keep an ever-growing list of corner cases and give examples for each of them. To get completely unambiguous, the list needs to be very long, and when it gets very long, it becomes unwieldy to handle for an implementer of the spec.


I mean, they've explicitly disclaimed that this is version 1.0, but they've also explicitly claimed that this is complete, which it doesn't seem to me. This spec is extremely repetitive (0-3 spaces is okay, 4 spaces is too much) and doesn't actually follow a "top-down" approach which would resolve things like precedence. What they've released is something more like a testing suite.

Even more troubling, they skipped the chance for some basic innovations which will probably ultimately result in a Standard Markdown 2 spec. So, for example, they are defining Markdown as a mapping to HTML, rather than a mapping to an internal tree structure which can then be serialized to HTML.[1] If you make that change in perspective, then you can have Markdown for other languages too: not just HTML but also literate code in an arbitrary language, for example.

Another innovation which should probably work its way into Markdown as it becomes more of a file format is metadata. It's a little hard to remember, but acceptable metadata tagging was one of the killer features of MP3s, leading ultimately to their global rise. We don't have a good metadata expression for text files, and Markdown's embedded link references are, essentially, a sort of metadata already. Do this before it gets to the W3C so that we can start off a document with a simple

    @author: Chris Drost 
because once the W3C gets a hold of it that's likely to become something more messy like:

    @author[http://www.w3.org/TR/markdown/2/] "Chris\u00a0Drost"
Another nice change would be an implementation-dependent $extra sigil$ for inline text.[2] Some LaTeX math sites use Markdown with precisely this extension for inline equations; it might be nice to say "this is mapped to a $ node, but the meaning of that is dependent on the tree-reader; the HTML reader simply prepends and appends a literal dollar sign to the text without embedding it in a tag."

[1] This isn't a huge change in the language but it's a huge change in perspective. The main decision needed to fix this is to say that the "embedded HTML blocks" should have a special sigil at the beginning which is not the < character of the first tag; those "raw" blocks are then held separately in the Markdown tree, and the serializer to HTML passes the raw blocks through without HTML escaping or embedding in another tag.]

[2] Why not just use backticks? We could, of course. One problem here though is that there is no good way to distinguish those literate-code blocks which are commentary and those literate-code blocks which are code to be executed. If you don't fix that now, it will probably be fixed in SM2.


They worked on it for two years before going public.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: