
Stackoverflow, HTML by Regex, topmost answer - s2r2
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
======
jgrahamc
First: please don't editorialize in the title. The "Got to see this" smacks of
other places I will not mention.

Second: this is not at all interesting. The person asks a sensible question
and then gets some ridiculous replies.

Third: it made me remember my spat with ESR about HTML parsing:
<http://news.ycombinator.com/item?id=923775> and now I feel sad.

~~~
pclark
I really dislike all this arm chair moderation of hacker news, I like the
occasional commentary in submission titles, and I also like the fact the
community decides what is or isn't interesting. Like the ~60 people who found
this article interesting.

~~~
jgrahamc
The trouble is "Got to see this" is purely the judgement of the submitter and
not of the community. The 60 upvotes are the community judgement.

~~~
CodeMage
_The trouble is "Got to see this" is purely the judgement of the submitter_

Yes, just like "Second: this is not at all interesting" is your judgment in
your comment. I know we all want to keep HN different from reddit, but a
little tolerance is good.

------
euroclydon
Some person explained that HTML is a Chomsky type 2 grammar and regular
expressions are a Chomsky type 3 grammar, and provided this link:
<http://en.wikipedia.org/wiki/Chomsky_hierarchy>

Can anyone here provide a link that makes the discussion of these typed
grammars available to laymen?

~~~
bad_user
Oh, I don't know if reducing the grammars to Chomsky is really necessary.

Regular expressions, in their original version, are equivalent to Finite-stage
Machines (i.e. ... regular grammars, no recursion, no stack, no memory further
than keeping the current state). You can't describe the rules of HTML with a
FSM.

Perl's regular expressions contain various enhancements. Newer versions of
Perl's regexes also contain direct support for recursion (but frankly, you
can't call those "regular expressions" anymore).

So ... if your regex library has recursion support, then you can parse HTML
(since with recursion you can parse context-free / Chomsky type-2 grammars).
If it doesn't support recursion, then you can't.

Btw ... the equivalent for a context-free grammar would be a Push-down
Automaton ... <http://en.wikipedia.org/wiki/Pushdown_automaton> , which is a
FSM + a stack.

------
friism
For more discussion (of a blog-post about the answer by Atwood), see here:
<http://news.ycombinator.com/item?id=944673>

------
man1sh
I think this is pretty old and has been discussed everywhere many times. Check
the date of the question/answer too.

------
albertzeyer
Ehm, I wonder a bit, the discussions always goes that HTML is not regular. The
poster though asked to just match any open tags. The language of HTML tags
clearly is regular, isn't it?

~~~
jauco
For one thing: <br> is valid html, and this too:

    
    
        <!DOCTYPE html>
        <html>
        <head>
           <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>
        </head>
        <body>
           <div>
        <br id="<bl>">
           </div>
        </body>
        </html>

~~~
albertzeyer
All kind of ways you could write the br-tag are regular.

~~~
jauco
No, because as the example shows I can put regular html in the id tag. Once
you do nesting, it's not regular anymore.

~~~
albertzeyer
You cannot nest the quotes. I.e. this is invalid:

    
    
        <br id="<br id="<br>">">
    

I.e. it doesn't matter what is inside the quotes.

------
jvdh
> HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain

Just reading that makes me wince.

------
arethuza
I'm tempted to change our bug tracker to have a bug status of "it is too late
it is too late we cannot be saved".

~~~
gvb
The terse term is "overcome by events" (OBE), applied literally.

------
buro9
The revisions for it are pretty cool too:
<http://stackoverflow.com/posts/1732454/revisions>

Someone slightly not getting the joke edited it out on the basis of it being
troll/rambling, then someone put it all back. The nice bit... the actual point
is emphasised as a result.

------
ars
What is necessary to make regex turing complete?

~~~
steveklabnik
You need to be able to provide 'context.' I linked this above, but I'll link
it to you, too:
[http://www.reddit.com/r/programming/comments/cm02a/you_cant_...](http://www.reddit.com/r/programming/comments/cm02a/you_cant_parse_html_with_regular_expressions/c0tjedt)

------
earcar
Got to read that in a Cylon Hybrid or GLaDOS voice.

------
ilkhd2
Anybody can suggests fiction authors who write in this style - sane text
morphing into gibberish and probably back to sanity?

~~~
friism
The prose is almost certainly inspired by H. P. Lovecraft. "At the Mountains
of Madness" may be a good start, if you like this sort of stuff:
[http://www.dustylibrary.com/horror/5-at-the-mountains-of-
mad...](http://www.dustylibrary.com/horror/5-at-the-mountains-of-madness.html)

~~~
steveklabnik
Yes, the entire Zalgo meme is vaguely inspired by Lovecraft.

