

Regexes Parse XML Just Fine, Actually - evincarofautumn
http://evincarofautumn.blogspot.com/2011/10/regexes-parse-xml-just-fine-actually.html

======
groovy2shoes
You've already said, "Of course, they’re not regular expressions in the strict
mathematical sense, but rather in the sense with which most of us are
familiar—Perl regexen." And that's just it. Perl's "regular expressions"
aren't regular at all! They match a _superset_ of the regular languages.

What are you going to do with the output of this script? If you want to do
anything useful with XML, you usually build a DOM out of it during the parse.

What do you mean by this? "I would call this a good ... practical solution to
a problem that’s theoretically unsolvable." I'd say that parsing XML is
definitely a solvable (and solved) problem.

~~~
evincarofautumn
The problem is “parsing XML with a regex”, which you can’t do unless you take
“regex” to mean “Perl regex”. This isn’t a useful script; it only serves to
illustrate a point.

~~~
timtadh
Is the point a modernized "You can write Fortran in any language"?

I am getting a little tired of these types of responses to "you can't parse X
with regular expressions." Someone always feels the need to come up with a
tremendously complex solution that mostly works. Usually using backref
matchings, (making the regular expression not so ... regular). Meanwhile, the
author never seems to talk about the parse time. Which, will be exponential.
See <http://swtch.com/~rsc/regexp/> for all the glory details.

Specific points:

> I’m talking about parsing XML, not checking whether some input actually is
> XML. Correctness is a Boolean, after all: invalid XML is not XML.

However, from a particularly pedantic point of view: To parse is to check
whether the input is in your language. If your parser excepts input that is
not XML it is not an XML parser. It is a parser which accepts some language
which largely overlaps with XML, but is not XML.

I know it's an academic point, but it _is_ an important one. When you have
properly parsed your input, you should be sure that the input is what you were
expecting. A parser that excepts a different language than the intended one
can be highly misleading.

So I am left to wonder what is the point of it all. You never say in the
article.

 _tl;dr_

Read the dragon book. Learn what a Language actually is. Study some Chomsky.
Write a recursive descent parser. Lay in the green green grass. Think about
Kurt Gödel. Write a parsing framework for LALR grammars. Think. Then, I
believe, the desire for using regular expressions for in-appropriate purposes
will have left you. Your other tools are just _so much cooler_.

~~~
evincarofautumn
I’ve actually done all of those things, and using regular expressions for
inappropriate purposes remains entertaining, especially for the reactions.
Perhaps you need to revisit the joy of making something work in _completely
the wrong way_.

~~~
SimHacker
It's the job of the XML parser to answer the question "is this properly formed
XML?" -- so you have indeed made something that "works" in completely the
wrong way, by not working correctly.

------
perlgeek
Yes, you can parse XML that way, but you shouldn't. It's more work than you
originally thought.

Example: I've tried two parts of the xml syntax, <?xml version="1.0"?> and
namespaces <foo:bar />, and it didn't seem to parse either. (At least if I
understood the output right, it doesn't say "parsed" or "not parsed").

So it's the typical "author's idea of xml" parser, not xml parser.

~~~
evincarofautumn
It doesn’t handle self-closing tags, but that’s not the point. It would be
trivial to add that, but again, that’s not the point.

The point is that you probably don’t want to do this with regular expressions,
even though, contrary to popular opinion, you pretty much can. The article
even calls the solution “slow and overspecified, but still kinda neat”.

~~~
Ives
Self closing tags are to the best of my knowledge not a part of XML.

~~~
jeffcoat
The XML spec refers to them as "empty-element" tags.

<http://www.w3.org/TR/REC-xml/#sec-starttags>

~~~
Ives
Oh ofcourse, I thought evincarofautumn meant the SGML/HTML thing where you
don't need to close certain tags.

------
winestock
Whenever this topic arises, we should never forget one of the best answers
ever given concerning it.

[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

~~~
evincarofautumn
This is why I wrote the article.

------
Ives
No one claimed that it's impossible to use regexes in the process of parsing
XML. It is however impossible to parse arbitrary XML with a _single_ regex.

For example: write a regular expression that will return the contents of the
first <div> that is found in the following XML:

For example: <div> whatever <div> more stoff </div> foo </div>

or

<div> whatever foo </div>

You can't because XML is not a regular language and regexes only match regular
languages.

~~~
evincarofautumn
Every specific counterexample can be countered with a specifically tailored
countersolution. If you only need to parse a certain subset of an XML
document, then you don’t need to parse XML, and (ordinary) regular expressions
will work.

Also, the article does say that you can parse well-formed XML with a single
regex, you just can’t validate it that way.

~~~
SimHacker
By "just fine" do you mean "not quite"?

You missed the part about the first job of a parser being "check for correct
syntax":

<http://en.wikipedia.org/wiki/Parser>

Parser

In computing, a parser is one of the components in an interpreter or compiler,
which checks for correct syntax and builds a data structure (often some kind
of parse tree, abstract syntax tree or other hierarchical structure) implicit
in the input tokens. The parser often uses a separate lexical analyser to
create tokens from the sequence of input characters. Parsers may be programmed
by hand or may be (semi-)automatically generated (in some programming
languages) by a tool.

------
wsgrah
We have literally tens-of-thousands of XML documents that had been delivered
with one-off regexes and Perl for almost 20 years. Worked like a champ until
we wanted to do an update to the look and feel of these pages, then spent the
last six years regularizing the SGML and XML, as well as the transforms to
just be able to manage these documents in some way.

------
perfunctory
Jeez, why is this on the home page. This drives me crazy. Please, stop.

~~~
evincarofautumn
I’ve been writing a lot lately, and I’ve gotten in the habit of submitting all
of my blog posts to HN in order to get feedback and see what interests the
people for whom I’m trying to write. It’s on the home page because enough
people thought it was interesting that they upvoted it. If it drives you
crazy, all you have to do to make it go away is refrain from upvoting.

~~~
buster
Could you please stop submitting everything you write, just because you can?
If everyone would do that, it'd be terrible!

~~~
evincarofautumn
There’s a huge difference between “everything I write” and “everything I write
on that blog”, but I agree. The readers here have shown some interest in what
I write, and for that reason alone I continue to submit it for your judgement.
Hopefully my writing will inspire someone to make something cool, or change
their teaching style, or design a better programming language. Besides, this
is the best platform on which I can promote the work I’ve been doing lately,
and I have nothing else to promote: most of the things I find online that are
worthy of posting on HN are things that I found _via_ HN. What else can I do
but continue to throw links into the void and quietly ask that someone care?

------
pohl
Bolting a stack onto your regular language just makes a Frankenstein context
free one, just as bolting a stack onto a DFA just makes a push down automata.
This whole blog post is knocking down a straw man.

------
buster
Can one really call this sort searching parsing? That's certainly somewhat
subjective but this would never be called a "parser" in my world.

Everyone who thinks that <http://www.w3.org/TR/xml/> can be parsed with a few
lines of perl is just mislead. Example: Does the parser handle XML namespaces?

At best it's a trickier way of searching xml files.

~~~
evincarofautumn
All parsing is is discerning structure in flat input, so yes, it’s a parser.

------
gregjor
I've had to do this a few times because PHP's XML parser can't parse Microsoft
Office MSXML. Works good enough for stuff like uploading a list of hotels and
restaurants to a tourism website when the office manager only knows how to use
Excel. And the PCRE code is a lot shorter and simpler than the equivalent XML
parser code.

------
dustingetz
> This isn’t a problem if you assume your input is well-formed. I’m talking
> about parsing XML, not checking whether some input actually is XML.
> Correctness is a Boolean, after all: invalid XML is not XML. ... [proceeds
> to parse HTML]

even validated HTML isn't XML, let alone tag soup.

~~~
evincarofautumn
That string does serve to show what you get when you throw some tag soup at
the thing, though.

