
Coding Horror: Parsing Html The Cthulhu Way - Anon84
http://www.codinghorror.com/blog/archives/001311.html
======
michael_dorfman
The original (Zalgo'd) version of the StackOverflow answer is even funnier
than Jeff's excerpt: [http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

~~~
bonaldi
It is, but it doesn't answer the question. The asker doesn't ask how to parse
HTML, he wants to match certain tag-like patterns, and not others. He doesn't
define the document, or the context, and could plausibly be wanting to remove
certain tags from a non-HTML text document or the like.

There are plenty of potential areas involving HTML tags where you can
manipulate the document with regexes, so long as you're not actually trying to
make a proper parsing job of it. But everyone was so keen to show that they
know you don't use regexes to parse HTML, they blew the actual question away.
The asker did try to point out that he didn't want to parse the HTML, but
still the points kept on rolling up for an answer that ... told him not to try
and parse the HTML.

~~~
jordanb
While I agree that using regexes for one-off jobs is not the kind of thing
that would summon Cthulhu. I think it's still the wrong thing to do and you
would still be better off using a parser.

I think, to sum up this whole drama, we have one group saying "you should use
a parser" and the other group saying "but sometimes regexes aren't totally
horrible, and this is just a quick job. I don't want to take the effort to use
the proper tool."

The latter group sees themselves as being pretty reasonable. And perhaps they
are. But the thing that exasperates the former group is that it is generally
_not_ quicker or easier to use regexps -- even for one-off jobs --- than using
a nice parser like BS or lxml.

Those tools give you a object-view of the document which, as JGC demonstrated,
allows you to express what you want to do concisely and easily, even
(especially) when you're picking a few tags out of the document.

The reason why regex hacking on X/HTML documents remains popular isn't because
it's 'easy' but because it doesn't require people knowing how to use the
proper tools, nor does it require people to make the conceptual leap from the
document as a long string of unstructured text to the document being a tree-
structure of objects.

It's a classic "The Wrong Way" choice made by people who don't have a
mechanical or conceptual understanding of the right way.

It's like this. Imagine you come upon someone banging in screws with a hammer.
And laying next to them is a nice power drill with a Phillips bit. You say to
them "why aren't you using the drill for that. It'd do a better job and be a
hell of a lot easier too".

And the guy responds "This is a quick job. These screws don't need to hold
very much weight, or for very long. If this was a serious piece of
construction I would totally use the drill, but for my needs this is just
easier."

Now, you might look at him with some incredulity. No way is his way "easier."
It's _much_ harder, on top of doing a worse job. The real reason why he's not
using the drill is, quite obviously, he doesn't know how!

So you say to him "seriously, try the drill" and he keeps on insisting that
his approach is "totally sufficient" (which it may well be, that's not the
point) and that using the drill would be "too much work."

And then that's how we got here in this discussion.

~~~
bonaldi
I agree with you, but I may not have made my point clearly enough: It's not
about "should he use regexes?" it's the other classic mistake of _not
clarifying the problem domain first_.

Everybody is assuming he wants to _parse_ this document, which isn't proven
and is mostly unfounded. What if it's the text of a book, including all sorts
of mentalism that will blow up any parser? Nowhere does it say that he's got
valid (or invalid) HTML/XML here.

It's actually as if he said "hey, can I borrow a screwdriver?" and everyone
went "He wants a screwdriver? He must be building a box, and that means nails!
Give him a hammer! A hammer! You can't put in nails with a screwdriver, Zalgo
wouldn't do that!"

So now he's off trying to put his screws in with a hammer, because in the race
for points people wouldn't first ask him to clarify the problem domain and
work out what he's actually trying to achieve.

~~~
jordanb
He narrowed his problem domain to the one suited to parsers when he said he
had a structured document. He did that when he said "HTML."

Regular expressions are a tool for extracting (semi)-structured patterns from
unstructured text. For instance, it is unfortunately the case that we must
treat English text as unstructured. It is not possible (at least not yet) to
build a parser that will be able to correctly discern every syntactical
structure in English text.

So if you want to operate on that text (suppose, slicing it into sentences)
with a computer, then doing something incredibly superficial --- say, looking
for periods --- is the best that can be done. It may need special casing for
abbrevs., etc, but it would still produce _something,_ even when the parser
would get confused because it couldn't find the predicate of half the
sentences.

Now, granted, HTML (or SGML) is not the same thing as XML. But the fact is
that HTML is _by definition_ structured. Even bad HTML is still structured..
and it's still machine-translatable because it has to be renderable by at
least one browser (and therefore understandable by at least one quirks-mode
SGML parser). Even the most technically incompetent designer in the world is
aware that he has to load the page he's butchered into at least one browser
and make sure it still renders.

Trying to use a strict XML parser on HTML will likely end in tears, because it
has a different definition of validity than the quirks-mode SGML parser the
designer used to check his work. But Beautiful Soup is not a strict XML
parser, it's an SGML tool. One that, in my experience, is a pretty good
approximation of the parsers the browsers use.

~~~
bonaldi
_He did that when he said "HTML."_

He _didn't_ say that. There's not a single "HTML" anywhere in his question.
(There's an "XHTML", but it's in the title as something he wants to
_exclude_.)

Everyone is making that assumption on his behalf, which is my point.

~~~
jordanb
You're right. :x

It was a while since I looked at the original question.

It would be interesting to hear what he's actually trying to do. Looking
closely at the question, he wants all open tags, but not ones that are self-
closing.

My guess is that he is trying to find HTML elements that have content. That
should, of course, be done with a parser. But you're right that he could be
doing something completely tangential to the structure.

------
drtse4
See [http://www.jgc.org/blog/2009/11/parsing-html-in-python-
with....](http://www.jgc.org/blog/2009/11/parsing-html-in-python-with.html)
and <http://news.ycombinator.com/item?id=923775> . Btw, as someone that has
sometimes parsed html using regular expressions (for small and localized html
snippets extracted by bigger pages) and has dealt with the numerous patches
needed everytime something changes (structural changes or random \n,\r,etc...
added to the source html ), i definitely agree that a html/xhtml parser is the
way to go. The resulting implementation will be cleaner and the typical issues
will be adressed slightly changing the parsing code or fixing bugs in the html
cleaning/validator section of the parser...

------
JulianMorrison
It's actually really simple.

You can't parse HTML with regex.

You can parse text blobs with regex.

If your input looks like HTML but is really a text blob, you can parse it with
regex.

~~~
btilly
No, it is even simpler than that. Regular expressions are for matching
patterns, not parsing. They don't parse HTML. They don't parse text blobs.
They don't parse anything. They just match.

------
thaumaturgy
Nice, codinghorror has audio ads for Mastercard which cause Safari to hang and
then finally crash.

I loooove it when a site justifies my use of AdBlock.

------
cia_plant
Don't HTML parsers use regexes for their lexical scan phase?

~~~
statictype
I suppose that depends entirely on how the parser was implemented.

You could write a scanner that doesn't use regexes. It could just read in
characters one at a time and hard-code the logic for matching tokens.

On the other hand, I wouldn't be surprised if many of the standard html
parsing libraries use regex libraries internally to match tokens.

If you used a scanner generator like lex, you would specify your scanner using
regexes and lex would generate code that expands those regexes into an
automata. I guess that also counts as using regexes, only they're compiled
down at compile-time instead of at runtime.

------
PHP-TROLL
This is total bullshit. This is assuming that you are parsing well formatted,
modern HTML.

Anyone who has had to data mine HTML on a regular basis knows regular
expressions are the only way to go because most web pages are clusterfucks of
invalid, font-tag laden, id-less, ancient piles of shit.

Any XML parser I threw at the sites I was mining ran away crying with it's
pants soaked with urine. So climb down your ivory tower, because XML parsers
may be the "correct" way to get data out of HTML, but they are near worthless
for most situations.

~~~
psadauskas
Thats because you're using an XML parser to parse HTML. You should use an HTML
parser, like hpricot or beautifulsoup.

~~~
PHP-TROLL
Yeah when I said XML parsers I was speaking generically. I tried many HTML and
XML parsers as well has HTML sanitizers. They worked, maybe, 60% of the time.
Hpricot regularly segfaulted, beautifulsoup didn't even begin to work. Even if
I ran the HTML through tidy first it would fail to parse.

Most people are ignorant of how bad most of the HTML on the web is, it is
simply unparseable.

~~~
coderdude
>>Most people are ignorant of how bad most of the HTML on the web is, it is
simply unparseable.

And somehow using a regex to guess at all the possible combinations of
"badness" is the correct way to go? You seem lost my friend. Everything you're
saying is backwards to what someone who actually parses a lot of HTML "in the
wild" would say on the topic. It sounds more like you're trying to cuss your
way into being correct.

~~~
PHP-TROLL
I'm not talking about parsing HTML. I'm talking about getting data out of HTML
which is completely different.

One doesn't need to guess at all possible combinations of badness using a
regex, you just need to find the data you need and adjust the regex to the
badness of that particular page.

Do you know anyone that mines for data out of HTML "in the wild"? I didn't
think so.

~~~
DrJokepu
Uh, Google?

~~~
coderdude
I know, I know. It blows your mind. I think we're feeding the troll.

------
zyb09
<obligatory anti jeff attwood comment>

~~~
krakensden
the title was good though

