
Parsing HTML in Python with BeautifulSoup - jgrahamc
http://www.jgc.org/blog/2009/11/parsing-html-in-python-with.html
======
spatulon
I think there's an important distinction between telling someone they've
written "poor code" and telling them they're a "poor programmer". Whether the
latter is true or not is irrelevant to the discussion at hand and comes across
as an ad hominem argument.

I more or less agree with your criticism of parsing via regular expressions,
but there was no need to distract us all with insulting Eric (and he should
know better than to respond in kind).

~~~
jgrahamc
One thing I asked myself after I wrote that comment was how I would feel if I
had been on the receiving end. I think there would have been two reactions.

First, I would have had the sort of anger and upset reaction that would have
come with someone describing my code as "looking like it was written by a poor
programmer". I'm sure that I would have wanted to lash out at that person.

But that reaction would have quickly been replaced with deep shame that my
code had been examined by someone and found to be very poor. Once I had
examined the code I would have felt very bad because I had put something out
there of that quality.

~~~
gonzo
esr is a poor programmer. there are dozens of examples.

~~~
jacquesm
Indeed, there are dozens of examples of poor programmers.

------
rhymes
The only problem with BeautifulSoup is that its kind of slow, but if it's fast
enough for you go for it, otherwise you can try lxml with its lxml.html
module. See also: [http://blog.ianbicking.org/2008/03/30/python-html-parser-
per...](http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/)

~~~
jamongkad
Upvote I used to dig BeautifulSoup. It's latest release is slower than it's
predecessors. That's why I use PyQuery nowadays, it's based on lxml and uses a
jQuery like API to access the DOM.

~~~
joshu
I'm always terrified that these conversations will unearth a library or
technique that will immediately obsolete most of the code in my current pet
project.

Thanks!

------
ivankirigin
At Tipjoy, I needed to parse pages to find Tipjoy widgets to validate the
owner of the widget. The configuration made sense, but left me in an
undesirable position of parsing lots of pages.

At first I used regexes. I would find bugs, and fix them. The bugs affected
the product in delaying confirmation of content ownership, which stinks.

I noticed that the bugs didn't stop coming, so I switched to BeautifulSoup. It
was faster and better. I highly recommend it for anyone using Python.

------
transmit101
This guy is correct, of course. But he comes across quite rudely. Sometimes it
is better to not be right at all than to be right at any cost.

Actually, I'm a bit surprised that this article has been voted to the top of
HN. It's not particularly interesting or challenging from any perspective that
I can think of.

~~~
jgrahamc
I was a bit rude, but it's not always inappropriate to be rude. Recall that I
was addressing a person who claims, literally, to be a God:
<http://catb.org/~esr/writings/dancing.html>

~~~
statictype
I disagreed with you on the previous thread, but I would agree that if anyone
deserved a harsh reality check on his skills, it would probably be esr:

<http://esr.ibiblio.org/?p=1350#comment-241727>

"On the other hand, every once in while I am reminded that “programmers I’ve
known” are clustered in the top 5% of ability, usually the top 1% of ability."

------
jim_lawless
The author doesn't state which version of Python / BeautifulSoup he's using,
but based on this page, the older versions parse HTML more reliably.

[http://www.crummy.com/software/BeautifulSoup/3.1-problems.ht...](http://www.crummy.com/software/BeautifulSoup/3.1-problems.html)

~~~
jgrahamc
I was using 3.1.0.1

------
pkulak
I've tried to use BeautifulSoup, but I wasn't impressed. Coming from Hpricot
and Nokogiri, it leaves a lot to be desired. Mostly because it's not very
tolerant of bad markup, which is a deal breaker when you're trying to parse
random HTML from around the web. I'm also pretty sure it's a dead project.

------
derwiki
I started using BeautifulSoup a few weeks ago to help write more exact unit
tests for front end design. Instead of saying "make sure this page contains
abcdef", I can say "pull the exact section that abcdef is supposed to be in,
and make sure it is (or conversely, make sure it's NOT showing up when it's
not supposed to). If you have access to the HTML code, you can make it a lot
easier on yourself by putting eyecatcher IDs or classes in elements and
starting from there.

------
cookiecaper
I've used BeautifulSoup before and I've gotta say that it was kind of a pain;
surely easier than writing regex myself, but I've had a much more pleasant
parsing experience with other methods, like XPath in scrapy or jQuery API.

BeautifulSoup is also only kinda-sorta-maintained from what I recall, so it's
probably better to use something else.

~~~
Erwin
I've found this addition to BeatifulSoup useful:
<http://code.google.com/p/soupselect/>

which lets you use CSS expressions to find what you want.

------
DanielBMarkham
It's a shame we don't have something like Beautiful Soup for .NET or C or any
other of a dozen languages -- it would have made this article more applicable.

But the thrust is good -- for easy-to-describe yet tough-to-implement
problems, always steal somebody else's work if you can

~~~
arihelgason
There's a port for Ruby, Rubyful Soup
<http://www.crummy.com/software/RubyfulSoup/>

Apparently it's slower, though. I've only used the python version, which works
very well if you use it with the old SGMLParser.

~~~
DanielBMarkham
UPDATE:

For those of you doing .NET, the HtmlAgilityPack looks very interesting. You
can search using multiple paradigms such as XPath, XSLT, and Linq

<http://www.codeplex.com/htmlagilitypack>

------
eli
Another option is to run the source through Tidy with XHTML output and then
treat it as XML.

------
jgrahamc
The other thing that I wondered about is why he goes to all the trouble of web
scraping when SourceForge (at least) offers an XML export option:
<https://sourceforge.net/export/>

------
nvn1
Is there an equivalent library for PHP? It might save me a lot of time.

~~~
qeorge
Yes, try Simple HTML DOM:

<http://simplehtmldom.sourceforge.net/>

Its a little light on documentation, but has a familiar syntax and handles
malformed HTML. I've used it in a number of projects and its been great.

------
qeorge
So someone open-sources code they wrote, and the author decides to insult them
for it? And as if being an ass on HN wasn't enough, he regurgitated it on his
blog and then resubmitted _that_ to HN? Ugh.

Parsing malformed HTML is a nightmare, I'm impressed the programmer even gave
it a shot.

~~~
natrius
You're not supposed to give it a shot. That's the point.

~~~
qeorge
I fundamentally disagree with this. If an existing solution exists, one
shouldn't try to create his own? Then why did Linus build Linux?

There's also a lot to be said for curiosity. I'm currently building an email
client in my spare time, not because I don't think there are plenty of great
ones already, but because I'm interested in programming with IMAP. I'll
probably open source the final result, and I don't think there's anything
wrong with doing so.

~~~
natrius
If you want to build an HTML parser, go ahead. From the comments in the code,
it's clear that ESR tried to parse HTML because he thought he had to, not
because he wanted to.

~~~
qeorge
Fair enough, good point.

------
jacquesm
The question is would you have spent that much time picking apart his code if
you had liked the guy ?

------
d0m
Yeah good job for destroying his code publicly like that, that's pretty cool
and I hope it make you feel good. Just send him a patch email with the
beautifulsoup version instead.

And I can't believe this is ranked first on hacker news.

~~~
Torn
Peer-review and constructive criticism are valuable forces for change and
improvement.

Clearly there's some personal politics going on (insults having been traded)
but jgc does raise some good points: the original code is very tightly-coupled
code to whatever Berlios outputs, and would incur a large maintainability
cost.

It's also bought BeautifulSoup to my attention, which seems quite a neat
utility built _exactly_ for this sort of thing.

For these reasons it's interesting, and is why I've upvoted it.

~~~
jgrahamc
It would be foolish of me to pretend and try to hide my dislike for Eric
Raymond, but it comes down not to a personal problem, but a problem with the
way he presents himself.

In this case, the juxtaposition of the quality of the code (which was truly
poor) and Raymond's opinion of himself (recall that he claimed to be a Core
Linux Developer at one point) made plain the problem that many people have
with the man.

Interestingly, his response to my criticism was not to say something like "You
are right, but don't be nasty about it". Instead he wrote a blog posting going
on about how right he is. Oddly, I share his concerns about HTML parsing, but
I think he's wrong to not use an HTML parser and do everything by hand. Having
done a lot of screen scraping work dealing with all the edge cases is a pain.

And, also, Raymond claims to be very thick-skinned:
<http://catb.org/~esr/writings/take-my-job-please.html>

~~~
statictype
>Oddly, I share his concerns about HTML parsing, but I think he's wrong to not
use an HTML parser and do everything by hand

It looks like he didn't really understand how BeautifulSoup (or for that
matter, xpath) works. For example, he seems completely unaware of the '//node'
syntax which would completely sidestep the issue of encoding structure in the
code.

~~~
youngian
And in his defense, someone who has not used BeautifulSoup might not realize
how robust it is. I was deeply impressed with it when I used it for a scraping
project - it's really good at handling tag soup _and_ giving you powerful
access to the parse tree. I would not have expected one tool to perform well
in both those areas.

ESR probably just made some assumptions about the capabilities of available
parsing options that would be correct but for a few exceptional tools. Let's
fault him for not checking his assumptions rather than name-calling about poor
programming.

