Hacker News new | comments | show | ask | jobs | submit login
Parsing HTML in Python with BeautifulSoup (jgc.org)
86 points by jgrahamc 2186 days ago | 52 comments



I think there's an important distinction between telling someone they've written "poor code" and telling them they're a "poor programmer". Whether the latter is true or not is irrelevant to the discussion at hand and comes across as an ad hominem argument.

I more or less agree with your criticism of parsing via regular expressions, but there was no need to distract us all with insulting Eric (and he should know better than to respond in kind).

-----


One thing I asked myself after I wrote that comment was how I would feel if I had been on the receiving end. I think there would have been two reactions.

First, I would have had the sort of anger and upset reaction that would have come with someone describing my code as "looking like it was written by a poor programmer". I'm sure that I would have wanted to lash out at that person.

But that reaction would have quickly been replaced with deep shame that my code had been examined by someone and found to be very poor. Once I had examined the code I would have felt very bad because I had put something out there of that quality.

-----


The irony is that had he actually contributed significant amounts of code to the larger open source projects, he'd be used to this sort of criticism. Hell hath no fury like a maintainer's scorn. Incidentally, I think that's why developers in the major open source projects tend to generally be pretty good: the commit crappy code, get flamed, fix, commit-cycle tends to be instructive.

-----


esr is a poor programmer. there are dozens of examples.

-----


Indeed, there are dozens of examples of poor programmers.

-----


http://news.slashdot.org/article.pl?sid=99/12/10/0821224&...

-----


and he should know better than to respond in kind

It's the usual response he has; we should probably be used to it by now :)

I think poor programmer was a fair comment in the end; the code does demonstrate poor programming (rather than just poor code). On the other hand "snotty kid" struck me as too far; questioning programming ability is fair enough - even being rude about it isn't nice but at least reasonably acceptable. Personal attacks is just pram and toys throwing.

-----


The only problem with BeautifulSoup is that its kind of slow, but if it's fast enough for you go for it, otherwise you can try lxml with its lxml.html module. See also: http://blog.ianbicking.org/2008/03/30/python-html-parser-per...

-----


Upvote I used to dig BeautifulSoup. It's latest release is slower than it's predecessors. That's why I use PyQuery nowadays, it's based on lxml and uses a jQuery like API to access the DOM.

-----


I'm always terrified that these conversations will unearth a library or technique that will immediately obsolete most of the code in my current pet project.

Thanks!

-----


I highly reccomend PyQuery. It's basically the jQuery API on top of lxml.

-----


any chance of this running in IronPython?

-----


Don't quote me on this, but I believe you can get this running by way of IronClad. (AFAIK lxml doesn't run on IronPython without using IronClad right now.)

http://www.resolversystems.com/products/ironclad/

-----


I forgot what BeautifulSoup couldn't do that made me look for something else, but I've been using html5lib for those purposes these days with good results, especially if you need to output modified HTML.

-----


At Tipjoy, I needed to parse pages to find Tipjoy widgets to validate the owner of the widget. The configuration made sense, but left me in an undesirable position of parsing lots of pages.

At first I used regexes. I would find bugs, and fix them. The bugs affected the product in delaying confirmation of content ownership, which stinks.

I noticed that the bugs didn't stop coming, so I switched to BeautifulSoup. It was faster and better. I highly recommend it for anyone using Python.

-----


This guy is correct, of course. But he comes across quite rudely. Sometimes it is better to not be right at all than to be right at any cost.

Actually, I'm a bit surprised that this article has been voted to the top of HN. It's not particularly interesting or challenging from any perspective that I can think of.

-----


I was a bit rude, but it's not always inappropriate to be rude. Recall that I was addressing a person who claims, literally, to be a God: http://catb.org/~esr/writings/dancing.html

-----


I disagreed with you on the previous thread, but I would agree that if anyone deserved a harsh reality check on his skills, it would probably be esr:

http://esr.ibiblio.org/?p=1350#comment-241727

"On the other hand, every once in while I am reminded that “programmers I’ve known” are clustered in the top 5% of ability, usually the top 1% of ability."

-----


Even worse - it was submitted by the author himself. Can we just flag it and get rid of personal rants on HN?

-----


Seriously - can someone explain what was so wrong with this comment that it deserves to go down to -3? I'm amused that this comment caused so much... anger?

-----


Projection.

-----


The author doesn't state which version of Python / BeautifulSoup he's using, but based on this page, the older versions parse HTML more reliably.

http://www.crummy.com/software/BeautifulSoup/3.1-problems.ht...

-----


I was using 3.1.0.1

-----


I've tried to use BeautifulSoup, but I wasn't impressed. Coming from Hpricot and Nokogiri, it leaves a lot to be desired. Mostly because it's not very tolerant of bad markup, which is a deal breaker when you're trying to parse random HTML from around the web. I'm also pretty sure it's a dead project.

-----


I started using BeautifulSoup a few weeks ago to help write more exact unit tests for front end design. Instead of saying "make sure this page contains abcdef", I can say "pull the exact section that abcdef is supposed to be in, and make sure it is (or conversely, make sure it's NOT showing up when it's not supposed to). If you have access to the HTML code, you can make it a lot easier on yourself by putting eyecatcher IDs or classes in elements and starting from there.

-----


Another option is to run the source through Tidy with XHTML output and then treat it as XML.

-----


The other thing that I wondered about is why he goes to all the trouble of web scraping when SourceForge (at least) offers an XML export option: https://sourceforge.net/export/

-----


Is there an equivalent library for PHP? It might save me a lot of time.

-----


Yes, try Simple HTML DOM:

http://simplehtmldom.sourceforge.net/

Its a little light on documentation, but has a familiar syntax and handles malformed HTML. I've used it in a number of projects and its been great.

-----


I don't think there is, but a way to workaround it is to use tidy to clean up malformed html, then use simplexml or another parser library.

-----


So someone open-sources code they wrote, and the author decides to insult them for it? And as if being an ass on HN wasn't enough, he regurgitated it on his blog and then resubmitted that to HN? Ugh.

Parsing malformed HTML is a nightmare, I'm impressed the programmer even gave it a shot.

-----


You're not supposed to give it a shot. That's the point.

-----


I fundamentally disagree with this. If an existing solution exists, one shouldn't try to create his own? Then why did Linus build Linux?

There's also a lot to be said for curiosity. I'm currently building an email client in my spare time, not because I don't think there are plenty of great ones already, but because I'm interested in programming with IMAP. I'll probably open source the final result, and I don't think there's anything wrong with doing so.

-----


If you want to build an HTML parser, go ahead. From the comments in the code, it's clear that ESR tried to parse HTML because he thought he had to, not because he wanted to.

-----


Fair enough, good point.

-----


The question is would you have spent that much time picking apart his code if you had liked the guy ?

-----


I've used BeautifulSoup before and I've gotta say that it was kind of a pain; surely easier than writing regex myself, but I've had a much more pleasant parsing experience with other methods, like XPath in scrapy or jQuery API.

BeautifulSoup is also only kinda-sorta-maintained from what I recall, so it's probably better to use something else.

-----


I've found this addition to BeatifulSoup useful: http://code.google.com/p/soupselect/

which lets you use CSS expressions to find what you want.

-----


It's a shame we don't have something like Beautiful Soup for .NET or C or any other of a dozen languages -- it would have made this article more applicable.

But the thrust is good -- for easy-to-describe yet tough-to-implement problems, always steal somebody else's work if you can

-----


There's a port for Ruby, Rubyful Soup http://www.crummy.com/software/RubyfulSoup/

Apparently it's slower, though. I've only used the python version, which works very well if you use it with the old SGMLParser.

-----


Ruby has two popular HTML parsers, Nokogiri and Hpricot, both of which are awesome and fast.

-----


UPDATE:

For those of you doing .NET, the HtmlAgilityPack looks very interesting. You can search using multiple paradigms such as XPath, XSLT, and Linq

http://www.codeplex.com/htmlagilitypack

-----


Yeah good job for destroying his code publicly like that, that's pretty cool and I hope it make you feel good. Just send him a patch email with the beautifulsoup version instead.

And I can't believe this is ranked first on hacker news.

-----


Peer-review and constructive criticism are valuable forces for change and improvement.

Clearly there's some personal politics going on (insults having been traded) but jgc does raise some good points: the original code is very tightly-coupled code to whatever Berlios outputs, and would incur a large maintainability cost.

It's also bought BeautifulSoup to my attention, which seems quite a neat utility built exactly for this sort of thing.

For these reasons it's interesting, and is why I've upvoted it.

-----


It would be foolish of me to pretend and try to hide my dislike for Eric Raymond, but it comes down not to a personal problem, but a problem with the way he presents himself.

In this case, the juxtaposition of the quality of the code (which was truly poor) and Raymond's opinion of himself (recall that he claimed to be a Core Linux Developer at one point) made plain the problem that many people have with the man.

Interestingly, his response to my criticism was not to say something like "You are right, but don't be nasty about it". Instead he wrote a blog posting going on about how right he is. Oddly, I share his concerns about HTML parsing, but I think he's wrong to not use an HTML parser and do everything by hand. Having done a lot of screen scraping work dealing with all the edge cases is a pain.

And, also, Raymond claims to be very thick-skinned: http://catb.org/~esr/writings/take-my-job-please.html

-----


>Oddly, I share his concerns about HTML parsing, but I think he's wrong to not use an HTML parser and do everything by hand

It looks like he didn't really understand how BeautifulSoup (or for that matter, xpath) works. For example, he seems completely unaware of the '//node' syntax which would completely sidestep the issue of encoding structure in the code.

-----


And in his defense, someone who has not used BeautifulSoup might not realize how robust it is. I was deeply impressed with it when I used it for a scraping project - it's really good at handling tag soup and giving you powerful access to the parse tree. I would not have expected one tool to perform well in both those areas.

ESR probably just made some assumptions about the capabilities of available parsing options that would be correct but for a few exceptional tools. Let's fault him for not checking his assumptions rather than name-calling about poor programming.

-----


but in reality, his ego is quite fragile. My theory is that the left-over dregs of being a kid with CP has left him a) prone to overstatement and b) quite shy of being challenged.

I actually got him to admit (in public, on his blog) that half of all he claims is untrue.

-----


I can believe that ESR was responsible for many cores on Linux.

-----


Constructive criticism != demolishing someone publicly for a small snippet of code.

+ The word maintenance is incorrecly used there.

-----


Granted, esr's a bit of a strawman here because it's clear he rarely if ever has to parse HTML or deal with it at all, but his counter attack at John's corrections was way, way off base. Watching a programmer as prolific as esr defend hand-parsing HTML with trivially fragile regex is painful and downright surprising, but worse, he defended it with a weak and garbled argument about "semantics". There's a disconnect here.. :)

-----


Is he prolific. Sure he is famous. But the amount of code he has written is not much if i remember right.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: