I think there's an important distinction between telling someone they've written "poor code" and telling them they're a "poor programmer". Whether the latter is true or not is irrelevant to the discussion at hand and comes across as an ad hominem argument.
I more or less agree with your criticism of parsing via regular expressions, but there was no need to distract us all with insulting Eric (and he should know better than to respond in kind).
One thing I asked myself after I wrote that comment was how I would feel if I had been on the receiving end. I think there would have been two reactions.
First, I would have had the sort of anger and upset reaction that would have come with someone describing my code as "looking like it was written by a poor programmer". I'm sure that I would have wanted to lash out at that person.
But that reaction would have quickly been replaced with deep shame that my code had been examined by someone and found to be very poor. Once I had examined the code I would have felt very bad because I had put something out there of that quality.
The irony is that had he actually contributed significant amounts of code to the larger open source projects, he'd be used to this sort of criticism. Hell hath no fury like a maintainer's scorn. Incidentally, I think that's why developers in the major open source projects tend to generally be pretty good: the commit crappy code, get flamed, fix, commit-cycle tends to be instructive.
It's the usual response he has; we should probably be used to it by now :)
I think poor programmer was a fair comment in the end; the code does demonstrate poor programming (rather than just poor code). On the other hand "snotty kid" struck me as too far; questioning programming ability is fair enough - even being rude about it isn't nice but at least reasonably acceptable. Personal attacks is just pram and toys throwing.
Upvote I used to dig BeautifulSoup. It's latest release is slower than it's predecessors. That's why I use PyQuery nowadays, it's based on lxml and uses a jQuery like API to access the DOM.
I'm always terrified that these conversations will unearth a library or technique that will immediately obsolete most of the code in my current pet project.
Don't quote me on this, but I believe you can get this running by way of IronClad. (AFAIK lxml doesn't run on IronPython without using IronClad right now.)
I forgot what BeautifulSoup couldn't do that made me look for something else, but I've been using html5lib for those purposes these days with good results, especially if you need to output modified HTML.
At Tipjoy, I needed to parse pages to find Tipjoy widgets to validate the owner of the widget. The configuration made sense, but left me in an undesirable position of parsing lots of pages.
At first I used regexes. I would find bugs, and fix them. The bugs affected the product in delaying confirmation of content ownership, which stinks.
I noticed that the bugs didn't stop coming, so I switched to BeautifulSoup. It was faster and better. I highly recommend it for anyone using Python.
This guy is correct, of course. But he comes across quite rudely. Sometimes it is better to not be right at all than to be right at any cost.
Actually, I'm a bit surprised that this article has been voted to the top of HN. It's not particularly interesting or challenging from any perspective that I can think of.
I was a bit rude, but it's not always inappropriate to be rude. Recall that I was addressing a person who claims, literally, to be a God: http://catb.org/~esr/writings/dancing.html
"On the other hand, every once in while I am reminded that “programmers I’ve known” are clustered in the top 5% of ability, usually the top 1% of ability."
Seriously - can someone explain what was so wrong with this comment that it deserves to go down to -3? I'm amused that this comment caused so much... anger?
I've tried to use BeautifulSoup, but I wasn't impressed. Coming from Hpricot and Nokogiri, it leaves a lot to be desired. Mostly because it's not very tolerant of bad markup, which is a deal breaker when you're trying to parse random HTML from around the web. I'm also pretty sure it's a dead project.
I started using BeautifulSoup a few weeks ago to help write more exact unit tests for front end design. Instead of saying "make sure this page contains abcdef", I can say "pull the exact section that abcdef is supposed to be in, and make sure it is (or conversely, make sure it's NOT showing up when it's not supposed to). If you have access to the HTML code, you can make it a lot easier on yourself by putting eyecatcher IDs or classes in elements and starting from there.
The other thing that I wondered about is why he goes to all the trouble of web scraping when SourceForge (at least) offers an XML export option: https://sourceforge.net/export/
So someone open-sources code they wrote, and the author decides to insult them for it? And as if being an ass on HN wasn't enough, he regurgitated it on his blog and then resubmitted that to HN? Ugh.
Parsing malformed HTML is a nightmare, I'm impressed the programmer even gave it a shot.
I fundamentally disagree with this. If an existing solution exists, one shouldn't try to create his own? Then why did Linus build Linux?
There's also a lot to be said for curiosity. I'm currently building an email client in my spare time, not because I don't think there are plenty of great ones already, but because I'm interested in programming with IMAP. I'll probably open source the final result, and I don't think there's anything wrong with doing so.
If you want to build an HTML parser, go ahead. From the comments in the code, it's clear that ESR tried to parse HTML because he thought he had to, not because he wanted to.
I've used BeautifulSoup before and I've gotta say that it was kind of a pain; surely easier than writing regex myself, but I've had a much more pleasant parsing experience with other methods, like XPath in scrapy or jQuery API.
BeautifulSoup is also only kinda-sorta-maintained from what I recall, so it's probably better to use something else.
It's a shame we don't have something like Beautiful Soup for .NET or C or any other of a dozen languages -- it would have made this article more applicable.
But the thrust is good -- for easy-to-describe yet tough-to-implement problems, always steal somebody else's work if you can
Yeah good job for destroying his code publicly like that, that's pretty cool and I hope it make you feel good. Just send him a patch email with the beautifulsoup version instead.
And I can't believe this is ranked first on hacker news.
Peer-review and constructive criticism are valuable forces for change and improvement.
Clearly there's some personal politics going on (insults having been traded) but jgc does raise some good points: the original code is very tightly-coupled code to whatever Berlios outputs, and would incur a large maintainability cost.
It's also bought BeautifulSoup to my attention, which seems quite a neat utility built exactly for this sort of thing.
For these reasons it's interesting, and is why I've upvoted it.
It would be foolish of me to pretend and try to hide my dislike for Eric Raymond, but it comes down not to a personal problem, but a problem with the way he presents himself.
In this case, the juxtaposition of the quality of the code (which was truly poor) and Raymond's opinion of himself (recall that he claimed to be a Core Linux Developer at one point) made plain the problem that many people have with the man.
Interestingly, his response to my criticism was not to say something like "You are right, but don't be nasty about it". Instead he wrote a blog posting going on about how right he is. Oddly, I share his concerns about HTML parsing, but I think he's wrong to not use an HTML parser and do everything by hand. Having done a lot of screen scraping work dealing with all the edge cases is a pain.
>Oddly, I share his concerns about HTML parsing, but I think he's wrong to not use an HTML parser and do everything by hand
It looks like he didn't really understand how BeautifulSoup (or for that matter, xpath) works. For example, he seems completely unaware of the '//node' syntax which would completely sidestep the issue of encoding structure in the code.
And in his defense, someone who has not used BeautifulSoup might not realize how robust it is. I was deeply impressed with it when I used it for a scraping project - it's really good at handling tag soup and giving you powerful access to the parse tree. I would not have expected one tool to perform well in both those areas.
ESR probably just made some assumptions about the capabilities of available parsing options that would be correct but for a few exceptional tools. Let's fault him for not checking his assumptions rather than name-calling about poor programming.
but in reality, his ego is quite fragile. My theory is that the left-over dregs of being a kid with CP has left him a) prone to overstatement and b) quite shy of being challenged.
I actually got him to admit (in public, on his blog) that half of all he claims is untrue.
Granted, esr's a bit of a strawman here because it's clear he rarely if ever has to parse HTML or deal with it at all, but his counter attack at John's corrections was way, way off base. Watching a programmer as prolific as esr defend hand-parsing HTML with trivially fragile regex is painful and downright surprising, but worse, he defended it with a weak and garbled argument about "semantics". There's a disconnect here.. :)