Reading it, it looks like a total hack job by a poor programmer. For example, HTML parsing it done by a bunch of regular expressions. Which include stuff like
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
text = text.replace("</TD></TD>", "</TD>")
Also, then you get stuff like:
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
text = re.sub('<tr[^>]+>', '<TR>', text, re.I)
text = re.sub('<td[^>]+>', '<TD>', text, re.I)
Hmm. The rest of the code is really pretty crappy. He could have just read the thing into lxml and have done XPath queries to extract data. It would have had the advantage of making clear what parts of the page he was extracting and made maintenance easy.
In your post you say:
So instead, I walk through the page looking for anything that looks like a hotlink wrapping literal text of the form #[0-9]. Actually, that oversimplifies; I look for a hotlink with a specific pattern of URL in a hotlink that I know points into the site bugtracker.
I don't have any problem with you using regexps to identify the particular fragments you are looking for, my criticism is that you use them for HTML parsing. For example, in your code you use a regexp like this
This would not happen if you parsed the HTML into a DOM tree and then did queries against it. You could quickly extract all the <A> tags in the path with a //A query (or even all those that have an HREF) and get the actual HREF robustly. Or you could not use XPath and use a parser that does calls backs with robust lists of attributes.
Doing that would be both robust against changes in page structure, and robust against changes in the attributes or placement of attributes in the <A>.
PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'
PPS Also you say:
If I did what my unwise critic says is right, I’d take a parse tree of the entire page and then write a path query down to the table cell and the embedded link.
I never said that, I said that you could use XPath queries to extract the data. I never implied that you use a complete rooted path to get where you want in the document. You've twisted what I actually said to fit into your 'structure and meaning' blog post.
Well, you said his code looks like a hack job by a poor programmer. That's fairly insulting too.
About the actual mechanics of screen scraping, I guess your only issue is that he could have used an html parser to sanitize the code first before querying it.
Well, it does. Specifying case insensitive match and then handling both cases yourself reeks of someone who's never parsed anything before. The second case won't even do anything, it's already been done! And even if the language or the libraries worked that way, what about mixed case! Let alone parsing HTML with regexps in the first place. Python has BeautifulSoup for not-well-formed documents.
It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.
Well, does the program actually work as intended? If so, that's a pretty good indication that the author does understand the language and the problem domain, even if the code isn't pristine.
I parse HTML with regexps when I know it's been machine-generated with a certain structure and I'm only interested in pulling out the bits using re.search(r'blah(%s)blah') that got put in there with a printf "blah$(x)blah".
You're going to be tarring a lot of not-incorrect code with that brush of yours.
I fail to see the difference between that and calling someone a poor programmer based on 1000 lines of prototype code.
Likewise, someone looking at an abstract painting might say "that looks like it was painted by a 5 year old", which doesn't mean they're calling the actual painter a 5 year old. It's commentary on the work.
Or is that what you meant by 'bazaar-style development'?
Too many egos getting in the way here guys. This is an all right project. Don't let disagreements on how to do things make it all fail. There are always better ways to do things and there are prototypes that do a lot of wrong things.
Now's the time to make the bad parts good. The code was just released, okay, now make it right.
Isn't the whole point of open source to show the code and let community involvement make it better?
Since you have a lot of experience with HTML parsing, it'd be better to contribute your experience to the project than to bash someone for not doing something you know well as well as you could do it.
CML2 is yet another, though it failed to gain acceptance by the kernel group.
the GNU ncurses library was maintained by Mr. Raymond for a while.
fetchmail wasn't an original idea, it came out of UC Berkeley's "popclient".
ESR has also famously over-claimed his experience (several times).
Truth is, he is loud, but... he's not very good at writing software.
He's not bad at books, provided he has an editor working with him. The writing on his blog is a mess.
Jgrahamc commented that ESR's code here is pretty crappy. I agree. Fnid comes in and accuses jgrahamc of being too harsh. It appears to my mind's eye that Fnid doesn't know who ESR is and naivly assumes that he is some random newbie that would appreciate help, rather than a notorious figure with a number of projects books under his belt. I'm pointing out that it is not wrong for jgrahamc to call him on it. If ESR's is mostly self publication, them I am even more correct.
To further elaborate, if Bjarne Stroustrup wrote an C compiler that was ineptly coded and someone said the code was lousy, one would not reply, 'lighten up, this was a good first try for a newbie,' unless one did not know who Bjarne Strousup was.
Unsurprising, most programmers are poor.