Reading it, it looks like a total hack job by a poor programmer. For example, HTML parsing it done by a bunch of regular expressions. Which include stuff like
# Yes, Berlios generated \r<BR> sequences with no \n
text = text.replace("\r<BR>", "\r\n")
# And Berlios generated doubled </TD>s
text = text.replace("</TD></TD>", "</TD>")
All the pain of maintaining these little cases could have been removed by simply running the page being walked through an actual HTML parser that produces a DOM tree. Similarly the function dehtmlize contains a limit set of HTML attributes that it will convert (e.g. it does not convert ).
Also, then you get stuff like:
# First, strip out all attributes for easier parsing
text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
text = re.sub('<tr[^>]+>', '<TR>', text, re.I)
text = re.sub('<td[^>]+>', '<TD>', text, re.I)
Why have you got four expressions there? They are all doing case insensitive matching yet there are upper and lowercase versions of each.
Hmm. The rest of the code is really pretty crappy. He could have just read the thing into lxml and have done XPath queries to extract data. It would have had the advantage of making clear what parts of the page he was extracting and made maintenance easy.
Your reply doesn't address why you do the things I highlighted in my original comment. For example, in the middle of your generic table parser you have code specific to Berlios making it non-generic. That's there because you haven't actually parsed the HTML you are trying to hack around it with regexps.
In your post you say:
So instead, I walk through the page looking for anything that looks like a hotlink wrapping literal text of the form #[0-9]. Actually, that oversimplifies; I look for a hotlink with a specific pattern of URL in a hotlink that I know points into the site bugtracker.
I don't have any problem with you using regexps to identify the particular fragments you are looking for, my criticism is that you use them for HTML parsing. For example, in your code you use a regexp like this
This mixes structure and meaning to use your terms. You assume that there's an HREF attribute after the A. Your code is brittle when it comes to a change in the HTML (e.g. suppose the page author adds a CLASS= attribute).
This would not happen if you parsed the HTML into a DOM tree and then did queries against it. You could quickly extract all the <A> tags in the path with a //A query (or even all those that have an HREF) and get the actual HREF robustly. Or you could not use XPath and use a parser that does calls backs with robust lists of attributes.
Doing that would be both robust against changes in page structure, and robust against changes in the attributes or placement of attributes in the <A>.
PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'
PPS Also you say:
If I did what my unwise critic says is right, I’d take a parse tree of the entire page and then write a path query down to the table cell and the embedded link.
I never said that, I said that you could use XPath queries to extract the data. I never implied that you use a complete rooted path to get where you want in the document. You've twisted what I actually said to fit into your 'structure and meaning' blog post.
>PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'
Well, you said his code looks like a hack job by a poor programmer. That's fairly insulting too.
About the actual mechanics of screen scraping, I guess your only issue is that he could have used an html parser to sanitize the code first before querying it.
his code looks like a hack job by a poor programmer
Well, it does. Specifying case insensitive match and then handling both cases yourself reeks of someone who's never parsed anything before. The second case won't even do anything, it's already been done! And even if the language or the libraries worked that way, what about mixed case! Let alone parsing HTML with regexps in the first place. Python has BeautifulSoup for not-well-formed documents.
It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.
It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.
Well, does the program actually work as intended? If so, that's a pretty good indication that the author does understand the language and the problem domain, even if the code isn't pristine.
Errr, no, it's not. There are other reasonable inferences: that the programmer isn't concerned about efficiency of the running code but instead with his efficiency in coding; that the programmer originally had the code doing one thing, but changed it to do another and hasn't (yet) noticed that it's now redundant; that it's not wrong code, but instead redundant code so qwitcherbellyaching and submit a patch.
I parse HTML with regexps when I know it's been machine-generated with a certain structure and I'm only interested in pulling out the bits using re.search(r'blah(%s)blah') that got put in there with a printf "blah$(x)blah".
You're going to be tarring a lot of not-incorrect code with that brush of yours.
No, the only reason to a) specifically add the option for case insensitive matching and then b) copy-paste the same regexp just with the case changed is if you really don't understand what it means.
The difference between "fool" and "looks like it's a hack job by a poor programmer" is that one critiques the actual work, and one assaults the person. Insulting or not, critiques of work improve discourse and name calling usually doesn't.
What you're failing to see is the meaning of the phrase "looks like it was written by a poor programmer". That's a comment on the code only, given that it's possible for a fantastic programmer to write code like that. It's not, however, the same as "calling someone a poor programmer".
Likewise, someone looking at an abstract painting might say "that looks like it was painted by a 5 year old", which doesn't mean they're calling the actual painter a 5 year old. It's commentary on the work.
Except it doesn't appear to be. The original author has responded to my criticism of his code by calling me a 'fool' and 'snotty kid', ignoring the actual criticisms that I made and that writing a blog post explaining his superiority.
Or is that what you meant by 'bazaar-style development'?
And that response isn't very conducive to creating a community of people willing to help you on your open source project.
Too many egos getting in the way here guys. This is an all right project. Don't let disagreements on how to do things make it all fail. There are always better ways to do things and there are prototypes that do a lot of wrong things.
Now's the time to make the bad parts good. The code was just released, okay, now make it right.
I think you are being a little over-critical. There are always better ways to do everything. This contributor was doing something that had never been done before -- by himself.
Isn't the whole point of open source to show the code and let community involvement make it better?
Since you have a lot of experience with HTML parsing, it'd be better to contribute your experience to the project than to bash someone for not doing something you know well as well as you could do it.
ESR is a well known hacker who has maintained several large-scale OSS projects and done lots of coding for several decades. He literally wrote the book about OSS. If ESR's code was markedly bad, it would be news.
Hey there, Jim Thompson! How surprising to see you here, posting under your usual "I have a reputation to protect so I post under a pseudonym" name. You should put more effort into libelling people -- maybe your fine work will rise to the level of a lawsuit someday?
I'm aware. fnid just seemed to be under the impression that ESR was a newbie that was genuinely looking for help from more experienced hackers. I was just showing that he (ESR) is a noted (notorious) figure that should know better than to make noobish mistakes in code.
I never said that he was any good; merely that he has notoriety. I thought that made that clear. I guess not, so let me reiterate:
Jgrahamc commented that ESR's code here is pretty crappy. I agree. Fnid comes in and accuses jgrahamc of being too harsh. It appears to my mind's eye that Fnid doesn't know who ESR is and naivly assumes that he is some random newbie that would appreciate help, rather than a notorious figure with a number of projects books under his belt. I'm pointing out that it is not wrong for jgrahamc to call him on it. If ESR's is mostly self publication, them I am even more correct.
To further elaborate, if Bjarne Stroustrup wrote an C compiler that was ineptly coded and someone said the code was lousy, one would not reply, 'lighten up, this was a good first try for a newbie,' unless one did not know who Bjarne Strousup was.
Reading it, it looks like a total hack job by a poor programmer. For example, HTML parsing it done by a bunch of regular expressions. Which include stuff like
All the pain of maintaining these little cases could have been removed by simply running the page being walked through an actual HTML parser that produces a DOM tree. Similarly the function dehtmlize contains a limit set of HTML attributes that it will convert (e.g. it does not convert ).Also, then you get stuff like:
Why have you got four expressions there? They are all doing case insensitive matching yet there are upper and lowercase versions of each.Hmm. The rest of the code is really pretty crappy. He could have just read the thing into lxml and have done XPath queries to extract data. It would have had the advantage of making clear what parts of the page he was extracting and made maintenance easy.