Hacker News new | comments | show | ask | jobs | submit login

It's amusing to actually look at the source code of this. It's a single file, here: http://svn.gna.org/viewcvs/forgeplucker/trunk/bugplucker.py?...

Reading it, it looks like a total hack job by a poor programmer. For example, HTML parsing it done by a bunch of regular expressions. Which include stuff like

  # Yes, Berlios generated \r<BR> sequences with no \n
  text = text.replace("\r<BR>", "\r\n")
  # And Berlios generated doubled </TD>s
  text = text.replace("</TD></TD>", "</TD>")
All the pain of maintaining these little cases could have been removed by simply running the page being walked through an actual HTML parser that produces a DOM tree. Similarly the function dehtmlize contains a limit set of HTML attributes that it will convert (e.g. it does not convert &nbsp;).

Also, then you get stuff like:

  # First, strip out all attributes for easier parsing
  text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
  text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
  text = re.sub('<tr[^>]+>', '<TR>', text, re.I)
  text = re.sub('<td[^>]+>', '<TD>', text, re.I)
Why have you got four expressions there? They are all doing case insensitive matching yet there are upper and lowercase versions of each.

Hmm. The rest of the code is really pretty crappy. He could have just read the thing into lxml and have done XPath queries to extract data. It would have had the advantage of making clear what parts of the page he was extracting and made maintenance easy.




It's a proof-of-concept. Don't get hung up on the parsing details, I expect I'm going to have to rewrite it at least twice.


I've written something relevent here: "Structure Is Not Meaning".

http://esr.ibiblio.org/?p=1387


Your reply doesn't address why you do the things I highlighted in my original comment. For example, in the middle of your generic table parser you have code specific to Berlios making it non-generic. That's there because you haven't actually parsed the HTML you are trying to hack around it with regexps.

In your post you say:

So instead, I walk through the page looking for anything that looks like a hotlink wrapping literal text of the form #[0-9]. Actually, that oversimplifies; I look for a hotlink with a specific pattern of URL in a hotlink that I know points into the site bugtracker.

I don't have any problem with you using regexps to identify the particular fragments you are looking for, my criticism is that you use them for HTML parsing. For example, in your code you use a regexp like this

  <A HREF="/bugs/\?func=detailbug&bug_id=([0-9]+)&group_id=%s">
This mixes structure and meaning to use your terms. You assume that there's an HREF attribute after the A. Your code is brittle when it comes to a change in the HTML (e.g. suppose the page author adds a CLASS= attribute).

This would not happen if you parsed the HTML into a DOM tree and then did queries against it. You could quickly extract all the <A> tags in the path with a //A query (or even all those that have an HREF) and get the actual HREF robustly. Or you could not use XPath and use a parser that does calls backs with robust lists of attributes.

Doing that would be both robust against changes in page structure, and robust against changes in the attributes or placement of attributes in the <A>.

PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'

PPS Also you say:

If I did what my unwise critic says is right, I’d take a parse tree of the entire page and then write a path query down to the table cell and the embedded link.

I never said that, I said that you could use XPath queries to extract the data. I never implied that you use a complete rooted path to get where you want in the document. You've twisted what I actually said to fit into your 'structure and meaning' blog post.


>PS Is your name calling really necessary? In your post you refer to me as a 'snotty kid' and a 'fool'

Well, you said his code looks like a hack job by a poor programmer. That's fairly insulting too.

About the actual mechanics of screen scraping, I guess your only issue is that he could have used an html parser to sanitize the code first before querying it.


his code looks like a hack job by a poor programmer

Well, it does. Specifying case insensitive match and then handling both cases yourself reeks of someone who's never parsed anything before. The second case won't even do anything, it's already been done! And even if the language or the libraries worked that way, what about mixed case! Let alone parsing HTML with regexps in the first place. Python has BeautifulSoup for not-well-formed documents.

It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.


It's reasonable from that code snippet to infer the author understands neither the language he's using nor the problem domain he's working in.

Well, does the program actually work as intended? If so, that's a pretty good indication that the author does understand the language and the problem domain, even if the code isn't pristine.


Errr, no, it's not. There are other reasonable inferences: that the programmer isn't concerned about efficiency of the running code but instead with his efficiency in coding; that the programmer originally had the code doing one thing, but changed it to do another and hasn't (yet) noticed that it's now redundant; that it's not wrong code, but instead redundant code so qwitcherbellyaching and submit a patch.

I parse HTML with regexps when I know it's been machine-generated with a certain structure and I'm only interested in pulling out the bits using re.search(r'blah(%s)blah') that got put in there with a printf "blah$(x)blah".

You're going to be tarring a lot of not-incorrect code with that brush of yours.


he obviously hates dealing with regular expressions and I do too. It's only natural.


So he used double the number of regexps he needed?


double the expressions but they're simpler to understand perhaps.


No, the only reason to a) specifically add the option for case insensitive matching and then b) copy-paste the same regexp just with the case changed is if you really don't understand what it means.


The difference between "fool" and "looks like it's a hack job by a poor programmer" is that one critiques the actual work, and one assaults the person. Insulting or not, critiques of work improve discourse and name calling usually doesn't.


esr calls him a fool for his position that meaning can be derived from structure.

I fail to see the difference between that and calling someone a poor programmer based on 1000 lines of prototype code.


What you're failing to see is the meaning of the phrase "looks like it was written by a poor programmer". That's a comment on the code only, given that it's possible for a fantastic programmer to write code like that. It's not, however, the same as "calling someone a poor programmer".

Likewise, someone looking at an abstract painting might say "that looks like it was painted by a 5 year old", which doesn't mean they're calling the actual painter a 5 year old. It's commentary on the work.


This is bazaar-style development in action: do a crap initial job and troll up helpful comments from people like you.


Except it doesn't appear to be. The original author has responded to my criticism of his code by calling me a 'fool' and 'snotty kid', ignoring the actual criticisms that I made and that writing a blog post explaining his superiority.

Or is that what you meant by 'bazaar-style development'?


exactly so.


And that response isn't very conducive to creating a community of people willing to help you on your open source project.

Too many egos getting in the way here guys. This is an all right project. Don't let disagreements on how to do things make it all fail. There are always better ways to do things and there are prototypes that do a lot of wrong things.

Now's the time to make the bad parts good. The code was just released, okay, now make it right.


I think you are being a little over-critical. There are always better ways to do everything. This contributor was doing something that had never been done before -- by himself.

Isn't the whole point of open source to show the code and let community involvement make it better?

Since you have a lot of experience with HTML parsing, it'd be better to contribute your experience to the project than to bash someone for not doing something you know well as well as you could do it.


ESR is a well known hacker who has maintained several large-scale OSS projects and done lots of coding for several decades. He literally wrote the book about OSS. If ESR's code was markedly bad, it would be news.


He's certainly a talented self-publicist, but I'd go no further than that. He's the Michael Moore of software. Or the Paris Hilton.


He's more Nichole Richie. He doesn't rise to the level of Paris, who was surprised by jail, but never surprised by wealth.


He's like Zed Shaw only without, y'know, actual code


Hey there, Jim Thompson! How surprising to see you here, posting under your usual "I have a reputation to protect so I post under a pseudonym" name. You should put more effort into libelling people -- maybe your fine work will rise to the level of a lawsuit someday?


Which large scale OSS projects?


Fetchmail is the most famous one. http://fetchmail.berlios.de/


gpsd is another, but esr seems to have trouble understanding parts of it.

CML2 is yet another, though it failed to gain acceptance by the kernel group.

the GNU ncurses library was maintained by Mr. Raymond for a while.

fetchmail wasn't an original idea, it came out of UC Berkeley's "popclient".

ESR has also famously over-claimed his experience (several times).


I'm aware. fnid just seemed to be under the impression that ESR was a newbie that was genuinely looking for help from more experienced hackers. I was just showing that he (ESR) is a noted (notorious) figure that should know better than to make noobish mistakes in code.


You're assuming that because he's loud, he's good.

Truth is, he is loud, but... he's not very good at writing software.

He's not bad at books, provided he has an editor working with him. The writing on his blog is a mess.


I never said that he was any good; merely that he has notoriety. I thought that made that clear. I guess not, so let me reiterate:

Jgrahamc commented that ESR's code here is pretty crappy. I agree. Fnid comes in and accuses jgrahamc of being too harsh. It appears to my mind's eye that Fnid doesn't know who ESR is and naivly assumes that he is some random newbie that would appreciate help, rather than a notorious figure with a number of projects books under his belt. I'm pointing out that it is not wrong for jgrahamc to call him on it. If ESR's is mostly self publication, them I am even more correct.

To further elaborate, if Bjarne Stroustrup wrote an C compiler that was ineptly coded and someone said the code was lousy, one would not reply, 'lighten up, this was a good first try for a newbie,' unless one did not know who Bjarne Strousup was.


Reading it, it looks like a total hack job by a poor programmer.

Unsurprising, most programmers are poor.


especially those who aren't self-confident enough to say, "Oh, you're right!"




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: