
ESR announces ForgePlucker: solving the data-jail problems of OSS hosting sites - Luyt
http://esr.ibiblio.org/?p=1369
======
jgrahamc
It's amusing to actually look at the source code of this. It's a single file,
here:
[http://svn.gna.org/viewcvs/forgeplucker/trunk/bugplucker.py?...](http://svn.gna.org/viewcvs/forgeplucker/trunk/bugplucker.py?rev=7&view=markup)

Reading it, it looks like a total hack job by a poor programmer. For example,
HTML parsing it done by a bunch of regular expressions. Which include stuff
like

    
    
      # Yes, Berlios generated \r<BR> sequences with no \n
      text = text.replace("\r<BR>", "\r\n")
      # And Berlios generated doubled </TD>s
      text = text.replace("</TD></TD>", "</TD>")
    

All the pain of maintaining these little cases could have been removed by
simply running the page being walked through an actual HTML parser that
produces a DOM tree. Similarly the function dehtmlize contains a limit set of
HTML attributes that it will convert (e.g. it does not convert &nbsp;).

Also, then you get stuff like:

    
    
      # First, strip out all attributes for easier parsing
      text = re.sub('<TR[^>]+>', '<TR>', text, re.I)
      text = re.sub('<TD[^>]+>', '<TD>', text, re.I)
      text = re.sub('<tr[^>]+>', '<TR>', text, re.I)
      text = re.sub('<td[^>]+>', '<TD>', text, re.I)
    

Why have you got four expressions there? They are all doing case insensitive
matching yet there are upper and lowercase versions of each.

Hmm. The rest of the code is really pretty crappy. He could have just read the
thing into lxml and have done XPath queries to extract data. It would have had
the advantage of making clear what parts of the page he was extracting and
made maintenance easy.

~~~
esr
It's a proof-of-concept. Don't get hung up on the parsing details, I expect
I'm going to have to rewrite it at least twice.

~~~
esr
I've written something relevent here: "Structure Is Not Meaning".

<http://esr.ibiblio.org/?p=1387>

~~~
jgrahamc
Your reply doesn't address why you do the things I highlighted in my original
comment. For example, in the middle of your generic table parser you have code
specific to Berlios making it non-generic. That's there because you haven't
actually parsed the HTML you are trying to hack around it with regexps.

In your post you say:

 _So instead, I walk through the page looking for anything that looks like a
hotlink wrapping literal text of the form #[0-9]. Actually, that
oversimplifies; I look for a hotlink with a specific pattern of URL in a
hotlink that I know points into the site bugtracker._

I don't have any problem with you using regexps to identify the particular
fragments you are looking for, my criticism is that you use them for HTML
parsing. For example, in your code you use a regexp like this

    
    
      <A HREF="/bugs/\?func=detailbug&bug_id=([0-9]+)&group_id=%s">
    

This mixes structure and meaning to use your terms. You assume that there's an
HREF attribute after the A. Your code is brittle when it comes to a change in
the HTML (e.g. suppose the page author adds a CLASS= attribute).

This would not happen if you parsed the HTML into a DOM tree and then did
queries against it. You could quickly extract all the <A> tags in the path
with a //A query (or even all those that have an HREF) and get the actual HREF
robustly. Or you could not use XPath and use a parser that does calls backs
with robust lists of attributes.

Doing that would be both robust against changes in page structure, and robust
against changes in the attributes or placement of attributes in the <A>.

PS Is your name calling really necessary? In your post you refer to me as a
'snotty kid' and a 'fool'

PPS Also you say:

 _If I did what my unwise critic says is right, I’d take a parse tree of the
entire page and then write a path query down to the table cell and the
embedded link._

I never said that, I said that you could use XPath queries to extract the
data. I never implied that you use a complete rooted path to get where you
want in the document. You've twisted what I actually said to fit into your
'structure and meaning' blog post.

~~~
statictype
>PS Is your name calling really necessary? In your post you refer to me as a
'snotty kid' and a 'fool'

Well, you said his code looks like a hack job by a poor programmer. That's
fairly insulting too.

About the actual mechanics of screen scraping, I guess your only issue is that
he could have used an html parser to sanitize the code first before querying
it.

~~~
gaius
_his code looks like a hack job by a poor programmer_

Well, it does. Specifying case insensitive match and then handling both cases
yourself reeks of someone who's never parsed anything before. The second case
won't even do anything, it's already been done! And even if the language or
the libraries worked that way, what about mixed case! Let alone parsing HTML
with regexps in the first place. Python has BeautifulSoup for not-well-formed
documents.

It's reasonable from that code snippet to infer the author understands neither
the language he's using nor the problem domain he's working in.

~~~
omouse
he obviously hates dealing with regular expressions and I do too. It's only
natural.

~~~
gaius
So he used double the number of regexps he needed?

~~~
omouse
double the expressions but they're simpler to understand perhaps.

~~~
gaius
No, the only reason to a) specifically add the option for case insensitive
matching and then b) copy-paste the _same_ regexp just with the case changed
is if you really don't understand what it means.

------
omouse
Jesus fuck, the first comments I see here are complaints about the code.

What about the concept? Is it any good? Is it worth working on?

~~~
cdibona
Yes and no... for most of the sites syncing svn from site to site is pretty
simple and usable, even if it can take a bit of time.

For bugs, code.google.com has a export mechanism and you can kind of gin one
up for sf.net via its feeds (I've not done it, so ymmv).

For wikis, we (google) store wiki content in the vc, so that's easy and so
we're pretty happy with the ability of people to pull info out of
code.google.com.

I think it is a non-trivial problem to take bugs from any bug tracking system
to any other different one. There are too many customizations and changes even
within single projects to make it a simple task. I think of taking a bug from
a Demetrius project to , say, jira and that strikes me as being a poor match
from a data perspective.

Backing up your data off of the hosting site makes a lot of sense to me, which
if Eric can make his program do, then that would be more immediately useful
than host to host migration. In my experience, people tend to stick with hosts
once they have an established project there.

------
9turningmirrors
"Currently there is one area director, for the Open Source Awards, John
Graham-Cummings" from: <http://catb.org/~esr/roles.html>

google result for "Eric raymond" "John Graham-Cumming" -
<http://bit.ly/3eCOSZ>

not that I'm trying to prove anything ...... rofl

