
Robula+: an algorithm to generate robust XPath-based locators - kamocyc
https://github.com/cyluxx/robula-plus
======
neilv
This looks like it might come in handy.

I started working with Web-scraping roughly around '95 (initially for a
personalized newspaper metaphor for Web software agent reporting), and wrote
HtmlChewer, an HTML parser in Java designed for that purpose. A while later, I
moved my rapid R&D work to Scheme, where I wrote the `htmlprag` permissive
parser, now known as the `html-parsing` package in Racket and other Scheme
dialects.

By the time I was using Scheme, my scraping usually ended up starting with
XPath, to get a starting point subtree of the DOM, then used a mix of
arbitrary code and sometimes a proprietary pattern-based destructuring DSL, to
extract info from the subtree. And sometimes filtering/transformation
algorithms across a big free-form-ish text subtree (e.g., for simplifying the
articles of a site a custom crawler scraped, for building a labeled corpus for
an ML research project).

Of course we've always had resilience problems for Web scraping, even as the
Web changed dramatically.

In general, my scraping methods usually ends up hand-crafted (and this was
starting before in-browser development tools with element pickers and DOM
editors), and much of the guesswork/art of it was in coming up with queries
and transforms that seemed like they might keep working the next time the site
changed its HTML. In 2004 I did make a small tool to automate a "starting
point" for hand-crafting such an XPath query:
[https://www.neilvandyke.org/racket/webscraperhelper/](https://www.neilvandyke.org/racket/webscraperhelper/)

~~~
kabacha
> In general, my scraping methods usually ends up hand-crafted

I've tried many of these xpath generators and even built few myself. There's
still nothing that matches human built ones. Best selectors and most stable
selectors are context aware. For example to get a comment text a human would
build a css selector: `.article .comments-box .comment p::text` and there's no
way without AI's involvement or some big-sample training for the generator to
know this object relation structure.

This becomes especially noticeable when parsing complex webpages that can be
highly dynamic and with their html. While the tree structure is often unstable
the core object relationship almost always is, in other words comment text
will always be under comment paragraph, under comment box, under article.

------
santa_boy
I recently discovered [ScrapeMate]([https://github.com/hermit-
crab/ScrapeMate#readme](https://github.com/hermit-crab/ScrapeMate#readme)) and
[selectorgadget]([https://github.com/cantino/selectorgadget](https://github.com/cantino/selectorgadget))
both available as chrome extensions that can come handy for quick scraping.

There are opportunities for better selectors that could possibly be found
using machine learning (?)

------
kabacha
Would be nice if readme would show some actual example; now it just trails
off:

    
    
        let element = robulaPlus.getElementByXPath('/html/body/div/span/a', 
        document);
        robulaPlus.getRobustXPath(element, document);
        # what's the result?

~~~
kamocyc
It depends on the content of the HTML document but I agree with you.

I tried running the algorithm in Chrome. It output much shorter XPath than one
copied from Chrome Developer tool.

    
    
      e.g. "//*[@id="rso"]/div[4]/div/div[1]/a/h3" => "//*[contains(text(),'(text of the element)')]"

~~~
kabacha
Your example seems pretty awful though. You'd rarely want to select element
based on it's text

~~~
kamocyc
I think the author claims the XPath is more "robust" because it doesn't depend
on indexes of the elements so you can add elements without breaking the XPath.
(But it is arguable which one is better ...)

------
WalterGR
Is there a name for the concept of automatically generating (potentially with
machine learning?) selectors?

I feel like I’ve seen similar projects come across HN, but I’m at a loss for
what to search for.

~~~
onli
It's not that easy to reuse those because projects often differ in what they
need (do you really want unique selectors for one single element?). But there
are a couple of CSS selector generators, maybe you are referring to something
like [https://github.com/fczbkk/css-selector-
generator](https://github.com/fczbkk/css-selector-generator) or
[https://github.com/antonmedv/finder](https://github.com/antonmedv/finder)?

~~~
benibela
I made a greasemonkey script for this purpose. Create XPath, CSS or pattern
matching. Unfortunately it stopped working when Firefox got its new API. But
here is a video of it:
[https://youtu.be/PUrBJ6wOXvE?t=50](https://youtu.be/PUrBJ6wOXvE?t=50)

~~~
onli
A very structured and guided approach/interface you built there. I can see why
it does not work anymore with the new extension API, it's a complete
interface. Nice work!

------
neilv
Non-paywall article PDF:
[https://www.researchgate.net/publication/299336358_Robula_An...](https://www.researchgate.net/publication/299336358_Robula_An_algorithm_for_generating_robust_XPath_locators_for_web_testing)

------
kamocyc
The algorithm described in the paper is outlined as follows (just for my
curiosity):

"The algorithm starts with a generic XPath locator that returns all nodes
(‘//*’) and then it iteratively refines the locator until only the element of
interest is selected. In such iterative refinement, ROBULA+ applies seven
refinement transformations, according to a set of heuristic XPath
specialization steps."

The algorithm seems to be a specialized heuristics for XPath generation.

------
benibela
I have been using pattern matching for web-scraping. I think it is more robust
than XPath. At least more reliable to detect invalid input.

Let's look at some of the Robula test cases:

Input:

    
    
          <head></head><body><h1 class="false"></h1><h1 class="false"></h1><h1 class="true"></h1><h1 class="false"></h1></body>
    

Task: get the true element, <h1 class="true"></h1>

XPath:

    
    
           //*[@class='true']
    

Pattern matching:

    
    
           <h1 class="true">{.}</h1>
    
    

Input:

    
    
           <head></head><body><h1 class="false" title="foo"></h1><h1 class="false" title="bar"></h1><h1 class="true" title="foo"></h1><h1 class="true" title="bar"></h1></body>
    

Get <h1 class="true" title="foo"></h1>

XPath:

    
    
           //*[@class='true' and @title='foo']
    
    

Pattern matching:

    
    
           <h1 class="true" title="foo">{.}</h1>
          

As you see, you do not need a new syntax for attributes. Input and pattern are
the same!

Input:

    
    
           <h1></h1><h1></h1><h1></h1><h1></h1>
    

Get the third element.

XPath:

    
    
           //*[3]
    

Pattern matching:

    
    
           <h1></h1><h1></h1><h1>{.}</h1>
    
    

Input:

    
    
           <head></head><body><h1></h1><h1></h1><div><h1></h1></div><h1></h1></body>
    

Get the h1 in the div

XPath:

    
    
           //div/*
    
    

Pattern matching:

    
    
          <div><h1>{.}</h1></div>
    

This last example is actually getting to the point of pattern matching.
Because every part of the patterns must match. If the div is missing, it will
report, "div not found". If the h1 is missing in the div, it will report "h1
not found". But the XPath will just report "found these elements" or "found
nothing".

