Hacker News new | comments | show | ask | jobs | submit login
Portia, an open-source visual web scraper (scrapinghub.com)
367 points by pablohoffman on Apr 1, 2014 | hide | past | web | favorite | 67 comments

The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...

All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.

In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.

To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.

In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.

Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.

Can you explain this more, how do you see this being operated? See a pdf, clip it, create your own reader with your own clips?

Why would scrape JIRA when they have a perfectly workable API?

While their APIs are nice, they require separate permissions and having access to them isn't always a given depending on the company that runs/owns the Jira instance.

This. I also would rather just grab it in the browser instead of having to run a server or something else.

how would a bookmarklet be able to crawl & scrape a website?

You can have javascript code as a bookmarklet

yes I know that but how would you make it crawl links in your browser?

At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)

Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.

The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).

Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.

The web keeps evolving though, dang it. Tricky thing!

Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?

I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.

It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.

It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

Try using https://snapsearch.io/ It is designed for JS sites.

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.

Of course, I should have thought about it that way.

Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?

Answering my own question:

CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents

I wrote a blog post on my experiences using CasperJS to parse a single page site which used angular. http://www.andykelk.net/tech/web-scraping-with-casperjs

I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/


>Is it also a webscraper that can pull data out of a page for me?

No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.

Sorry. I guess you're right. As a programmer, both of those look the same to me.

"PhantomJS is a headless WebKit scriptable with a JavaScript API", so it's a browser.

Is it also a webscraper that can pull data out of a page for me?

I've heard of people using PhantomJS with CasperJS to scrape, not sure if it can be done solely with PhantomJS.

CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.

Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.

You could always have your CasperJS scraping script make an AJAX request to a RESTful API for your MySQL DB. You won't be doing it all from one piece of code, but you'll be doing about 90% of it.

SlimerJS too

I expected an April Fool's joke and found something pleasantly awesome and useful instead.

Source is here: https://github.com/scrapinghub/portia

I've used Scrapy and it is the easiest and most powerful scraping tool I've used. This is so awesome. Since it is based on Scrapy I guess it should be possible to do the basic stuff with this tool and then take care of the nastier details directly on the code. I'll try it for my next scraping project.

I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.

For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:

An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.

The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

Surely there are difficulties in expecting data providers to produce their data in standard formats across industries and countries? I am naive as to how much and what data is available but that seems a stretch

If interested, take a look at my project on unifying bike sharing networks data. Besides providing a public API, we are also providing a python library that accesses and abstracts different sources under the same model [1, 2]

There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]

What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).

Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).

That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.

Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:

    Feed A:
      "status": "ok",
      "foobars": [
          "name": "Foo",
          "bar": "Baz"
        }, ...

    Feed B
    [{"n": "foo","info": {"b": "baz"}},...]

    We could provide:
      "feeds": [
          "name": "Feed A",
          "url": "http://feed.a",
          "format": "json",
          "fields": {
            "name": "/foobars//name",
            "bar": "/foobars//bar"
          "name": "Feed B",
          "url": "http://feed.b",
          "format": "json",
          "fields": {
            "name": "//n",
            "bar": "//info/b"
    Instead of providing a service ourselves that accesses Feed A and Feed B
    every minute just because we want to ease things on the client.
Not sure if that's what you asked, though.

[1]: http://citybik.es

[2]: http://github.com/eskerda/pybikes

[3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

[4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

Ok different feeds, same domain, unifying the model sis feasible, either as an intermediate or as a client "template thing"

Thank you - makes sense. I was thinking different data feeds different domains.

Cool tool for developers, but since this one is open source, I think it opens up even more interesting possibilities for these tools to be integrated into part of a consumer app. Curation is the next big trend, right? I think I'll give that a try.

I just took it for a testdrive and it was an absolute pleasure. I tried to scrape all job listings at https://hasjob.co hoping to find trends.

There is one small pain, the output is being printed to the console and piping output to file is not figuring. But it did fetch all the pages and printed a nice json.

UPDATE: there is a logfile setting to dump output to file

I have a project which includes a huge list of websites which must be scraped heavily. My question is... Are these kind of tools suitable for 'heavy lifting', scraping hundreds of thousands of pages?

Yep. It's just a GUI that generates scrapy (python) code.

Can anyone give a real-life example where this visual tool would be useful? Not that I dont believe in scraping (we do it too: https://github.com/brandicted/scrapy-webdriver). I know Google has a similar tool called Data Highlighter (in Google Webmaster Tools) which is used by non-technical webmasters to tell Google bot where to find the structured data in the page source of a website. It makes sense at Google's scale however I fail to see in which other cases this would be useful considering the drawbacks: some pages may have a different structure, javascript not always properly loaded, etc. Therefor requiring the intervention of a technical person...

This is great. However I have one bone to pick(or rather know if its been taken care of) Scrapy uses xpaths or equivalent representations to scrape. However there are many alternate xpaths to represent the same div. For e.g. Suppose data is to be extracted from the fifth div in a sequence of divs. So it would use that as the xpath. But now say it also has a meaningful class or id attribute. An xpath based on this attribute might be a better choice because this content may not be in the fifth div across all the pages in a site I want to scrape. Is this taken care of by taking the common denominator from many sample pages?

Portia uses https://github.com/scrapy/scrapely library for data extraction. It doesn't use XPaths for learning. There are some links to papers in scrapely README; scrapely is largely based on ideas from these papers, but there are many improvements. In short - yes, this is taken care of.

Excellent. But the example presented in the video (scraping new articles) is a actually a case better solved with other technologies.

I imagine this will be useful when scraping sites like IMDB in case they don't have an API or their API is not useful enough.

Although this is cool, the ultimate scraper would probably need to be somehow embedded in a browser and be able to access the JS engine and DOM. Embedded as a plugin, or some other extension depending on the browser.

Totally off topic, but what's the name of the song in the video? :)

It's been made just for the video :)

Could you please upload it somewhere? It's really catchy :)

From the video, I noticed that the HTML tags were also scraped in the large article text. Is there some way to remove those automatically? Or perform further processing?

Yes, you just need to select a different field type ("text", instead of "html").

This is cool. Can I use it locally on internal sites too?

Why wouldnt you??

I believe the GUI is run locally, but if it was run as a web application from the developers site it would only be able to scrape sites accessible to the public internet.

Outside of this tool, or a tool that uses a scripted browser, another option could be Sikuli in a VM.

I really dig these scrapers, but most of them seem to only work well for simple sites as someone has already noted.

Just want to point out a (commercial but reasonable) program that really works well for all our odd edge case customer site issues.


This solution remembers Pyquery, but using a visual interface.

Love this, been using Scrapy for all my scraping needs.

Is there a live demo available?

Not affiliated with the OP or the project, but I threw up a sandbox to play with this at http://awstest123.notesies.com:9001/static/main.html

Not yet.

Import.io, Kimono Labs, and now this. Web scraper -> data area is heating up.

It's always pretty hot! Some still around - like Kapow, Connotate, Mozenda, ScraperWiki (which I run). Some not - Needlebase.

Portia is more interesting because it is an open source scraping GUI - the GUIs tend to be very proprietary.

I also wrote http://scrape.ly, which let's you write web scrapers via a URL syntax.

Here's an open source web scraping GUI I wrote a while back https://github.com/jjk3/scrape-it-screen-scraper

I'm still integrating the browser engine which I was able to procure for open source purposes.

The video is quite old.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact