
Artoo, the client-side scraping companion - jacomyal
http://medialab.github.io/artoo/
======
EamonLeonard
Another "Artoo" [http://artoo.io/](http://artoo.io/)

------
zak_mc_kracken
Still not convinced by the reasons offered for client-side scraping. If I'm on
my browser, I'm not interested in consuming JSON.

Scraping is really something that's better done in the back end, and today,
there are a lot of libraries that let you access web sites from Java and run
all the Javascript you need in order to display the page properly.

~~~
rektide
To each their own. I'm not interested in systematic scraping. I just want to
take back, take home the web experience I've had, and be able to digest and
work with it latter. The things that I want to work with are the sights and
experiences I've had. Client side is perfect.

Second, if I was trying to scrape, I'd rather do scraping with WebDriver than
anything else, and injecting some client side scraping tools and using
WebDriver as a driver, not a driver/scraper sounds remarkably better.

I see no reason to ever not use a browser to consume html content.

~~~
rektide
For example, favoriting a tweet on twitter is lossy: there's no after-the-fact
scraping I can do to know where I was, what time it was when I favorited the
thing.

If we want to Publish Everywhere Syndicate to Own Site (#IndieWeb dubs this
PESOS), if we want to have our own experiences we can talk about, client side
is the way to go.

------
brucehart
Great work! I really like this! I typically use the JavaScript console
bookmarklet for tasks like this, but it is not specifically designed for
scraping. I would love to see an option that would allow Artoo commands to be
packaged into a PhantomJS script. Developers could use Artoo manually to
figure out what elements should be targeted and then the PhantomJS script to
run it in an automated fashion.

~~~
Yomguithereal
This would indeed be nice and this is precisely what we intend to code next.

------
ghkbrew
What advantages does this have over Phantom.js[1] ?

[1] [http://phantomjs.org](http://phantomjs.org)

~~~
jacomyal
Both are really different. Phantom.js is a headless browser while artoo is a
tool to easily scrape data from website.

But combining both would be nice to make it possible to automatize scrapers
that have been developed quickly directly in the browser with artoo.

------
fiatjaf
This is awesome. I've been dreaming about this for weeks.

I don't know if it is possible, but could this run as a Chrome Extension, in a
background script, loading various pages, executing code on then and keep
going, storing the data at the extension's localStorage?

It could also store the code of the scrapers, for reusing.

~~~
fiatjaf
Well, I see you already have almost all I suggested. Now I would want
something to make the ajaxSpider render the pages using the browser engine,
instead of just getting pure HTML.

~~~
Yomguithereal
This is an interesting point. I created an issue on the github repository
concerning this matter. Maybe you'd like to comment on it about your use case
so we can improve the tool?

------
nnnnni
I would like to see something that helps create useful, specific scrapers for
languages like Ruby and Python.

It's annoying to have to run scripts multiple times, tweaking it after each
run to get _exactly_ what you need. It's a waste of time...

~~~
the_cat_kittles

      >> ipython
    
      [In 1]: from pyquery import PyQuery as pq
      [In 2]: pq("http://www.foo.com")("<some jquery selectors>")
    

(inspect output, repeat till right)

... or do it with requests + lxml.etree, or whatever you want

when you have what you need, copy and paste into a file

~~~
nnnnni
PERFECT! Thanks a lot =-)

------
dfischer
This is in direct conflict with another library:
[http://artoo.io](http://artoo.io)

Might be better to use another name?

------
benmmurphy
this jquery injection looks kind of dangerous. Looks like code from
code.jquery.com is loaded into any page. Say I go to
[https://secretsquirrel.com](https://secretsquirrel.com) and they have been
very careful to only load javascript from their own domain but now it can also
load malicious javascript from
[https://code.jquery.com](https://code.jquery.com).

it also disable CSP. i'm not exactly sure how the extension works. maybe it is
turned on/off on per tab basis and defaults to off which would be quite safe.
but if it defaults to on then it can be kind of risky.

~~~
Yomguithereal
jquery is injected carefully by artoo so it does not break anything on the
host page. However, CSP override is not default on artoo and you have to
install the chrome extension to perform this. But this extension has solely to
be activated when scraping and only developers should use them while knowing
its effects.

------
thebiglebrewski
Yeaaaah you might wanna rename that. I think the other Artoo already has
enough traction and this will just confuse people.

~~~
kej
They seem different enough that anyone interested in these would be able to
tell them apart.

------
notastartup
This is great for simple, quick job. However, you can do only so much in a
local browser itself.

I basically built a bookmarklet that let's you define the actions locally on
your browser, and then run the scrapes in your own box, essentially allowing
unmetered scraping without charging per page.

[http://scrape.ly](http://scrape.ly)

~~~
sogen
closed beta?

~~~
notastartup
I'm still putting the finishing touches. Will email everyone when it's ready
to use.

~~~
mendicantB
Looks really awesome, I'll await it's arrival.

