

Morph.io – search over 3000 scrapers - dkarapetyan
https://morph.io/

======
btown
TL;DR it's cron jobs as a service + a managed result database + GUI + API to
the database, all for free (donation-supported) and intended for nonprofits
trying to expose government data.

For instance, the table at [https://morph.io/planningalerts-
scrapers/city_of_sydney](https://morph.io/planningalerts-
scrapers/city_of_sydney) is created by running
[https://github.com/planningalerts-
scrapers/city_of_sydney/bl...](https://github.com/planningalerts-
scrapers/city_of_sydney/blob/master/scraper.rb) daily, and the PlanningAlerts
organization uses this API to send email alerts when scrape results change.
They've created dozens of these scrapers: [https://github.com/planningalerts-
scrapers](https://github.com/planningalerts-scrapers) .

It's great to see services like this. The need for this does underscore,
however, how difficult it is to write a generalized scraper that will work on
multiple websites.

Google has been attempting to do this structured-scraping-at-scale with its
WebTables team: [http://googleresearch.blogspot.com/2014/09/introducing-
struc...](http://googleresearch.blogspot.com/2014/09/introducing-structured-
snippets-now.html) . I remember seeing a talk on the underlying technology -
there's a lot of machine learning used to determine whether a <table> is
actually structured data, and how to associate things with Google's Knowledge
Graph. Solving structured scraping in 100% of the cases is an "AI-complete"
problem, but there's definitely progress on getting partially there in an
automated fashion.

