Hacker News new | past | comments | ask | show | jobs | submit login
Import.io – Structured Web Data Scraping (import.io)
100 points by steeples on April 13, 2014 | hide | past | favorite | 34 comments



If you're concerned about using a hosted scraping platform because it might disappear check out https://morph.io - it's open source as well https://github.com/openaustralia/morph


As a PoC, I would be willing to "turn the web into data", i.e., produce one of the formats offered by these "services": CSV.

I will use only standard UNIX utilities, no Python, etc. As such, you "own" the code. No SaaS. The result will be portable and run on any UNIX.

I believe I can deliver in fewer words of code and that the result will be easier to modify when sites change.

You pay nothing. Post your scraping "challenges" to HN.

I enjoy turning web into data.

Some people enjoy working with HTML, CSS, Javascript, etc. I prefer working with raw data.

It is interesting to hear that some people are willing to pay to have the HTML, CSS, Javascript, etc. stripped out.


For anyone that wants to do this full time and work with a really cool team, shoot me an email: kevin@mattermark.com


HN,

So who do you guys use more? Import.io or Kimono? I have heard good things about both.


I write my own custom scrapers, I prefer the flexibility and feel safer that the service isn't going to disappear any minute.

If anybody is interested, I wrote a detailed article on scraping not so long back that was well received here: http://jakeaustwick.me/python-web-scraping-resource/


I tried Komono, but it cannot auth into the sites I want to pull the data from....

Just grabbed import.io - will see if it can loginto sites and grab the data from services I am already paying thousands per month for.

EDIT:

To add some context: I pay about $3,000 per month for some monitoring services which do not have any real reportin mechanisms. So for my daily and weekly reports, I have to manually compile them and screen shot a ton of things, compose an email and send.

I want to configure a scraper to automatically grab screens of things I want regularly and email them.

I want to have a script that will grab many diff pieces of data (visual graphs, typically) and put them all into one email.

I am working with my monitoring vendors to get them to add reporting... but until that can happen - I am tired of spending a couple hours per week screen capping graphs...


I'm evaluating these to augment a system I'm building on top of casper. This is the first I've seen of this one, but right out of the gate I think I prefer Kimono.


I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.


> rely on an API that could just vanish the next day

Kind of ironic that you are saying this about web scraping ...


But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.


What do you use for scraping? I may have a scraping project later this year and would love recommendations.


I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.


Scrapy gets a solid recommendation from me. http://scrapy.org/


We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki


I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb


There are a bunch of comments about rolling your own scraper instead of relying upon a possibly unreliable SaaS app.

That makes me think -- would it be viable to run a service that, instead of running the scraping on their own servers, simply gave you a custom binary to run?

Assuming that you trusted the executable, you would never have to worry about the company failing. It'd just be a one-time fee, and yours to use in perpetuity. Presumably updates would be free.


That's a really neat idea I'd pay for. Not sure about the sustainability of the model though.


if you use scrapy (which is an awesome python scraping framework) you can plug a different third party solutions such as: http://crawlera.com/

Not really server level hosting, but you get the benefits of their network.


Great to see these guys are now out of Beta!

While their real-time Extractors aren't quite as quick as doing it yourself, we've found them to be particularly useful for sites requiring JavaScript and/or cookies to use.

It's also worth mentioning that it's quick to get started. You can start playing around with real data without having to dig into a site's URL structure, and then write your own scraper later if needed.


Isn't it illegal to scrape without permission? How would import.io handle the case when a large site comes back with legal threats when a user of their site has used scraped the wrong site? Can they claim non-responsibility?

Also what happens when sites start blocking their IPs due to repeated scraping or is this unlikely to happen?


Heads up, the application is placed in ~/Desktop and not /Applications


They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an interesting concept, they've come far since their initial presentation and while the app has its quirks I have come to use it occasionally for some tasks.

I hope that they'll manage to properly monetize on this - I don't see why I should pay for using a scraping rule if I can just write the scraper myself which doesn't cost me that much more time.


What kind of legitimate uses are there for something like this? This is not a sarcastic question. It seems like an obvious spam magnet, but if people are using it legitimately wouldn't their sources already be providing an API or RSS key?


I've my own use case for it and it will probably mirror other sites. I run my own blog and thus have ads and affiliate links there. The thing is, as good as Google Adsense is, it's shitty for my site and my topic (Web Dev).

What am I left with? Great affiliates like Team Treehouse, Lynda.com, framework themes, and Udemy. The problem is that none of those offer any kind of a good API. All they have is a link and possibly an image that they provide.

By using Kimono, I can scrape (but I don't) all of Udemy's programs, categorize them with custom categories, build a full-text search engine around it and serve relevant ads per post. For instance, my "Best Bootstrap Themes" post would yield "Learn bootstrap" udemy course and an on-the-fly-but-cached image for it thus serving relevant ads to my users.

Same goes for Lynda. If someone lands on "Why C# is a great language to learn" (one of my unreleased articles), my custom API built on top of scraped data could serve them with a "ASP.NET Essentials" course.

So why use something like this for framework themes? Take Wrapbootstrap.com, they have a great affiliate program. Using Kimono, you can easily get daily refreshes of their main page which usually has: sales priced themes, featured themes, and new rising themes. This way, you can serve users with an ad that has up-to-date prices and themes that are hot right now.

What about non-ad uses? You can create custom search, weighted according to YOUR metrics and build your own marketplace front and aggregate several sources in order to serve users with better content.


We use scraping to gather product prices from online shops for a price comparison site. I have permission from the sites who are not bothered to provide us with a price list other than their public website. Legal and necessary - so there is a market for this I believe, I am not sure about its size though.


Doing something like http://openstates.org/ is a perfect example. State government data is shitty most of the time and doesn't have a public api you can query so open states runs 50+ scrapers to get the data and normalize it.


Very few companies can figure out how to provide proper api's. Unless it's part of their core business, it'll always be lacking.


I suspect the real, top-secret business behind import.io is in either training a system to crawl the web and see structured data, and/or gathering over time a very rich crowd-sourced database of structured data.


We've posted answers to some of your questions on our blog: http://blog.import.io/post/you-ask-we-answer


Can someone tell me more about the law and scrapping websites?


See my blog post about this on the ScraperWiki blog https://blog.scraperwiki.com/2012/04/is-scraping-legal/


Unfortunately, this doesn't seem to work too well on my mac. And, why do you want to know who my friends on Facebook are?



I wrote http://scrape.ly if you wanna have a look, it's a url-based API for web scraping.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: