
Import.io – Structured Web Data Scraping - steeples
https://import.io/
======
mlandauer
If you're concerned about using a hosted scraping platform because it might
disappear check out [https://morph.io](https://morph.io) \- it's open source
as well
[https://github.com/openaustralia/morph](https://github.com/openaustralia/morph)

------
uuid_to_string
As a PoC, I would be willing to "turn the web into data", i.e., produce one of
the formats offered by these "services": CSV.

I will use only standard UNIX utilities, no Python, etc. As such, you "own"
the code. No SaaS. The result will be portable and run on any UNIX.

I believe I can deliver in fewer words of code and that the result will be
easier to modify when sites change.

You pay nothing. Post your scraping "challenges" to HN.

I enjoy turning web into data.

Some people enjoy working with HTML, CSS, Javascript, etc. I prefer working
with raw data.

It is interesting to hear that some people are willing to pay to have the
HTML, CSS, Javascript, etc. stripped out.

~~~
kevin_morrill
For anyone that wants to do this full time and work with a really cool team,
shoot me an email: kevin@mattermark.com

------
ycmike
HN,

So who do you guys use more? Import.io or Kimono? I have heard good things
about both.

~~~
thejosh
I prefer to rely on code that doesn't rely on an API that could just vanish
the next day or cost a bucket to run.

~~~
ejstronge
What do you use for scraping? I may have a scraping project later this year
and would love recommendations.

~~~
PuerkitoBio
I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays
between requests to the same host).

\- Fetchbot:
[https://github.com/PuerkitoBio/fetchbot](https://github.com/PuerkitoBio/fetchbot)

Flexible, similar API to net/http (uses a Handler interface with a simple mux
provided, supports middleware, etc.)

\- gocrawl:
[https://github.com/PuerkitoBio/gocrawl](https://github.com/PuerkitoBio/gocrawl)

Higher-level, more framework than library.

Coupled with goquery
([https://github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery)
) to scrape the dom (well, the net/html nodes), this makes custom scrapers
trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

------
RaphiePS
There are a bunch of comments about rolling your own scraper instead of
relying upon a possibly unreliable SaaS app.

That makes me think -- would it be viable to run a service that, instead of
running the scraping on their own servers, simply gave you a custom binary to
run?

Assuming that you trusted the executable, you would never have to worry about
the company failing. It'd just be a one-time fee, and yours to use in
perpetuity. Presumably updates would be free.

~~~
caio1982
That's a really neat idea I'd pay for. Not sure about the sustainability of
the model though.

------
robotfelix
Great to see these guys are now out of Beta!

While their real-time Extractors aren't quite as quick as doing it yourself,
we've found them to be particularly useful for sites requiring JavaScript
and/or cookies to use.

It's also worth mentioning that it's quick to get started. You can start
playing around with real data without having to dig into a site's URL
structure, and then write your own scraper later if needed.

------
chrisherring
Isn't it illegal to scrape without permission? How would import.io handle the
case when a large site comes back with legal threats when a user of their site
has used scraped the wrong site? Can they claim non-responsibility?

Also what happens when sites start blocking their IPs due to repeated scraping
or is this unlikely to happen?

------
seivan
Heads up, the application is placed in ~/Desktop and not /Applications

------
th0br0
They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an
interesting concept, they've come far since their initial presentation and
while the app has its quirks I have come to use it occasionally for some
tasks.

I hope that they'll manage to properly monetize on this - I don't see why I
should pay for using a scraping rule if I can just write the scraper myself
which doesn't cost me that much more time.

------
fibertera
What kind of legitimate uses are there for something like this? This is not a
sarcastic question. It seems like an obvious spam magnet, but if people are
using it legitimately wouldn't their sources already be providing an API or
RSS key?

~~~
antjanus
I've my own use case for it and it will probably mirror other sites. I run my
own blog and thus have ads and affiliate links there. The thing is, as good as
Google Adsense is, it's shitty for my site and my topic (Web Dev).

What am I left with? Great affiliates like Team Treehouse, Lynda.com,
framework themes, and Udemy. The problem is that none of those offer any kind
of a good API. All they have is a link and possibly an image that they
provide.

By using Kimono, I can scrape (but I don't) all of Udemy's programs,
categorize them with custom categories, build a full-text search engine around
it and serve relevant ads per post. For instance, my "Best Bootstrap Themes"
post would yield "Learn bootstrap" udemy course and an on-the-fly-but-cached
image for it thus serving relevant ads to my users.

Same goes for Lynda. If someone lands on "Why C# is a great language to learn"
(one of my unreleased articles), my custom API built on top of scraped data
could serve them with a "ASP.NET Essentials" course.

So why use something like this for framework themes? Take Wrapbootstrap.com,
they have a great affiliate program. Using Kimono, you can easily get daily
refreshes of their main page which usually has: sales priced themes, featured
themes, and new rising themes. This way, you can serve users with an ad that
has up-to-date prices and themes that are hot right now.

What about non-ad uses? You can create custom search, weighted according to
YOUR metrics and build your own marketplace front and aggregate several
sources in order to serve users with better content.

------
thom
I suspect the real, top-secret business behind import.io is in either training
a system to crawl the web and see structured data, and/or gathering over time
a very rich crowd-sourced database of structured data.

------
jmethvin
We've posted answers to some of your questions on our blog:
[http://blog.import.io/post/you-ask-we-answer](http://blog.import.io/post/you-
ask-we-answer)

------
pmtarantino
Can someone tell me more about the law and scrapping websites?

~~~
frabcus
See my blog post about this on the ScraperWiki blog
[https://blog.scraperwiki.com/2012/04/is-scraping-
legal/](https://blog.scraperwiki.com/2012/04/is-scraping-legal/)

------
late2part
Unfortunately, this doesn't seem to work too well on my mac. And, why do you
want to know who my friends on Facebook are?

------
notduncansmith
Reminds me of [https://www.kimonolabs.com/](https://www.kimonolabs.com/)

------
notastartup
I wrote [http://scrape.ly](http://scrape.ly) if you wanna have a look, it's a
url-based API for web scraping.

