
Show HN: Kimono – Never write a web scraper again - pranade
http://kimonify.kimonolabs.com/kimload?url=http%3A%2F%2Fwww.kimonolabs.com%2Fwelcome.html
======
randomdrake
The presentation is beautiful and the website is great, but the tech broke so
I have no idea how or if this even works. This is a wonderful concept and one
I've talked about doing with others. I was really excited to try this. I
watched the demo video and it seemed straightforward.

I went to try and use it on the demo page it provides, going through and
adding things, but when I went to save it, I just received an error that
something went wrong. Well, crap. That was a waste of time. Oh well, maybe
it's just me.

Alright, I'll give it another shot using the website they used in the demo.
Opened up a Hacker News discussion page and started to give it a try.
Immediately it was far less intelligent than the demo. Clicking on a title
proceeded to select basically every link on the page. Somehow I clicked on
some empty spots as well. Nothing was being intelligently selected like it was
in the demo. Fine, that wasn't working tremendously well, but I wanted to at
least see the final result.

Same thing: just got an error that something went wrong and it couldn't save
my work.

Disappointing. I still might try it again when it works 'cause it's a great
idea if they really pulled it off. So far: doesn't seem to be the case.

~~~
ricardobeat
HN pages are possibly the worst case, very hard to infer structure from due to
its 1998 coding standards. You'll have a better chance with an alternative
interface like [http://ihackernews.com/](http://ihackernews.com/) or
[http://hckrnews.com](http://hckrnews.com) (no comments though).

~~~
wilg
I really wish that HN wasn't even in the running for a "worst case". For a
community that seems to be all about UX and innovation, shouldn't it run on at
least a marginally user-friendly piece of software with this-century markup?

I get that it has a sort of kitschy or retro appeal, but it's just basically a
pain to use and looks terrible.

I can't tell you how often I click next page to find that I've taken too long
and my session or whatever has expired.

~~~
tinco
No. This site isn't about UX or innovation, it's about tech start ups. It's a
constant reminder that something can be successful even if it was written in a
LISP dialect and has a bunch of UX misses as long as the core idea is valid
and the product usable enough.

I've used a bunch of HN skins that were supposedly better designed, but none
of them stuck. Apparently it's just plain unnecessary for HN to be better.

~~~
theg2
So because it's written in LISP it gets a free pass in other areas? That seems
like a dubious claim.

~~~
SkyMarshal
No, despite it being written in an obscure language and sparse UI it has
massive traction anyway. There's a lesson in there somewhere.

~~~
bmelton
> There's a lesson in there somewhere

The website that lets technologists hang out with people who might give them
large sums of money to see their ideas succeed doesn't have to be good, or
pretty.

If the website were better, or prettier, that would not add any additional
value to the previously mentioned large sums of money.

(Oh, and advice, and being able to chat with industry leaders and experts on
diverse arrays of topics, etc.)

------
dunham
The Simile group at MIT did something similar back around 2006. Automatic
identification of collections in web pages (repeated structures), detection of
fields by doing tree comparisons between the repeated structures, and fetching
of subsequent pages.

The software is abandoned, but their algorithms are described in a paper:

    
    
        http://people.csail.mit.edu/dfhuynh/research/papers/uist2006-augmenting-web-sites.pdf

~~~
losvedir
Oh, hey, memories. I worked one summer with David Huynh (who you're linking to
there) and David Karger (his thesis advisor) on one of the Simile projects.

I vaguely remember playing around with this tool you mentioned. I thiiiiink it
was this one[0], although it seems to be superseded by this one[1] now.

[0]
[http://simile.mit.edu/wiki/Piggy_Bank](http://simile.mit.edu/wiki/Piggy_Bank)
[1] [http://simile.mit.edu/wiki/Sifter](http://simile.mit.edu/wiki/Sifter)

~~~
danso
Just had to chime in and say that David Huynh and his fellow programmers will
be forever heroes to me and a small group of data journalists who depended on
Gridworks/Google Refine/OpenRefine

------
DanBlake
Show me it working with authentication and you will have a customer. Scraping
is always something you need to write because the shit you want to get is only
shown when you are logged in.

~~~
pranade
Yes, it's one of the most popular feature requests. We don't support auth yet,
but it's on our shortlist and we hope to have it ready soon.

~~~
chii
how are you going to do it without having to know the actual authentication
key(s)? if i don't trust anyone enough to give my auth away, and so unless the
site being scraped has some sort of oauth support, how are you going to get
any data?

of course, if this was an offline product, or self-hosted product, then it
would solve that problem of auth instantly.

~~~
the_french
Would there any way to fake the beginning of an OAuth session with Facebook,
Google or any other OAuth authenticated site? Kind of like replaying cookies
to hijack sessions?

~~~
garyjob
The route of proxying the web page presents much difficulty in doing actual
authentication on Facebook or Google's website via the proxied webpage without
first rewriting most of the javascript and hijacking their Ajax calls on the
fly.

The approach I took was to hijack the Cookies from the browser once the user
has signed in after on e.g. Facebook via the browser extension.

The route of proxying the website does in fact do away with the need to
install any external 3rd libraries.

This browser extension I built coupled with the web service its integrated to
does allow for scraping of pages from Facebook, Google and LinkedIn logged in
pages as well.

[https://chrome.google.com/webstore/detail/krakeio/ofncgcgajh...](https://chrome.google.com/webstore/detail/krakeio/ofncgcgajhgnbkbmkdhbgkoopfbemhfj)

~~~
davedx
Hah, I've been working on this recently with Facebook, on a TV set-top-box. It
was painful and I ended up giving up. xd_arbiter.php is the key, I think.

------
georgemcbay
I've written more web scraping code than I care to admit. A lot of the apps
that ran on chumby devices used scraping to get their data (usually(!) with
the consent of the website being scraped) since the device wasn't capable of
rendering html (it eventually did get a port of Qt/WebKit, but that was right
before it died and it wasn't well integrated with the rest of the chumby app
ecosystem).

This service looks great, good work! But since you seem to host the APIs
created how do you plan to get around the centralized access issues? Like on
the chumby we had to do a lot of web scraping on the device itself (even
though doing string processing operations needed for scraping required a lot
of hoop jumping optimization to run well in ActionScript 2 on a slow ARMv5
chip with 64mb total RAM) to avoid all the requests coming from the same set
of chumby-server IP addresses, because companies tend to notice lots of
requests coming from the same server block really quick and will often rate
limit the hell out of you, which could result in a situation where one heavy-
usage scraper destroys access for every other client trying to scrape from
that same source.

~~~
pranade
Access, legality and rate limiting issues come up a lot. We're working on a
couple things to address them. The first is an intelligent job distribution
system that consolidates scrapes across users and hits sites (and pages) at
human-like intervals. the second is to create a portal for webmasters that
allows them special privileged access to analytics on data being extracted
from their sites, and the ability to "turn on or off" kimono APIs if they see
fit. this way, via kimono, a webmaster at chumby could "provision" certain
kimono users. we're still yet to see whether the later works out. thanks for
the input

~~~
Kudos
Use a user-agent containing a URL to find out who and what you are, and honor
my robots.txt.

Having a panel for webmasters _along_ with that would be fine.

~~~
pranade
Great suggestion... thanks for this one. We're putting this on our list

~~~
thatthatis
Please tell me that the robots.txt suggestion is something that you're already
doing and the user agent part is whats going on the list.

~~~
garyjob
You could try doing IP address rotation by rotation EC2 instances or some
other cloud services.

I wrote a library for that.

[https://github.com/KrakeIO/resque-my-aws](https://github.com/KrakeIO/resque-
my-aws)

~~~
Kudos
That has nothing to do with what's being discussed on this thread.

------
GigabyteCoin
I'm curious how you plan to avoid/circumvent the inevitable hard IP ban that
the largest (and most sought after targets) will place on you and your
services once you begin to take off?

I could have really used a service like this just yesterday actually, I ended
up fiddling around with iMacros and got about 80% of what I was trying to
achieve.

~~~
pranade
It's a great question. What we're really trying to do is make data accessible
programmatically and at scale. We want to connect data providers and data
consumers with APIs in a way that's mutually beneficial vs. being a tool for
data theft. Our hope is to (once we scale) actually work with data providers
directly on the on the distribution of their data so the IP ban becomes a non-
issue.

~~~
Nilzor
But isn't the point kinda to let the users come up with data providers
themselves? If you say "Only these 500 data providers are available for
scraping", you don't have a business. If you _don 't'_ have such a limitation,
you'll not be able to work directly with all data providers. You'll have IP
problems.

------
hcarvalhoalves
This is excellent. Even it if doesn't work for scraping all sites, it
simplifies the average use case so much that it's not even funny.

Feature proposal: deal with pagination.

~~~
scotty79
Another feature, simple one: Allow to add some filters to the data stream. For
example: only posts that contain word "bitcoin" in the name or only those with
50 upvotes or more.

~~~
pranade
Thanks for the suggestion... adding to the list :)

~~~
walden42
Make sure to include regex matching =)

~~~
pranade
Thanks. We support regex matching now. Try dragging to select text, if there's
a relevant regex pattern and kimono will find it (there's an example inthe
blog post). You can preview (and soon, you'll be able to edit) the CSS and
Regex also in advanced mode.

~~~
scotty79
Too bad I couldn't edit selectors and regex-es at this step. I could implement
the filters I needed myself manually like this.

------
fsckin
Constructive Tone: I figured that it might be nifty to scrape cedar pollen
count information from a calendar and then shoot myself an email when it was
higher than 100 gr/m3.

This would be a pretty difficult thing to grab when scraping normally, but the
app errors before loading the content:

[https://www.keepandshare.com/calendar/show_month.php?i=19409...](https://www.keepandshare.com/calendar/show_month.php?i=1940971)

JS error: An error occurred while accessing the server, please try againError
Reference: 6864046a

~~~
pranade
Thanks for letting us know. Just tried and am getting the same error. The page
is loading content dynamically from another source... We'll look into this and
see if we can get it working on this page

~~~
fsckin
Do you support POSTs for fetching dynamic data? I found where it's pulling
from, here's the curl command:

curl
"[https://www.keepandshare.com/calendar/fns_asynch_api.php?r=0...](https://www.keepandshare.com/calendar/fns_asynch_api.php?r=0.04834879608824849&fromapp=calendar")
\--data "action=getrange&i=1940971&from=2013-12-26&to=2014-02-06"

~~~
pranade
No, we don't have POST support quite yet. We're working on a solution.

------
thinkzig
Great work so far. The tool was very intuitive and easy to use.

My suggestion: once I've defined an API, let me apply it to multiple targets
that I supply to you programatically.

The use case driving my suggestion: I'm an affiliate for a given eCommerce
site. As an affiliate, I get a data feed of items available for sale on the
site, but the feed only contains a limited amount of information. I'd like to
make the data on my affiliate page richer with extra data that I scrape from a
given product page that I get from the feed.

In this case, the page layout for all the various products for sale is exactly
the same, but there are thousands of products.

So I'd like to be able to define my Kimono API once - lets call it
CompanyX.com Product Page API - then use the feed from my affiliate partner to
generate a list of target URLs that I feed to Kimono.

Bonus points: the list of products changes all the time. New products are
added, some go away, etc. I'd need to be able to add/remove target URLs from
my Kimono API individually as well as adding them in bulk.

Thanks for listening. Great work, again. I can't wait to see where you go with
this.

Cheers!

~~~
pranade
Thanks a ton for the feedback. Getting data from multiple similarly structured
URLs programmatically is something we're working on now. We love hearing about
the use cases you want to use this for so we can make sure we build out the
right features to make kimono useful for you.

------
tectonic
Just use [http://selectorgadget.com](http://selectorgadget.com)

~~~
fizx
You should write a blog post on lessons learned when we spent a year making
~this in 2008.

~~~
deskglass
Thanks so much for creating SelectorGadget! I used it a lot when scraping some
fanfiction and Wikipedia data.

------
sync
Undo button is awesome.

More web apps need an undo button.

------
rlpb
Are you familiar with ScraperWiki? I'm wondering how your work fits in with
it.

Edit: looks like they've moved away from that space, but have an old version
available at:
[https://classic.scraperwiki.com/](https://classic.scraperwiki.com/)

~~~
Maxious
The people who scrape data to avoid paying for APIs are the same people who
will not pay for a service to make scraping easier ;)

~~~
handelaar
The people who scrape data at Scraperwiki -- which was made by the same people
who opened up parliamentary transcripts in the UK for the first time, and the
UN's proceedings, and data about how MPs in London vote -- generally don't
have an option to buy anything because the data's hidden by governments from
the people who paid for it, on purpose.

But by all means take this opportunity to dismiss all of us as freeloaders.

~~~
Maxious
I have 17 scrapers on ScraperWIki classic for government data
[https://classic.scraperwiki.com/profiles/maxious/](https://classic.scraperwiki.com/profiles/maxious/)

It would have cost me $348/year to move those to new scraperwiki.

~~~
yahelc
That reads as "Less than $1/day" to me...

------
trey_swann
This is a great tool! In a past life we needed a web scraper to pull single
game ticket prices from NBA, MLB, and NHL team pages (e.g.
[http://www.nba.com/warriors/tickets/single](http://www.nba.com/warriors/tickets/single)).
We needed the data. But, when you factor in dynamic pricing and frequent page
changes you are left with a real headache. I wish Kimono was around when we
were working on that project.

I love how you can actually use their "web scraper for anyone" on the blog
post. Very cool!

------
pknight
That UI made me go wow, this could be an awesome tool. Idea that pops into my
mind is being able to grab data from those basic local sites run by councils,
local news papers etc and putting it into a useful app.

How dedicated are you guys to making this work because I'd imagine there are
quite a few technical hurdles in keeping a service like this working long term
while not getting blocked by various sites?

~~~
pranade
Love your suggestion. We're committed to making kimono better and we're
working on it all the time. We want to make sure it's a responsible scraper,
so want to work together with webmasters in cases where there might be
blocking but the data is legal to share...

------
fnordfnordfnord
>Sorry, can't kimonify

>According that web site's data protection policy, we were unable to kimonify
that particular page.

Sigh... Oh well... Back to scraping.

~~~
pranade
What page were you trying to hit? We'll check it out

~~~
fnordfnordfnord
Pages buried in here: [https://fannin4.wcjc.edu/](https://fannin4.wcjc.edu/)

The course catalog is public, so no login is needed. I want to scrape various
data related to courses, to populate forms automatically and such.

~~~
nobodysfool
Yeah, of course it won't work with HTTPS sites. They'd have to proxy those
HTTPS sites and perform a MITM just to do it.

~~~
fnordfnordfnord
I kind of just assumed that's what they were doing.

~~~
garyjob
HTTPS is definitely a problem for proxy servers unless you the proxy server
rewrites all the URLs in the html pages loaded as well as all the URLs of the
Ajax calls to point back to the proxy server.

~~~
fnordfnordfnord
It may be to their advantage to come up with a solution for this, given the
popularity of https these days.

------
bambax
> _Web scraping. It 's something we all love to hate. You wish the data you
> needed to power your app, model or visualization was available via API. But,
> most of the time it's not. So, you decide to build a web scraper. You write
> a ton of code, employ a laundry list of libraries and techniques, all for
> something that's by definition unstable, has to be hosted somewhere, and
> needs to be maintained over time._

I disagree. Web scraping is mostly fun. You don't need "a ton of code" and "a
laundry list of libraries", just something like Beautiful Soup and maybe XSLT.

The end of the statement is truer: it's not really a problem that your web
scraper will have to be hosted somewhere, since the thing you're using it for
also has to be hosted somewhere, but yes, it needs to be maintained and it
will break if the source changes.

But I don't see how this solution could ever be able to automatically evolve
with the source, without the original developer doing anything?

~~~
littledot5566
Perhaps this could be automated by finding the same content in two versions of
the dom and then doing a diff on the structure, updating the rules?

~~~
pranade
It would be great to automate this eventually. For now, we're trying to make
it really easy to set up and rebuild the scraper. If it goes down, you'll see
it in the status on your user dashboard. We're also implementing alerts, so
you can opt to get an email notification if a scrape fails

------
IbJacked
Wow, this is looking good, I wish I had it available to me 6 months ago! Nice
job :D

I don't know if it's just me or not, but it's not working for me in Firefox
(OSX Mavericks 10.9.1 and Firefox v26). The X's and checkmarks aren't showing
up next to the highlighted selections. Works fine in Safari.

~~~
pranade
Thanks for letting us know. We've tested on some versions of Firefox, but not
v26 on Mavericks. We'll look into this

------
eth
Great tool!

I'm coming at things from a non-coder perspective and found it easy to use,
and easy to export the data I collected into a usable format.

For my own enjoyment, I like to track and analyze Kickstarter project
statistics. Options up until now have been either labor intensive (manually
entering data into spreadsheets) or tech heavy (JSON queries, KickScraper,
etc. pull too much data and my lack of coding expertise prevents me from
paring it down/making it useful quickly and automagically) as Kickstarter
lacks a public API. Sure, it is possible to access their internal API or I
could use KickScraper, but did I mention the thing about how I dont, as many
of you say, "code"?

What I do understand is auto-updating.CSV files, and that's what I can get
from Kimono. Looking forward to continued testing/messing about with Kimono!

------
alternize
looks promising!

to be fully usable for me, there are some features missing:

\- it lacks manual editing/correcting possibilities: i've tried to create an
api for
[http://akas.imdb.com/calendar/?region=us](http://akas.imdb.com/calendar/?region=us)
with "date", "movie", "year". unfortunately, it failed to group the date
(title) with the movies (list entries) but rather created two separate,
unrelated collections (one for the dates, one for the movies).

\- it lacks the ability to edit an api, the recommended way is to delete and
recreate.

small bugreport: there was a problem saving the api, or at least i was told
saving failed - it nevertheless seems to be stored stored in my account

~~~
pranade
Thanks for the feedback. We're working on a feature that will allow you to
edit APIs you've created and also edit the selectors and regex (right now, in
advanced mode, you can see them, but cannot edit). We're looking into your bug
now...

------
aqme28
I would seriously consider rethinking that Favicon.

~~~
misuba
Seconded. I can't show that to anyone at work.

~~~
roryokane
To me, the favicon merely looks like a sumo wrestler’s head with a short
ponytail and scowling/serious eyebrows. I can’t tell what NSFW thing you see
it as.

~~~
prolways
Until I read this thread I also saw a sumo or an angry onion, but I believe
the picture is actually a person facing away from us undoing their kimono.

~~~
GBond
You see all that in a 32x32 pixel image?

~~~
Kudos
It's much larger than that
[http://kimonify.kimonolabs.com/favicon.ico](http://kimonify.kimonolabs.com/favicon.ico)

------
lips
I'm experiencing login errors (PEBKAC caveat: password manager, 2x checked,
reset), but the support confirmation page is a nice surprise.

[http://i.imgur.com/w01CoUy.jpg](http://i.imgur.com/w01CoUy.jpg)

------
guptaneil
Nice work, this is much better than I expected! Does it require Chrome? It
doesn't seem to work in Safari for me. Also, does Kimono work for scraping
multiple pages or anything that requires authentication?

~~~
pranade
Great, it should work well on webkit browsers, what version of Safari are you
using?

~~~
guptaneil
7.0.1, the latest. I also don't have Flash installed, but it doesn't look like
you're using Flash. The entire top bar doesn't show for me. Feel free to email
me and I can send you screenshots.

~~~
pranade
Thanks - we're not using flash, so it must be something else. Will follow up
over email

------
garyjob
I found the one click action for selecting an entire column of values as well
as the UI/UX on the top column of the page to be very impressive. We were
thinking of a nice clean way to represent that particular UI/UX flow in this
browser extension we built as well. Will incorporate that in our next release.

[https://chrome.google.com/webstore/detail/krakeio/ofncgcgajh...](https://chrome.google.com/webstore/detail/krakeio/ofncgcgajhgnbkbmkdhbgkoopfbemhfj)

Would love to meetup and exchange some ideas if you are based in Bay area.

------
jlees
I like how you've thought through the end to end use case: not just generating
an API, but actually making it usable. I've done my fair share of web scraping
and it's not an easy task to make accessible and reliable -- good luck!

It makes me wonder if there isn't a whole "API to web/mobile app with custom
metadata" product in there somewhere. I can imagine a lot of folks starting to
get into data analysis and pipelines having an easier time of it if they could
just create a visual frontend in a few clicks.

~~~
pranade
Yes, we're excited about the possibilities of an end-to-end use case as
well... in fact we were surprised when we found a lot of interest in front-end
output layers on top of the APIs than just the APIs themselves. Would be
curious to know what output features would be most valuable for you.

~~~
jlees
Well. Think about spreadsheets. Think about live spreadsheets powered directly
by APIs. Bingo!

I'm doing a couple of data mining projects right now and simply being able to
query and look into the API outputs, as well as my local database, without
building a custom frontend would've saved me a bunch of time. But I'm thinking
more of the knowledge worker, or even the power user who wants to view their
Fitbit, Up and Lark data all in the same dashboard. Can't help but thinking
this already exists somewhere though.

~~~
pranade
Love your idea. Would love to follow up on this with you

------
chevreuil
We all know there are a lot of existing tools that does the same things. But
I've not met one with such a polished UX. Kudos to the Kimono team, I'll
definitly recommend your product.

------
ph4
Very nice job. What about scraping data from password-protected pages?

~~~
pranade
Great request... it's on our feature shortlist. Definitely a feature we want
to implement as soon as we can (after we tackle some basics like pagination
and getting images)

------
shekyboy
Like the parameter passthrough feature. Take a look at places where the
parameters are part of the URL structure. For example a Target product page
[http://www.target.com/p/men-s-c9-by-champion-impact-
athletic...](http://www.target.com/p/men-s-c9-by-champion-impact-athletic-
shoe-black/-/A-14656388#prodSlot=medium_1_1)

In order to get data for a different product, I will have to modify the URL
itself. I think same holds true for blog posts.

~~~
pranade
Yes, it's a great point. We're working on updating the query param passthrough
to handle params within the URL structure.

~~~
shekyboy
Apart from that here are other items. You may have these on your list, but can
count my vote to prioritize.

1\. Pagination 2\. Image URLs 3\. Focus on page types such as product pages,
posts etc. That way its easy to go from content to content. Will help crawling
too 4\. Link back to original page included in JSON

Finally common sites/pages used by multiple users of your systems should not
count against the API count requirement under pricing. You may want to charge
against total calls, like Parse.

~~~
pranade
Thanks for the suggestions!

------
ameister14
I really like how you guided me in to demoing. Nice job.

------
rafeed
This is awesome. Really nice implementation and so useful for many different
applications. Just signed up and looking forward to trying this out.

------
jfoster
Cool concept. One concern I'd have about this type of tool is that when it
encounters something it can't handle, I'm stuck. Writing your own scraper
means that you can modify it when you need to. I think the ultimate solution
would be something like Kimono with the ability to write snippets of custom
javascript to pull out anything that it can't handle by default.

~~~
pranade
We're in the middle of implementing a more power developer version of the tool
to handle the use cases you're talking about. The beginning of this is
surfaced under the "advanced" tab in the data model view where we show the
selectors and regular expressions that are produced. we want to ultimately let
you edit those to customize the extractor. From there it'd be super cool to
implement the javascript snippet feature you suggested.

------
dmunoz
I'm normally a bit worried when a thread quickly fills up with praise, but
this looks very nice.

It's something I have thought about, as I'm sure many people who have done any
amount of scraping have, but never went forward and tried to implement. The
landing page with video up top and in-line demo is a pretty slick presentation
of the solution you came up with. Good job.

~~~
pranade
Thanks we were pretty surprised as well, but we're really grateful for the
encouragement

------
critium
Please get this off the ground. I would also possibly suggest a separate
business, website regression testing.

Selenium is WAAAY to painful.

~~~
pranade
Thanks for the suggestion... we're working hard to get auth up and running!

------
ThomPete
Thank you for building a tool I been wanting so I don't have to!

Can't wait to play around with this tonight.

Suggestion. Allow one to select images.

~~~
pranade
Great add. It's on our shortlist - a popular request!

~~~
ThomPete
Also, the ability to style it myself would be nice :)

------
ForHackernews
This looks really slick. What happens if a website you're scraping changes its
design? Do you respect robots.txt?

~~~
pranade
If it changes the format significantly, the scraper will break, so for now
you'll have to use the tool to rebuild. You will see on your API status page
that it's down. As for robots.txt, we do respect it... for now we're leaving
that to the user, but we're trying to implement a proactive way of checking
for disallows and stopping those scrapers from being built.

~~~
thatthatis
Please clarify: are you saying that right now you leave respecting robots.txt
to the user?

~~~
pranade
At the moment, we rely on users to be responsible. We spell it out in the
terms and FAQ. We've been in private beta, keeping usage very limited until
today. We fully understand the seriousness of the issue as we scale. We're
committed to becoming a responsible bot that respects robots.txt

~~~
loceng
I would say how you're scraping differs from say how Google, a search engine,
scrapes. I'm not sure there is a way in robots.txt to define for each use?
Knowing the data in a structured way, but then allowing it to be displayed in
full off-site is quite different than using the scraped data for linking into
a website.

~~~
thatthatis
But robots.txt provides minimums: don't scrape this page, don't refresh more
than once every x, these crawlers are allowed this access, etc.

------
blazespin
There's a huge business here if you keep at it. I'll throw money at the screen
if you can make this work.

------
rpedela
Definitely awesome presentation and product.

The example doesn't seem to work right on Firefox. On Chrome, if I click
"Character" in the table then it highlights the whole column and asks if I
want to add the data in the column. On Firefox, clicking "Character" just
highlights "Chatacter" and that is it.

Ubuntu 12.04

Firefox 25.0.1

~~~
pranade
Thanks for flagging. We'll get on this to figure out what's going on when
running on ubuntu

------
tlrobinson
I built something very similar last year, but sadly never got around to
polishing and launching it: [http://exfiltrate.org/](http://exfiltrate.org/)

(There's a prototype of an API generator hidden in a menu somewhere but it's
nowhere near production ready)

~~~
pranade
Yeah, we've been working on this for a while too... took a while to polish it
a bit before we could put it out there. Will check out exfiltrate.org - looks
cool!

------
BinaryBird
Nice tool, slick UI. It worked for some pages and not for others. Currently
I'm using Feedity: [http://feedity.com](http://feedity.com) for all business-
centric data extraction and it has been working great (although not as
flexible as kimono).

------
jval
Great job guys.

One problem I've had though is that I think you guys are hosted on AWS - a lot
of websites block incoming connections from AWS.

Are there plans to add an option in future to route through clean IPs? Premium
or default, this would be cool and make it a lot more useful.

------
lucasnemeth
Nice job! I really liked, it's a fantastic idea! And your UX is great! Just
one thing I've found when testing: I've had some problems with non-ascii
characters, when I was visiting brazilian websites, such as this :
www.folha.com.br.

------
twog
Well done on the product & solving a clear need! This is extremely useful for
hackathons/prototyping. I also loved the live demo in the blog post and you
did a wonderful job with the design/layout/colorscheme of the site.

------
jjcm
Very cool, and I like that the link is your announcement page running inside
of the demo. Really drives home the idea.

That said, it looks like it can't do media right now. I would love it if it
could at least give me a url for images/other media.

~~~
pranade
It's a great suggestion, thanks! ... image extraction would be cool, and it's
on our shortlist of features to build next

------
cbaleanu
Does it do logging in to websites then fetching? Do you plan to add scripting
to it?

~~~
pranade
We don't support logging in yet, but it's a feature we're working on adding.
Scripting will also be cool, but it's right now further down our feature queue

------
dikei
I don't think this can beat the speed of a hand-tune crawler. When I write
crawlers, I skip rendering page and javascript execution if it isn't needed,
which massively speed up the crawling process.

~~~
pranade
There's definitely things that a custom-built scraper can do more efficiently
than kimono, but our focus right now making scraping accessible across a broad
enough range of web sites.

------
pranade
Thanks guys, glad you like it. Welcome any feedback so we can make it better!

------
ewebbuddy
Really cool idea and tool. Still need to test this out properly. Is it
possible to scrape note just one page but a stack of them? For example - a
product catalog of 1000 SKUs extending upto 50pages.

~~~
pranade
We don't support that quite yet. It's our #1 feature request though, and we're
working to get it ready soon

------
shamsulbuddy
Does such Webscraping is allowed legally. Since it is not done directly from
our servers and if any legal action will be taken by the scraped website ,
will it be on kimonolabs..or the user..

------
catshirt
really excited to see this. i've had the idea (and nearly this execution) in
mind for years but no use or ambition to get it done.

given the pricing though i'm almost motivated to make my own. as a hosted
service the fees make sense with the offerings. but not only would i rather
host my own- it would be cheaper all around. would you consider adding a free
or cheap self hosted option?

aside, i think there is a mislabel on the pricing page. i'm guessing the free
plan should not have 3 times the "apis" than the lite plan.

~~~
pranade
Yes, we're in beta right now, while we're still working out the bugs. For
beta, it's free for 30 APIs.

~~~
catshirt
thanks for the response! any chance at a self hosted option? i'd even still
pay (once) for a self hosted version.

------
jmcgough
Really sleek interface, and looks like it could be extremely useful (I just
spent a few hours cranking out Nokogiri this morning).

Oh, typo: "Notice that toolbar at the toop of the screen?"

~~~
pranade
Awesome, thanks for the kind words. And for catching that typo, will fix that
now

~~~
lstamour
I thought it was intentional. Swedish chef style. I like it, but I'll need to
go back and re-read to understand how I can use this on other pages than the
homepage/demo page. I've nothing immediate to try it with right now. :)

Edit: I'll watch the video after work, probably will clear everything up for
me.

------
tchadwick
This looks really useful, and I'm trying to figure out if I could use it on a
project I'm working on, but hitting an issue. I sent a support message. Nice
job!

~~~
pranade
Thanks, the support tickets really help us debug. We'll look into it and get
back to you

------
kenrikm
Looks awesome, however I keep getting errors and 404s. Could this be an issue
on my end (seems to be working for others) or just HN making the servers beg
for mercy?

~~~
pranade
Where are you getting the 404s? We will check into it now

------
paul1664
Reminds me of Dapper

[http://open.dapper.net/](http://open.dapper.net/)

This allowed you to do similar, before being consumed by Yahoo. Might be worth
a look.

------
keyurfaldu
Awesome! Hats off.. How about extracting hashtag/GID of any record if
applicable, which are typically not rendered on page, but hidden under the
hood.

------
cullenmacdonald
the reason i ever have to write a scraper is because of pagination. while this
looks awesome, i'll have to stick to scraping until that is solved. :(

~~~
pranade
It's probably our #1 feature request at the moment. We're working on it and
hope to have it ready for you to try soon

------
rmason
I thought to myself oh boy yet another web scraper as a service but got
surprised. I haven't been this impressed with a product video since Dropbox.

------
xux
Wow looks amazing. I tried doing some queries on public directories, and it
even supports parameter passing. Will be using this for some side projects.

------
bluejellybean
How (if at all) does this run on javascript heavy sites?

~~~
pranade
JS-heavy sites can be tricky. We position it so it should execute after most
of the on-page JS, so it handles a lot of cases. There are still sites that
break it though.... we're trying to tackle these guys one by one right now, as
we try to generalize a broader solution

------
mhluongo
Any chance you guys plan to add link hrefs to CSVs? I'd love to use this now,
but I need the href for backlinks and future inference.

~~~
pranade
Thanks for the suggestion, we're adding to our list

------
phillmv
The UX is great and a journalists everywhere will thank you.

But outside of government websites I don't see how a lot of this is even
legal, per se?

~~~
pranade
Thanks... yes, public data from governments is a great use case. Often a lot
of apps built using scrapers will wind up driving up traffic/ sales a the
source site so it's okay. We want to do responsible web scraping, so will
respect webmasters robots.txt files to make sure it's legal.

~~~
hartard
I love the execution, but I also see inherent problems.

Robots.txt is just a convention to advise crawlers. I'm confident most sites
explicitly state this is against their terms of service.

You will encounter terms along the lines of:

 _" Unauthorized uses of the Site also include, without limitation, those
listed below. You agree not to do any of the following, unless otherwise
previously authorized by us in writing: Use any robot, spider, scraper, other
automatic device, or manual process to monitor, copy, or keep a database copy
of the content or any portion of the Site."_

~~~
pranade
You've got a valid point. We want to eventually create a space that allows
responsible scraping - so webmasters can have access to analytics on what's
being scraped and can explicitly turn off kimono APIs for their domains if
they see fit. We also think there are use cases for people who own their own
data. Often, APIs will provide a way for companies to streamline their
internal app development and figure out what to expose to the developer
community before investing in an expensive API deployment.

------
cycnusx
One of my favorites:
[http://htmlagilitypack.codeplex.com](http://htmlagilitypack.codeplex.com)

------
diegolo
It would be nice to have a view also on the raw html code, e.g., to create a
field containing the url of an image in the page.

~~~
pranade
Thanks, for the suggestion. We're rolling out advanced mode soon, which will
allow you to edit the CSS selectors and RegEx operating on the page's HTML to
define the selected data elements

------
PhilipA
It looks cool, but very expansive compared to Visual Web Ripper, which you pay
way less for (but has to host yourself).

------
dmritard96
as someone building a home grown proprietary scraping engine. Consider
alternative locations of elements. Most sites are using templating engines so
its fairly reliable to find things in the same place, but more often than you
might expect, things move a round ever so slightly. Navigation is a fun one
also. ;)

------
thatthatis
This is my third time trying to get an answer to this question: does your
crawler automatically respect robots.txt?

~~~
pranade
So sorry for missing this earlier. See our response in comments below: "At the
moment, we rely on users to be responsible. We spell it out in the terms and
FAQ. We've been in private beta, keeping usage very limited until today. We
fully understand the seriousness of the issue as we scale. We're committed to
becoming a responsible bot that respects robots.txt"

------
timov
You can use the utility without registration or login by blocking the login
prompt with, for example, AdBlock.

------
toddwahnish
This is fantastic. Congrats on launching it! Once it has pagination & auth
I'll be all over this :)

------
iurisilvio
What about some navigation tools there?

Looks pretty good, but it does not really replace my scrappers. Maybe some of
them...

~~~
trip41
re: nav tools -- you mean the ability to crawl multiple pages?

------
bigd
Seems it can't see the stuff inside angular views.. well at least mines..

But for the rest, awesome product. Thanks.

~~~
pranade
Yes, you're right... we can't handle angular quite yet. We're working on this.

------
yummybear
Looks very nice. There seems to be an issue with international characters
though (æ/ø/å).

~~~
pranade
Yes, thanks for spotting. We'd discovered that Chinese, Japanese and Korean
failed but didn't know about these characters. Thanks!

------
kyriakos
It appears that it doesn't work with websites containing international
characters.

~~~
pranade
Yes, thanks for noting. This is a bug and we're working to get these
characters supported as soon as we can

------
szidev
great idea. i'll have to keep this in mind for future projects.

------
wprl
It's easy not to write web scrapers even without this tool ;)

------
narzero
I like the concept. Would love to see page authentication

------
joshmlewis
Is there an ability to scrape more than one page of data?

~~~
pranade
Not yet, but it's our #1 feature request, so we're working on it now. For now,
you can make multiple APIs (one for each URL). If the URL takes query
parameters though, you can re-use the same API and programmatically cycle
through query parameters.

------
NicoJuicy
This is really slick! Btw. Who made your intro video?

~~~
pranade
It's pretty homegrown right now, we did the intro video ourselves on our
laptops

------
dome82
I like the concept and it looks similar at Import.io

~~~
pranade
Yes, in concept quite similar. We wanted to make something that you can use
from within your browser as part of your natural workflow, without installing
any other software.We also really wanted to figure out the right data
association intelligently based on user selections vs. asking users to think
through a data model up front

------
abvdasker
I kind-of enjoy writing web scrapers.

------
mswen
How does this compare with Mozenda?

------
aaronsnoswell
Man that demo is impressive!

~~~
dclara
Agree demo is awesome. But I don't think scraping for any web page is that
simple. Lots of exceptional cases.

------
taternuts
That looks quite swift

------
byteface
use any chrome xpath plugin and give that to YQL

------
harryovers
so what do you do that import.io doesn't?

------
iamkoby
i love this! and amazing video!

------
rismay
OMFG.

------
tonystark
neat.

------
nnnn
"Never write a web scraper again"... yea right.. sick and tired of such
gimmicks and self promotion on the net today.

------
pyed
actually I love scraping :(

