Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Kimono – Never write a web scraper again (kimonolabs.com)
717 points by pranade on Jan 15, 2014 | hide | past | web | favorite | 230 comments



The presentation is beautiful and the website is great, but the tech broke so I have no idea how or if this even works. This is a wonderful concept and one I've talked about doing with others. I was really excited to try this. I watched the demo video and it seemed straightforward.

I went to try and use it on the demo page it provides, going through and adding things, but when I went to save it, I just received an error that something went wrong. Well, crap. That was a waste of time. Oh well, maybe it's just me.

Alright, I'll give it another shot using the website they used in the demo. Opened up a Hacker News discussion page and started to give it a try. Immediately it was far less intelligent than the demo. Clicking on a title proceeded to select basically every link on the page. Somehow I clicked on some empty spots as well. Nothing was being intelligently selected like it was in the demo. Fine, that wasn't working tremendously well, but I wanted to at least see the final result.

Same thing: just got an error that something went wrong and it couldn't save my work.

Disappointing. I still might try it again when it works 'cause it's a great idea if they really pulled it off. So far: doesn't seem to be the case.


Sorry you had a bad first experience. We've tested this on a lot of sites, and it works stably across a lot of different cases, but haven't solved it everywhere. Thanks for letting us know about the discussion page. We'll look into the bugs right now


Having written a whole lot of crawling code throughout the years, I can totally understand how monumental of a task this is. This really does look like a cool product. Glad you're actively hunting down ways to improve the demo before asking for money :)

Feel free to drop me a line if you want any specifics about the troubles I had.


hacker news has probably been scraped too many times. that's why it's not working. it didn't work for me, but other sites worked easily.

...i will comment that the user experience for first adding elements is not what i expected. once u understand it's great overall, but i'd have some popup tooltip type guide thing going on for the first time a user uses it explaining that you should choose one element first, and then check a matching element. ...Then explain the number bubbles next to the property field. You don't get what those numbers are at first. ..Then for the property field, somehow make it more obvious that you should fill it out.

Basically you have the interface setup so any action can be done at any time, but it should be presented as a forced sequential set of steps to the end user, at least at first.


Thanks for the suggestions on tool tips... We can also expand the tutorials section to make this more clear.


HN pages are possibly the worst case, very hard to infer structure from due to its 1998 coding standards. You'll have a better chance with an alternative interface like http://ihackernews.com/ or http://hckrnews.com (no comments though).


I really wish that HN wasn't even in the running for a "worst case". For a community that seems to be all about UX and innovation, shouldn't it run on at least a marginally user-friendly piece of software with this-century markup?

I get that it has a sort of kitschy or retro appeal, but it's just basically a pain to use and looks terrible.

I can't tell you how often I click next page to find that I've taken too long and my session or whatever has expired.


No. This site isn't about UX or innovation, it's about tech start ups. It's a constant reminder that something can be successful even if it was written in a LISP dialect and has a bunch of UX misses as long as the core idea is valid and the product usable enough.

I've used a bunch of HN skins that were supposedly better designed, but none of them stuck. Apparently it's just plain unnecessary for HN to be better.


Couldn't agree more. E.g., when Digg got popular and redesigned their site everyone switched to the bare-bones Reddit.


So because it's written in LISP it gets a free pass in other areas? That seems like a dubious claim.


No, despite it being written in an obscure language and sparse UI it has massive traction anyway. There's a lesson in there somewhere.


> There's a lesson in there somewhere

The website that lets technologists hang out with people who might give them large sums of money to see their ideas succeed doesn't have to be good, or pretty.

If the website were better, or prettier, that would not add any additional value to the previously mentioned large sums of money.

(Oh, and advice, and being able to chat with industry leaders and experts on diverse arrays of topics, etc.)


That's not what he said.


Indubitably.


I like HN because it gives me exactly what I want. A list of interesting links, ranked by what people that I truest more than news sites think of them. Oh and offers comment sections for each of them. HN is wonderful link discovery during compile time or if you need a break.


I think we all like HN, otherwise we wouldn't be here. That doesn't mean it can't improve.


If "improve" would involve turning it into AJAX-heavy app with images, CSS effects and some weird-ass scrolling interceptors, that requires me to load 200+ files for seeing the main page, I would rather live without any improvements.

I think better terms might be "clean up" or "restructure".


And why do you assume that is what I meant by "improve"?


A minimalist interface is fine but parts of HN are simply broken or suboptimal.


A lot of the time when I'm writing a scraper it's because of bad/old code or incorrect HTML. So if Kimono has issues with that its utility is reduced.


Kimono can handle several pages with malformed and old/bad HTML. We're still in beta though, so we're handling more edge cases as we encounter them. Should work on HN main page.


I developed some hot strategies for the next level ...

http://edinburghhacklab.com/2013/09/probabalistic-scraping-o...


Have you thought of implementing a system for custom handling of edge cases? Realistically you can't handle all edge cases, so leaving the last 2% up to the user (which is a developer) would be a good idea. It'd still save him 98% of the work, but also give him the comfort of knowing it won't break even if site X is doing something weird with their HTML in the future.

I'll leave it to you to figure out how one would implement such a feature ;-)


Sounds good. I've made a note of it for the next time I need to screen scrape something.


OK so you're saying that instead of using a scraper to deal with malformed data out there (the whole reason of its existence), instead we should use a format that is better suited for machine representation? That's like saying 'yeah I've got this car here to take you to places that are very far away, too far to walk; except it doesn't work very well if you want to go far away, so you're betting off just staying at home. Or walk along time.'.


Scrapers exist to take data out of HTML, malformed or not, when an API or feed is not available. Yes, ideally HTML is written in a semantic, annotated machine-parseable way; that also enables smarter search engines, better accessibility, interoperability and so on. That's one of the main reasons behind the changes to the standard made in HTML5.

A better example would be "hey I have this cool device that will improve your mileage by 100%, but it only works on cars built according to current emission standards...".


Kimono can handle many pages with malformed/ old HTML. Of course, it's still beta and there are still pages that break it, so we're improving it as we go with the help of early adopters like everyone on this thread. Of course, our goal is an ideal state it works everywhere perfectly :)


This is awkward, but I disagree.

HN pages are very well structured and are probably an ideal case for an automatic scraper.

If your automatic scraper doesn't work on HN, it is unlikely to work generally.


I'd love to argue that PG probably thinks this is a good thing. (no scraping). And for good reasons. Who is paying the bandwidth bills in the end, and for whom, you might ask. And then if those users will end up contributing to HN in any intellectual way in the end, or are just scraping content to spin in their auto-blogs filled with ads.


hn pages are not too hard, each story is a <tr>, title is in 'tr .title' and link is in 'tr .title a'.

There are some irregularities (i.e. YC announcements without score, more button, self posts) but it's not really structurally complicated (compare reddit, which has 3 "score" fields per link)


In my case, worked for some pages and not for others. Currently I'm using Feedity: http://feedity.com for all business-centric data extraction and it has been working great (although not as flexible as kimono).


It worked great for me (http://www.kimonolabs.com/kimonoapp/aws-status-check). It would be helpful if you shared what you tried to do when it failed. I don't work for or with Kimono, but am curious what does and does not work.


You must check http://webscrapemaster.com/ Doing one thing well.


The Simile group at MIT did something similar back around 2006. Automatic identification of collections in web pages (repeated structures), detection of fields by doing tree comparisons between the repeated structures, and fetching of subsequent pages.

The software is abandoned, but their algorithms are described in a paper:

    http://people.csail.mit.edu/dfhuynh/research/papers/uist2006-augmenting-web-sites.pdf


Oh, hey, memories. I worked one summer with David Huynh (who you're linking to there) and David Karger (his thesis advisor) on one of the Simile projects.

I vaguely remember playing around with this tool you mentioned. I thiiiiink it was this one[0], although it seems to be superseded by this one[1] now.

[0] http://simile.mit.edu/wiki/Piggy_Bank [1] http://simile.mit.edu/wiki/Sifter


Just had to chime in and say that David Huynh and his fellow programmers will be forever heroes to me and a small group of data journalists who depended on Gridworks/Google Refine/OpenRefine


Thanks a ton for sharing... the association algorithms have been where we've been spending a good chunk of time. Will read through this


If you're interested in hosted solutions that try to do automatic identification of pages, diffbot is worth a look. We've had some good experiences: http://diffbot.com/


Show me it working with authentication and you will have a customer. Scraping is always something you need to write because the shit you want to get is only shown when you are logged in.


Yes, it's one of the most popular feature requests. We don't support auth yet, but it's on our shortlist and we hope to have it ready soon.


how are you going to do it without having to know the actual authentication key(s)? if i don't trust anyone enough to give my auth away, and so unless the site being scraped has some sort of oauth support, how are you going to get any data?

of course, if this was an offline product, or self-hosted product, then it would solve that problem of auth instantly.


Would there any way to fake the beginning of an OAuth session with Facebook, Google or any other OAuth authenticated site? Kind of like replaying cookies to hijack sessions?


The route of proxying the web page presents much difficulty in doing actual authentication on Facebook or Google's website via the proxied webpage without first rewriting most of the javascript and hijacking their Ajax calls on the fly.

The approach I took was to hijack the Cookies from the browser once the user has signed in after on e.g. Facebook via the browser extension.

The route of proxying the website does in fact do away with the need to install any external 3rd libraries.

This browser extension I built coupled with the web service its integrated to does allow for scraping of pages from Facebook, Google and LinkedIn logged in pages as well.

https://chrome.google.com/webstore/detail/krakeio/ofncgcgajh...


Hah, I've been working on this recently with Facebook, on a TV set-top-box. It was painful and I ended up giving up. xd_arbiter.php is the key, I think.


I hope that the lite plan will feature auth handling, I can't imagine the service being useful in most cases without it.


We're working on auth... it's the most requested feature at the moment. And we're still beta at the moent, so all usage is free


Glad to hear it. I was just saying that I think auth should be included as a basic feature in all paid for plans.


I'm wondering how you will be able to with the numerous ways of CSRF protection implementations.


Creator of Automately here, our service could definitely be something you might be interested in. While we aren't directly in the business of web scraping, we do have a powerful automation service that can accomplish those needs using simple javascript and our powerful scalable automation API.

We are accepting early access requests right now. Check us out! http://automate.ly/


I've written more web scraping code than I care to admit. A lot of the apps that ran on chumby devices used scraping to get their data (usually(!) with the consent of the website being scraped) since the device wasn't capable of rendering html (it eventually did get a port of Qt/WebKit, but that was right before it died and it wasn't well integrated with the rest of the chumby app ecosystem).

This service looks great, good work! But since you seem to host the APIs created how do you plan to get around the centralized access issues? Like on the chumby we had to do a lot of web scraping on the device itself (even though doing string processing operations needed for scraping required a lot of hoop jumping optimization to run well in ActionScript 2 on a slow ARMv5 chip with 64mb total RAM) to avoid all the requests coming from the same set of chumby-server IP addresses, because companies tend to notice lots of requests coming from the same server block really quick and will often rate limit the hell out of you, which could result in a situation where one heavy-usage scraper destroys access for every other client trying to scrape from that same source.


Access, legality and rate limiting issues come up a lot. We're working on a couple things to address them. The first is an intelligent job distribution system that consolidates scrapes across users and hits sites (and pages) at human-like intervals. the second is to create a portal for webmasters that allows them special privileged access to analytics on data being extracted from their sites, and the ability to "turn on or off" kimono APIs if they see fit. this way, via kimono, a webmaster at chumby could "provision" certain kimono users. we're still yet to see whether the later works out. thanks for the input


Use a user-agent containing a URL to find out who and what you are, and honor my robots.txt.

Having a panel for webmasters along with that would be fine.


Great suggestion... thanks for this one. We're putting this on our list


Please tell me that the robots.txt suggestion is something that you're already doing and the user agent part is whats going on the list.


You could try doing IP address rotation by rotation EC2 instances or some other cloud services.

I wrote a library for that.

https://github.com/KrakeIO/resque-my-aws


That has nothing to do with what's being discussed on this thread.


I'm curious how you plan to avoid/circumvent the inevitable hard IP ban that the largest (and most sought after targets) will place on you and your services once you begin to take off?

I could have really used a service like this just yesterday actually, I ended up fiddling around with iMacros and got about 80% of what I was trying to achieve.


It's a great question. What we're really trying to do is make data accessible programmatically and at scale. We want to connect data providers and data consumers with APIs in a way that's mutually beneficial vs. being a tool for data theft. Our hope is to (once we scale) actually work with data providers directly on the on the distribution of their data so the IP ban becomes a non-issue.


But isn't the point kinda to let the users come up with data providers themselves? If you say "Only these 500 data providers are available for scraping", you don't have a business. If you don't' have such a limitation, you'll not be able to work directly with all data providers. You'll have IP problems.


This is excellent. Even it if doesn't work for scraping all sites, it simplifies the average use case so much that it's not even funny.

Feature proposal: deal with pagination.


Another feature, simple one: Allow to add some filters to the data stream. For example: only posts that contain word "bitcoin" in the name or only those with 50 upvotes or more.


Thanks for the suggestion... adding to the list :)


Make sure to include regex matching =)


Thanks. We support regex matching now. Try dragging to select text, if there's a relevant regex pattern and kimono will find it (there's an example inthe blog post). You can preview (and soon, you'll be able to edit) the CSS and Regex also in advanced mode.


Too bad I couldn't edit selectors and regex-es at this step. I could implement the filters I needed myself manually like this.


Besides pagination, you might want to handle nested pages as well as deep linking too.

https://krake.io/docs/define-data#next_page_object


thanks. glad you like it. pagination, dynamic tabs (and crawling in general) is a big feature we really want to add soon. a lot of people are asking for it. the challenge will be integrating it with the current UEX which we're trying to keep super simple.


+1 for paging. really important


Constructive Tone: I figured that it might be nifty to scrape cedar pollen count information from a calendar and then shoot myself an email when it was higher than 100 gr/m3.

This would be a pretty difficult thing to grab when scraping normally, but the app errors before loading the content:

https://www.keepandshare.com/calendar/show_month.php?i=19409...

JS error: An error occurred while accessing the server, please try againError Reference: 6864046a


Thanks for letting us know. Just tried and am getting the same error. The page is loading content dynamically from another source... We'll look into this and see if we can get it working on this page


Do you support POSTs for fetching dynamic data? I found where it's pulling from, here's the curl command:

curl "https://www.keepandshare.com/calendar/fns_asynch_api.php?r=0... --data "action=getrange&i=1940971&from=2013-12-26&to=2014-02-06"


No, we don't have POST support quite yet. We're working on a solution.


Great work so far. The tool was very intuitive and easy to use.

My suggestion: once I've defined an API, let me apply it to multiple targets that I supply to you programatically.

The use case driving my suggestion: I'm an affiliate for a given eCommerce site. As an affiliate, I get a data feed of items available for sale on the site, but the feed only contains a limited amount of information. I'd like to make the data on my affiliate page richer with extra data that I scrape from a given product page that I get from the feed.

In this case, the page layout for all the various products for sale is exactly the same, but there are thousands of products.

So I'd like to be able to define my Kimono API once - lets call it CompanyX.com Product Page API - then use the feed from my affiliate partner to generate a list of target URLs that I feed to Kimono.

Bonus points: the list of products changes all the time. New products are added, some go away, etc. I'd need to be able to add/remove target URLs from my Kimono API individually as well as adding them in bulk.

Thanks for listening. Great work, again. I can't wait to see where you go with this.

Cheers!


Thanks a ton for the feedback. Getting data from multiple similarly structured URLs programmatically is something we're working on now. We love hearing about the use cases you want to use this for so we can make sure we build out the right features to make kimono useful for you.



You should write a blog post on lessons learned when we spent a year making ~this in 2008.


Thanks so much for creating SelectorGadget! I used it a lot when scraping some fanfiction and Wikipedia data.


Undo button is awesome.

More web apps need an undo button.


Are you familiar with ScraperWiki? I'm wondering how your work fits in with it.

Edit: looks like they've moved away from that space, but have an old version available at: https://classic.scraperwiki.com/


The people who scrape data to avoid paying for APIs are the same people who will not pay for a service to make scraping easier ;)


The people who scrape data at Scraperwiki -- which was made by the same people who opened up parliamentary transcripts in the UK for the first time, and the UN's proceedings, and data about how MPs in London vote -- generally don't have an option to buy anything because the data's hidden by governments from the people who paid for it, on purpose.

But by all means take this opportunity to dismiss all of us as freeloaders.


I have 17 scrapers on ScraperWIki classic for government data https://classic.scraperwiki.com/profiles/maxious/

It would have cost me $348/year to move those to new scraperwiki.


That reads as "Less than $1/day" to me...


Actually, I usually scrape data because there is NO API I can pay for.


This is a great tool! In a past life we needed a web scraper to pull single game ticket prices from NBA, MLB, and NHL team pages (e.g. http://www.nba.com/warriors/tickets/single). We needed the data. But, when you factor in dynamic pricing and frequent page changes you are left with a real headache. I wish Kimono was around when we were working on that project.

I love how you can actually use their "web scraper for anyone" on the blog post. Very cool!


That UI made me go wow, this could be an awesome tool. Idea that pops into my mind is being able to grab data from those basic local sites run by councils, local news papers etc and putting it into a useful app.

How dedicated are you guys to making this work because I'd imagine there are quite a few technical hurdles in keeping a service like this working long term while not getting blocked by various sites?


Love your suggestion. We're committed to making kimono better and we're working on it all the time. We want to make sure it's a responsible scraper, so want to work together with webmasters in cases where there might be blocking but the data is legal to share...


>Sorry, can't kimonify

>According that web site's data protection policy, we were unable to kimonify that particular page.

Sigh... Oh well... Back to scraping.


What page were you trying to hit? We'll check it out


Pages buried in here: https://fannin4.wcjc.edu/

The course catalog is public, so no login is needed. I want to scrape various data related to courses, to populate forms automatically and such.


Yeah, of course it won't work with HTTPS sites. They'd have to proxy those HTTPS sites and perform a MITM just to do it.


As opposed to HTTP where they proxy it and MITM it? I don't understand the objection.


I kind of just assumed that's what they were doing.


HTTPS is definitely a problem for proxy servers unless you the proxy server rewrites all the URLs in the html pages loaded as well as all the URLs of the Ajax calls to point back to the proxy server.


It may be to their advantage to come up with a solution for this, given the popularity of https these days.


> Web scraping. It's something we all love to hate. You wish the data you needed to power your app, model or visualization was available via API. But, most of the time it's not. So, you decide to build a web scraper. You write a ton of code, employ a laundry list of libraries and techniques, all for something that's by definition unstable, has to be hosted somewhere, and needs to be maintained over time.

I disagree. Web scraping is mostly fun. You don't need "a ton of code" and "a laundry list of libraries", just something like Beautiful Soup and maybe XSLT.

The end of the statement is truer: it's not really a problem that your web scraper will have to be hosted somewhere, since the thing you're using it for also has to be hosted somewhere, but yes, it needs to be maintained and it will break if the source changes.

But I don't see how this solution could ever be able to automatically evolve with the source, without the original developer doing anything?


Perhaps this could be automated by finding the same content in two versions of the dom and then doing a diff on the structure, updating the rules?


It would be great to automate this eventually. For now, we're trying to make it really easy to set up and rebuild the scraper. If it goes down, you'll see it in the status on your user dashboard. We're also implementing alerts, so you can opt to get an email notification if a scrape fails


I assume you get an error on the hourly, daily, monthly, whatever update which you are notified about. Then you can redo the semi-manual setup of the scraper.


Wow, this is looking good, I wish I had it available to me 6 months ago! Nice job :D

I don't know if it's just me or not, but it's not working for me in Firefox (OSX Mavericks 10.9.1 and Firefox v26). The X's and checkmarks aren't showing up next to the highlighted selections. Works fine in Safari.


Thanks for letting us know. We've tested on some versions of Firefox, but not v26 on Mavericks. We'll look into this


Great tool!

I'm coming at things from a non-coder perspective and found it easy to use, and easy to export the data I collected into a usable format.

For my own enjoyment, I like to track and analyze Kickstarter project statistics. Options up until now have been either labor intensive (manually entering data into spreadsheets) or tech heavy (JSON queries, KickScraper, etc. pull too much data and my lack of coding expertise prevents me from paring it down/making it useful quickly and automagically) as Kickstarter lacks a public API. Sure, it is possible to access their internal API or I could use KickScraper, but did I mention the thing about how I dont, as many of you say, "code"?

What I do understand is auto-updating.CSV files, and that's what I can get from Kimono. Looking forward to continued testing/messing about with Kimono!


looks promising!

to be fully usable for me, there are some features missing:

- it lacks manual editing/correcting possibilities: i've tried to create an api for http://akas.imdb.com/calendar/?region=us with "date", "movie", "year". unfortunately, it failed to group the date (title) with the movies (list entries) but rather created two separate, unrelated collections (one for the dates, one for the movies).

- it lacks the ability to edit an api, the recommended way is to delete and recreate.

small bugreport: there was a problem saving the api, or at least i was told saving failed - it nevertheless seems to be stored stored in my account


Thanks for the feedback. We're working on a feature that will allow you to edit APIs you've created and also edit the selectors and regex (right now, in advanced mode, you can see them, but cannot edit). We're looking into your bug now...


I would seriously consider rethinking that Favicon.


Seconded. I can't show that to anyone at work.


To me, the favicon merely looks like a sumo wrestler’s head with a short ponytail and scowling/serious eyebrows. I can’t tell what NSFW thing you see it as.


Spitballing here, but it could be a POV dick's eye view. I am really reaching however.

edit: Interesting downvotes, I dont mean to be rude, I am just trying to see what someone could find offensive. (I dont see the problem with the favicon)


Until I read this thread I also saw a sumo or an angry onion, but I believe the picture is actually a person facing away from us undoing their kimono.


You see all that in a 32x32 pixel image?


It's much larger than that http://kimonify.kimonolabs.com/favicon.ico


Why? It looks like an onion with angry eyebrows.

Do you have something against vegetables?


I'm experiencing login errors (PEBKAC caveat: password manager, 2x checked, reset), but the support confirmation page is a nice surprise.

http://i.imgur.com/w01CoUy.jpg


Nice work, this is much better than I expected! Does it require Chrome? It doesn't seem to work in Safari for me. Also, does Kimono work for scraping multiple pages or anything that requires authentication?


Great, it should work well on webkit browsers, what version of Safari are you using?


7.0.1, the latest. I also don't have Flash installed, but it doesn't look like you're using Flash. The entire top bar doesn't show for me. Feel free to email me and I can send you screenshots.


Thanks - we're not using flash, so it must be something else. Will follow up over email


I found the one click action for selecting an entire column of values as well as the UI/UX on the top column of the page to be very impressive. We were thinking of a nice clean way to represent that particular UI/UX flow in this browser extension we built as well. Will incorporate that in our next release.

https://chrome.google.com/webstore/detail/krakeio/ofncgcgajh...

Would love to meetup and exchange some ideas if you are based in Bay area.


I like how you've thought through the end to end use case: not just generating an API, but actually making it usable. I've done my fair share of web scraping and it's not an easy task to make accessible and reliable -- good luck!

It makes me wonder if there isn't a whole "API to web/mobile app with custom metadata" product in there somewhere. I can imagine a lot of folks starting to get into data analysis and pipelines having an easier time of it if they could just create a visual frontend in a few clicks.


Yes, we're excited about the possibilities of an end-to-end use case as well... in fact we were surprised when we found a lot of interest in front-end output layers on top of the APIs than just the APIs themselves. Would be curious to know what output features would be most valuable for you.


Well. Think about spreadsheets. Think about live spreadsheets powered directly by APIs. Bingo!

I'm doing a couple of data mining projects right now and simply being able to query and look into the API outputs, as well as my local database, without building a custom frontend would've saved me a bunch of time. But I'm thinking more of the knowledge worker, or even the power user who wants to view their Fitbit, Up and Lark data all in the same dashboard. Can't help but thinking this already exists somewhere though.


Love your idea. Would love to follow up on this with you


We all know there are a lot of existing tools that does the same things. But I've not met one with such a polished UX. Kudos to the Kimono team, I'll definitly recommend your product.


Very nice job. What about scraping data from password-protected pages?


Great request... it's on our feature shortlist. Definitely a feature we want to implement as soon as we can (after we tackle some basics like pagination and getting images)


Like the parameter passthrough feature. Take a look at places where the parameters are part of the URL structure. For example a Target product page http://www.target.com/p/men-s-c9-by-champion-impact-athletic...

In order to get data for a different product, I will have to modify the URL itself. I think same holds true for blog posts.


Yes, it's a great point. We're working on updating the query param passthrough to handle params within the URL structure.


Apart from that here are other items. You may have these on your list, but can count my vote to prioritize.

1. Pagination 2. Image URLs 3. Focus on page types such as product pages, posts etc. That way its easy to go from content to content. Will help crawling too 4. Link back to original page included in JSON

Finally common sites/pages used by multiple users of your systems should not count against the API count requirement under pricing. You may want to charge against total calls, like Parse.


Thanks for the suggestions!


I really like how you guided me in to demoing. Nice job.


This is awesome. Really nice implementation and so useful for many different applications. Just signed up and looking forward to trying this out.


Cool concept. One concern I'd have about this type of tool is that when it encounters something it can't handle, I'm stuck. Writing your own scraper means that you can modify it when you need to. I think the ultimate solution would be something like Kimono with the ability to write snippets of custom javascript to pull out anything that it can't handle by default.


We're in the middle of implementing a more power developer version of the tool to handle the use cases you're talking about. The beginning of this is surfaced under the "advanced" tab in the data model view where we show the selectors and regular expressions that are produced. we want to ultimately let you edit those to customize the extractor. From there it'd be super cool to implement the javascript snippet feature you suggested.


I'm normally a bit worried when a thread quickly fills up with praise, but this looks very nice.

It's something I have thought about, as I'm sure many people who have done any amount of scraping have, but never went forward and tried to implement. The landing page with video up top and in-line demo is a pretty slick presentation of the solution you came up with. Good job.


Thanks we were pretty surprised as well, but we're really grateful for the encouragement


Please get this off the ground. I would also possibly suggest a separate business, website regression testing.

Selenium is WAAAY to painful.


Thanks for the suggestion... we're working hard to get auth up and running!


Thank you for building a tool I been wanting so I don't have to!

Can't wait to play around with this tonight.

Suggestion. Allow one to select images.


Great add. It's on our shortlist - a popular request!


Also, the ability to style it myself would be nice :)


This looks really slick. What happens if a website you're scraping changes its design? Do you respect robots.txt?


If it changes the format significantly, the scraper will break, so for now you'll have to use the tool to rebuild. You will see on your API status page that it's down. As for robots.txt, we do respect it... for now we're leaving that to the user, but we're trying to implement a proactive way of checking for disallows and stopping those scrapers from being built.


Please clarify: are you saying that right now you leave respecting robots.txt to the user?


At the moment, we rely on users to be responsible. We spell it out in the terms and FAQ. We've been in private beta, keeping usage very limited until today. We fully understand the seriousness of the issue as we scale. We're committed to becoming a responsible bot that respects robots.txt


I would say how you're scraping differs from say how Google, a search engine, scrapes. I'm not sure there is a way in robots.txt to define for each use? Knowing the data in a structured way, but then allowing it to be displayed in full off-site is quite different than using the scraped data for linking into a website.


But robots.txt provides minimums: don't scrape this page, don't refresh more than once every x, these crawlers are allowed this access, etc.


Feedback from webmasters is really helpful for us. We want to make sure we're making data available via API responsibly, so would love to hear your suggestions/ thoughts as we define a scalable solution.


There's a huge business here if you keep at it. I'll throw money at the screen if you can make this work.


Definitely awesome presentation and product.

The example doesn't seem to work right on Firefox. On Chrome, if I click "Character" in the table then it highlights the whole column and asks if I want to add the data in the column. On Firefox, clicking "Character" just highlights "Chatacter" and that is it.

Ubuntu 12.04

Firefox 25.0.1


Thanks for flagging. We'll get on this to figure out what's going on when running on ubuntu


I built something very similar last year, but sadly never got around to polishing and launching it: http://exfiltrate.org/

(There's a prototype of an API generator hidden in a menu somewhere but it's nowhere near production ready)


Yeah, we've been working on this for a while too... took a while to polish it a bit before we could put it out there. Will check out exfiltrate.org - looks cool!


Nice tool, slick UI. It worked for some pages and not for others. Currently I'm using Feedity: http://feedity.com for all business-centric data extraction and it has been working great (although not as flexible as kimono).


Great job guys.

One problem I've had though is that I think you guys are hosted on AWS - a lot of websites block incoming connections from AWS.

Are there plans to add an option in future to route through clean IPs? Premium or default, this would be cool and make it a lot more useful.


Nice job! I really liked, it's a fantastic idea! And your UX is great! Just one thing I've found when testing: I've had some problems with non-ascii characters, when I was visiting brazilian websites, such as this : www.folha.com.br.


Well done on the product & solving a clear need! This is extremely useful for hackathons/prototyping. I also loved the live demo in the blog post and you did a wonderful job with the design/layout/colorscheme of the site.


Very cool, and I like that the link is your announcement page running inside of the demo. Really drives home the idea.

That said, it looks like it can't do media right now. I would love it if it could at least give me a url for images/other media.


It's a great suggestion, thanks! ... image extraction would be cool, and it's on our shortlist of features to build next


Does it do logging in to websites then fetching? Do you plan to add scripting to it?


We don't support logging in yet, but it's a feature we're working on adding. Scripting will also be cool, but it's right now further down our feature queue


I don't think this can beat the speed of a hand-tune crawler. When I write crawlers, I skip rendering page and javascript execution if it isn't needed, which massively speed up the crawling process.


There's definitely things that a custom-built scraper can do more efficiently than kimono, but our focus right now making scraping accessible across a broad enough range of web sites.


Thanks guys, glad you like it. Welcome any feedback so we can make it better!


Really cool idea and tool. Still need to test this out properly. Is it possible to scrape note just one page but a stack of them? For example - a product catalog of 1000 SKUs extending upto 50pages.


We don't support that quite yet. It's our #1 feature request though, and we're working to get it ready soon


Does such Webscraping is allowed legally. Since it is not done directly from our servers and if any legal action will be taken by the scraped website , will it be on kimonolabs..or the user..


really excited to see this. i've had the idea (and nearly this execution) in mind for years but no use or ambition to get it done.

given the pricing though i'm almost motivated to make my own. as a hosted service the fees make sense with the offerings. but not only would i rather host my own- it would be cheaper all around. would you consider adding a free or cheap self hosted option?

aside, i think there is a mislabel on the pricing page. i'm guessing the free plan should not have 3 times the "apis" than the lite plan.


Yes, we're in beta right now, while we're still working out the bugs. For beta, it's free for 30 APIs.


thanks for the response! any chance at a self hosted option? i'd even still pay (once) for a self hosted version.


Really sleek interface, and looks like it could be extremely useful (I just spent a few hours cranking out Nokogiri this morning).

Oh, typo: "Notice that toolbar at the toop of the screen?"


Awesome, thanks for the kind words. And for catching that typo, will fix that now


I thought it was intentional. Swedish chef style. I like it, but I'll need to go back and re-read to understand how I can use this on other pages than the homepage/demo page. I've nothing immediate to try it with right now. :)

Edit: I'll watch the video after work, probably will clear everything up for me.


This looks really useful, and I'm trying to figure out if I could use it on a project I'm working on, but hitting an issue. I sent a support message. Nice job!


Thanks, the support tickets really help us debug. We'll look into it and get back to you


Looks awesome, however I keep getting errors and 404s. Could this be an issue on my end (seems to be working for others) or just HN making the servers beg for mercy?


Where are you getting the 404s? We will check into it now


Reminds me of Dapper

http://open.dapper.net/

This allowed you to do similar, before being consumed by Yahoo. Might be worth a look.


Awesome! Hats off.. How about extracting hashtag/GID of any record if applicable, which are typically not rendered on page, but hidden under the hood.


the reason i ever have to write a scraper is because of pagination. while this looks awesome, i'll have to stick to scraping until that is solved. :(


It's probably our #1 feature request at the moment. We're working on it and hope to have it ready for you to try soon


I thought to myself oh boy yet another web scraper as a service but got surprised. I haven't been this impressed with a product video since Dropbox.


Wow looks amazing. I tried doing some queries on public directories, and it even supports parameter passing. Will be using this for some side projects.


How (if at all) does this run on javascript heavy sites?


JS-heavy sites can be tricky. We position it so it should execute after most of the on-page JS, so it handles a lot of cases. There are still sites that break it though.... we're trying to tackle these guys one by one right now, as we try to generalize a broader solution


Any chance you guys plan to add link hrefs to CSVs? I'd love to use this now, but I need the href for backlinks and future inference.


Thanks for the suggestion, we're adding to our list


The UX is great and a journalists everywhere will thank you.

But outside of government websites I don't see how a lot of this is even legal, per se?


I'm not sure scraping itself is illegal, depending on what you're doing with the data. (Though it may be against a site's Terms of Use which may be binding. IANAL.)

I can tell you that on several occasions I've scraped commercial sites with the permission of the owner. They want me to have access to the data but don't have the time or ability to create a proper API.


Thanks... yes, public data from governments is a great use case. Often a lot of apps built using scrapers will wind up driving up traffic/ sales a the source site so it's okay. We want to do responsible web scraping, so will respect webmasters robots.txt files to make sure it's legal.


I love the execution, but I also see inherent problems.

Robots.txt is just a convention to advise crawlers. I'm confident most sites explicitly state this is against their terms of service.

You will encounter terms along the lines of:

"Unauthorized uses of the Site also include, without limitation, those listed below. You agree not to do any of the following, unless otherwise previously authorized by us in writing: Use any robot, spider, scraper, other automatic device, or manual process to monitor, copy, or keep a database copy of the content or any portion of the Site."


The law isn't entirely blind to conventions, though. They don't guarantee anything, but if a court understood that there exists a convention for saying "no robots, please", and the robot operator in question followed it, then a court could well look less favorably on the damages claims of a website operator who didn't make use of the widely known convention.


You've got a valid point. We want to eventually create a space that allows responsible scraping - so webmasters can have access to analytics on what's being scraped and can explicitly turn off kimono APIs for their domains if they see fit. We also think there are use cases for people who own their own data. Often, APIs will provide a way for companies to streamline their internal app development and figure out what to expose to the developer community before investing in an expensive API deployment.



It would be nice to have a view also on the raw html code, e.g., to create a field containing the url of an image in the page.


Thanks, for the suggestion. We're rolling out advanced mode soon, which will allow you to edit the CSS selectors and RegEx operating on the page's HTML to define the selected data elements


It looks cool, but very expansive compared to Visual Web Ripper, which you pay way less for (but has to host yourself).


as someone building a home grown proprietary scraping engine. Consider alternative locations of elements. Most sites are using templating engines so its fairly reliable to find things in the same place, but more often than you might expect, things move a round ever so slightly. Navigation is a fun one also. ;)


This is my third time trying to get an answer to this question: does your crawler automatically respect robots.txt?


So sorry for missing this earlier. See our response in comments below: "At the moment, we rely on users to be responsible. We spell it out in the terms and FAQ. We've been in private beta, keeping usage very limited until today. We fully understand the seriousness of the issue as we scale. We're committed to becoming a responsible bot that respects robots.txt"


You can use the utility without registration or login by blocking the login prompt with, for example, AdBlock.


This is fantastic. Congrats on launching it! Once it has pagination & auth I'll be all over this :)


What about some navigation tools there?

Looks pretty good, but it does not really replace my scrappers. Maybe some of them...


re: nav tools -- you mean the ability to crawl multiple pages?


Seems it can't see the stuff inside angular views.. well at least mines..

But for the rest, awesome product. Thanks.


Yes, you're right... we can't handle angular quite yet. We're working on this.


Looks very nice. There seems to be an issue with international characters though (æ/ø/å).


Yes, thanks for spotting. We'd discovered that Chinese, Japanese and Korean failed but didn't know about these characters. Thanks!


It appears that it doesn't work with websites containing international characters.


Yes, thanks for noting. This is a bug and we're working to get these characters supported as soon as we can


great idea. i'll have to keep this in mind for future projects.


It's easy not to write web scrapers even without this tool ;)


I like the concept. Would love to see page authentication


Is there an ability to scrape more than one page of data?


Not yet, but it's our #1 feature request, so we're working on it now. For now, you can make multiple APIs (one for each URL). If the URL takes query parameters though, you can re-use the same API and programmatically cycle through query parameters.


This is really slick! Btw. Who made your intro video?


It's pretty homegrown right now, we did the intro video ourselves on our laptops


I like the concept and it looks similar at Import.io


Yes, in concept quite similar. We wanted to make something that you can use from within your browser as part of your natural workflow, without installing any other software.We also really wanted to figure out the right data association intelligently based on user selections vs. asking users to think through a data model up front


I kind-of enjoy writing web scrapers.


How does this compare with Mozenda?


Man that demo is impressive!


Agree demo is awesome. But I don't think scraping for any web page is that simple. Lots of exceptional cases.


That looks quite swift


use any chrome xpath plugin and give that to YQL


so what do you do that import.io doesn't?


i love this! and amazing video!


actually I love scraping :(


OMFG.


neat.


"Never write a web scraper again"... yea right.. sick and tired of such gimmicks and self promotion on the net today.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: