I went to try and use it on the demo page it provides, going through and adding things, but when I went to save it, I just received an error that something went wrong. Well, crap. That was a waste of time. Oh well, maybe it's just me.
Alright, I'll give it another shot using the website they used in the demo. Opened up a Hacker News discussion page and started to give it a try. Immediately it was far less intelligent than the demo. Clicking on a title proceeded to select basically every link on the page. Somehow I clicked on some empty spots as well. Nothing was being intelligently selected like it was in the demo. Fine, that wasn't working tremendously well, but I wanted to at least see the final result.
Same thing: just got an error that something went wrong and it couldn't save my work.
Disappointing. I still might try it again when it works 'cause it's a great idea if they really pulled it off. So far: doesn't seem to be the case.
Feel free to drop me a line if you want any specifics about the troubles I had.
...i will comment that the user experience for first adding elements is not what i expected. once u understand it's great overall, but i'd have some popup tooltip type guide thing going on for the first time a user uses it explaining that you should choose one element first, and then check a matching element. ...Then explain the number bubbles next to the property field. You don't get what those numbers are at first. ..Then for the property field, somehow make it more obvious that you should fill it out.
Basically you have the interface setup so any action can be done at any time, but it should be presented as a forced sequential set of steps to the end user, at least at first.
I get that it has a sort of kitschy or retro appeal, but it's just basically a pain to use and looks terrible.
I can't tell you how often I click next page to find that I've taken too long and my session or whatever has expired.
I've used a bunch of HN skins that were supposedly better designed, but none of them stuck. Apparently it's just plain unnecessary for HN to be better.
The website that lets technologists hang out with people who might give them large sums of money to see their ideas succeed doesn't have to be good, or pretty.
If the website were better, or prettier, that would not add any additional value to the previously mentioned large sums of money.
(Oh, and advice, and being able to chat with industry leaders and experts on diverse arrays of topics, etc.)
I think better terms might be "clean up" or "restructure".
I'll leave it to you to figure out how one would implement such a feature ;-)
A better example would be "hey I have this cool device that will improve your mileage by 100%, but it only works on cars built according to current emission standards...".
HN pages are very well structured and are probably an ideal case for an automatic scraper.
If your automatic scraper doesn't work on HN, it is unlikely to work generally.
There are some irregularities (i.e. YC announcements without score, more button, self posts) but it's not really structurally complicated (compare reddit, which has 3 "score" fields per link)
The software is abandoned, but their algorithms are described in a paper:
I vaguely remember playing around with this tool you mentioned. I thiiiiink it was this one, although it seems to be superseded by this one now.
of course, if this was an offline product, or self-hosted product, then it would solve that problem of auth instantly.
The approach I took was to hijack the Cookies from the browser once the user has signed in after on e.g. Facebook via the browser extension.
The route of proxying the website does in fact do away with the need to install any external 3rd libraries.
This browser extension I built coupled with the web service its integrated to does allow for scraping of pages from Facebook, Google and LinkedIn logged in pages as well.
We are accepting early access requests right now.
Check us out! http://automate.ly/
This service looks great, good work! But since you seem to host the APIs created how do you plan to get around the centralized access issues? Like on the chumby we had to do a lot of web scraping on the device itself (even though doing string processing operations needed for scraping required a lot of hoop jumping optimization to run well in ActionScript 2 on a slow ARMv5 chip with 64mb total RAM) to avoid all the requests coming from the same set of chumby-server IP addresses, because companies tend to notice lots of requests coming from the same server block really quick and will often rate limit the hell out of you, which could result in a situation where one heavy-usage scraper destroys access for every other client trying to scrape from that same source.
Having a panel for webmasters along with that would be fine.
I wrote a library for that.
I could have really used a service like this just yesterday actually, I ended up fiddling around with iMacros and got about 80% of what I was trying to achieve.
Feature proposal: deal with pagination.
This would be a pretty difficult thing to grab when scraping normally, but the app errors before loading the content:
JS error: An error occurred while accessing the server, please try againError Reference: 6864046a
curl "https://www.keepandshare.com/calendar/fns_asynch_api.php?r=0... --data "action=getrange&i=1940971&from=2013-12-26&to=2014-02-06"
My suggestion: once I've defined an API, let me apply it to multiple targets that I supply to you programatically.
The use case driving my suggestion: I'm an affiliate for a given eCommerce site. As an affiliate, I get a data feed of items available for sale on the site, but the feed only contains a limited amount of information. I'd like to make the data on my affiliate page richer with extra data that I scrape from a given product page that I get from the feed.
In this case, the page layout for all the various products for sale is exactly the same, but there are thousands of products.
So I'd like to be able to define my Kimono API once - lets call it CompanyX.com Product Page API - then use the feed from my affiliate partner to generate a list of target URLs that I feed to Kimono.
Bonus points: the list of products changes all the time. New products are added, some go away, etc. I'd need to be able to add/remove target URLs from my Kimono API individually as well as adding them in bulk.
Thanks for listening. Great work, again. I can't wait to see where you go with this.
More web apps need an undo button.
Edit: looks like they've moved away from that space, but have an old version available at: https://classic.scraperwiki.com/
But by all means take this opportunity to dismiss all of us as freeloaders.
It would have cost me $348/year to move those to new scraperwiki.
I love how you can actually use their "web scraper for anyone" on the blog post. Very cool!
How dedicated are you guys to making this work because I'd imagine there are quite a few technical hurdles in keeping a service like this working long term while not getting blocked by various sites?
>According that web site's data protection policy, we were unable to kimonify that particular page.
Sigh... Oh well... Back to scraping.
The course catalog is public, so no login is needed. I want to scrape various data related to courses, to populate forms automatically and such.
I disagree. Web scraping is mostly fun. You don't need "a ton of code" and "a laundry list of libraries", just something like Beautiful Soup and maybe XSLT.
The end of the statement is truer: it's not really a problem that your web scraper will have to be hosted somewhere, since the thing you're using it for also has to be hosted somewhere, but yes, it needs to be maintained and it will break if the source changes.
But I don't see how this solution could ever be able to automatically evolve with the source, without the original developer doing anything?
I don't know if it's just me or not, but it's not working for me in Firefox (OSX Mavericks 10.9.1 and Firefox v26). The X's and checkmarks aren't showing up next to the highlighted selections. Works fine in Safari.
I'm coming at things from a non-coder perspective and found it easy to use, and easy to export the data I collected into a usable format.
For my own enjoyment, I like to track and analyze Kickstarter project statistics. Options up until now have been either labor intensive (manually entering data into spreadsheets) or tech heavy (JSON queries, KickScraper, etc. pull too much data and my lack of coding expertise prevents me from paring it down/making it useful quickly and automagically) as Kickstarter lacks a public API. Sure, it is possible to access their internal API or I could use KickScraper, but did I mention the thing about how I dont, as many of you say, "code"?
What I do understand is auto-updating.CSV files, and that's what I can get from Kimono. Looking forward to continued testing/messing about with Kimono!
to be fully usable for me, there are some features missing:
- it lacks manual editing/correcting possibilities: i've tried to create an api for http://akas.imdb.com/calendar/?region=us with "date", "movie", "year". unfortunately, it failed to group the date (title) with the movies (list entries) but rather created two separate, unrelated collections (one for the dates, one for the movies).
- it lacks the ability to edit an api, the recommended way is to delete and recreate.
small bugreport: there was a problem saving the api, or at least i was told saving failed - it nevertheless seems to be stored stored in my account
edit: Interesting downvotes, I dont mean to be rude, I am just trying to see what someone could find offensive. (I dont see the problem with the favicon)
Do you have something against vegetables?
Would love to meetup and exchange some ideas if you are based in Bay area.
It makes me wonder if there isn't a whole "API to web/mobile app with custom metadata" product in there somewhere. I can imagine a lot of folks starting to get into data analysis and pipelines having an easier time of it if they could just create a visual frontend in a few clicks.
I'm doing a couple of data mining projects right now and simply being able to query and look into the API outputs, as well as my local database, without building a custom frontend would've saved me a bunch of time. But I'm thinking more of the knowledge worker, or even the power user who wants to view their Fitbit, Up and Lark data all in the same dashboard. Can't help but thinking this already exists somewhere though.
In order to get data for a different product, I will have to modify the URL itself. I think same holds true for blog posts.
2. Image URLs
3. Focus on page types such as product pages, posts etc. That way its easy to go from content to content. Will help crawling too
4. Link back to original page included in JSON
Finally common sites/pages used by multiple users of your systems should not count against the API count requirement under pricing. You may want to charge against total calls, like Parse.
It's something I have thought about, as I'm sure many people who have done any amount of scraping have, but never went forward and tried to implement. The landing page with video up top and in-line demo is a pretty slick presentation of the solution you came up with. Good job.
Selenium is WAAAY to painful.
Can't wait to play around with this tonight.
Suggestion. Allow one to select images.
The example doesn't seem to work right on Firefox. On Chrome, if I click "Character" in the table then it highlights the whole column and asks if I want to add the data in the column. On Firefox, clicking "Character" just highlights "Chatacter" and that is it.
(There's a prototype of an API generator hidden in a menu somewhere but it's nowhere near production ready)
One problem I've had though is that I think you guys are hosted on AWS - a lot of websites block incoming connections from AWS.
Are there plans to add an option in future to route through clean IPs? Premium or default, this would be cool and make it a lot more useful.
That said, it looks like it can't do media right now. I would love it if it could at least give me a url for images/other media.
given the pricing though i'm almost motivated to make my own. as a hosted service the fees make sense with the offerings. but not only would i rather host my own- it would be cheaper all around. would you consider adding a free or cheap self hosted option?
aside, i think there is a mislabel on the pricing page. i'm guessing the free plan should not have 3 times the "apis" than the lite plan.
Oh, typo: "Notice that toolbar at the toop of the screen?"
Edit: I'll watch the video after work, probably will clear everything up for me.
This allowed you to do similar, before being consumed by Yahoo. Might be worth a look.
But outside of government websites I don't see how a lot of this is even legal, per se?
I can tell you that on several occasions I've scraped commercial sites with the permission of the owner. They want me to have access to the data but don't have the time or ability to create a proper API.
Robots.txt is just a convention to advise crawlers. I'm confident most sites explicitly state this is against their terms of service.
You will encounter terms along the lines of:
"Unauthorized uses of the Site also include, without limitation, those listed below. You agree not to do any of the following, unless otherwise previously authorized by us in writing:
Use any robot, spider, scraper, other automatic device, or manual process to monitor, copy, or keep a database copy of the content or any portion of the Site."
Looks pretty good, but it does not really replace my scrappers. Maybe some of them...
But for the rest, awesome product. Thanks.