Thanks for linking to this, chipperyman573! The parent comment used to be valid, but we added a dollar amount for the "Business" plan based on this and a few other comments; we are still willing to discuss discounts for non-profits, smaller companies, etc.
I've always felt that I service like this would be great for Code for America projects. A big problem I have with creating technology applications for civic goods is that the government is terrible at providing open data. Even when they do open source civic data, they do a terrible job of it.
An example of this is California drought data: automatically grabbing data on the drought is incredibly difficult because it involves scraping HTML tables. I tried to build an API that presents drought data so volunteers would have an easier time building out data visualizations. I ended up just getting exhausted doing all the scraping work.
I then moved onto a new project: building a free-to-use Padmapper for affordable housing. The data for income restricted apartment units are driven by a government contracted vendor. A city county will declare income stabilization policies and legally enforce them against landowners and then the landowners would send over their list of units to the vendor.
This would be great except the vendor does the bare minimum. Padmapper looks amazing but, really, it's only applicable for the upper middle class due to explosive housing costs in the Bay Area. So, in order to provide a more modern website and mobile application for the community, I started to scrape the vendor's website. It was terrible. I kept getting throttled. So I gave up.
@OhSoHumble: this actually came to mind for us too in terms of building better experiences on top of common government services. I'd love to chat and see how I can help, so shoot me an email at peter@wrapapi.com
Hi everyone! We just released the second version of WrapAPI
We have a new WrapAPI API Builder looks like a browser, and is as easy to use as one too. You can define your API's inputs with a quick tap on the address bar, and point and click at the data you want to extract.
We also have a Chrome extension is smarter and better-integrated than ever. It records your requests and It'll automatically create parameter inputs for the values that change between requests to the same endpoints. The contents of your captures are immediately ready for you to start defining outputs and data to extract too.
Let me know if you have any questions or feedback!
if you take a screen shot of all the items being scraped you could build a dataset for a pretty powerful AI. Something that takes an image of a webpage and out puts machine readable data. Not saying there's a NN that can do it right now but it seems like eventually it could get there.
Also, since that is the case, you could build this in a few hours using something like: https://github.com/bda-research/node-crawler. Yes, it would have no gui, so you lose that.
For sites that load data using AJAX, we recommend you take a look at our Chrome extension (https://wrapapi.com/#/chromePlugin). Our philosophy isn't to run a full headless browser (similar to Phantom), but rather make it really easy to find the AJAX requests that actually load the data you need.
If JS a problem for you, try Kantu. It works with screenshots and uses OCR for scraping. The beauty is that it works with any kind of site. But clearly, the speed can not match a node.js or perl based scraper (mechanize etc), so it is not suitable for high volumes.
Yeah, the concept is the same as Sikuli, but all inside Chromium (and the OCR is better).
>Do you find it better than Phantom?
It depends. Once you have a working script, web scraping with Phantom is much faster and much more resource efficient. But since Kantu works visually, you do not have to touch any page source code. That makes it much easier/faster to create the automation in the first place, especially for complex sites with date controls, drag & drop and other Javascript.
By the way your Onboarding step-by-step wizard[1] is really awesome. I've used similar scripts on my sites before but they keep breaking because the users often click on some div or button (or due to mobile phones) not intended (they're only learning) and then wizard can't sync to the next step and the whole thing breaks :/
Is this happening on your site? If not, would appreciate some tips about coding it and how to handle exception cases where the wizard can't keep in sync or user click on unintended page elements.
Thanks! We used this awesome library called React-Joyride (https://github.com/gilbarbara/react-joyride) which made setting up the product tour a breeze. Since our product tour is on a single-page app, it works quite well.
The most helpful part is that you can pass a callback which will trigger before/during/after each step, which can let you ensure that the state of the page matches what you're expecting. In our case, we use it to make sure that you're switched to the right tab, etc. Take a look! I highly recommend it.
This tool is really well thought-out and useful! I made a working API in less than 1 hr. This tool has a much better design & implementation than Kimono and easier than using Python 3 + Beautiful Soup 4 which is how I made my previous web scrapers. This tool also works for POSTing to web forms.
No offense, but your comment sounds like astroturfing (I'm not saying you are, just that it's part of a pattern I see).
I often see one or more commenters write what seems like an excessively positive thought dump on Show HNs. It just doesn't seem like the natural conversational tone everyone uses, but I can't quite put my finger on it.
Has anyone else noticed it? Is there a term for this sort of writing style?
It could be that I need to work on my writing skills. I'll admit, I'm an systems engineer not a writer. On the other hand, HN commenters tend to convey a healthy dose of cynicism and skepticism. Also it's known that negative comments come across as more trust-worthy than positive comments on the internet. I simply used this tool and it did what it said it did. I don't give a positive review unless I had a good experience. But it is easier for all of us to believe people with some degree of skepticism and cynicism.
Thanks webninja! The POSTing part is one of the biggest things we were trying to get right while not making it any harder to use than Kimono. Is there anything that was confusing when trying to learn it for the first time? If so, we're still trying to make it easier =)
I think it's pretty close to perfect. I made a sample WrapAPI for http://etfdailynews.com/etf/{{symbol}}/ where the symbol is a 3-4 character ETF ticker symbol like VYM. It's one of the only websites I've found that provides the full breakdown of an ETF's contents. Initially I wasn't sure where to provide sample input such as a handful of ticker 3 character strings since /{{symbol}}/ isn't a GET or a POST value. So under the JSON and Table column I had multiple symbols separated by commas and it took me a little while before I realized that you were only supposed to supply one test value and where I was supposed to supply that test value. But it "clicked" shortly after.
That endpoint will then emit a state token, which includes the session cookies. You can feed that state token into your next request and it'll authenticate you
Yes, you can get the content you want even if it's behind a login page! Expect to create 2 APIs, one for logging in and another for getting the content. An example is provided on their homepage: https://wrapapi.com/v2#/caseStudies/cj
Likely because they have custom pricing based loosely on how much business value they create for the customer. E.g. if a philatelist wanted to scrape stamp catalogs, and if an industry-specific analytics platform wanted to scrape a directory of prospects - you'd want two different prices. Otherwise, you'd either 1) leave stamp enthusiasts out in the rain, or 2) leave a whole lot of meat on the bone w/r/t enterprise pricing. There might also be a consulting upsell!
Eh, that seems like a non-problem. The solution to me is to leave stamp enthusiasts out in the rain. If your SaaS product can provide a lot of value in enterprise companies, $500 a month is not a lot to ask. And many people just want to see the price of a line item if it's a productized service so they can go back to someone higher with a purchase request.
When I was last working inside an organization and reviewing vendors for a product, it really left a bad taste in my mouth when they had "Ask for Pricing." I get it, my consulting work is basically Ask for Pricing, I understand the business strategy. But it's such a headache to sit through bullshit product demos for multiple vendors over a few weeks just to hear that their pricing structure is way out of line.
There is this idea that a lot of companies have, where they're more "professional" or conversion-optimized by removing public pricing and putting everyone through a sales funnel. But that concept only works if 1) you have a great product and 2) you have a great sales team, capable of making my time to failure in the conversion process fast and painless. Every company thinks they have this, but they almost never do. I really don't think you want to optimize your business for keeping stamp enthusiasts happy.
The problem with opaque pricing is this: people don't want to start experimenting with something if it could be infinitely bad. i.e. if they can imagine the worst.
In the back of their heads, some people imagine the service is going to be huge, and then they worry that all the profits will be paid out to wrapapi.
Better to have a high headline number and then offer discounts for certain uses (non-profit, open source, students, etc). People are optimistic about how much money they might make so a high headline future price for when you graduate from the free tier is not necessarily bad.
WrapAPI is meant to not only do scraping (reading information), but also to (1) perform actions with side effects and (2) allow for complex chaining
Let's say you have a web-based inventory management system or CRM that requires a login, but you want to take data a customer has sent you in a spreadsheet and automatically batch enter it into the CRM, which doesn't have that functionality. You could then:
1. Create an API endpoint that allows you to log into that system and return a state token
2. Create a second API endpoint that's parametrized the inputs of the form to create a new inventory entry
3. Chain those 2 API endpoints together so that the 2 actions are actually combined into one API call
Our focus is not only on getting data, but automating the many things that you or your company does with websites to save time
Im asking since the term API is mentioned. Is this designed for technical or non technical people? Im a non coder but really could do with the scraper so would this work for me?
This is designed for at least semi-technical people, but it's really not that hard to give it a try for simple sites. Try watching the video, and shoot me an email at peter@wrapapi.com if you run into any issues!
Kimono shut down on February 29th, 2016 and the cloud service has been discontinued. It only exists as a desktop app now.
Bought by Palantir, they retired in a good way, keeping people's data available for a moment and communicating well.
It was a great product still complicated to get a practical business model.
This WrapAPI v2 is an alternative I think, but I would use them with care as the economical model is unsure and it seems to be really new, still promising! :)
The desktop app + browser plugin still seem to work fine. I've run into a few things that don't quite work well, like pages that have a combination of click-to-paginate and auto-scroll-paginate, but in general, it's good.
We're quite inspired by Kimono, and aim to be just as easy to use while handling use cases beyond scraping (e.g., fully automated form-filling, POST requests, etc.). One of the big feature requests we've been getting has been RSS feeds though, so we're definitely trying to get to full feature parity!
The software itself probably wouldn't, but the use of it for anything anyone cares about probably would. The CFAA, etc., make unwanted scraping illegal and this has been tested repeatedly in court.
The company that runs this software as a service needs to be very careful. 3Taps was similar and got destroyed for relaying data scraped from Craigslist.
Contacting the server after its operator has expressed its wish for you to stop is a violation of the CFAA (in that you are "exceeding authorized access" and/or gaining "unauthorized access" to a protected computer system). If it's found that the site's ToS is binding upon you, which it typically would be, you don't really even need separate notice to be held liable.
Storing a copy of a web page in RAM creates a copy that is eligible for copyright protection, and it is likely that any implied license to read that page will be invalidated by the access revocation.
Another court stated that copying data into a ram buffer for under 1.2 seconds was allowed. Depending on how they structure this it might be legally allowed.
Thanks for that! Like I said, I'm not a lawyer and I'm sure there are other gaps in my case knowledge. It's certainly positive to see the Second Circuit recognizing that there is some need to consider the transient nature of RAM copies before ruling them infringing.
The ruling suggests that MAI v. Peak did not address the transitory argument merely because it was not raised by the litigants, and that the precedent set there (which wouldn't have necessarily been binding anyway) is therefore not abrogated by ruling that some RAM copies are transient enough to fail to qualify.
Importantly, the durations listed here describe the runtime of the content, not the amount of time the data is held in the RAM. It is said that the system would buffer 0.1 seconds (100ms) of content at one point and 1.2 seconds of content at another point.
The Court does not seem to establish "1.2 seconds" as a general benchmark for RAM transience, but rather it suggests that transience should be considered on a case-by-case basis, per the language of the statute.
However, the general rule of thumb is that if a copy exists long enough to derive any value from it, it is non-transient. Guidance from the Copyright Office [0] reads:
>[...] we believe that Congress intended the copyright owner’s exclusive right to extend to all reproductions from which economic value can be derived. The economic value derived from a reproduction lies in the ability to copy, perceive or communicate it. Unless a reproduction manifests itself so fleetingly that it cannot be copied, perceived or communicated, the making of that copy should fall within the scope of the copyright owner’s exclusive rights. The dividing line, then, can be drawn between reproductions that exist for a sufficient period of time to be capable of being "perceived, reproduced, or otherwise communicated" and those that do not. As a practical matter, as discussed above, this would cover the temporary copies that are made in RAM in the course of using works on computers and computer networks.
and scrapers have been held liable for copyright infringement via RAM copies on multiple occasions. Ticketmaster v. RMG states:
>[...] copies of ticketmaster.com webpages automatically stored on a viewer's computer are “copies” within the meaning of the Copyright Act.
despite the fact that they likely would've been held for a much shorter time than either 100ms or 1.2 seconds.
Notably, this was before the case referenced above, but it's typical of later cases, and it succinctly demonstrates that courts are likely to find RAM copies of an entire work (the web page) more likely to be of non-transitory nature than snippets of ~ 1/1500th of an entire work, regardless of how long they're stored in RAM.