Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ParseHub – Extract data from dynamic websites (parsehub.com)
143 points by tsergiu on Sept 23, 2014 | hide | past | favorite | 69 comments

One of the founders here. Our goal with ParseHub is to enable data extraction not only from static websites, but from highly dynamic or complicated ones as well.

We've done this by separating the concepts of selecting and doing something to a selection. Specifically, we've created tools like click, input, hover, etc. that you can combine with any selection, and with each other. This keeps a lot of the power that you have with programming.

This also applies to the data structures that are created. So it's easy to express nested lists or even recursive lists, because of the ability to combine tools easily.

If you have any questions I'd love to answer them.

The demo is magical. Excellent work!

One question about this FAQ:

   > Does ParseHub respect robots.txt?
   > We're working on an admin panel to give webmasters full transparency and control. We'll have more info soon.
What does that mean? I read it as ParseHub does not respect robots.txt, which as a content owner is a bit disappointing. Would you elaborate on your thinking?

Sorry, we'll make the wording more clear.

At the moment, ParseHub does not respect robots.txt. We do expect to add this + features for webmasters in the future, but have not had the developer cycles to do this yet.

Please add this sooner than "in the future". Obeying the robots.txt is a _MUST_ for any robots out there. This also from a content owner. Thank you.

The tutorial was quite excellent. Unfortunately I get stuck at the "zooming" portion. Pressing ctrl+[ does nothing for me. (Firefox 32.0 on Ubuntu 12.04)

Yeah, I'm having the same problem. It might be related to the fact that Ctrl + ] is also the keyboard shortcut to nagivate back in firefox [1], so when I try it I end up going back to the previous page. Ctrl + [ does nothing.

Or it might be that in my keyboard layout, [ and ] requires pressing RIGHT ALT + 8 and RIGHT ALT + 9, respectively.

[1] https://support.mozilla.org/en-US/kb/keyboard-shortcuts-perf...

PS: Indeed changing the layout to en-US fixes the problem, but that not a real solution.

That's odd. I wasn't aware of that shortcut. It also doesn't work for me (and my layout is already en-US AFAIK)

Do you have some time to show me over skype? We've done extensive user testing, but may have missed a few bugs. Email me at serge@parsehub.com, skype id is t.sergiu

More than a bug, maybe you should be look for more universal key binding. On my European keyboard I have to do AltGr-5 to do a [ and Ctl-AltGr-5 is not what is supposed to.

Did you choose Airbnb for your demo because they are well known for scraping Craigslist in their origins?

We chose Airbnb because the complexity of their site highlights the flexibility of our technology. That being said, we also assumed they wouldn't mind because of their origins :)

That's a good answer :)

Does this work in a website with password? e.g. can I fill a form and give user/pwd as the first step? (I did the tutorials but couldn't figure it out)

Also are you thinking on allowing it to run locally? (i.e. I have some websites that only work for my IPs)

Yes, ParseHub works with login forms (they are no different than regular forms). Check out the interactivity tutorial. If you still don't get it after that, I'd be happy to show you 1-on-1 over skype.

Please note that the password will be accessible by ParseHub, since it needs to enter it on the web page.

Currently, we support local deployments only in our custom enterprise plan. That may change in the future.


Sorely needed. I've been trying to find a tool that'll grab data from behind a POST form and none of the other scrapers as a service do it. It's so simple! Any plans for adding crawling to the service?

Crawling already works :)

One of the things that has been heavily marketed by other web scrapers is "crawling" as a separate feature.

With ParseHub, all the tools easily combine, so you don't need that distinction. You can use the navigate tool to jump to another page (see our interactive navigation tutorial in the extension for the details).

And you can combine multiple navigations to go as deep in the website structure as you like. For example, say you have a forum that links to subforums that link to posts that link to users. You can easily model the structure of such a site by using a few navigation nodes (one from forum to its subforums, another from subforum to posts, etc.). The result would be a big json (or csv) dump of all the data on the forum, in the proper hierarchy.

We've really tried to make our tools as general as possible. A side effect of the navigate tool is that you can use it to get "pagination" for free as well (another feature that's been heavily marketed).

Hi, kudos on the tool. Does it work with sites where some content is revealed only if the user scrolls down and/or has to click a "load more" button at the bottom? That's a major pain point.

Yes, it does! Check out the interactivity tutorial in the extension.

Cool stuff!

> Easily turn websites into APIs or tables of data you can download in JSON or CSV

Do you need to download, or can you call these APIs from an application?

We have a fully-documented API that you can use to integrate directly with your app. https://www.parsehub.com/docs/ref/api

Love the tool, wish I had this over a lot of other scrapers on multiple projects.

Is a chrome extensions in the works at all?

Chrome extensions run in a severely restricted environment. While this is arguably good for security, it prevents us from building some of the powerful tools we can build in Firefox. We do plan to eventually release as a standalone app with no browser dependency.

can you be more specific please? as far as i know, you can create chrome extension what executes custom javascript code on the current page and pass the results to extension. so what exactly is the problem?

Sure, I'll give you one example.

We want to show a sample immediately as a user changes what they extract. On a static website, this is fairly easy. You simply run what the user created on the currently visible page.

However, when you involve interactivity, you can no longer do that. The major problem is idempotent operations. Imagine a click that changed the dom of a page. And now imagine running the sample on that same page. Re-running the sample may no longer work, because the click could have changed the page in such a way that the extraction no longer works (e.g. it deletes an element from the page).

To solve this issue, we actually reset a "hidden tab" to the starting state of the page you're on. This happens every time you re-run a sample. Unfortunately, it's not possible with Chrome to create such hidden tabs. We also mess with the cache to make sure that this tab can be reset really quickly, something that we couldn't find an API for with chrome.

Hope that answers your question.

Did you have a look at https://developer.chrome.com/extensions/background_pages ?

Not sure if it does fit all your needs.

Could you clone the DOM into a virtual HTML element and save/serialize that?

That doesn't work because you need the javascript as well.

i see. thanks for taking time to explain.

https://scrape.it requires chrome extension. We have had no problems or found it "restrictive" in anyway.

I think many websites people are going to want to extract from are going to have anti-scraping/anti-robot traffic controls that are going to try to keep out a scraper like this. Amazon.com for instance. Probably google properties.

That they will in the future plan on respecting robots.txt suggests they don't mean to get places content owners don't want them. On the other hand, automatic IP rotation kind of suggests they do mean to (what other purpose is there for that?).

Either way, it might be a limitation on what you might dream of using it for.

My own experiments with scraping Amazon and Google have been stopped in the water by their anti-bot traffic controls. (Amazon recently improved theirs).

Having built scrapers working against some of these measures, you would be really surprised at how often they are accidental. Shared hosting providers often set them as defaults, at least as far as I see from the work I've done.

The real barrier I find is the current case law in the US, which seems to be the jurisdiction of choice for many web companies. It's currently a real possibility that you will be criminally in breach of the law and suffer the cost if you blatantly and knowingly continue after being notified of their ToS. Yes google and other big companies have nothing to fear, but it's pretty much a case of "how many people are dumb enough to pick a fight with mike Tyson?"

If you target your scraping to further your own business, and impinge on someone else's business model, your in water that is currently murky. It really needs to be settled but until another lawsuit rises to the Supreme Court in the US, we won't have that, so it's just a matter of being aware that while your not trying to be an evil criminal, you may still be viewed as such by someone you scrape.

Is being blocked the worst thing that can happen? Can't they sue you for scraping and using their stuff?

Scraping and selling their content without consent. I would assume that you can definitely sue for that.

I see a lot of sites selling Google Search results (like services tracking your SERP positions). Could Google sue them?

We've invested very heavily in building out a solid infrastructure for extracting data. We want to make sure that the product Just Works for our users, and that includes rotating IP addresses (you don't have to fiddle with your own, we have access to a pool of thousands).

Robots.txt is a tricky balancing act. It was first conceived in 1994, and was designed for crawlers that tried to suck up all the pages on the web. ParseHub, on the other hand, is very specifically targeted by a human. A human tells ParseHub exactly which pages and which pieces of data to extract. From that point of view, ParseHub is more like a "bulk web browser" than a robot.

Here are some examples that make this line blurry. If I tell ParseHub to log into a site, visit a single page, and extract one piece of information on it, does that violate robots.txt? If yes, then your browser has been violating robots.txt for years. The screenshots of your most visited websites are updated by periodically polling those sites (and ignoring robots.txt). My browser is currently showing a picture of my gmail inbox, which is blocked by robots.txt https://mail.google.com/robots.txt

More importantly, your computer and browser already do a lot of robot-like stuff to turn your mouse click into a request that's sent to the server. You don't have to write out the full request yourself. Is that then considered a robot? If not, then why is it considered a robot when ParseHub does the same (again, assuming a single request) thing?

Furthermore, some sites don't specify rate limits in robots.txt, but still actively block IP addresses when they cross some threshold.

It is far from a perfect standard, so it makes a lot of practical sense to have the ability to rotate IPs, even if it's not appropriate to use that ability all the time.

Our goal here is to be able to distinguish between the good type and bad type of scraping and give webmasters full transparency. Obviously this is a hard problem. If you have any feedback on any of this we'd love to hear it.

ps. we've tested our infrastructure on many Alexa top 100 sites and can say with moderate confidence that it will Just Work.

pps. if you're a webmaster, having ParseHub extract data from your site is probably far preferable to the alternative. People usually hack together their own scripts if their tools can't do the job. ParseHub does very aggressive caching of content and tries to figure out the traffic patterns of the host so that we can throttle based on the traffic the host is receiving. Hacked together scripts rarely go through the trouble of doing that.

"Our goal here is to be able to distinguish between the good type and bad type of scraping and give webmasters full transparency. Obviously this is a hard problem. If you have any feedback on any of this we'd love to hear it."

Yes as said before plus:

- Obey robots.txt to the full extend

- Name your access, i.e. label your bot

- Don't use shady tactics such as IP rotation

- Provide web site owners the option to fully block access of your bots (yes, communicate your full IP ranges)

Again - this is from a content owner who paid for his content.

Neither of us are lawyers (as far as I know), and I assume you have legal counsel for a business like this, and I wish you luck in your business and hope it doesn't come to anything legal.

Actually I hope even more it does come to something legal and you win, because I'd love to expand and make concrete fair use rights for scraping. I like scraping, scraping is both fun and very useful for the business domain I work in, and very frustrating when content providers don't allow it by either terms of service (which may or may not be legally enforceable if you haven't agreed to them, it's unclear, but scary enough with all the CFAA over-enforcement) or technological protections.

But I think you're being disingenous about the difference between a bot and an interactive web browser, I think it's pretty straightforward to most people and will be to the courts if it comes to that.

Interestingly, the latest enhanced Amazon anti-bot protections I ran into say "To discuss automated access to Amazon data please contact...", but don't explicitly try to say "you are forbidden from automated access."

It's fair to say that robots.txt is a balancing act in this case, given it's intended use. However, a website's terms of use are non-negotiable. Clauses banning any form of automated access or data gathering (especially for non-personal use) are fairly popular amongst sites with "deny everything" robots.txt files. There's a very real risk here for both you and your customers.

In the long run it'd be nice to see some sort of "fair access" to websites introduced into law, unfortunately we don't let live in that world.

Very cool!

Another useful tool is http://selectorgadget.com

Excellent! How about creating google now replica but for offline use + more private data dashboard.

- Use your platform to parse dynamic website which are completely under controll to user, so no privacy issue

- Dont store data intocloud so no security issue

- create common parsing parse job and distribute it through central server / store

- such as parsing bank account from bank of america / chase / well fargo etc - parsing my stock portfolio etc

Your platform can act as job creator and people can crowd source the job script! Can create amazing private dashboard where user can see its private data at once , no cloud interfearance so no worry about security and privacy.

Great ideas to consider for the future!. We may decide to release an offline version of ParseHub eventually.

For these kind of tasks I always used an app written by myself using an embedded browser control (webkit engine or IE ActiveX control on Windows). So I load the page and just call the control methods (usually just to convert the output to plain text).

How are this tool (and similar tools) more efficient then the highly optimized browser engines? I am missing something here?

ParseHub itself runs on a highly optimized browser engine. The idea is to give a visual interface so that you don't have to worry about low-level details of how to control the browser. This also makes it easier to reason about the logic of what's happening since humans have stronger spatial reasoning than symbolic reasoning.

Are you using phantomjs/selenium and friends or did you write your own?

How does ParseHub compare to Kimono [1]?

I use Kimono daily (scraping government expense docs), and I love how it works on static sites. If you've managed to replicate that on dynamic ones, I'll be a very happy customer.

[1]: https://www.kimonolabs.com/

Sorry for the delayed response. See my answer to a similar question in this thread: https://news.ycombinator.com/item?id=8356999

If you have any more questions or just want to chat, feel free to shoot me an email directly at serge@parsehub.com

This looks like a great utility for data analysis. Thanks.

Doesn't see to be functioning on the Tor browser bundle by default, but maybe there are instructions of how to do that somewhere.

Will continue to test it out, and see how everything works.

Sorry about that. We haven't tested with Tor.

I hope disabling Tor is suitable for your use case.

Scraping is always such a pain, this looks incredibly well done. I've personally have had a really good experience with PhantomJS. What does your backend look like? (What happens when a API request is made)

We run a headless browser to execute your extraction.

Is this legal? Can I use this to scrape and store Google Search results?

according to Google TOS, you cannot, and Google will ban you as soon as the traffic from your host(s) will look more like monkeys on typewriters than actual people looking for search results.

Will it have legal consequences? Most likely no. Will your coworkers or employees stab you with a rusty fork for getting their (most likely) favorite search engine to block them? Absolutely.

Looks powerful. Just curious, how does it compare to Kimono Labs?

We think Kimono is a great tool, but it is very limited in capability.

We specifically focus on handling highly dynamic or interactive websites. Our toolset gives you more flexibility over how you can extract data. For example, you can extract all the nested comment data from a reddit post, or you can extract data from maps without having to fumble around in the web inspector.

After using both there is a very large brick wall that prevents usability of kimono labs on most well-known web applications (airbnb, craiglist, etc). Example is data extraction from content visible via hover (airbnb calendar pricing) which ParseHub is able to handle.

Thanks a lot for building this, I am excited to save server costs/time from scraping data for projects.

Pricing page no found:

    404 Not Found
    The resource could not be found.

Thanks! Pushing a fix now.

Edit: Fixed

The website looks really interesting and very useful if it works like in the demo.

My only issue is $79 / mo starting plan looks a bit steep for casual development (I know there is a free option) but if you want to attract more buyers i think something like GitHub's pricing would be really attractive for prospective buyers like me (something like $22 / month for 10 private projects with 10 pages / min). Just a suggestion in case you're looking for feedback.

Didn't find an email in your profile. I'd love to discuss with you further. Can you email me at serge@parsehub.com?

thanks! I've sent you an email. Also couldn't figure out how to use select boxes, radio buttons, or fill a specific text field on the page. The "expression" page in the doc is also 404. I guess you must have too much on your plate right now, so all this is just fyi.

Agreed on this point. The automatic IP rotation is one of the bigger features I think.

Unlimited scraping at https://scrape.it is $20/month which has no restrictions in how many projects you can create or how many pages you can scrape.

A fix is on the way!

FWIW, the Facebook SVG in the home page doesn't seem to be loading for me. Rest looks great. Congrats!

Thanks, we'll look into it.

This reminds me of the wonderfully cool Web 2.0-era startup Dapper.

really nice. unfortunately gets a bit slow on long pages but it does work well and I like the fact that you offer an unlimited free plan.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact