Hacker News new | past | comments | ask | show | jobs | submit login

One of the founders here. Our goal with ParseHub is to enable data extraction not only from static websites, but from highly dynamic or complicated ones as well.

We've done this by separating the concepts of selecting and doing something to a selection. Specifically, we've created tools like click, input, hover, etc. that you can combine with any selection, and with each other. This keeps a lot of the power that you have with programming.

This also applies to the data structures that are created. So it's easy to express nested lists or even recursive lists, because of the ability to combine tools easily.

If you have any questions I'd love to answer them.

The demo is magical. Excellent work!

One question about this FAQ:

   > Does ParseHub respect robots.txt?
   > We're working on an admin panel to give webmasters full transparency and control. We'll have more info soon.
What does that mean? I read it as ParseHub does not respect robots.txt, which as a content owner is a bit disappointing. Would you elaborate on your thinking?

Sorry, we'll make the wording more clear.

At the moment, ParseHub does not respect robots.txt. We do expect to add this + features for webmasters in the future, but have not had the developer cycles to do this yet.

Please add this sooner than "in the future". Obeying the robots.txt is a _MUST_ for any robots out there. This also from a content owner. Thank you.

The tutorial was quite excellent. Unfortunately I get stuck at the "zooming" portion. Pressing ctrl+[ does nothing for me. (Firefox 32.0 on Ubuntu 12.04)

Yeah, I'm having the same problem. It might be related to the fact that Ctrl + ] is also the keyboard shortcut to nagivate back in firefox [1], so when I try it I end up going back to the previous page. Ctrl + [ does nothing.

Or it might be that in my keyboard layout, [ and ] requires pressing RIGHT ALT + 8 and RIGHT ALT + 9, respectively.

[1] https://support.mozilla.org/en-US/kb/keyboard-shortcuts-perf...

PS: Indeed changing the layout to en-US fixes the problem, but that not a real solution.

That's odd. I wasn't aware of that shortcut. It also doesn't work for me (and my layout is already en-US AFAIK)

Do you have some time to show me over skype? We've done extensive user testing, but may have missed a few bugs. Email me at serge@parsehub.com, skype id is t.sergiu

More than a bug, maybe you should be look for more universal key binding. On my European keyboard I have to do AltGr-5 to do a [ and Ctl-AltGr-5 is not what is supposed to.

Did you choose Airbnb for your demo because they are well known for scraping Craigslist in their origins?

We chose Airbnb because the complexity of their site highlights the flexibility of our technology. That being said, we also assumed they wouldn't mind because of their origins :)

That's a good answer :)

Does this work in a website with password? e.g. can I fill a form and give user/pwd as the first step? (I did the tutorials but couldn't figure it out)

Also are you thinking on allowing it to run locally? (i.e. I have some websites that only work for my IPs)

Yes, ParseHub works with login forms (they are no different than regular forms). Check out the interactivity tutorial. If you still don't get it after that, I'd be happy to show you 1-on-1 over skype.

Please note that the password will be accessible by ParseHub, since it needs to enter it on the web page.

Currently, we support local deployments only in our custom enterprise plan. That may change in the future.


Sorely needed. I've been trying to find a tool that'll grab data from behind a POST form and none of the other scrapers as a service do it. It's so simple! Any plans for adding crawling to the service?

Crawling already works :)

One of the things that has been heavily marketed by other web scrapers is "crawling" as a separate feature.

With ParseHub, all the tools easily combine, so you don't need that distinction. You can use the navigate tool to jump to another page (see our interactive navigation tutorial in the extension for the details).

And you can combine multiple navigations to go as deep in the website structure as you like. For example, say you have a forum that links to subforums that link to posts that link to users. You can easily model the structure of such a site by using a few navigation nodes (one from forum to its subforums, another from subforum to posts, etc.). The result would be a big json (or csv) dump of all the data on the forum, in the proper hierarchy.

We've really tried to make our tools as general as possible. A side effect of the navigate tool is that you can use it to get "pagination" for free as well (another feature that's been heavily marketed).

Hi, kudos on the tool. Does it work with sites where some content is revealed only if the user scrolls down and/or has to click a "load more" button at the bottom? That's a major pain point.

Yes, it does! Check out the interactivity tutorial in the extension.

Cool stuff!

> Easily turn websites into APIs or tables of data you can download in JSON or CSV

Do you need to download, or can you call these APIs from an application?

We have a fully-documented API that you can use to integrate directly with your app. https://www.parsehub.com/docs/ref/api

Love the tool, wish I had this over a lot of other scrapers on multiple projects.

Is a chrome extensions in the works at all?

Chrome extensions run in a severely restricted environment. While this is arguably good for security, it prevents us from building some of the powerful tools we can build in Firefox. We do plan to eventually release as a standalone app with no browser dependency.

can you be more specific please? as far as i know, you can create chrome extension what executes custom javascript code on the current page and pass the results to extension. so what exactly is the problem?

Sure, I'll give you one example.

We want to show a sample immediately as a user changes what they extract. On a static website, this is fairly easy. You simply run what the user created on the currently visible page.

However, when you involve interactivity, you can no longer do that. The major problem is idempotent operations. Imagine a click that changed the dom of a page. And now imagine running the sample on that same page. Re-running the sample may no longer work, because the click could have changed the page in such a way that the extraction no longer works (e.g. it deletes an element from the page).

To solve this issue, we actually reset a "hidden tab" to the starting state of the page you're on. This happens every time you re-run a sample. Unfortunately, it's not possible with Chrome to create such hidden tabs. We also mess with the cache to make sure that this tab can be reset really quickly, something that we couldn't find an API for with chrome.

Hope that answers your question.

Did you have a look at https://developer.chrome.com/extensions/background_pages ?

Not sure if it does fit all your needs.

Could you clone the DOM into a virtual HTML element and save/serialize that?

That doesn't work because you need the javascript as well.

i see. thanks for taking time to explain.

https://scrape.it requires chrome extension. We have had no problems or found it "restrictive" in anyway.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact