
Show HN: ParseHub – Extract data from dynamic websites - tsergiu
https://www.parsehub.com/
======
tsergiu
One of the founders here. Our goal with ParseHub is to enable data extraction
not only from static websites, but from highly dynamic or complicated ones as
well.

We've done this by separating the concepts of selecting and doing something to
a selection. Specifically, we've created tools like click, input, hover, etc.
that you can _combine_ with any selection, and with each other. This keeps a
lot of the power that you have with programming.

This also applies to the data structures that are created. So it's easy to
express nested lists or even recursive lists, because of the ability to
combine tools easily.

If you have any questions I'd love to answer them.

~~~
michaelmior
The tutorial was quite excellent. Unfortunately I get stuck at the "zooming"
portion. Pressing ctrl+[ does nothing for me. (Firefox 32.0 on Ubuntu 12.04)

~~~
jotaass
Yeah, I'm having the same problem. It might be related to the fact that Ctrl +
] is also the keyboard shortcut to nagivate back in firefox [1], so when I try
it I end up going back to the previous page. Ctrl + [ does nothing.

Or it might be that in my keyboard layout, [ and ] requires pressing RIGHT ALT
+ 8 and RIGHT ALT + 9, respectively.

[1] [https://support.mozilla.org/en-US/kb/keyboard-shortcuts-
perf...](https://support.mozilla.org/en-US/kb/keyboard-shortcuts-perform-
firefox-tasks-quickly)

PS: Indeed changing the layout to en-US fixes the problem, but that not a real
solution.

~~~
michaelmior
That's odd. I wasn't aware of that shortcut. It also doesn't work for me (and
my layout is already en-US AFAIK)

------
jrochkind1
I think many websites people are going to want to extract from are going to
have anti-scraping/anti-robot traffic controls that are going to try to keep
out a scraper like this. Amazon.com for instance. Probably google properties.

That they will in the future plan on respecting robots.txt suggests they don't
mean to get places content owners don't want them. On the other hand,
automatic IP rotation kind of suggests they do mean to (what other purpose is
there for that?).

Either way, it might be a limitation on what you might dream of using it for.

My own experiments with scraping Amazon and Google have been stopped in the
water by their anti-bot traffic controls. (Amazon recently improved theirs).

~~~
tsergiu
We've invested very heavily in building out a solid infrastructure for
extracting data. We want to make sure that the product Just Works for our
users, and that includes rotating IP addresses (you don't have to fiddle with
your own, we have access to a pool of thousands).

Robots.txt is a tricky balancing act. It was first conceived in 1994, and was
designed for crawlers that tried to suck up all the pages on the web.
ParseHub, on the other hand, is very specifically targeted by a human. A human
tells ParseHub exactly which pages and which pieces of data to extract. From
that point of view, ParseHub is more like a "bulk web browser" than a robot.

Here are some examples that make this line blurry. If I tell ParseHub to log
into a site, visit a single page, and extract one piece of information on it,
does that violate robots.txt? If yes, then your browser has been violating
robots.txt for years. The screenshots of your most visited websites are
updated by periodically polling those sites (and ignoring robots.txt). My
browser is currently showing a picture of my gmail inbox, which is blocked by
robots.txt
[https://mail.google.com/robots.txt](https://mail.google.com/robots.txt)

More importantly, your computer and browser already do a lot of robot-like
stuff to turn your mouse click into a request that's sent to the server. You
don't have to write out the full request yourself. Is that then considered a
robot? If not, then why is it considered a robot when ParseHub does the same
(again, assuming a single request) thing?

Furthermore, some sites don't specify rate limits in robots.txt, but still
actively block IP addresses when they cross some threshold.

It is far from a perfect standard, so it makes a lot of practical sense to
have the ability to rotate IPs, even if it's not appropriate to use that
ability all the time.

Our goal here is to be able to distinguish between the good type and bad type
of scraping and give webmasters full transparency. Obviously this is a hard
problem. If you have any feedback on any of this we'd love to hear it.

ps. we've tested our infrastructure on many Alexa top 100 sites and can say
with moderate confidence that it will Just Work.

pps. if you're a webmaster, having ParseHub extract data from your site is
probably far preferable to the alternative. People usually hack together their
own scripts if their tools can't do the job. ParseHub does very aggressive
caching of content and tries to figure out the traffic patterns of the host so
that we can throttle based on the traffic the host is receiving. Hacked
together scripts rarely go through the trouble of doing that.

~~~
spacefight
"Our goal here is to be able to distinguish between the good type and bad type
of scraping and give webmasters full transparency. Obviously this is a hard
problem. If you have any feedback on any of this we'd love to hear it."

Yes as said before plus:

\- Obey robots.txt to the full extend

\- Name your access, i.e. label your bot

\- Don't use shady tactics such as IP rotation

\- Provide web site owners the option to fully block access of your bots (yes,
communicate your full IP ranges)

Again - this is from a content owner who paid for his content.

------
tectonic
Very cool!

Another useful tool is [http://selectorgadget.com](http://selectorgadget.com)

------
nextbig
Excellent! How about creating google now replica but for offline use + more
private data dashboard.

\- Use your platform to parse dynamic website which are completely under
controll to user, so no privacy issue

\- Dont store data intocloud so no security issue

\- create common parsing parse job and distribute it through central server /
store

\- such as parsing bank account from bank of america / chase / well fargo etc
\- parsing my stock portfolio etc

Your platform can act as job creator and people can crowd source the job
script! Can create amazing private dashboard where user can see its private
data at once , no cloud interfearance so no worry about security and privacy.

~~~
tsergiu
Great ideas to consider for the future!. We may decide to release an offline
version of ParseHub eventually.

------
fenesiistvan
For these kind of tasks I always used an app written by myself using an
embedded browser control (webkit engine or IE ActiveX control on Windows). So
I load the page and just call the control methods (usually just to convert the
output to plain text).

How are this tool (and similar tools) more efficient then the highly optimized
browser engines? I am missing something here?

~~~
tsergiu
ParseHub itself runs on a highly optimized browser engine. The idea is to give
a visual interface so that you don't have to worry about low-level details of
how to control the browser. This also makes it easier to reason about the
logic of what's happening since humans have stronger spatial reasoning than
symbolic reasoning.

~~~
nhjk
Are you using phantomjs/selenium and friends or did you write your own?

------
fnbr
How does ParseHub compare to Kimono [1]?

I use Kimono daily (scraping government expense docs), and I love how it works
on static sites. If you've managed to replicate that on dynamic ones, I'll be
a very happy customer.

[1]: [https://www.kimonolabs.com/](https://www.kimonolabs.com/)

~~~
tsergiu
Sorry for the delayed response. See my answer to a similar question in this
thread:
[https://news.ycombinator.com/item?id=8356999](https://news.ycombinator.com/item?id=8356999)

If you have any more questions or just want to chat, feel free to shoot me an
email directly at serge@parsehub.com

------
HashBrowns
This looks like a great utility for data analysis. Thanks.

Doesn't see to be functioning on the Tor browser bundle by default, but maybe
there are instructions of how to do that somewhere.

Will continue to test it out, and see how everything works.

~~~
tsergiu
Sorry about that. We haven't tested with Tor.

I hope disabling Tor is suitable for your use case.

------
plingamp
Scraping is always such a pain, this looks incredibly well done. I've
personally have had a really good experience with PhantomJS. What does your
backend look like? (What happens when a API request is made)

~~~
tsergiu
We run a headless browser to execute your extraction.

------
Kiro
Is this legal? Can I use this to scrape and store Google Search results?

~~~
sqrt17
according to Google TOS, you cannot, and Google will ban you as soon as the
traffic from your host(s) will look more like monkeys on typewriters than
actual people looking for search results.

Will it have legal consequences? Most likely no. Will your coworkers or
employees stab you with a rusty fork for getting their (most likely) favorite
search engine to block them? Absolutely.

------
ar7hur
Looks powerful. Just curious, how does it compare to Kimono Labs?

~~~
tsergiu
We think Kimono is a great tool, but it is very limited in capability.

We specifically focus on handling highly dynamic or interactive websites. Our
toolset gives you more flexibility over how you can extract data. For example,
you can extract all the nested comment data from a reddit post, or you can
extract data from maps without having to fumble around in the web inspector.

~~~
blklane
After using both there is a very large brick wall that prevents usability of
kimono labs on most well-known web applications (airbnb, craiglist, etc).
Example is data extraction from content visible via hover (airbnb calendar
pricing) which ParseHub is able to handle.

Thanks a lot for building this, I am excited to save server costs/time from
scraping data for projects.

------
nivertech
Pricing page no found:

    
    
        404 Not Found
        
        The resource could not be found.
        
        /pricing

~~~
tsergiu
Thanks! Pushing a fix now.

Edit: Fixed

~~~
superasn
The website looks really interesting and very useful if it works like in the
demo.

My only issue is $79 / mo starting plan looks a bit steep for casual
development (I know there is a free option) but if you want to attract more
buyers i think something like GitHub's pricing would be really attractive for
prospective buyers like me (something like $22 / month for 10 private projects
with 10 pages / min). Just a suggestion in case you're looking for feedback.

~~~
tsergiu
Didn't find an email in your profile. I'd love to discuss with you further.
Can you email me at serge@parsehub.com?

~~~
superasn
thanks! I've sent you an email. Also couldn't figure out how to use select
boxes, radio buttons, or fill a specific text field on the page. The
"expression" page in the doc is also 404. I guess you must have too much on
your plate right now, so all this is just fyi.

------
ssiddharth
FWIW, the Facebook SVG in the home page doesn't seem to be loading for me.
Rest looks great. Congrats!

~~~
tsergiu
Thanks, we'll look into it.

------
adelevie
This reminds me of the wonderfully cool Web 2.0-era startup Dapper.

------
kyriakos
really nice. unfortunately gets a bit slow on long pages but it does work well
and I like the fact that you offer an unlimited free plan.

