
Show HN: Apifier – hosted web crawler for developers - jancurn
https://www.apifier.com/
======
thecodemonkey
I don't quite understand why you would use a full-blown browser like phantomjs
for crawling (I've seen a lot of projects recently taking this approach, so
this critique is not directly towards Apifier).

Yes, I get that in some specific circumstances it would be nice to be able to
execute the JavaScript on the page but think about the trade-off here.

In the vast majority of cases a simple HTTP GET request with a DOM parser is
all you need -- actually not a single one of the examples on the Apifier
homepage has any need for phantomjs.

Wouldn't it be much much cheaper, simpler and faster to ditch phantomjs? Or is
there something I'm missing here?

~~~
jancurn
You're right that most of the time you don't need to use JavaScript.

But look at Google Groups for example - there is an infinite scroll to get all
the topics, posts are also loaded dynamically, so you have to wait some time
to get them.

In the SFO flights example you have to deal with pagination also using
JavaScript.

We wanted to build a powerful tool which can crawl and scrape almost any
website out there. It's slower, but you can use bench of our nodes to do it in
parallel.

~~~
thecodemonkey
I agree that the Google Groups example is much simpler when using PhantomJS,
but I would argue that it would be an outlier.

The SFO flights example is actually heavily over engineered, from quickly
glancing over the XHR tab in Chrome Network tools it was pretty obvious that
all of the data is actually located in this very nice JSON blob
[http://www.flysfo.com/flightprocessing/fullFlightData.txt](http://www.flysfo.com/flightprocessing/fullFlightData.txt)

(I assume that the SF Flight Info was just meant as an example for the
platform and as such the fact that it's already a JSON blob was just ignored
for the sake of the example)

------
jancurn
Hello HN! Today we’re launching what we were building for the past couple of
months. Apifier is a hosted web crawler for developers that enables them to
extract data from any website using a few simple lines of JavaScript. We built
it because we realized that many existing web scrapers trade off their ability
to scrape complex websites for the "simplicity" of their user interface. We
thought: we are programmers and we already use JavaScript for client-side
development, so why not use it for scraping?

Please have a look at the service, play with the examples and maybe set up
your own crawl. My co-founder jakubbalada and myself will be around here to
answer your questions. We'd love to hear what you guys think!

~~~
jacquesm
Do you respect robots.txt? Do you publish your IP ranges?

~~~
jakubbalada
Yes, by default we respect robots.txt. There is a switch to disable it - on
your own responsibility. We don't fully respect Crawl-delay, but minimum delay
between requests for all our crawlers is set to 2000ms. We don't publish our
IP ranges yet.

------
rgbrgb
Looks really cool! Pricing is the big stickler for me. I've been burned too
many times to build any critical piece of my app with it without knowing how
much it'll cost if it gets popular.

~~~
jancurn
Many thanks! You're absolutely right, we'll publish the price ASAP and also
provide some long-term guarantee for users who will depend on our service. BTW
do you think pricing per GB downloaded is reasonable or would you prefer some
flat monthly fee?

~~~
nsp
Can't speak for the OP but I'd vastly prefer either a flat fee (maybe tiered
based on parallelism) or something like price per page, makes it much easier
to estimate usage.

~~~
jakubbalada
You're right, price per request would be easier for estimation. But as you can
use JavaScript, you can scrape whole website with just one page request (see
the SFO flights example). In other words, our costs doesn't correlate with
page requests, but with data transfer.

Flat fee is also possible, but we think that it's fair that users pay based on
their consumption.

------
necrodome
How do you access the latest crawling results programmatically? I hope you are
not expecting me to click results link for a developer's tool.

~~~
jakubbalada
Of course not, API will be available soon - in a week or two. If you have some
other feature requests, please let us know, we need to help with
prioritization.

------
danielharan
I'm hoping this could save me some work.

A few questions, if founders are still around:

-Can you cache pages / download entire sites? -If caching, can you detect changes on a given schedule, trigger the extraction "pageFunction" and save versioned data?

-How do you handle errors?

-Will you handle database extractions and other sites that require multiple levels of what you have as pseudo-URLs?

~~~
jancurn
We're still here :)

\- at the moment we don't store the HTML content of the visited pages (except
of the last one), so the only way to determine if something changed is to run
the 'pageFunction' on each page again and compare the results. This can be
optimized in certain situation, e.g. you can crawl a product listing and only
go to product details page if some basic property changed. Saving HTML for
each page is certainly possible, but after the crawler finished loading a
page, running a pageFunction adds very little extra overheads.

\- if a page cannot be loaded for any reason, a detailed description of the
error will be present in the JSON results. We want to implement a limited
number of retries for these pages, for situations the error is just temporary.

\- certainly, if your crawling strategy cannot be expressed using simple
pseudo-URLs, you can use the low-level 'interceptRequest' function to control
exactly how each new page navigation request is handled (enqueued/ignored),
tell the crawler which URLs refer to same pages and shouldn't be visited again
etc. You can also enqueue arbitrary pages to crawl using
'context.enqueuePage'. In fact, you don't need to use pseudo-URLs at all and
control everything from your code.

------
benjmn
As a website owner, is it easy to block a rude crawler by contacting you ?
(How would I identify in the first place that the crawler is operated by you ?
Would my server logfile have enough data to point back to you ? )

Nice & useful demo. I'll give it a try.

~~~
jancurn
Currently, there is no way to distinguish traffic from our crawlers, but of
course, let us know at support@apifier.com and we will blacklist your websites
from our crawlers.

------
Eridrus
Why should I use this instead of just firing up some spot instances with
phantomjs?

~~~
jancurn
You could definitely do that, but then you might also need to:

\- implement a mechanism to find and click active page elements, track the
browser actions

\- recompile PhantomJS to support POST requests

\- implement a page queue with checks for duplicate URLs

\- implement some parallelization and failover mechanism for case PhantomJS
crashes (it does)

\- possibly implement support for infinite scroll

\- setup your own pool of proxy servers

\- setup a database to store the results

and finally make the whole thing simple to setup and use :)

------
bentpins
I love the demos, and that you can use them without registering. One thing I
couldn't find without making an account was what happens after you've used a
Gigabyte. That would be a helpful addition I think.

~~~
jancurn
TBH we don't have the pricing defined yet, because we don't know what will be
our server costs. In a few days we will know more and put up the prices.

~~~
jancurn
I forgot to add: when you reach 1 GB we'll get in touch with you.

------
aakilfernandes
Cool! How do you stop users from trying to run malicious code?

~~~
jancurn
Thank you! The user JavaScript code runs in the context of the web pages in
the same restricted environment as normal web page's JavaScript. Also, the
crawling processes are sandboxed.

------
thomasfromcdnjs
Also similar to [https://morph.io/](https://morph.io/) which has more of a
trend towards open data sets.

~~~
jancurn
morph.io is a great tool but it requires a non-trivial setup on your machine
in order to get things running. We wanted to enable people to create scrapers
with no prerequisites.

------
misiti3780
I like the idea - would probably use it in the future - can you talk a little
bit about what technologies you are using?

~~~
jancurn
Thank you! The crawler is currently based on PhantomJS, there is a pool of
worker nodes that distribute the crawling to more machines. We use Node.js +
MongoDB backend and Meteor for front-end.

------
asterfield
I was just thinking yesterday of creating a similar service. I'm glad to see
someone else has already made it :D

~~~
jancurn
Cool, if things go well, we'll be hiring soon :)

------
Raphmedia
Exactly what I was looking for in order to efficiently improve my searches for
a new home. Thanks!

~~~
rgbrgb
Raph, awesome that you're hacking your own tools for homebuying! This is how
our company started.

If you're looking in California, let us know if there's a feature we could add
to Open Listings to improve your search. Always excited to help HNers hack
homebuying. One thing we're adding soon is the ability to filter your property
feed by running regular expressions over the descriptions. We'd love to hear
other ideas for hacker friendly homebuying :).

