
Show HN: Distributed Scraper - Nimsical
http://stdlib.com/nemo/scrape
======
afandian
Is it me or is 'stdlib' not the best name for something that's not the
<stdlib.h>?

~~~
keithwhor
Founder here; as much as it pains many of us to hear, the vast majority of
software developers never interface with <stdlib.h> directly. You can think of
our choice of name more as an homage for a new generation of web-based
developers.

The question we asked ourselves was --- if I'm a developer today, and I want
to release services for people to easily use and compose, take advantage of
new "serverless" technology, where do I put it and how do I distribute it? How
do I keep my services organized? Can I build a business around an API? What
does that look like?

The name "Standard Library" was a natural fit --- I consider it a huge amount
of serendipity that the domain was available. We've also been lucky to have a
huge amount of support from our developer community --- we got a few questions
like this to begin with, which naturally caused some unease, but far more
positive reactions in total. I'm very happy we've been able to build something
people like. The name choice has not negatively impacted our SEO or anything
like that, thankfully! :)

If you happen to play with it at all, let me know what you think!

(Humorous aside; it is more frequent for newer developers to ask us about
"sexually transmitted diseases" than it is for them to get the C reference -
which is totally fine, don't mind sharing knowledge! But when I told my non-CS
mother about the business we were building, definitely had to really quickly
explain the background of the name.)

~~~
afandian
Thanks for your reply! So is this a library or a service? It looks like a
library but why would I need to sign up to use a library?

~~~
keithwhor
It's a library of services, but we provide deployment, authorization, rate
limiting, billing, and documentation layers for developers. Think of it like
GitHub for Services instead of source control. (+ Making the usage-based
billing model highly accessible to every developer.)

------
djyaz1200
Does anyone know of a scraper with a simple UI that's usable by less technical
people? Similar to Kimono (bought by Palantir and shut down).
[https://techcrunch.com/2016/02/15/palantir-acquires-
kimono-l...](https://techcrunch.com/2016/02/15/palantir-acquires-kimono-labs-
for-its-web-scraping-service/)

~~~
jusob
I have tried [https://dexi.io/](https://dexi.io/) (free account) about a year
ago and it looks very good. One downside is that it was issuing requests from
Europe by default, so you had to bring your own proxy for the US, for example.

------
Nimsical
Mainly built this as an experiment to pull a bunch of data for some ML work
I've been doing. Wrote about it more extensively here:
[https://hackernoon.com/microservice-series-scraper-
ee970df3e...](https://hackernoon.com/microservice-series-scraper-
ee970df3e81f#.ex6qh4aek)

AMA!

------
wenbert
"Distributed" as in using proxies? Where do the requests come from when I
scrape a page?

~~~
Nimsical
I guess it's not truly distributed in that sense. StdLib uses AWS Lambda,
which have widely different IPs and I believe they're multi-region.

I haven't had issues hitting a wall with getting caught doing any scraping.
But then again, I haven't done it at a 10k/pages/sec rate or anything like
that.

~~~
wenbert
Thank you! This is very interesting.

I have wanted to implement something like this one - ie: Lambda doing the
downloading of the page itself.

I have wondered how it would work with very strict sites like Yelp - limits
similar to what you would get in their API (so doesn't make sense not to use
their API).

What are your stats like if you don't mind? How much people are using it and
much are getting blocked (404 or 500 after 1000 requests, etc.)?

Edit: Is it possible to use my own credentials for AWS?

~~~
Nimsical
I don't track that information on the function – but I really should!

Based on StdLib's dashboards – a bunch of folks have been using it per month
with a steady pace of a few 100 scrapes a day type of thing.

We've been using it internally for quite a while now.

And as far as I know, StdLib doesn't allow you to use your own AWS
credentials. They have their own gateway and a bunch of stuff on top of Lambda
that makes the whole experience a lot easier and more powerful (e.g. 128MB
limit on payload vs 5MB for Lambda)

