
Show HN: Faast.js – Serverless Batch Computing Made Simple - achou
https://faastjs.org
======
achou
Hi everyone, faast.js is a library that allows you to use serverless to run
batch processing jobs. It makes it super easy to run regular functions as
serverless functions. This is one of my first open source projects and I'd be
happy to answer any questions here.

~~~
m00dy
Hi,

Do you think serverless pricing model contradicts with batch operations ? I
think that you normally pay for duration of tasks and ram usage etc. Batch
jobs are supposed to be running long. I'm probably missing something here.
Would you tell me little bit more ?

~~~
achou
It depends on the specific use case. Some of the use cases I envision have
sharp spikes in demand, and serverless can provide better service and
price/performance. Part of faast.js is a cost analyzer that can tell you in
real time how much your workload costs. What I found is that most people are
probably using the wrong memory sizes for their lambda functions to optimize
for price/performance. More on that when I write my next blog post... If you
want a preview, check out this chart from the documentation:
[https://faastjs.org/docs/cost-estimates](https://faastjs.org/docs/cost-
estimates)

------
gregmac
From what I can tell, it's the invocation model and deployment that is unique
here?

You invoke faast from your local machine (or build server, or cron job,
whatever), and in turn it deploys some functions to a serverless platform and
runs them, then tears them all down when complete. Eg, from the site, this
code runs locally:

    
    
        import { faast } from "faastjs";
        import * as funcs from "./functions";
    
        (async () => {
            const m = await faast("aws", funcs);
            try {
                // m.functions.hello: string => Promise<string>
                const result = await m.functions.hello("world");
                console.log(result);
            } finally {
                await m.cleanup();
            }
        })();
    

You wouldn't want to run _this code_ on serverless, as you'd be paying for
compute time of just waiting for all the other tasks to complete.

It would be useful to see a discussion about how and where to host this entry
code, may even a topic on "Running in production".

It's definitely a neat idea because if you control the event that kicks
everything off anyway (eg: "create monthly invoices" or "build daily reports")
you can deploy the latest version of everything, run it and clean it up in
essentially a single step.

(Please correct me if I've misunderstood any of the details here!)

~~~
achou
You're basically correct, and thanks for the suggestion to add documentation
about deployment in production.

One special case is if your functions return a lot of data; outbound data
charges can get expensive fast, and you'll be limited in getting responses by
your network link. So you can run the coordinator code on, say, EC2 in the
same region and then the link to Lambda is super fast and you won't have any
outbound data costs.

~~~
penagwin
This is how I interpreted it's usage too. We've all started an instance on
DO/AWS/GCP/ETC for some batch job were we wanted 32 cores or whatnot. This
lets you use lambda's for the scaling instead of the cores directly. How
efficient this is performance wise I have no clue.

------
asadlionpk
This can be great for scrapping jobs!

There are IP-based rate limiters on sites (linkedIn, facebook, etc), but each
lambda has a new public IP so by using faast.js, I can stay under the radar.

Plus you can essentially spawn a headless chrome (puppeteer) to do advanced
stuff.

~~~
achou
Indeed, I've put together a simple example of using puppeteer with faast.js in
this repo: [https://github.com/faastjs/examples/tree/master/aws-
puppetee...](https://github.com/faastjs/examples/tree/master/aws-puppeteer-ts)

------
dongxu
Very interesting project, the problem with Serverless service provided by
different public cloud vendors is that programming and API are not uniform. I
think Faast.js is on the right path to creating a unified interface for
different Serverless services.

~~~
bdcravens
Doesn't Serverless (the framework, not the concept) abstract this away?

[https://serverless.com/framework/docs/providers/](https://serverless.com/framework/docs/providers/)

(not familiar enough with that framework to form an opinion one way or the
other)

~~~
zaq_xsw
I'm not experienced on this stuff either, but it seems like Serverless (the
org/framework) is designed for architecting whole sites/apps, whereas faast.js
is focussed on hhandling batch computing jobs.

------
BrandiATMuhkuh
Love what you did!

We resently were exactly in a situation where we had to do heavy processing of
~4000 items each running between 1-10minutes. To speed the process up we ran
it on lambda. That means our process went down from 10h++ on a single core
computer to about 15min running it on 4000 lambdas.

Your library would have saved us quite some work as it would take away a lot
of Aws config, deploy, etc....

Btw: I'm thinking of building a similar library for multi core/webworkers for
node.js. currently a lot of boilerplate is required on node.js to make a loop
run parallel on all cores.

~~~
achou
Very cool. What kind of data was it, if you don't mind sharing?

Faast.js can be used with multi-core, just use the "local" mode and run it on
a large box. I'm billing this as a way to test locally before running in the
cloud, but it's actually a completely viable way to run parallel processes on
one machine, with the option to run on serverless with a one line change.

~~~
BrandiATMuhkuh
Wow that's awesome. I'll have a look at it ASAP. We have actually just
converted our lambda code to run on a multi core machine + much wiser
algorithms to massively speed up the process.

I have not deeply look into your library yet. But how do you deal with
de/serialising? We use [https://www.npmjs.com/package/class-
transformer](https://www.npmjs.com/package/class-transformer) to correctly
de/serialise ts-objects.

Also, do you create a new webworker per function call or do you create only as
many workers as threads/cores on the machine and run the functions inside
those? Starting a webworker can be very expensive if the serialised data is
large .

Ps: each lambda function ran a special parsing of complex mathematics-
excercises. We are an ed-tech company ;)

~~~
achou
The serialization/deserialization is just JSON for now, though I plan on
adding some configurability and perhaps changing the implementation at some
point. There is some runtime checking to make sure the arguments are correctly
serializable.

In local mode, a process is created up to the concurrency limit you specify,
and each process is reused for subsequent calls (mimicking how Lambda reuses
containers, allowing you to use the same caching behavior you'd use on
Lambda). I'm not currently using webworkers, but that's something I could see
a new mode for easily. For larger data, I would recommend storing arguments
and return values directly in cloud storage like S3, or on local disk in local
mode.

I would be interested to learn how your experiment with faast.js goes!

------
mring33621
This is neat, but would be more useful if it could deploy cloud functions made
in language {x} and provide local js proxies for them.

~~~
achou
Good idea. Any specific example you have in mind?

~~~
linuxdude314
Python is a good place to start.

~~~
adeora
Pywren ([http://pywren.io/](http://pywren.io/)) seems to be basically this
project, but in Python

------
sourc3
This is very neat! Last year I had to essentially do this on GCP and relied on
a very similar implementation. Everyone was surprised to see JS being used for
data processing but it worked wonderfully.

One thing I want to ask is the retries, how do you handle that currently? I
ran into multiple cases where functions would fail for transient reasons.

~~~
achou
Functions need to be idempotent, so you have to assume they will be retried.
Faast.js will proactively do retries in some cases where it thinks a function
is slow, to reduce tail latency.

If a function fails to execute for transient reasons and exceeds the retry
maximum (a config setting you can change), then it will reject the return
value promise. You can catch that and handle with another attempt, or report
an error, or just ignore it and report less accurate or complete results.

------
heathermiller
Reminds me a bit of like 2019's version of RMI...

------
dead_mall
Looks interesting. The concept reminds me of RPyC

