Show HN: Faast.js – Serverless Batch Computing Made Simple

achou · on May 2, 2019

Hi everyone, faast.js is a library that allows you to use serverless to run batch processing jobs. It makes it super easy to run regular functions as serverless functions. This is one of my first open source projects and I'd be happy to answer any questions here.

m00dy · on May 2, 2019

Hi,

Do you think serverless pricing model contradicts with batch operations ? I think that you normally pay for duration of tasks and ram usage etc. Batch jobs are supposed to be running long. I'm probably missing something here. Would you tell me little bit more ?

achou · on May 2, 2019

It depends on the specific use case. Some of the use cases I envision have sharp spikes in demand, and serverless can provide better service and price/performance. Part of faast.js is a cost analyzer that can tell you in real time how much your workload costs. What I found is that most people are probably using the wrong memory sizes for their lambda functions to optimize for price/performance. More on that when I write my next blog post... If you want a preview, check out this chart from the documentation: https://faastjs.org/docs/cost-estimates

keeganj · on May 2, 2019

This looks great! I have an upcoming processing-intesive project I hope to test this out on soon.

I especially like the cost estimate feature, that isn't something I've seen in such a seemingly simple tool like this before.

pushtheenvelope · on May 2, 2019

this looks super neat!

I've been wanting to make a graphql server framework that can run on lambdas, and will perhaps look into integrating with faast.

gregmac · on May 2, 2019

From what I can tell, it's the invocation model and deployment that is unique here?

You invoke faast from your local machine (or build server, or cron job, whatever), and in turn it deploys some functions to a serverless platform and runs them, then tears them all down when complete. Eg, from the site, this code runs locally:

    import { faast } from "faastjs";
    import * as funcs from "./functions";

    (async () => {
        const m = await faast("aws", funcs);
        try {
            // m.functions.hello: string => Promise<string>
            const result = await m.functions.hello("world");
            console.log(result);
        } finally {
            await m.cleanup();
        }
    })();

You wouldn't want to run this code on serverless, as you'd be paying for compute time of just waiting for all the other tasks to complete.

It would be useful to see a discussion about how and where to host this entry code, may even a topic on "Running in production".

It's definitely a neat idea because if you control the event that kicks everything off anyway (eg: "create monthly invoices" or "build daily reports") you can deploy the latest version of everything, run it and clean it up in essentially a single step.

(Please correct me if I've misunderstood any of the details here!)

achou · on May 2, 2019

You're basically correct, and thanks for the suggestion to add documentation about deployment in production.

One special case is if your functions return a lot of data; outbound data charges can get expensive fast, and you'll be limited in getting responses by your network link. So you can run the coordinator code on, say, EC2 in the same region and then the link to Lambda is super fast and you won't have any outbound data costs.

penagwin · on May 2, 2019

This is how I interpreted it's usage too. We've all started an instance on DO/AWS/GCP/ETC for some batch job were we wanted 32 cores or whatnot. This lets you use lambda's for the scaling instead of the cores directly. How efficient this is performance wise I have no clue.

teej · on May 2, 2019

To serve as a data point, I effectively built an in-house version of this a few years ago built on top of AWS Lambda, all in Python. The "entry point" code or orchestration code was hosted normally on an EC2 instance. More specifically we were using Airflow, so our Airflow server would kick off a Python program that would then orchestrate a couple thousand Lambdas.

netofeverythin3 · on May 3, 2019

Very cool. Worth checking out a similar project Durable Functions. However those orchestrations can run in serverless and can scale to zero during the “waiting for other tasks” step.

https://docs.microsoft.com/en-us/azure/azure-functions/durab...

Disclaimer - product manager for Azure durable functions

achou · on May 3, 2019

I'd love to add Azure support but I'm not super familiar with it. Would be great to chat about it sometime.

netofeverythin3 · on May 3, 2019

Absolutely! And awesome job with this - very cool to see it being shared with the community

achou · on May 3, 2019

Contact me via DM on twitter, or on linkedin

asadm · on May 2, 2019

This can be great for scrapping jobs!

There are IP-based rate limiters on sites (linkedIn, facebook, etc), but each lambda has a new public IP so by using faast.js, I can stay under the radar.

Plus you can essentially spawn a headless chrome (puppeteer) to do advanced stuff.

achou · on May 2, 2019

Indeed, I've put together a simple example of using puppeteer with faast.js in this repo: https://github.com/faastjs/examples/tree/master/aws-puppetee...

dongxu · on May 2, 2019

Very interesting project, the problem with Serverless service provided by different public cloud vendors is that programming and API are not uniform. I think Faast.js is on the right path to creating a unified interface for different Serverless services.

bdcravens · on May 2, 2019

Doesn't Serverless (the framework, not the concept) abstract this away?

https://serverless.com/framework/docs/providers/

(not familiar enough with that framework to form an opinion one way or the other)

zaq_xsw · on May 14, 2019

I'm not experienced on this stuff either, but it seems like Serverless (the org/framework) is designed for architecting whole sites/apps, whereas faast.js is focussed on hhandling batch computing jobs.

BrandiATMuhkuh · on May 2, 2019

Love what you did!

We resently were exactly in a situation where we had to do heavy processing of ~4000 items each running between 1-10minutes. To speed the process up we ran it on lambda. That means our process went down from 10h++ on a single core computer to about 15min running it on 4000 lambdas.

Your library would have saved us quite some work as it would take away a lot of Aws config, deploy, etc....

Btw: I'm thinking of building a similar library for multi core/webworkers for node.js. currently a lot of boilerplate is required on node.js to make a loop run parallel on all cores.

achou · on May 2, 2019

Very cool. What kind of data was it, if you don't mind sharing?

Faast.js can be used with multi-core, just use the "local" mode and run it on a large box. I'm billing this as a way to test locally before running in the cloud, but it's actually a completely viable way to run parallel processes on one machine, with the option to run on serverless with a one line change.

BrandiATMuhkuh · on May 2, 2019

Wow that's awesome. I'll have a look at it ASAP. We have actually just converted our lambda code to run on a multi core machine + much wiser algorithms to massively speed up the process.

I have not deeply look into your library yet. But how do you deal with de/serialising? We use https://www.npmjs.com/package/class-transformer to correctly de/serialise ts-objects.

Also, do you create a new webworker per function call or do you create only as many workers as threads/cores on the machine and run the functions inside those? Starting a webworker can be very expensive if the serialised data is large .

Ps: each lambda function ran a special parsing of complex mathematics-excercises. We are an ed-tech company ;)

achou · on May 2, 2019

The serialization/deserialization is just JSON for now, though I plan on adding some configurability and perhaps changing the implementation at some point. There is some runtime checking to make sure the arguments are correctly serializable.

In local mode, a process is created up to the concurrency limit you specify, and each process is reused for subsequent calls (mimicking how Lambda reuses containers, allowing you to use the same caching behavior you'd use on Lambda). I'm not currently using webworkers, but that's something I could see a new mode for easily. For larger data, I would recommend storing arguments and return values directly in cloud storage like S3, or on local disk in local mode.

I would be interested to learn how your experiment with faast.js goes!

mring33621 · on May 2, 2019

This is neat, but would be more useful if it could deploy cloud functions made in language {x} and provide local js proxies for them.

achou · on May 2, 2019

Good idea. Any specific example you have in mind?

linuxdude314 · on May 2, 2019

Python is a good place to start.

adeora · on May 2, 2019

Pywren (http://pywren.io/) seems to be basically this project, but in Python

mring33621 · on May 2, 2019

Honestly, I would want java. Probably would have to provide a mapping spec file (like IDL) to help generate the mediation code between the local proxies and the deployed functions.

sourc3 · on May 2, 2019

This is very neat! Last year I had to essentially do this on GCP and relied on a very similar implementation. Everyone was surprised to see JS being used for data processing but it worked wonderfully.

One thing I want to ask is the retries, how do you handle that currently? I ran into multiple cases where functions would fail for transient reasons.

achou · on May 2, 2019

Functions need to be idempotent, so you have to assume they will be retried. Faast.js will proactively do retries in some cases where it thinks a function is slow, to reduce tail latency.

If a function fails to execute for transient reasons and exceeds the retry maximum (a config setting you can change), then it will reject the return value promise. You can catch that and handle with another attempt, or report an error, or just ignore it and report less accurate or complete results.

heathermiller · on May 2, 2019

Reminds me a bit of like 2019's version of RMI...

dead_mall · on May 3, 2019

Looks interesting. The concept reminds me of RPyC