Hi everyone, faast.js is a library that allows you to use serverless to run batch processing jobs. It makes it super easy to run regular functions as serverless functions. This is one of my first open source projects and I'd be happy to answer any questions here.
Do you think serverless pricing model contradicts with batch operations ? I think that you normally pay for duration of tasks and ram usage etc. Batch jobs are supposed to be running long. I'm probably missing something here. Would you tell me little bit more ?
It depends on the specific use case. Some of the use cases I envision have sharp spikes in demand, and serverless can provide better service and price/performance. Part of faast.js is a cost analyzer that can tell you in real time how much your workload costs. What I found is that most people are probably using the wrong memory sizes for their lambda functions to optimize for price/performance. More on that when I write my next blog post... If you want a preview, check out this chart from the documentation: https://faastjs.org/docs/cost-estimates
From what I can tell, it's the invocation model and deployment that is unique here?
You invoke faast from your local machine (or build server, or cron job, whatever), and in turn it deploys some functions to a serverless platform and runs them, then tears them all down when complete. Eg, from the site, this code runs locally:
import { faast } from "faastjs";
import * as funcs from "./functions";
(async () => {
const m = await faast("aws", funcs);
try {
// m.functions.hello: string => Promise<string>
const result = await m.functions.hello("world");
console.log(result);
} finally {
await m.cleanup();
}
})();
You wouldn't want to run this code on serverless, as you'd be paying for compute time of just waiting for all the other tasks to complete.
It would be useful to see a discussion about how and where to host this entry code, may even a topic on "Running in production".
It's definitely a neat idea because if you control the event that kicks everything off anyway (eg: "create monthly invoices" or "build daily reports") you can deploy the latest version of everything, run it and clean it up in essentially a single step.
(Please correct me if I've misunderstood any of the details here!)
You're basically correct, and thanks for the suggestion to add documentation about deployment in production.
One special case is if your functions return a lot of data; outbound data charges can get expensive fast, and you'll be limited in getting responses by your network link. So you can run the coordinator code on, say, EC2 in the same region and then the link to Lambda is super fast and you won't have any outbound data costs.
This is how I interpreted it's usage too. We've all started an instance on DO/AWS/GCP/ETC for some batch job were we wanted 32 cores or whatnot. This lets you use lambda's for the scaling instead of the cores directly. How efficient this is performance wise I have no clue.
To serve as a data point, I effectively built an in-house version of this a few years ago built on top of AWS Lambda, all in Python. The "entry point" code or orchestration code was hosted normally on an EC2 instance. More specifically we were using Airflow, so our Airflow server would kick off a Python program that would then orchestrate a couple thousand Lambdas.
Very cool. Worth checking out a similar project Durable Functions. However those orchestrations can run in serverless and can scale to zero during the “waiting for other tasks” step.
There are IP-based rate limiters on sites (linkedIn, facebook, etc), but each lambda has a new public IP so by using faast.js, I can stay under the radar.
Plus you can essentially spawn a headless chrome (puppeteer) to do advanced stuff.
Very interesting project, the problem with Serverless service provided by different public cloud vendors is that programming and API are not uniform. I think Faast.js is on the right path to creating a unified interface for different Serverless services.
I'm not experienced on this stuff either, but it seems like Serverless (the org/framework) is designed for architecting whole sites/apps, whereas faast.js is focussed on hhandling batch computing jobs.
We resently were exactly in a situation where we had to do heavy processing of ~4000 items each running between 1-10minutes.
To speed the process up we ran it on lambda. That means our process went down from 10h++ on a single core computer to about 15min running it on 4000 lambdas.
Your library would have saved us quite some work as it would take away a lot of Aws config, deploy, etc....
Btw: I'm thinking of building a similar library for multi core/webworkers for node.js. currently a lot of boilerplate is required on node.js to make a loop run parallel on all cores.
Very cool. What kind of data was it, if you don't mind sharing?
Faast.js can be used with multi-core, just use the "local" mode and run it on a large box. I'm billing this as a way to test locally before running in the cloud, but it's actually a completely viable way to run parallel processes on one machine, with the option to run on serverless with a one line change.
Wow that's awesome. I'll have a look at it ASAP. We have actually just converted our lambda code to run on a multi core machine + much wiser algorithms to massively speed up the process.
Also, do you create a new webworker per function call or do you create only as many workers as threads/cores on the machine and run the functions inside those? Starting a webworker can be very expensive if the serialised data is large .
Ps: each lambda function ran a special parsing of complex mathematics-excercises. We are an ed-tech company ;)
The serialization/deserialization is just JSON for now, though I plan on adding some configurability and perhaps changing the implementation at some point. There is some runtime checking to make sure the arguments are correctly serializable.
In local mode, a process is created up to the concurrency limit you specify, and each process is reused for subsequent calls (mimicking how Lambda reuses containers, allowing you to use the same caching behavior you'd use on Lambda). I'm not currently using webworkers, but that's something I could see a new mode for easily. For larger data, I would recommend storing arguments and return values directly in cloud storage like S3, or on local disk in local mode.
I would be interested to learn how your experiment with faast.js goes!
Honestly, I would want java. Probably would have to provide a mapping spec file (like IDL) to help generate the mediation code between the local proxies and the deployed functions.
This is very neat! Last year I had to essentially do this on GCP and relied on a very similar implementation. Everyone was surprised to see JS being used for data processing but it worked wonderfully.
One thing I want to ask is the retries, how do you handle that currently? I ran into multiple cases where functions would fail for transient reasons.
Functions need to be idempotent, so you have to assume they will be retried. Faast.js will proactively do retries in some cases where it thinks a function is slow, to reduce tail latency.
If a function fails to execute for transient reasons and exceeds the retry maximum (a config setting you can change), then it will reject the return value promise. You can catch that and handle with another attempt, or report an error, or just ignore it and report less accurate or complete results.