Hacker News new | comments | ask | show | jobs | submit login
S3 trickery: using it as a scheduler (hackernoon.com)
32 points by efi_mk 5 months ago | hide | past | web | favorite | 43 comments

We're taking this serverless thing too far. We're coming up with elaborate overcomplicated schemes to accomplish simple tasks just so we can say "we're not running a server for this".

Instead of doing the simple thing and running a server process that periodically checks some queue for tasks to execute, this "scheme" involves triggering a lambda function every minute, to do some IO operations on a distributed file store where the execution date metadata is embedded in the filenames to see if there is some task to execute. If there is, some more distributed IO operations are required to move the file to another part of the distributed file storage, where some listener will notice a new file was added and trigger the actual lambda function with the task-specific code...

But hey, you're not running any servers yourself...

Don’t blame Serverless. Blame it on a poor implementation. AWS Cloudwatch supports cron rules that include the year that it should be scheduled and it can trigger lambdas directly. There was no reason he had to over complicate this.

As far as using a queue, lambda also supports triggering based on a queue.

Don't forget that the above includes multi-datacenter, multi-region redundancy / failover. Could be important sometimes.

Waking on items in queue would be better, though, and AWS has it built-in; the authors just did not use it for some reason.

Not saying it's right or wrong solution, but saying that a simpler solution would be to take a server and put on it a job scheduler process is misleading. People shouldn't forget that it never ends with a single server and a single process, you have to manage it to make it production ready and fault tolerant. Sometimes doing the extra step to make it serverless compatible is worth the effort.

Agreed, lol. I’d say let Tech Darwinism take its toll on the weak.

Azure's service bus queue has a much better solution for this: you can schedule an item to appear in the queue at any future time. https://docs.microsoft.com/en-us/dotnet/api/microsoft.servic...

Like Amazon's new SQS Lambda trigger you can then schedule a serverless function that's triggered by new items in the queue to have arbitrary scheduling of tasks.

Using their Python API it's pretty nice:

    from datetime import datetime, timedelta
    from azure.servicebus import ServiceBusService, Message

    sbs = ServiceBusService('task-queue',  shared_access_key_name=key_name, shared_access_key_value=key_value)

    d = datetime.utcnow() + timedelta(minutes=1)
    task = {"some": "object"}

            broker_properties={'ScheduledEnqueueTimeUtc': d}

You can do the same with AWS...

- set a CloudWatch scheduled rule to schedule a lambda.

- or set a rule to send an sns, subscribe a queue to the sns topic and subscribe a lambda to the queue.

CloudWatch is cron based, which means you can't schedule a one time event.

Agree regarding sns, but in our case sns == s3

AWS adds a year parameter to cron. You can specify the exact time.

Why use S3? S3 events aren’t guaranteed. SNS was specifically designed for this use case.


Thanks for the info, tucking that away! :)

"you can schedule an item to appear in the queue at any future time" Nice! Anything similar in aws ?

This is misleading.

The title should be "Using s3 as storage and some side thingy that runs every minute as a scheduler".

It's worth noting that there is currently a gotcha with S3 Event Notifications such that they are not guaranteed. As a result, you may end up missing out on events.

S3 also doesn't provide a linearizable consistency model or even a vague approximation of one. You can't rely on the events you try to schedule happening in the order you try to schedule them in, or even happening at all.

This seems overcomplicated compared to using a regular timed event to trigger a lambda and having it decide what to execute conditionally.

I wasn't really too worried about picking apart the approach. If order mattered, you'd be looking at other approaches anyway.

Mainly just calling out a gotcha where you might quietly miss out on scheduled events with no warning.

For example:

1. Object written to successfully to bucket.

2. `s3:ObjectCreated:Put` is _never_ delivered.

The possibility of duplicate events are warned about a lot in the AWS ecosystem, and this sets up an expectation of "at-least-once" delivery.

Can you elaborate more? Is this related to S3 uptime SLA or is there a different reason?

I believe it is different to the S3 uptime SLA.

You can measure the impact and potentially automate recovery of missed events by:

1) Keep a track of events published.

2) Generate an S3 inventory daily.

3) Compare events received to objects listed in inventory.

You rarely end up with fewer events than objects, but it does occur.

I've personally only observed this with SNS target, but due to it being a problem with S3, I believe Lambda can fall afoul of this too.

I wouldn’t trust S3 events to lambda. Sure lambda supports retries and a dead letter queue but you can’t reprocess the data.

A much more resilient approach would be:

S3 event -> SNS Topic -> SQS Queue -> lambda.

and set up a dead letter queue for the SQS queue.

It doesn’t help with the reliability of S3 events (and I’ve never seen that happen), but it does help if their is an error running your lambda.

Move the S3 object after processing it. As long as you move it to a bucket in the same region, there aren’t any charges.

Then if you are really paranoid, you can have a timed lambda that checks the source S3 bucket periodically and manually sends SNS messages to the same topic to force processing.

Is there an open issue or documentation for this somewhere? We rely quite heavily on lambdas trigger by S3 events and have never experienced this

> an open issue

If only; this is AWS. You pretty much need premium support just to tell them their products are broken. Before someone says "forums", the forums are a joke.

This isn’t using S3 as a scheduler. Cloudwatch already supports cron and rate expressions, as the post alluded to. This is a hack to schedule something once.

All Cloudwatch would have to do is implement a recurrence = 1 feature but I’m guessing it's not a common enough use case.

It supports the year it should be scheduled. You could specify the exact time for it to be scheduled.

I think there is a limit on a number of scheduled events per account per region.

So when doing one-time events (by specifying a year) - the triggered lambda need to delete it.

Or have a hourly scheduled lambda to delete already fired one-time events.

Or maybe CloudWatch is so smart it can delete them by itself?

You would have to delete the rule. In theory, it’s just like deleting an item from a queue once it has been processed. Another advantage is that you could look in the console to see which overdue rules haven’t been deleted.

And if you really need more than 50 rules, it’s just a soft limit. You send a request to support and they will raise it.

CloudWatch Events already supports triggering Lambda functions on a cron schedule. You don't even need S3. This seems like a ton of work to basically invoke a function every n minutes.

If you want a reoccurring task then you are right, but if you want to run a one time task in a specific time in the future depending on internal logic, then it won't work

Isn't this a job for Step Functions?

Step function is also a good idea

For a moment I thought this was going to be something totally silly like using lifecycle management with a feedback loop

Dynamodb can also be used as a one-time scheduler and I will say it will look also a lot simpler than this example.

In dynamodb you can set TTL for a record. When the record expires it will fire event to your designated lambda. That's it. You simply write a record, wait for the record to expire and you get notified on your lambda.

Correct me if I'm wrong, but s3 unlike dynamodb does not require a specific region, therfore it's more fault tolerance or it requires more work on dynamodb to achieve same level of stability

S3 is in regions, like Dynamo.

S3 is Turing complete ;)

And as with everything overly flexible, people will abuse it and build over-engineered solutions on top of it, when simpler solution is already present, but possibly not obvious or poorly documented.

How much more obvious could it be? It’s in the Cloudwatch cron rules that the last parameter specifies the year.

Can you attach data to the event ?



The data you attach will be included in the SNS message or the data that gets sent to the lambda handler.

You can put static JSON as an event from either the AWS Console, the CLI, or CloudFormation.

Wait what? You can invoke lambda functions directly via the invoke api. No need to use the s3 api. Am I missing something?

> What happens when you want to schedule a one-time event? You are stuck.

Or you create a rule for a specific date / time. Any reason why that would not work?

I was expecting this to be about combining Lambda with S3 lifecycle management to make a poor man’s cron.

Now I kinda want to try that myself.

What is it you can achieve with this that you cannot with cron?

Multiple keywords on your resume (S3, AWS, Lambda, Serverless)

If I interviewed someone who explained that as a solution, it would definitely not help their case. The worse hire is someone who overcomplicates projects. It leads to “negative work”.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact