
Show HN: s3-lambda – Lambda functions over S3 objects: each, map, reduce, filter - wellsjohnston
https://github.com/littlstar/s3-lambda
======
hoodoof
Its weird how S3 seems to be the unwanted stepchild of AWS.

So many obvious innovations just aren't turning up.

For example, strangely, AWS introduced tagging for S3 resources, but you can't
search/filter by tag, nor is the tag even returned when you get a list of
objects, you can only get the tag with an object request. The word "pointless"
springs to mind.

In fact it's strange that there is NO useful filtering at all apart from the
very useful folder/hierarchy/prefix filtering. But apart from that you can't
do wildcard searches or filters or date filters or tag filters.

I'm building an application right now that needs to get a list of all the jpg
files - the only way to do that is get every single object in the bucket and
manually filter out the unwanted ones - feels like its 1988 again.

It seems like it would also be valuable for there to be alternate interfaces
to S3 such as the ability to send data via ftp or SMTP or sftp or whatever,
but there are no such interfaces.

Hopefully Google will goad AWS into action on S3 innovation by implementing
such features.

~~~
lobster_johnson
S3's API is so rudimentary that I prefer to think of it as a non-enumerable
key/value store.

I learned this the hard way: We had an application where made the mistake of
storing about a billion files in a nearly flat structure — one level of
nesting, probably 100m "folders" in the root. Then one day we needed to go
through it to prune stuff that was no longer in use. Unfortunately, if you
don't have a "shardable" prefix, list requests are impossible to parallelize
efficiently (because you can't subdivide the work), and our scripts took
_weeks_ to run to completion. Hard-earned experience: If you're storing large
quantities of stuff in S3, _always pick a shardable prefix_. The upload date
is a good choice. A random string will also do.

After this, my solution for any non-trivially-sized storage use case is to
store an inventory of objects separately in a performant PostgreSQL database,
and make sure all writes go through a service layer that shields the consumer
from the details of S3. This has some benefits over a hypothetical centralized
approach (but some downsides, like the possibility that things get out of sync
if you sidestep the inventory). Overall, I wish S3 would store its metadata in
something like BigQuery.

Anyone know if Google Cloud Platform's S3 equivalent, Cloud Storage, improves
on these issues?

~~~
jessaustin
I wonder if "bucket notifications" are reliable enough that one could keep
such an index DB populated automatically?

~~~
nolite
Yes, just hook those up to a lambda function and write to dynamodb or
something

~~~
otterley
I tried this, but if you want to query by tags, using an RDS database works
much better. DynamoDB is not well suited to this particular problem.

------
dschnurr
Might make sense to rename this to avoid confusion with AWS Lambda (I
immediately thought it was related). Otherwise, looks like an awesome library!

~~~
wellsjohnston
Ah yeah, just realizing this...what would you recommend?

~~~
cocktailpeanuts
I also came here thinking this is some sort of aws lambda triggered within the
context of certain s3 file. I would say anyone who's heard of AWS lambda would
think that way.

Maybe functional-s3?

~~~
wellsjohnston
Okay this seems like a good alternative. I just renamed the repo. Renaming it
on npm...is a bit cumbersome :|

------
simonw
First impression: this is a brilliant piece of software design.

The ability to compose a map/filter chain and execute it in parallel against
every object in an S3 bucket that matches a specific prefix - wow.

The set of problems that can be quickly and cheaply solved with this thing is
enormous. My biggest problem with lambda functions is that they are a bit of a
pain to actually write - for transforming data in S3 this looks like my ideal
abstraction.

~~~
koolba
... Except it's not!

The "lambda" here isn't AWS Lambda. It's a locally executed function.

Now if this scheduled a bunch of real Lambdas to execute the work for each
bucket then yes that'd be awesome.

~~~
simonw
Bah. My first impression was totally wrong in that case. Here's hoping someone
builds a version of this that executes magically in the lambda cloud.

~~~
illumin8
Well, you could run it on a large EC2 instance (x1.32xlarge?!:O) and it would
be running the lambdas on the cloud, technically... ;-)

------
hayd
see also aws athena
[https://aws.amazon.com/athena/](https://aws.amazon.com/athena/) ?

~~~
dajohnson89
That seems cool but paying per query (per TB scanned) frightens me. I imagine
having to fret about how efficient my queries are...

~~~
illumin8
It's not that bad. You can compress the data on S3 in ORC or Parquet format,
and you only pay for the compressed data you read, so 1TB can be 130GB after
compression. Plus, these formats store summary data, so queries like SELECT
COUNT don't have to do a full table scan - they can read just a few KB of
summary data for the result.

~~~
dajohnson89
But that's a lot of work....Just to have sane costs for _reads of your data_

~~~
illumin8
It's actually just two commands:

1\. hive 2\. INSERT INTO parquet_table SELECT * FROM csv_table;

------
DenisM
So... the client-side code iterates S3 objects matching a certain filter, and
then schedules a lambda for each one of those objects. Is that right? Or does
the iteration procedure itself is a lambda? Also, when you chain several
operators together, where does the chaining happen?

I'd like to understand where different parts of the code are being executed.

~~~
daviding
On a quick page-down through the code, I think this is not related to AWS-
Lambda, it's more 'local lambda' where the map etc is run locally.

------
avip
This is a nice project. For real-world use cases, we have good alternatives:

1\. Migrate s3 ==> gc and use BigQuery which does support udf

2\. Register to databricks (I'm not affiliated)

3\. (for the brave) poke aws support to implement udf on Athena

------
_Marak_
If anyone is interested in this same kind of architecture for multi-cloud
file-system providers ( no cloud lock-in ), please check out this project:
[https://github.com/bigcompany/hook.io-
vfs](https://github.com/bigcompany/hook.io-vfs)

Used in production, but it could use some contributors.

------
kvz
Getting aan index of (millions of) files on s3 is very slow for us, like,
days. Is there anything you do to work around this? It seems since this is not
an AWS Lambda project the client first has to acquire an index from S3 before
concurrency benefits set in?

~~~
wellsjohnston
This does not have to do with AWS Lambda, I'm thinking about renaming it to
"functional-s3", or something similar.

To answer your question, there isn't really a workaround for this yet,
although indexing should be much quicker than "days". All the keys are listed
recursively before running the lambda expression locally. If you have a huge
number of files, this can take several minutes, maybe hours depending on the
scope.

A workaround I've been considering is using a generator function to list the
keys; that way, the lambda expression can start immediately, generating keys
as it needs them.

------
cle
Is this susceptible to any of S3's eventual consistency constraints?

~~~
wellsjohnston
I'm not aware of S3's consistency constraints. What are those?

~~~
primax
[http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction....](http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel)

------
wcdolphin
I thinking having the default be destructive for mapping is a strange design
decision. That is going to bite someone one day soon.

~~~
wellsjohnston
Good point...I will make it so that destructive is opt-in

------
dhpe
Really nice to have a generic functional interface to S3. Thanks.

------
stolendog
where actually you can use it ? in which cases? can you provide examples?

~~~
wellsjohnston
Sure. We have application logs that come in to s3 and are stored by date
prefix. I have CRON jobs that run node scripts that do various
counts/statistics.

