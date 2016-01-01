So many obvious innovations just aren't turning up.
For example, strangely, AWS introduced tagging for S3 resources, but you can't search/filter by tag, nor is the tag even returned when you get a list of objects, you can only get the tag with an object request. The word "pointless" springs to mind.
In fact it's strange that there is NO useful filtering at all apart from the very useful folder/hierarchy/prefix filtering. But apart from that you can't do wildcard searches or filters or date filters or tag filters.
I'm building an application right now that needs to get a list of all the jpg files - the only way to do that is get every single object in the bucket and manually filter out the unwanted ones - feels like its 1988 again.
It seems like it would also be valuable for there to be alternate interfaces to S3 such as the ability to send data via ftp or SMTP or sftp or whatever, but there are no such interfaces.
Hopefully Google will goad AWS into action on S3 innovation by implementing such features.
I learned this the hard way: We had an application where made the mistake of storing about a billion files in a nearly flat structure — one level of nesting, probably 100m "folders" in the root. Then one day we needed to go through it to prune stuff that was no longer in use. Unfortunately, if you don't have a "shardable" prefix, list requests are impossible to parallelize efficiently (because you can't subdivide the work), and our scripts took weeks to run to completion. Hard-earned experience: If you're storing large quantities of stuff in S3, always pick a shardable prefix. The upload date is a good choice. A random string will also do.
After this, my solution for any non-trivially-sized storage use case is to store an inventory of objects separately in a performant PostgreSQL database, and make sure all writes go through a service layer that shields the consumer from the details of S3. This has some benefits over a hypothetical centralized approach (but some downsides, like the possibility that things get out of sync if you sidestep the inventory). Overall, I wish S3 would store its metadata in something like BigQuery.
Anyone know if Google Cloud Platform's S3 equivalent, Cloud Storage, improves on these issues?
Sounds a bit like something they cooked up in a hurry to avoid having to design a BigQuery-type service for querying arbitrary metadata; I bet they had some huge customer with a need to get a CSV file for a bucket, that were willing to effectively bankroll the development of this feature.
But yes. That would sidestep the issue. You'd still have to turn on the feature and wait for the CSV file to build (apparently the best granularity is daily), of course, but it would help tremendously. Wish that had existed when we had our difficulties, about a year ago.
/path/to/big-dir/«lots-of-sequential-filenames»
?
If you don't use the evenly-distributed-prefix trick, your only chance of speeding it up is knowing the file names beforehand. If they're all sequentially numbered, you might do that, of course.
The shardable prefix doesn't need to be at the top level. So you could also organize it like so, for example:
/secret/documents/2016-01-01/00000001.doc
In your use case, consider `/path/to/big-dir/AA/AABB/AABBCC` or similar?
I'll never do that though because I'd have to use DynamoDB, which is a technology that is high on my list of "technologies that I am least enthused about".
Also, I really shouldn't have to go to all the work of creating and maintaining a metadata database and implementing a query API just because I want to do searches more powerful than "list all objects" - that's Amazon's job.
It's still not something I want to do, mainly because I'd have to touch DynamoDB but secondly because, well, why the heck doesn't AWS do it?
A an S3Query module would not, I think, make things harder for S3 users.
And frankly - it would be awesome.
I used s3 a lot, and loathe to switch to a DB if I can avoid it.
Some querying and indexing features I think would be taken up by a large number of devs.
The real problem with building a metadata index outside is that you then have the synchronization validation - yuk.
the DB is only incomplete for as long it takes to commit to the SQL layer after storing successfully.
There are, however, ways to solve this: you could fire a Lambda function whenever an object is put into your S3 bucket that simply adds a single row to a DynamoDB table with the object name, along with any additional metadata you might like to capture to assure data provenance. Then, to search, you can simply query the DynamoDB table.
As always, there are many basic building blocks at AWS, but you have to connect them together (like legos) before they become useful for most applications.
DOS is smarter than S3.
I mean, the README is excellently written and makes clear what the project does, so it's not a big deal beyond the ambiguous name.
Maybe functional-s3?
The ability to compose a map/filter chain and execute it in parallel against every object in an S3 bucket that matches a specific prefix - wow.
The set of problems that can be quickly and cheaply solved with this thing is enormous. My biggest problem with lambda functions is that they are a bit of a pain to actually write - for transforming data in S3 this looks like my ideal abstraction.
The "lambda" here isn't AWS Lambda. It's a locally executed function.
Now if this scheduled a bunch of real Lambdas to execute the work for each bucket then yes that'd be awesome.
Writing this was a necessity for me, being a 1-person data team coming from a Node.js background.
1. hive
2. INSERT INTO parquet_table SELECT * FROM csv_table;
I'd like to understand where different parts of the code are being executed.
Edit: This is not related to aws lambda...sorry for the confusion
1. Migrate s3 ==> gc and use BigQuery which does support udf
2. Register to databricks (I'm not affiliated)
3. (for the brave) poke aws support to implement udf on Athena
Used in production, but it could use some contributors.
To answer your question, there isn't really a workaround for this yet, although indexing should be much quicker than "days". All the keys are listed recursively before running the lambda expression locally. If you have a huge number of files, this can take several minutes, maybe hours depending on the scope.
A workaround I've been considering is using a generator function to list the keys; that way, the lambda expression can start immediately, generating keys as it needs them.
The best way to prevent eventual consistency issues in s3 is to use immutable files. Then you have consistency-now.
They don't have explicit SLAs on this, unfortunately, but I've heard rumored that internal pagers start firing with consistency behind on the order of hours.
