
Handling thousands of image upload per second with Amazon S3 - asiddique
https://medium.com/udroppy/handling-thousands-of-image-upload-per-second-with-amazon-s3-7a1009e8ffc4
======
ignoramous
Oh well, there's a lot of things you could do to speed up 'large volume'
uploads S3 (server side):

1\. Faster crypto impls for the platform (this will likely be the biggest win
in terms of cpu usage per upload, if you're doing client side enc).

1a. Use Google ConScript as your JVM SecurityProvider, for instance.

2\. Attach S3 Gateway to your VPC to avoid having to traverse through the
internet to hit S3 front-ends.

3\. Force resolve DNS per S3 upload request at request time. S3 vends
different answers depending on the load at the front-ends.

3b. Assumption is that the cost of just-in-time name resolution + creating new
connections would be offset by higher throughput due to connecting to S3
frontends with lower load on them.

4\. Pre-warm your S3 bucket(s) to achieve higher throughput (Number of S3
partitions allowed per bucket is essentially infinity but carry a cap of 3000
write IOPS per partition, iirc).

5\. Avoid tiny files like plague. You could try to zlib them up, but then
retrieval isn't trivial anymore, and requires compute.

6\. Use EC2 with enhanced networking that has dedicated 25GbE/40GbE links
going out to S3, if you must use EC2.

7\. Stream/Batch APIs wherever you can use them (delete, multi-part upload).

8\. Latest AWS SDK.

Of course, there are new tools and services in town: S3 SDK TransactionManager
API, S3 Accelerated Buckets, S3 DataSync, S3 SFTP, and S3 BatchOps that are
much more hands-off and preferable to hand rolling your own highway.

------
Nursie
While he's right that S3 is the modern way, and it's likely to be cheaper than
doing what I'm about to say, in the File System piece he says this -

> Each time you have to spawn a new instance of your application you will also
> have to copy all the images in the new instance.

Which is a little odd. This is why we have things like SANs, enterprise
storage etc, and the separation of data and application servers. Scaling can
be achieved by rolling out more application servers, which are linked on the
backend to a high speed network storage system.

Or have I missed something?

It still won't be quite as good as passing off the load to someone else
entirely, but it's a filesystem option that's nowhere near as bad.

I guess this dates from the prehistoric era when people managed their own
infrastructure though. I miss that sometimes, modern devops seems to be a bit
of a mess, with a lot of the people claiming to be devops engineers not really
having the skills they need.

------
geekuillaume
The presigned requests are great, you can also use multi-part presigned URL to
upload a large file in multiple chunks and resume the upload if it's
interrupted.

Just one precision, you don't need to contact S3 to get a presigned URL, it's
a cryptographic operation most libs do on your server. This means that if your
key is invalid for any reason, the URL will be signed but the error will only
be raised when the client starts the upload. You need to check for this in
your code.

~~~
have_faith
> you don't need to contact S3 to get a presigned URL

Does this mean, in the example code on AWS where they import the AWS S3
library to create a presigned url, the library just generates it in situ
instead of firing off a request to S3 asking for the URL?

~~~
philliphaydon
You don't even need AWS SDK to do the generation, you could roll your own. (I
had to do that long time ago when doing a Windows Store App for Windows 8 and
they didn't support it) If you pull in their SDK it's easier.

One of the benefits to the signed URL stuff is you can actually include
metadata in the url which will be attached as attributes on the S3 object.

For example i had a UI which would allow user to enter data for a photo they
uploaded. When they hit submit it would save the data to the database and
return a signed url. The signed url would contain the id to the id of the
record in the database. Then when the file was uploaded to s3, i could raise
an event into a Lambda, and that would give me the attributes to the record in
the database. So I could process the image for a thumbnail then ping a url
with an update for the file. And never need to 'figure out' which record in
the database it belonged to. The user cannot change the id because the id is
signed as part of the requested.

------
blowski
The OP’s solution is the backend getting a pre-signed URL from S3 and giving
it to the front-end, so the file doesn’t need to touch the back-end servers -
it goes straight to S3.

This is what I do, and it works great, with one caveat: it tightly couples you
to S3 (or at least a service that can give pre-signed URLs). That makes non-
production workflows more complicated. So I’d recommend using it only if
you’re uploading massive files (e.g. > 50MB) or lots of files where the extra
hassle is worth it.

~~~
philliphaydon
Why does it make non-production workflows difficult? You could do many things
for non-production environment. Use a different bucket for development,
abstract the generation of the url away so local development just uses another
url to upload to yourself.

------
iamleppert
It’s also a good idea to use content-addressable hashing. Hash the file first
on the client and check if the file exists or not on the server. You can
create a resumable upload process by doing it with multiple parts.

------
tqkxzugoaupvwqr
Are there other services besides AWS that support uploads with presigned URLs?

~~~
foota
I know Google cloud storage does,
[https://cloud.google.com/storage/docs/access-
control/signed-...](https://cloud.google.com/storage/docs/access-
control/signed-urls).

I wouldn't be surprised if azure had a similar feature.

------
ykevinator
That client offload bit is awesome. Thanks for a great writeup

------
tsurkoprt
why not just use www.lucidlink.com you get all the benefits you need.

