
How to Save 90% on your S3 Bill - trjordan
http://www.appneta.com/blog/s3-list-get-bucket-default/
======
jameskilton
Yikes! That's a horrible thing for a library to do to its users. This
definitely should be changed in the library.

I noticed that no such Issue exists, so I opened one.
[https://github.com/boto/boto/issues/2078](https://github.com/boto/boto/issues/2078)

~~~
UnoriginalGuy
I kind of agree with garnaat's reply, if they just suddenly change the default
from TRUE to FALSE they're going to break backwards compatibility with anyone
using the library and worse still in a really subtle way!

All they can really do at this stage is add a warning to the documentation and
hope that new people using the library figure out the significance.

~~~
Negitivefrags
That's why you should avoid using default arguments in library functions.

~~~
jhenkens
Great point! I have never heard that, but it makes perfect sense. Hopefully I
will remember in the future.

~~~
reconbot
I don't agree, you should just know what they are. Defaults are helpful.

~~~
tnorthcutt
Until they change, and your code breaks, and you have to go figure out why.
The point of not (silently) using defaults is that as long as you're explicit
with your arguments, you won't get surprised by changing defaults. Of course,
you might get surprised by any number of other changes, but you can at least
reduce your failure surface area a little bit :-)

------
saurik
I found this exact same issue a few years ago after determining I had a
reasonably-long-existing $75/day "leak" in my S3 expenses :/. (Seriously:
$75/day, a number I am not exaggerating at all.)

> 09:00:04 < saurik> dls: I am performing millions of ListBucket request to
> this one amazon bucket every day

> 09:00:25 < dls> d'oh

> 09:00:28 < saurik> adding up to almost 1.5 gigabytes of data traffic in/out
> on just those requests

> 09:00:41 < saurik> I have NO CLUE what could POSSIBLY doing even a SINGLE
> ListBucket request on that bucket

> 09:00:49 < dls> LOL

~~~
goldenkey
Even the infamous saurik got floor-hit. _closes jaw_

------
nathancahill
Also see MimicDB [0]. Runs a transparent cache that responds to most S3 API
calls locally using Redis to store metadata. Besides the cost savings, it's
extremely fast. Listing an entire bucket with millions of objects is close to
instantaneous.

[0] [http://mimicdb.com/](http://mimicdb.com/)

~~~
zimbatm
It also adds another piece of infrastructure that needs to be maintained and
can go down. Not necessarily the best option for everyone.

~~~
pipeep
While the caching would be more effective with one large centralized instance,
I think the intended use case is to have one cache per server. So then it's
not really extra infrastructure.

~~~
nathancahill
Actually, you can do it either way since it's backed by Redis. You can set all
servers to connect to the same Redis instance, or run them all individually.

~~~
cmircea
What if Redis goes down?

~~~
nathancahill
"What if X goes down?" is my new favorite straw man on HN.

Regardless, it's just a caching layer, and requests are passed through to the
API in that case.

------
jlas
> One utility in general that’s provided us with an easy way to slice up and
> investigate our AWS spending is the awesome Asgard.

The tool they're actually referring to is Ice (also by netflix) [1]. Asgard is
another AWS-based tool for managing deployments and auto-scaling.

[1] [https://github.com/Netflix/ice](https://github.com/Netflix/ice)

~~~
quarterto
Now mentioned in the article.

~~~
jessemdavis
I'm the article author, and yes, we're using or evaluating both services. I
find myself swapping the two all the time in conversation. Pretty silly typo
on my part :)

------
HeyImAlex
I _really_ hated working with boto for s3. The abstractions they chose are
really ambiguous and it feels like it's fighting against the underlying api.
If you're building anything on top of s3, it might be easier to write a thin
S3 REST client and use it directly instead of going through Boto. So many
fewer surprises, and no more digging around in the boto source trying to
figure out what so and so function _actually_ does.

~~~
eiopa
I would look at tinys3: [https://github.com/smore-
inc/tinys3](https://github.com/smore-inc/tinys3) It was motivated by exactly
the same reasons.

(disclaimer: I used to work at Smore, and I'm friends with the author, but
I've been burned by Boto myself)

------
notfunk
interestingly enough, here's the original commit that defaulted
`validate=True`:

[https://github.com/boto/boto/commit/95939debc3813468264159d5...](https://github.com/boto/boto/commit/95939debc3813468264159d5feb7a3333ae82070)

 __EDIT __: Looks like the original committer is an Amazon employee.

~~~
nathancahill
Before you jump on the conspiracy train, most Boto contributors are Amazon
employees.

~~~
notfunk
Agreed, I doubt Amazon had any evil intentions with this change.

I wonder more if this was a product of Amazon developers using S3 (i.e.
dogfooding) and not noticing the cost side effect because I'm assuming they
don't get billed?

~~~
jeffbarr
We actually do get billed, but we don't have to pay. I always check my bill to
make sure that I am not using any resources that I don't need.

I also pay for my own personal EC2 instance and about 350 GB of S3 storage.
Begin a genuine user and customer of AWS helps me to be a better employee.

~~~
toomuchtodo
Jeff,

You may want to check LIST request statistics over the next few weeks. Between
this thread, an Issue for boto, etc. I'm curious if you see a noticeable
decline in LIST requests with the attention this has brought. I'm just curious
from a data standpoint.

------
ducci
This has been in the boto library for years, across multiple services. I
noticed it when first using SimpleDB, switching it on in a production
environment was much more expensive than we had originally calculated. I
noticed the bizarre "Domain Validate" calls after pouring through logs of all
boto activity.

The boto guys have justified it in the past:
[https://groups.google.com/forum/#!topic/boto-
users/1DVfbo4CD...](https://groups.google.com/forum/#!topic/boto-
users/1DVfbo4CDW4)

I still don't agree with their reasoning to leave it default, once you try to
do something on a non-existent domain/bucket it will throw an error anyway,
and I would argue the "extra work" is much cheaper than leaving these defaults
on which I would expect are completely redundant for most users.

------
acdha
Good news for django-storages users – this is off by default:

[https://bitbucket.org/david/django-
storages/src/cb7366693ce1...](https://bitbucket.org/david/django-
storages/src/cb7366693ce16d462b48535e575561f4d2efad0c/storages/backends/s3boto.py?at=default#cl-332)

~~~
djm_
To further add: the PyPI version hasn't been updated since March 2013 but the
last time the relevant lines were changed was January 2013, so you should
still be good with the PyPi version (assuming changes from then went out in
the last release).

The setting in question is: AWS_AUTO_CREATE_BUCKET.

------
cuu508
I discovered the 'validate' argument just recently. My concern was not costs,
it was latency and response times. My server is generating a HTML page with
signed URLs to resources on S3. The 'get_bucket' call adds a bit of latency as
it contacts S3 and I was thinking, does it really need anything from S3 to
generate signed URLs if I already know the exact key names, and am pretty sure
the bucket exists? Well, it does not, and adding "validate=False" speeded up
things noticeably.

------
rmcpherson
Because of the lack of fast and reliable s3 clients out there, I built
[https://github.com/rlmcpherson/s3gof3r](https://github.com/rlmcpherson/s3gof3r),
that can do over 1 Gbps with ease, both multipart uploads and parallelized
downloads. The killer feature, though, is streaming that enables things like
gof3r get -b <bucket> -k <key> | tar -x to extract directories tarred
directories or any other streaming application. It also provides end-to-end
md5 integrity checking. For objects over a few GB, I haven't found anything
matching in speed or reliability.

~~~
zargon
From when I was looking into doing something similar, I recall s3 needing to
know the content-length of upload parts up front. How do you handle that for
streaming? Do you buffer in memory up to the max part size so that you can
give the correct content-length header for the last part? I ask because my
uses would include low-memory VMs so I'm curious about the memory overhead.

~~~
rmcpherson
You don't need the content length of what you are uploading for multipart
uploads, see
([http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadIniti...](http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadInitiate.html)).
For each part of a multipart upload that is sent, however, you do need to
include the content-length header, so that may be what you are referring to.
These have a minimum size set by amazon of 5 MB. With the
[https://github.com/rlmcpherson/s3gof3r](https://github.com/rlmcpherson/s3gof3r)
the memory overhead of a streaming upload is approximately part size *
concurrent uploads (plus a couple of buffers in the pool). So, for example, if
you configure the part size to 20 MB and set concurrent uploads to 10, that
would be about 220 MB of memory usage.

~~~
zargon
Yeah, having to buffer the stream in order to set the correct value for the
content-length header on the last part of the multipart upload is what I was
referring to. Your example using a 20MB part size is quite feasible. Thanks!
Is it safe to use the master branch? Do you have build instructions? I haven't
used go before.

~~~
rmcpherson
We use the version in the master branch at CodeGuard to transfer many
terabytes into and out of S3 daily. While I hesitate to call it "production-
ready", as that means different things to different people, we do use it in
production with no issues. If you want to try it without having to install go,
there are statically linked binaries for OSX and Linux Amd/64 linked on the
github page that should run on any distro without having to install any
dependencies. If you have any issues or questions, feel free to contact me at
the email in my HN profile. I hope it works for your use case! :)

------
engates
This is one of the reasons why Rackspace simplified pricing of Cloud Files
from the start. No fees for PUT, POST, LIST, HEAD, GET, DELETE...no extra fees
for Akamai CDN requests. Very simple with no hidden fees that surprise you at
the end of the month.

~~~
rmc
Those fees for "operations" are there for a good reason. Otherwise, us smart
techies would hack it.

I heard a talk by someone at a mega tech company that has their own internal
cloud for their teams, and they "charge" each team based on usage. One team
stored lots of file with 12,000 character filenames with zero contents. Since
the company only "charged" for file size, that team had a tiny charge!

------
match
If you read the source code it calls a method called get_all_keys. Please
realize, this does NOT get all the keys in the bucket. Passed to it is the
maxkeys=0 argument which means no keys are returned and a single list call is
made.

Yes, it is still a waste of money, but just make sure you understand that it's
not actually listing your entire bucket.

------
penguindev
That's a crap default and a crap name. It should be prefetch_all_keys=False.
(edit: and some documented reason WHY you would want to do such a thing)

I ran into this recently when making my own s3 sync tool, because the commonly
used tool is completely broken (requires something called a 'config file' to
function). But I didn't pay it too much mind, because I forgot the price
discrepancy for ListBucket calls.

PS if you want to see what boto is doing do this:
logging.basicConfig(filename="boto.log", level=logging.DEBUG)

~~~
masklinn
> It should be prefetch_all_keys=False.

It does not prefetch any key (maxkeys is set to 0), it performs a query on the
bucket to validate that the bucket exists and blow up if the bucket does not
exist. With validate=False, you can call get_bucket and get a bucket object
where no remote bucket exists.

~~~
sltkr
It sounds like an annoying limitation of the API that (apparently?) you can't
cheaply validate whether a bucket exists.

Two manual work-arounds that come to mind:

\- store a list of created buckets as keys in another bucket.

\- store a dummy file in each bucket you create.

Either method allows you to check the existence of the bucket with a GET
request rather than a more expensive LIST request, but both are hackish. It
seems like this is functionality S3 should already provide cheaply.

~~~
masklinn
> It sounds like an annoying limitation of the API that (apparently?) you
> can't cheaply validate whether a bucket exists.

It looks like there now is:
[http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketHEA...](http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketHEAD.html)

I'm guessing (hoping?) that didn't exist back when the feature was added to
Boto, 7 years ago:
[https://github.com/boto/boto/commit/8410c365ee0120e073bf00bd...](https://github.com/boto/boto/commit/8410c365ee0120e073bf00bdc9b913a0575d54f7)

------
midas007
How to save 95%: don't use AWS long-term, buy dedicated hardware.
[http://blog.backblaze.com/2013/02/20/180tb-of-good-
vibration...](http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-
storage-pod-3-0/)

~~~
dclara
Thank you. For the long-term medium data center, it's a good choice. But other
then the hardware disk space part, there are the hosting service part which at
least provides a stable Linux instance and static IPs, and it' must be up and
running 24/7.

------
eric_khun
Idea: A tool which give you some improvements advises on your code to save up
money on your AWS bills

------
fours
The interesting thing is that the Boto S3 tutorial
([http://docs.pythonboto.org/en/latest/s3_tut.html](http://docs.pythonboto.org/en/latest/s3_tut.html))
uses s3.get_bucket() with impunity and never mentions you're paying 13.5x more
for your usage if you use that bucket object for just one GET and don't pass
validate=False. (Perhaps someone from AWS wrote that tutorial? :P) Probably
deserves a mention there, though.

------
shlomiatar
We ended up writing an s3 library out of our frustration with Boto's slowness
and it has a nicer, requests-inspired interface too

[https://github.com/smore-inc/tinys3](https://github.com/smore-inc/tinys3)

------
cmer
Is there something similar to this that exists that also caches s3 objects
locally?

~~~
nathancahill
Check out MimicDB [0]

[0] [http://mimicdb.com/](http://mimicdb.com/)

~~~
cmer
It stores everything but the object

~~~
nathancahill
Yes. If you're storing the objects on your server, why would you use S3?

~~~
cmer
Caching. We request the same objects quite frequently and bandwidth is killing
us.

------
gkumartvm
OR You can move from S3 to some cheap servers with lot of data bandwidth.

~~~
kennywinker
There are a lot of use cases where s3 makes sense. For a small operation (one
or two people) the admin time cost alone makes s3 pretty appealing. Also, the
cost of a colo + server + redundant drives is a big initial spend when I have
no way to know if my idea/app is going to get traction. Self-hosted may be
cheaper for larger scales, but for a couple GBs s3 is going to win every time.

------
tbarbugli
thats a good catch! thanks man :)

------
valtron
You only need to worry about this if you're doing 100k+ LIST requests. For me,
turning off validate would save 0.2%.

------
pw
I wonder how much savings we're talking about. I always worry when engineers
start talking about saving money.

~~~
dkuebric
Depends on your scale. Another commenter on this thread was in the +$75/day
range from this issue.

I don't worry about cultivating a connection between engineering and business
needs.

------
HeyImAlex
Are you sure this counts as a LIST request? Technically the method of the list
bucket API call is GET.

