

Amazon S3’s Business Model is Arbitragable, and the Future of Cloud Storage - adamsmith
http://blog.adamsmith.cc/2010/03/amazon-s3s-business-model-is-arbitragable-and-the-future-of-cloud-storage.html

======
lsc
As far as I can tell, getting good 'de-duplication' technology is more
expensive right now than just buying disks.

Also, how much would de-duplication help? I mean, how much stuff is stored on
s3 that is uncompressed? (or the same compressed file?)

I mean, sure, you could sell compressed/de-duplicated s3 storage, but do you
really think you could even get 50%? I mean, I've done the math and I could
turn a profit renting uncompressed drive space at $0.05/gigabyte/month, even
at my scale. Granted, not a huge profit, but something.

This is the thing, I think; there is much profit to be had buying disks and
renting them out right now, if you can charge amazon's prices. Amazon does
have a pretty massive economy of scale advantage, but they are not passing
those savings down to the end users.

Costs in the outsourced infrastructure market, as far as I can tell, have
always been dominated by marketing. The difference with 'cloud infrastructure'
seems to be that large corporations are trying to change that, but they aren't
passing down much by way of savings to the end users.

~~~
adamsmith
> As far as I can tell, getting good 'de-duplication' technology is more
> expensive right now than just buying disks.

Working at the abstraction layer of disks is misleading because there are so
many other pieces that go into making a service like S3.

Instead I'm proposing an arbitrage that works one layer up -- on top of S3.

> Also, how much would de-duplication help? I mean, how much stuff is stored
> on s3 that is uncompressed? (or the same compressed file?)

It's a good question. The answer is unclear. Certainly if you're willing to
get exotic with redundancy elimination (as opposed to dedup, which is a
subset) and large-data-set compression algorithms there just are going to be
economies of scale. What's not clear is (a) how large those economies become,
or (b) how exotic your algorithms have to be to capture those economies.

> ..I could turn a profit renting uncompressed drive space at
> $0.05/gigabyte/month..

I wouldn't want to compete with Amazon S3. Data safety and uptime would keep
me up at night like a mofo. I just wouldn't want to run that company, though I
can imagine folks like MS doing it.

> Costs in the outsourced infrastructure market, as far as I can tell, have
> always been dominated by marketing.

This is a really interesting question. The landscape is obviously changing
with the current batch of cloud infrastructure services.

I do think there are pieces of cloud storage that will not be commoditized,
most notably the peace of mind around data safety and uptime. I wouldn't want
to use a small cloud storage provider. Thus: (a) they will be able to charge a
premium for a long time, and (b) there will be a smallish number of providers,
meaning marketing will be less important in the future than having a good
product.

(I'm not sure how much the API's and other development tools will be
commoditizable. MSFT historically does a great job of embrace & extend when it
comes to those parts, though their recent cloud services are too wedded to
Windows and .NET.)

(EDIT: added responses to more of your original post.)

~~~
lsc
>Instead I'm proposing an arbitrage that works one layer up -- on top of S3.

Is transfer from s3 to ec2 free? in which case, that might work. Otherwise
you'd get eaten alive by bandwidth charges between s3 and your
compression/uncompression box.

>I wouldn't want to compete with Amazon S3. Data safety and uptime would keep
me up at night like a mofo. I just wouldn't want to run that company, though I
can imagine folks like MS doing it.

If you are storing data compressed in your special way on s3, you have the
same problems. using non-ecc ram (or ecc ram configured to not crash when bad)
and have bad ram on one of your compression boxes? data is corrupted. gone.
compression boxes go down? you now have your dreaded uptime issues. Of course,
if there is some bug in your compression/deduplication/whatever software,
again, you have data loss.

I have personally lost much money and reputation trying to make my disk system
'better' by adding complexity and flexibility. After that pain, I now just try
to keep the data as simple as possible (mirrors, and stripes of mirrors) I
haven't had data loss since I abandoned my SAN. Complexity == admin error ==
data loss. In a reasonably designed hardware system, data loss is pretty rare
without admin error.

>I do think there are pieces of cloud storage that will not be commoditized,
most notably the peace of mind around data safety and uptime. I wouldn't want
to use a small cloud storage provider. Thus: (a) they will be able to charge a
premium for a long time, and (b) there will be a smallish number of providers,
meaning marketing will be less important in the future than having a good
product.

If s3 costs keep going down 15% a year (and hard drive costs keep going down
50% a year) s3 will very shortly become a 'premium' provider. While yes, there
is room for premium providers, it will be a whole lot cheaper for anyone with
significant storage needs to run their own stuff (or to go with a smaller
player)

I mean, paying premium prices makes sense sometimes, but not all the time. If
you are speaking of premium markets, really, I'm the wrong guy to ask; I'm way
over on the other end of the spectrum.

edit: added more "simple is better"

~~~
adamsmith
Very good points, all of them!!

You are absolutely right that you would have to get serious about data
integrity. ECC is such a good example.

------
ramanujan
This analysis is flawed.

1) most data in the cloud is going to be proprietary. How much is GE's
internal factory data going to overlap with Starbucks' financials?

2) You would need global access at the byte level to truly dedupe systems. So
the 50 biggest companies on S3 are giving some random company read access to
all their data? And allowing them to compare it to all the data of their
competitors?

3) And to do what? to save a few percent max in storage costs on S3? That's
not going to dominate your cost structure by a longshot.

Outside of a web crawl I doubt there is that much redundancy.

The only one who can dedupe at scale and with trust would be Amazon
themselves, and they would only do it if it weren't a huge headache to keep
track of.

~~~
borism
deduping definitely not going to work on proprietary encrypted data

but for web services like tumblr, flickr, youtube or hulu, deduping must be a
killer app - there's a huge amount of overlapping imaging data between them
and it makes little sense not to analyze it. I'm sure google is deduping
youtube stuff on a massive scale.

So it depends on the data and application of course.

Then again, purely deduping based on byte comparison is not going to cut it -
you need to look into logical deduping (not storing same image in different
resolutions, formats etc)

------
kevinpet
Amazon's pricing model is even more busted when you look at the cost of puts
for small objects. The simplest answer is for Amazon to revisit their pricing
based on their actual costs. You can use SimpleDB to index larger objects
stored in S3 and save money. I've written up a description of how I did this,
but it proved too much hassle and we've moved on to HBase.

[http://kdpeterson.net/blog/2009/06/pack-multiple-small-
objec...](http://kdpeterson.net/blog/2009/06/pack-multiple-small-objects-
in-s3-for-cost-savings.html)

I'm skeptical of any business model too close to Amazon's core services. I'm
thinking of things like Elastic MapReduce, which aren't perfect, aren't
optimal compared to what you could do yourself, but aren't bad enough that I
would ever choose a niche provider instead of EMR or running my own cluster.

------
notmyname
Or it could be that AWS doesn't see S3 as a product per-se, but rather as an
infrastructure piece that is a building block for other, more full-featured
products. My understanding of S3 (and Dynamo) is that it/they started as tech
used for running Amazon's internal systems. Someone realized they could get
more revenue by offering their internal publicly and stuck a price tag on it.
Services like de-duping, compression, etc. are more in the realm of Jungle
Disk and tarsnap: third-party resellers that become a front end to the
infrastructure provided by S3.

What S3 and other storage services sell is not space as much as reliability.
No question: S3 is more expensive than a few file servers in your back room.
Cloud storage services sell reliability and availability.

------
mark_l_watson
I don't agree with the article: odd business premise since we use S3 because
it is replicated across availability zones, is very robust, and probably safe
enough to base a business on. Why "de-duplicate" when redundant storage is
what makes S3 secure. Also, what is the risk of trusting your business to a
smaller company?

~~~
adamsmith
The idea is to deduplicate / eliminate redundancy one layer up. So if you are
storing an image in S3 let's say it's replicated 10 times to ensure data
safety. If I store the same image it doesn't need to be stored another 10
times.

Getting customers to trust a smaller company's software is definitely a top
risk. I wouldn't start this company now, but it might be worth doing in 2015.
Watch this space!

~~~
houseabsolute
My biggest concern about the article is that it doesn't prove its premise. In
fact, the S3 service is only arbitrageable if all the other things you need to
do to build your proposed service are free, or less expensive (including
opportunity cost) than the profits you generate by doing them. This is far
from certain, because even with the infrastructure all built it's not clear to
me that you could turn a profit when you factor in your running costs like CPU
and RAM used to actually make the system function, much less the support
staff.

------
siculars
Data de-duplication is clearly the win, only paying larger dividends as your
data under management grows. Why does the OP think that amazon, google,
microsoft, apple, et al. are not already doing this at scale... which in turn
allows them to provide the services that they do.

~~~
adamsmith
You're right that they are doing this at scale.

It's just that their business model -- charging per GB, instead of per GB of
new data that nobody else is storing yet -- leaves them susceptible to
arbitrage. I.e. customers can steal the economies of scale if they
collaborate.

You know how Starbucks charges $5 for a small and $6 for a large? It's the
same thing. If two people want smalls, but collaborate, they can just split a
large and save $2 each. It's the same idea, except that arbitrage becomes more
realistic when it's electronic.

(I've ignored compression in the whole discussion for simplicity. If AWS wants
a business model that isn't arbitragable based on compression they would have
to charge based on how compressable your data is. Also, next generation
compression algorithms -- another blog post on that later perhaps -- could
achieve compression rates that have the same economies of cross-customer scale
that data deduplication does, so really it would have to be price based on how
compressable your data is __given all of the other data S3 is storing__.)

(EDIT: made starbucks example simpler.)

~~~
Andys
The problem is it wouldn't be difficult for Amazon to start charging based on
how much storage space you're _consuming_ , which would pull the rug from
under their feet.

~~~
notmyname
I'm wondering what the logistics of this would be. Say 1 customer is storing
100GB in a cloud storage service that does internal de-duping. A second
customer uploads exactly the same 100GB of data. What is each charged? The
whole 100GB*rate? Does each customer pay half? Do prices change for remaining
customers when one "copy" of the data is removed? Without charging each
customer for their total apparent storage use, I don't see how any customer
can have any predictability as to their monthly bill.

~~~
siculars
This is exactly the point. Billing under this some dedupe scheme would be a
nightmare. The only scale that is to be had is for the provider. Either Amazon
in this example, or some other service built on top of Amazon (or self hosted)
that kept the vagaries of price fluctuation away from the customer.

Notice that the way Amazon does price arbitrage with their compute nodes
allows then to pull the power on them at any time when the price point moves
above what you had payed for it. Anybody feel good about that happening to
their data?

------
bmelton
I read all the comments posted here thus far, and the one thing I don't see is
a concern that de-duping, if not done AT the filesystem, would be
prohibitively slow.

Am I misguided in thinking that it would be? I mean, I suppose if you
implemented it on disk at the server level, retrieving the duped blocks there,
I think you're adding latency. I suppose I might be looking at it from the
wrong perspective, as there just has to be people using S3 for backups only,
for which cost probably does make more sense than speed, but I honestly don't
know if that's more the exception or the rule.

~~~
houseabsolute
That depends on the block size. If you're using 40MB blocks, for example, then
seek costs are not going to dominate, and you can afford to stitch together a
couple of deduped blocks. Where you're going to find duplicated files that are
large enough that blocks so large make sense is another question.

