

How to: Compete with Amazon S3 without Buying Hardware - adamsmith
http://blog.adamsmith.cc/2012/04/how-to-compete-with-amazon-s3-without-buying-hardware.html

======
soult
Tahoe-LAFS[1] is a storage system that works similarly. It splits data into n
fragments, of which any k fragments are enough to restore the data. This leads
to a replication factor of only DATA_SIZE/k*n, while you can still lose n-k
pieces without data loss. Additionally all data is encrypted, signed and
optionally deduplicated.

One of the authors of Tahoe-LAFS started a company that ported the whole
system over to cloud storage providers.[2] It's still in alpha, but it's
definitly worth a look if you want secure, encrypted storage without relying
on a single cloud provider.

1: <https://tahoe-lafs.org/trac/tahoe-lafs>

2: <https://leastauthority.com/>

~~~
blantonl
Now _this_ is a cloud application.

------
driverdan
I've been watching the storage industry for years as a hobby-passion. Adam
pretty much hit all the major points. The storage space is still open for
disruption but is hard with high risk.

It's not like building a website. You need serious funding for hardware. You
need people to manage the hardware. You need complex software to manage the
data and ensure security. One serious breach early on and you're done.

Competing with Amazon is especially hard since S3 is well established and
entrenched. If you use EC2 you're going to use S3.

Pricing would be a primary factor in competing. 3x redundancy is unnecessary.
I'm not sure why services still do that. Reed-Solomon or similar redundancy
algorithms can provide better protection and use less space. They have CPU
overhead but CPUs aren't going to be the bottleneck for a storage service,
bandwidth and hard drives will be.

 _Edit: This would be if you built from the hardware up. I don't think
offering a service like S3 on top of other storage services would work as a
business. You'd have to deal with too many vendors, too much variation in APIs
/ software / hardware, lack of control, latency issues, and much tighter
margins. IMO you'd be better off starting with bare metal. You could do
something like this for personal, smaller scale storage but growing it to
scale would be a nightmare._

~~~
lusr
I imagine that the response times of S3 vs. this Frankenstore model will be
much better, which may be an issue for certain applications.

Also you need to host infrastructure software that knows where your data is
sitting, how to deal with provider failures, how to efficiently route
requests, etc. which means yet-another-thing-to-configure.

Finally, if the volume of data you're storing is so expensive on S3, I have to
wonder why you have all this non-revenue generating data stored in the first
place. Processing it also seems more expensive now because the free bandwidth
you get from EC2<->S3 won't apply in the Frankenstore model.

~~~
adamsmith
All good points!

I should have mentioned this more explicitly but you could take the buying-
raw-storage model and use it to do anything S3 does, I think. Eg you could
have three independent whole copies, or one whole copy and 1.5 copies
distributed widely.

The only thing I can think of that Amazon could do that you couldn't do, if
the raw storage providers are untrusted, is serve the data with no addition
hops, since the data would be encrypted.

------
thereallurch
"In 2006 a 320 GB hard drive cost $120. Today (Thailand floods aside) that
much money will snag you a 3 TB drive."

Floods or not, the current price isn't $120 dollars. It's 50% higher than
that.

[http://camelcamelcamel.com/Western-Digital-Caviar-Green-
Desk...](http://camelcamelcamel.com/Western-Digital-Caviar-Green-
Desktop/product/B004RORMF6?active=amazon)

Shows one of the cheapest 3tb non enterprise drives. It looks like 3tb was
$120 for ~2 weeks. Looking at enterprise drives, 3tb is closer to $300.

This article is basically advocating RAID 5 across many storage providers.

*edit: From the pictures, article is advocating RAID 10. Nonetheless, RAID5 would be just as feasible for additional storage.

~~~
adamsmith
> This article is basically advocating RAID 5 across many storage providers.

That is correct. And to be more precise, I'm advocating RAID 5 across storage
providers _as a service_ , so people who just want to store data don't have to
manage anything.

~~~
Terretta
>> _This article is basically advocating RAID 5 across many storage
providers._

> _That is correct._

The diagram in the article and text description is RAID 1+0 aka RAID 10.

~~~
Dylan16807
No, it's RAID 5. Well, specifically, it's RAID 6 but customizable. You can
lose _any_ k drives.

------
mleonhard
A big cost of running a redundant data storage service is data transfer.

To store two replicas of each piece of data, you must receive the data at one
replica, transmit it to the other replica, and receive it at that replica. The
data goes in at one server, then back out, and then in at the other server. To
store 1 GB of data, you must pay for 3 GB of data transfer. Data transfer is
expensive.

Amazon works around this problem by building data centers in clusters,
interconnected with low-cost connections. When you upload to S3, your data
goes over the Internet only once.

------
stevewilhelm
> Amazon S3 has high margins today. ...

> ... despite the fact that hard drive costs fall 50% per year.

Citations for both statements please.

Even if both are true, it may be the case that hard drives are not the primary
cost of running a large cloud storage service.

------
wmf
So basically <https://nimbus.io/> on rented hardware.

------
alexchamberlain
I'm not sure I completely understand. Are you recommending people shop around?
Do you want someone to develop a service to shop around for storage? Or, do
you simply want a cheaper competitor?

~~~
adamsmith
Thanks for the comment. I just edited the post to try to make it clearer.

I'm proposing that anyone could start a company, Foo Inc, who would sell
redundant storage and compete with S3. Instead of operating your own hard
drives, you rent hard drives connected to the Internet from a variety of
providers. Of course your customer would know that you were doing this, and
advanced customers could even choose their own blend of raw storage providers
to optimize for different things.

Towards the end of the post I mention briefly that instead of a startup (Foo
Inc), this ecosystem could be set up in a decentralized way (think Bitcoin
v.s. central banking), though that is far less realistic.

------
JoachimSchipper
It's true that cost/GB has fallen, but cost/IOPS hasn't followed suit. If your
I/O maxes out when the disk is 10% full, you can't really do much with the
other 90%.

------
coalsoul
Break the file into 3 equal parts. You only need to store one part plus the
parity file. And not necessarily with the same cloud provider. Result: you pay
1/3 less for storage. So if Dropbox is so clever why aren't they doing this
for their customers? xor'ing/Reed-Solomon has been used this way since Usenet.

------
chx
So... let me get this straight. You came up with an idea which requires
building a strong brand which takes a lot of money and that Amazon can squelch
any second there's a hint of possibility of a success. And this made
HackerNews frontpage. What?

~~~
oijaf888
Or building an open source tool that individual users could use to manage raw
storage in an S3 fashion. Its an interesting idea.

~~~
rbanffy
Or one many little startups everywhere could use to compete locally with
Amazon in the SMB space.

