One of the authors of Tahoe-LAFS started a company that ported the whole system over to cloud storage providers. It's still in alpha, but it's definitly worth a look if you want secure, encrypted storage without relying on a single cloud provider.
It's not like building a website. You need serious funding for hardware. You need people to manage the hardware. You need complex software to manage the data and ensure security. One serious breach early on and you're done.
Competing with Amazon is especially hard since S3 is well established and entrenched. If you use EC2 you're going to use S3.
Pricing would be a primary factor in competing. 3x redundancy is unnecessary. I'm not sure why services still do that. Reed-Solomon or similar redundancy algorithms can provide better protection and use less space. They have CPU overhead but CPUs aren't going to be the bottleneck for a storage service, bandwidth and hard drives will be.
Edit: This would be if you built from the hardware up. I don't think offering a service like S3 on top of other storage services would work as a business. You'd have to deal with too many vendors, too much variation in APIs / software / hardware, lack of control, latency issues, and much tighter margins. IMO you'd be better off starting with bare metal. You could do something like this for personal, smaller scale storage but growing it to scale would be a nightmare.
Most importantly, unlike the OP who speaks of "the big hosting guys don't have a track record of building complex systems software" and your own post speaking of "complex software", we run an architecture that is as simple as possible.
Our failures are always boring ones, just like our business is.
You are correct that a chain of vendors, ending in a behemoth that nobody will ever interact with, and will never take responsibility, is a bad model.
So too is a model whose risk you cannot assess. You have no idea how to model the risk of data in an S3 container. You can absolutely model the risk of data in a UFS filesystem running on FreeBSD.
 ZFS deployment occurs in May, 2012
Also you need to host infrastructure software that knows where your data is sitting, how to deal with provider failures, how to efficiently route requests, etc. which means yet-another-thing-to-configure.
Finally, if the volume of data you're storing is so expensive on S3, I have to wonder why you have all this non-revenue generating data stored in the first place. Processing it also seems more expensive now because the free bandwidth you get from EC2<->S3 won't apply in the Frankenstore model.
I should have mentioned this more explicitly but you could take the buying-raw-storage model and use it to do anything S3 does, I think. Eg you could have three independent whole copies, or one whole copy and 1.5 copies distributed widely.
The only thing I can think of that Amazon could do that you couldn't do, if the raw storage providers are untrusted, is serve the data with no addition hops, since the data would be encrypted.
Obviously they are targeted at backups but you wouldn't need to change a lot to improve performance (mostly it would be in software + some caching boxes I think).
Do you have any information on what/how you went about building your storage system? If not, would you be so kind as to create some text (blog, how-to's, etc.) that detailed your setup and how it performs under your workload?
After my experience there (both technical and political), I left to do big storage. We settled on the Backblaze enclosures due solely to cost (cheap is cheap); read performance is sub-optimal, but we try to compensate with heavy caching in memory and intelligent read assumptions ("what might someone request next based on past read requests") at the app level (sitting on top of Nova).
I could do a blog post, but I have to check with my partner to make sure they're cool with me spilling that much info =)
Hope this has helped a bit. If you haven't guessed yet, I love object storage.
What are your startup and blog URLs? You don't have any profile info.
Floods or not, the current price isn't $120 dollars. It's 50% higher than that.
Shows one of the cheapest 3tb non enterprise drives. It looks like 3tb was $120 for ~2 weeks. Looking at enterprise drives, 3tb is closer to $300.
This article is basically advocating RAID 5 across many storage providers.
*edit: From the pictures, article is advocating RAID 10. Nonetheless, RAID5 would be just as feasible for additional storage.
Each drive cost $x. You have 9$x = cost to store 1 drive of data, across 3 providers, who each store 3 copies. If drive prices halve, it's still 9$x.
Existing companies store multiple copies transparently...Either way, 1 copy or 3 copies, I dont think the math is wrong. Just change from 9$x to 3$x.
That is correct. And to be more precise, I'm advocating RAID 5 across storage providers as a service, so people who just want to store data don't have to manage anything.
If you'd like to talk more about this, send me an email (address in profile).
If you simply rely on "dumb disks" spread across multiple providers to provide availability, then you may be interested in looking at the nova-volumes part of openstack (it provides block storage attached to nova VMs). As part of openstack, it's an open system that is seeing rapid adoption.
However, one of the most-requested features for swift is to provide support for logical clusters that span a wide geographic area. This could potentially allow multiple providers to collaborate in providing a multi-provider storage system. However, I'd guess that the technical problems are much simpler than the business problems in setting up multi-provider clusters.
> That is correct.
The diagram in the article and text description is RAID 1+0 aka RAID 10.
At scale, this could be managed, I think, through a combination of (a) shipping hard drives around, (b) caching, and (c) peering between storage providers. For example, shipping hard drives around would be expedient if you wanted to switch out a raw storage provider. The optimal strategy also depends on the access patterns and latency requirements.
It seems solvable, but not trivial.
To store two replicas of each piece of data, you must receive the data at one replica, transmit it to the other replica, and receive it at that replica. The data goes in at one server, then back out, and then in at the other server. To store 1 GB of data, you must pay for 3 GB of data transfer. Data transfer is expensive.
Amazon works around this problem by building data centers in clusters, interconnected with low-cost connections. When you upload to S3, your data goes over the Internet only once.
> ... despite the fact that hard drive costs fall 50% per year.
Citations for both statements please.
Even if both are true, it may be the case that hard drives are not the primary cost of running a large cloud storage service.
I'm proposing that anyone could start a company, Foo Inc, who would sell redundant storage and compete with S3. Instead of operating your own hard drives, you rent hard drives connected to the Internet from a variety of providers. Of course your customer would know that you were doing this, and advanced customers could even choose their own blend of raw storage providers to optimize for different things.
Towards the end of the post I mention briefly that instead of a startup (Foo Inc), this ecosystem could be set up in a decentralized way (think Bitcoin v.s. central banking), though that is far less realistic.