Hacker News new | past | comments | ask | show | jobs | submit login

I've been watching the storage industry for years as a hobby-passion. Adam pretty much hit all the major points. The storage space is still open for disruption but is hard with high risk.

It's not like building a website. You need serious funding for hardware. You need people to manage the hardware. You need complex software to manage the data and ensure security. One serious breach early on and you're done.

Competing with Amazon is especially hard since S3 is well established and entrenched. If you use EC2 you're going to use S3.

Pricing would be a primary factor in competing. 3x redundancy is unnecessary. I'm not sure why services still do that. Reed-Solomon or similar redundancy algorithms can provide better protection and use less space. They have CPU overhead but CPUs aren't going to be the bottleneck for a storage service, bandwidth and hard drives will be.

Edit: This would be if you built from the hardware up. I don't think offering a service like S3 on top of other storage services would work as a business. You'd have to deal with too many vendors, too much variation in APIs / software / hardware, lack of control, latency issues, and much tighter margins. IMO you'd be better off starting with bare metal. You could do something like this for personal, smaller scale storage but growing it to scale would be a nightmare.




We've[1] been doing this for 11 years now, just as you describe. We built the bare metal ourselves, we own it, and the buck stops here.

Most importantly, unlike the OP who speaks of "the big hosting guys don't have a track record of building complex systems software" and your own post speaking of "complex software", we run an architecture that is as simple as possible.

Our failures are always boring ones, just like our business is.

You are correct that a chain of vendors, ending in a behemoth[2] that nobody will ever interact with, and will never take responsibility, is a bad model.

So too is a model whose risk you cannot assess. You have no idea how to model the risk of data in an S3 container. You can absolutely model the risk of data in a UFS filesystem running on FreeBSD[3].

[1] rsync.net

[2] Amazon

[3] ZFS deployment occurs in May, 2012


I imagine that the response times of S3 vs. this Frankenstore model will be much better, which may be an issue for certain applications.

Also you need to host infrastructure software that knows where your data is sitting, how to deal with provider failures, how to efficiently route requests, etc. which means yet-another-thing-to-configure.

Finally, if the volume of data you're storing is so expensive on S3, I have to wonder why you have all this non-revenue generating data stored in the first place. Processing it also seems more expensive now because the free bandwidth you get from EC2<->S3 won't apply in the Frankenstore model.


All good points!

I should have mentioned this more explicitly but you could take the buying-raw-storage model and use it to do anything S3 does, I think. Eg you could have three independent whole copies, or one whole copy and 1.5 copies distributed widely.

The only thing I can think of that Amazon could do that you couldn't do, if the raw storage providers are untrusted, is serve the data with no addition hops, since the data would be encrypted.


It really isn't that expensive see http://www.backblaze.com/ and specifically http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v...

Obviously they are targeted at backups but you wouldn't need to change a lot to improve performance (mostly it would be in software + some caching boxes I think).


We have ~8PB of spinning storage that we built with the backblaze storage pods, and use Openstack's Swift Object Storage for the software layer. Works like a champ.


This sounds very interesting and is something that I'm thinking about doing as well. One "limitation" that I see with the BackBlaze pods is the possibility that they don't perform well in heavy everyday use. They were designed to be mostly write-only devices, but my use case would be very read heavy and I'm not sure how they would hold up.

Do you have any information on what/how you went about building your storage system? If not, would you be so kind as to create some text (blog, how-to's, etc.) that detailed your setup and how it performs under your workload?


Before my current startup, I worked at Fermi National Accelerator Lab on the CMS detector data taking team for the LHC. I spent a year there getting to admin the spinning storage (~5PB) on Nexsan Satabeasts (very nice, but very expensive for ~48-96TB of disk per enclosure) and ~17PB of storage on Storagetek tape silos (also, of less consequence, ~5500 nodes that reconstructed collider data from raw data we streamed over 40Gb/s optical links from CERN).

After my experience there (both technical and political), I left to do big storage. We settled on the Backblaze enclosures due solely to cost (cheap is cheap); read performance is sub-optimal, but we try to compensate with heavy caching in memory and intelligent read assumptions ("what might someone request next based on past read requests") at the app level (sitting on top of Nova).

I could do a blog post, but I have to check with my partner to make sure they're cool with me spilling that much info =)

Hope this has helped a bit. If you haven't guessed yet, I love object storage.


A blog post would be really interesting, even if you couldn't disclose everything.

What are your startup and blog URLs? You don't have any profile info.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: