
Ask HN: Cheap Bulk Storage? - e1ven
2 years ago, BackBlaze released specs for their storage pod, which is basically huge tract of drives wired together.<p>It's not fast, but it's cheap, and it's great for bulk-archiving documents.<p>Now it's 2011- Is there a better way of solving this problem? S3 is crazy expensive for massive (100+TB) archiving, and I'd rather go one-step above building my own system using custom frames and components.<p>Are there any web-services which offer bulk storage at somewhat reasonable rates? Are there off-the-shelf cases which offer 50+ HDs for a non-insane price?
======
wmf
Supermicro has a 45-bay JBOD that should be under 10 cents/GB populated:
<http://www.supermicro.com/products/chassis/4U/?chs=847>

HP has a 70-bay JBOD that works out to under 30 cents/GB:
[http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/12169-304616-3...](http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/12169-304616-3930445-3930445-3930445-3936271.html)

~~~
e1ven
It's also worth looking at their 72 disk, 2.5" version

[http://www.supermicro.com/products/chassis/4U/417/SC417E16-R...](http://www.supermicro.com/products/chassis/4U/417/SC417E16-R1400U.cfm)

Using that, and 72 1TB 2.5" drives,
([http://www.newegg.com/Product/Product.aspx?Item=N82E16822136...](http://www.newegg.com/Product/Product.aspx?Item=N82E16822136545&cm_re=1TB_2.5%22-_-22-136-545-_-Product))

(($109 * 72) / (72 * 1TB * .75) in dollars per gigabyte) = $.15/Gb

~~~
fragmede
It's worth noting at this time, that most SSDs only come in a 2.5" version.
(Not for cheap bulk storage but because that's arguable Supermicro's
intentions for this chassis.)

~~~
justincormack
No its not intended for SSDs. There is not nearly enough IO bandwidth for SSD
to make any sense. 2.5 inch hard drives are what you use if you want disks
with spindles not just big storage. Eg for virtualised database servers.

------
rarrrrrr
SpiderOak (the backup and sync company) is on the verge of launching a S3-like
service, tuned for for archival class data. It's in beta now.

It's also entirely open source, so you could run it on your own hardware if
you wanted. It uses parity instead of replication for storing data (but with
arbitrary effectively replication levels.) So with R=3, you need a minimum of
about 10 nodes to be efficient.

The beta uses AMQP internally, and the upcoming release version is
Python/zeromq/gevent based.

<https://spideroak.com/diy/>

~~~
e1ven
This seems to be a great offering

The pricing is not entirely out of line compared with what I can get in-house,
and it's a fantastic solution for storing horribly bulk,low-reliability data.

$10 / 100GB comes out to .01/GB/Month. That means that you pay the total cost
of an in-house solution, every month you use their service..

That's 12x the price of doing it yourself, but you don't need to buy
replacement drives, pay for electricity, or send someone to go replace drives.

It's not that much cheaper than Amazon in the base pricing, but it looks like
you can relax the number of copies for them to keep from 3(default) to 1, and
cut the price down to 1/3?

Is this your experience? Thanks for the link.

~~~
rarrrrrr
FYI, I'm one of the founders of SpiderOak, it case that was not clear (it's in
my profile.)

For the storage cost calculations: electricity, cooling, and bandwidth end up
costing more than the drives, even once you include all additional hardware to
keep them running.

Note of caution: I would recommend that shops build bulk storage hardware
themselves only if someone on their team has an intensive storage background.
There are a number of gotchas that are distinctly non obvious; they are
capital intensive lessons to learn, and endanger data integrity. You saw what
Backblaze went through to design their storage pods: all of the self hosted
offsite backup companies do likewise. It's a core competency.

Typically it's much better to outsource or buy the more expensive business
storage products with hardware RAID, enterprise class drives, cache batteries,
and all that.

SpiderOak's DIY is designed for high reliability, but not high performance. In
price comparisons to S3, there's no option for reduced-reliability storage,
but also no charge for bandwidth in our out (up to a reasonable point.)

Having said all that, if you are interested in building 100+ TB of storage
hardware in house, feel free to send me a mail; I may be able to save you some
difficulty. If you'd rather us host it, we do discounts for startups. :)

~~~
e1ven
Oh, Man, that'll teach me to look more carefully ;)

Oops.

Thanks for the awesome service. I've used SO for backups on and off, and I
appreciate the tech work you do. I particularly like that you do client-side
encryption. I just wish your UI was a bit better ;)

I must have been mixing up your service with diomede; IIRC, their techcrunch
article was about letting you choose the levels of redundancy you want to
keep.

I'll grant you that the bandwidth is certainly a price-factor, but your page
talks a LOT about how you're higher latency than S3, and not everyone needs
that speed for that price, etc, etc, and then your price comes out to be about
80% of theirs. Honestly, after reading your product positioning, I was
envisioning it coming out to like 10%.

You're absolutely right that building a huge in-house storage network is a
huge ordeal.. If you're worried about making it fast and scalable. What I want
to do is to treat it as the equivalent of a huge tape drive- Just throw stuff
onto it, and hope for the best.

I've been toying around with the idea of dealing with it entirely in
application logic. Rather than RAIDing the disks, mount each one separately,
and keep a hashtable of my files in memory with their full path, including
redundant copies. Then, if a file fails, I have application logic try the
backup copy.

Anyway, you know a LOT more about this than I do, so I'll shut up before I
embarrass myself further.

My point is just that there's a lot of people in the market who have huge
datasets currently, for various reasons. Lots of science departments,
analytics groups, etc. I'd love to be able to keep an nearline copy of work.

I don't mind if it takes a good 15-20 minutes to be able to access the first
byte, I just want to know that I can if I need to, faster than waiting for
Iron Mountain to ship me something.

Side note- I considered trying to justify it under BackBlaze's "$5, unlimited
space" offer, but I didn't think they'd go for it. ;)

------
armored
I'm remember being fairly skeptical of the Backblaze storage pods. What
happens when you have a power loss situation? What happens when a drive dies?
Seems like it would be pretty tough to maintain and service those custom pods
with the unnecessarily expensive non-redundant custom power supplies. I'm
going to second the Supermicro case recommendation made by wmf.

------
danohuiginn
I'm surprised there aren't any services selling raw storage space. i.e. no
RAID or other redundancy, no doubling as a static web-server, just a bunch of
disks.

So people (myself included) buy physical HDDs for local backup -- simply
because every online alternative jacks up the price by maintaining 3+ copies
and keeping everything permanently online.

~~~
fragmede
That's a curious idea. How would you access the raw storage? iSCSI? And then
you run your own RAID on top? What's the advantage to you? It's ever so
slightly cheaper?

Backblaze (no affiliation) is $5/month or $50/yr. That's pretty hard to beat.

Also, running a static web-server is fairly trivial to setup these days, what
would a service _not_ running a static web-server do/have?

Every online alternative has a vested interest in both A. saving money, and B.
not losing your data. As much as you trust a business to perform B adequately,
you implicitly trust it to perform A at least as well.

~~~
corin_
I need data backup on a personal level rather than for business reasons, and I
love what Backblaze offer... except that for my data and with my ~50kbps
upload speed, it would take about five years to upload it all to their
servers.

------
ableal
> any web-services

You may want to take a look at <http://www.wuala.com>, although I'm not sure
it will fit at your scale.

Besides other features, they offer two not-so-common things:

1) you can trade your own space for remote storage

2) they have local encryption (at your end)

------
wavesound
Diomede might be worth a look. Their offline and nearline options are waaay
less expensive than S3. Never used them though so let us know how it works out
if you go with them.

<http://www.diomedestorage.com/>

~~~
e1ven
Their page appears to have a single unlinked image on it, and nothing else..?

~~~
davidandgoliath
[http://techcrunch.com/2009/02/27/diomede-offers-green-
file-s...](http://techcrunch.com/2009/02/27/diomede-offers-green-file-storage-
in-the-cloud-for-a-fraction-of-the-cost/) seems to offer some more details.
Their site leaves an awful lot to be desired.

~~~
e1ven
Ah, yes. I remember reading about them now. I'd be interested in looking over
their offerings, if they ever release any.

------
EwanToo
You ask about online bulk storage at reasonable rates, but you've not
mentioned what youre acceptable rate of data loss is?

How much data are you willing to lose per day out of your 100TB?

~~~
e1ven
I'd be willing to lose 100TB at once, assuming it wasn't every day.

If there were a system where it lost even 20TB/day, I'd be interested,
depending on what the tradeoffs were! I'd expect that to be pretty seriously
cheap to offset it.

That's why I'm asking what options people know about. If you know one that is
randomly lossy but cheap, I'd be REALLY curious as to what their backend is!

~~~
EwanToo
I'd say most of the options people have proposed are in that range. Theres no
backup, no simultaneous mirroring etc, which you are paying for with s3

~~~
danohuiginn
Hardly. Hard disk failure rates are about 1% per year _. For many purposes
(basically, anything where this isn't the only copy of valuable data), that's
acceptable.

Amazon S3 costs around $1/gb/year. Let's say it has 0% chance of data loss
Suppose I can store something without any replication. Cost drops to
$0.50/gb/year, 2% chance of losing all my data. If the value of my data is
less than $25/gb, it makes financial sense to go for the unreliable option.

Much of my data is worth less than $25/gb. photos, video, old mail. Storing it
on RAID is a waste of money.

besides, the chance of me deleting everything by human error is often much
higher than the chance of hardware failure.

_ and some decent portion of those failures come with advance warning, so that
you can save at least some of the data.

------
iwwr
Are there instructions out there on how to build your own?

~~~
e1ven
[http://blog.backblaze.com/2009/09/01/petabytes-on-a-
budget-h...](http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-
to-build-cheap-cloud-storage/)

------
engates
Would be interesting to install OpenStack Swift on top of Backblaze storage
and see how it works.

~~~
e1ven
I only looked at it briefly, but isn't Swift for multi-system installs, not
multi-spindle?

------
giantchamp
www.elephantbackup.com is a great choice. live drive at a way cheaper cost.
unlimited is 24 a year, 48 for the briefcase.

------
giantchamp
www.elephantbackup.com Unlimited for $24 a year. Can't beat it.

