

Nimbus.io: Open-source alternative to Amazon S3 - gglanzani
https://nimbus.io/

======
res0nat0r
Is being 100% open-source really the motivating factor to use this over S3? I
would think the only factors vs using this over S3 would be price and
reliability/performance.

I could care less if this is open source, if I'm going to offload my data to a
3rd party, open source or not and I'm worried about privacy I'm going to
encrypt it. I honestly could care less what happens on the back-end, just
commit to a data loss and reliability SLA and I'm happy.

If you can support my use case, or have reliable performance near 370k
requests/sec ([http://aws.typepad.com/aws/2011/10/amazon-s3-566-billion-
obj...](http://aws.typepad.com/aws/2011/10/amazon-s3-566-billion-
objects-370000-requestssecond-and-hiring.html)) and be cheaper than S3 then
we'll talk.

~~~
Egregore
Yes, being opens source is motivation, because you can create your own cloud
with your own hardware when you need it.

And it's additional assurance that you'll be able to deploy your system even
if they go out of business.

~~~
edanm
Is there anyone, anywhere, who considers that as a plus? Who would actually
consider rolling out their own cloud infrastructure?

~~~
notmyname
Actually, many people consider it. Some are simply cautious about hosting
their data with a third party. Some are prevented from using a third party for
compliance or regulatory reasons. Also, it's generally more cost-effective for
extremely large datasets to be self-hosted rather than hosted by a third
party.

~~~
edanm
Aren't the bulk of customers who turn to the cloud rather small operations,
who are trying to "outsource" as much of their infrastructure issues as
possible? And aren't these customers much more concerned about pricing, rather
than possible future growth?

Note: I don't mean to ask this sarcastically. I'm actually asking.

~~~
notmyname
From my experience working with Rackspace Cloud Files, customer sizes are all
over the map. Some customers are very small. Some are very large. I know that
S3 has a similar variance in customer size.

From my experience talking to users (and potential users) of Openstack
(<http://openstack.org>), there again is variance. Most people are relatively
small (a few hundred GB to a few hundred TB). Some are much bigger (several
PB). The most exciting thing I heard was that CERN is evaluating Openstack
swift (<http://swift.openstack.org>) for their storage needs. A researcher
from CERN gave a keynote at the last Openstack design summit. CERN generates
25 PB / year and has a 20 year retention policy. They have vast storage needs.
The storage needs vary greatly.

I've seen that outsourcing infrastructure is great to a point, but the largest
users can generally get substantial cost savings by bringing their
infrastructure back in house.

------
notmyname
There is already a 100% open source version of Amazon S3: Openstack swift
(<http://swift.openstack.org>). Swift is in use by many companies, and it is
the software that runs Rackspace's Cloud Files product, storing petabytes of
data and billions of objects.

Swift's code is available on github (<http://github.com/openstack/swift>) and
devs and users are almost always available in #openstack on freenode. There is
a wealth of info available and more available to anyone who asks.

I'm all for encouraging many people to solve large-scale storage problems.
However, as others have pointed out, nimbus is claiming to be open without
providing much detail.

~~~
rarrrrrr
(SpiderOak / Nimbus.io cofounder here)

Thanks for your interest! Nimbus.io will have public git repositories,
"developed in the open" before we ever charge money to use the service. We
just haven't posted the links yet. :)

We admire OpenStack and Ceph as great examples of open source S3 alternatives.
Also, Riak+Luwak isn't protocol-level compatible with S3 but offers similar
capabilities and an truly elegant design.

Nimbus.io takes a different approach than the above options in that it focuses
on space efficiency using parity instead of replication, allowing the storage
of a little more than twice as much data using the same hardware. It's a
tradeoff of cost vs. latency. For long term archival storage, while throughput
matters greatly, latency less so. That's why the price is $0.06/GB.

~~~
notmyname
Great to hear. I'll look forward to looking at your implementation and
exploring the tradeoffs you are making. I'm especially interested in how you
solve durability in the face of multiple, simultaneous hardware failures. I'm
also quite curious about how you are handling object metadata.

You are absolutely right that these things are greatly dependent on the use
case. I'm happy to see other people trying to solve these problems too.

Can you describe your API? Do you have your own? Are you reimplementing the S3
API? REST-ful? xmlrpc? How do you handle authentication and authorization?

------
asharp
So it stores data using RS encoding on multiple pieces.

I've had a quick look around the website, and the most information i could
sort of squeeze out was from the blog.

The arch page <https://nimbus.io/architecture/> is devoid of architecture.

I don't see any mention of compression or dedup.

I don't see any mention of network level failover/redundency.

I don't see any mention of high level CNC functionality/db arch (ie. swift's
notorious file replicated SQLite database....)

I don't see a download source button. Is that just me?

Overall, sounds very interesting and rather promising. Who are the people
behind Nimbus.io?

~~~
MichaelApproved
_"I don't see a download source button. Is that just me?"_

Says right there on the front page: "We are currently in private beta. Please
sign-up and we will send you an invitation as soon as we are ready!"
Presumably, the download is behind the invite-wall.

~~~
cperciva
I think someone's confused about what "100% open source" means.

~~~
alpb
Exactly. I think open source is just a buzzword for them. There is still money
concerns behind this service. I was quite disappointed since I was expecting a
"Source" link that goes to github or somewhere.

~~~
chalst
Wrt. their Spider Oak backup service, from
[https://spideroak.com/faq/questions/35/why_isnt_spideroak_op...](https://spideroak.com/faq/questions/35/why_isnt_spideroak_open_source_yet_when_will_it_be/)

"Our founders and engineers have a strong open source background and we
consider a contributory relationship with the FOSS community as the normal
course of business. Thus, our plan all along has been to make our entire
client-side code base open source; however, as anyone who has worked with such
issues knows, it is often not quite that simple."

So they say they want to be good, but not quite yet. I've posted a question
about nimbus.io on that page.

~~~
asharp
Interesting. If they only open source their client side libraries it would be
a rather sad development.

------
mendable
Storing 50TB on Amazon S3 (US-EAST) Premium costs ~ $6,264

Storing 50TB on Amazon S3 (US-EAST) Reduced Redundancy costs ~ $4,160

Storing 50TB on Nimbus: $3,000

Is Nimbus's fault tolerance closer to the Premium S3 or the Reduced Redundancy
S3?

(for completeness, Nimbus's transfer out is $0.06 per GB vs Amazon's $0.12 per
GB).

~~~
gglanzani
They say[^1] that they can tolerate destruction of any 2 nodes without data
loss. I don't know how many nodes Amazon S3 premium can tolerate.

[^1]: <https://nimbus.io/architecture/>

~~~
fredoliveira
Amazon doesn't talk about their numbers either. The only thing they do say is
that RRS (reduced redundancy storage) _'stores objects on multiple devices
across multiple facilities, providing 400 times the durability of a typical
disk drive, but does not replicate objects as many times as standard Amazon S3
storage, and thus is even more cost effective.'_

This is at the main page: <http://aws.amazon.com/s3/> (search for RRS)

~~~
zargon
Amazon says that S3 provides eleven nines (99.999999999%) durability of files.
So if you have 100 billion objects in S3, you should expect to lose on average
1 per year. Or, if you have 10,000 files, you should expect to lose 1 per 10
million years. In addition they say it can tolerate the simultaneous failure
of two datacenters. Nimbus, with 3 copies total, appears much less
redundant... but nobody knows how Amazon calculated their eleven nines claim.

------
JoshTriplett
The Eucalyptus project provides an Open Source alternative to EC2 and S3 as
well, with a compatible API.

------
rmc
This is a bit misleading. A big advantage of an "Open Source Software"
solution as opposed to a "Propriatary" solution is that you don't have to
worry if the original provider goes bust or doesn't release patches or
discontinues the product, and you don't have to worry about how many licences
you have etc. i.e. Open Source gives _you_ , not the people who made it, the
power & control.

With cloud hosting, like Amazon S3, where you store all your data (or servers)
on a 3rd party's servers, there are legitimate concerns about control & access
(i.e. how much control do you have to do things, how much control do you have
to stop someone else doing things (i.e. privacy)). So an "Open Source
Alternative to S3" sounds like a good thing that would not have any of these
drawbacks.

If someone thinks "Open Source S3" thinks that they, not the hosting company,
have power & control, then they would be dissapointed by Nimbus.io, since it
has all the drawbacks of S3.

------
jpwagner
Their blog post from yesterday describing nimbus...

[https://spideroak.com/blog/20111107183539-spideroaks-new-
ama...](https://spideroak.com/blog/20111107183539-spideroaks-new-
amazon-s3-alternative-is-half-the-cost-and-open-source)

------
Joakal
SpiderOak is a mix of proprietary and open source according to Wikipedia. Does
anyone know how much of SpiderOak is open source?

I wonder if they considered OpenStack that several companies including
Rackspace, NASA, et al, uses?

~~~
gglanzani
Not much of SpiderOak is open source, see <https://spideroak.com/code>

However they are considering open sourcing more and more of their code.

------
gst
"The server and client components are all free and open source software."

Can't find any link to the source. _Are_ they opensource. Or is it only
planned to release the source eventually?

------
itsnotvalid
Please note that, if you are not going to use 100GB for a project (for
example, storing your own stuff) you are better off grouping them as it is $6
per 100GB (not $0.06 per GB) for stored data. Transfer out is $0.06/GB though.
So if you have 101GB (I think the GB is is Giga _Byte_ right?) it should cost
you $12 a month.

But if you are just wandering with personal backups they have plans $10 a
month for 100GB+ (with affiliation program to earn up to 50GBs of data, but
that part are given to free users as well) so that is pretty close to $6 for
100GB, and $4 for 40GB of transfer-out. Since they claim they are using very
similar or just the same system for their service in SpiderOak, you can bet
they are just the same thing.

Or just go with free account and grab 2GB + whatever I get from affiliation.

On the other hand, since they are saying it's a trade-off of low-latency for
price. If their data about s3 is correct, that it should be slower than s3 in
the way that s3 has 3 drives to read from versus 3 drives but parity shared so
effectively just one fast enough.

------
samarudge
"Build by SpiderOak on the same proven backend storage network which powers
hundreds of thousands of backups"

This concerns me slightly, backup storage is a whole different world to real
time data storage. Backups are write once, read occasionally, some people use
S3 as a make-shift CDN, so constantly reading data.

Parity based repplication is great for backups, but would it not have
performance implications if every request is reading from multiple
disks/servers/nodes? I'm not an expert on hardware, but I would have thought
being able to read an entire file of one disk is faster than having to put
together pieces of data from multiple disks, anyone want to correct/inform me?

If you can offer me a serious alternative to S3 at a cheaper price, and open
source software, I can't wait to try it out. I might sound negative but I just
wanted to put across my first thoughts on having a look around the site.

~~~
SaltwaterC
This is exactly what they said on their blog post:

Long term archival data is different than everyday data. It's created in bulk,
generally ignored for weeks or months with only small additions and accesses,
and restored in bulk (and then often in a hurried panic!)

This access pattern means that a storage system for backup data ought to be
designed differently than a storage system for general data. Designed for this
purpose, reliable long term archival storage can be delivered at dramatically
lower prices.

~~~
rcthompson
Their architecture page seems to confirm this. It seems that their service is
explicitly designed to have different performance characteristics from Amazon
S3, so maybe they aren't quite a _direct_ competitor to S3, but there are
probably a lot of people using S3 for the use cases that Nimbus.IO claims to
do better on, simply because S3 was available at the time.

~~~
rarrrrrr
Yes exactly. Nimbus.io is designed for long term archival storage at more
affordable prices. We think it's a great time to be competing on price.

We may compete with S3 for low-latency service later on (latency can be made
arbitrarily low by spending enough money on caching.) Initial calculations
suggest we could be almost as low-latency as S3 and still under price by a
good margin.

~~~
asharp
Latency may be able to be made low through caching, but depending on the
distribution the point at which additional cache is uneconomical may be well
before the edge of your performance envelope.

How are you calculating your latency? Also, what distribution do you assume
your file accesses will come from?

------
hasanove
I have submitted my email twice and did not get any confirmation, neither on
page, nor in my inbox. Not sure if this is by design or bug, but confusing in
either case.

~~~
gglanzani
I had that too. I mailed them (info@nimbus.io), let's see if the behavior is
normal.

~~~
gglanzani
They replied to my email. The registration got through, but the confirmation
did not work (at the time?). So if you submitted your email, they'll be in
touch when it's time.

------
serverascode
I think that there should be more open source systems that do parity across
serves versus replication. So to me this is great! AFAIK, something like swift
suggests keeping 5 copies of everything, so if you want 1PB usable you need
5PB raw. But with dual parity spread across servers, like in Nimbus, you could
probably get 80% usable vs raw. This would be similar to Isilon.

I wonder if Swift may support something similar in the future.

------
dwm
See also: ceph.newdream.net

~~~
asharp
Ceph is a very interesting project. RADOS, their distributed block store is
now mainline I believe and the project is coming along in leaps and bounds.

I'm unsure if anybody has a large scale RADOS based blob store though. It
would be interesting to see how it holds up.

------
soult
Since the nimbus.io page has only little information, you can take a look at
the spideroak dyi archival storage page[1], which seems to be the predecessor
of the nimbus.io offer.

1: <https://spideroak.com/diy/>

------
quadhome
How is this open source?

------
tmcw
And there you have your answer for why NuoDB.com used to be NimbusDB but then
changed all of the sudden.

~~~
mbreese
I'm pretty sure that bad to do with Nimbus Data, not this project. I'm pretty
sure that this project will get a cease and desist too if it becomes popular.

Counting down to a name change in 3... 2...

------
CPlatypus
It's not open source. The code is not _currently_ available. Given that the
code for SpiderOak itself has been "coming real soon now" for a year, I'm not
going to hold my breath. Even if/when that day does come, it will be "thrown
over the wall" open source rather than "developed collaboratively" open
source. At least Swift, for all of its alleged technical deficiencies (which
don't seem to prevent it being used to billions of files already), hasn't been
guilty of false advertising. Alternatively you have Walrus, tabled (from
Project Hail), Elliptics, Luwak, Gluster's UFO, and probably more. Practically
all of these have solved the harder problems of cluster management, API
implementation (including the security that nimbus.io seems awfully quiet
about), OS integration, etc. Without source, nimbus.io can't credibly claim to
have reached parity in all of these other areas, or that it would take less
for it to reach parity than for the others to add the one feature (erasure
coding instead of replication) that they crow about.

~~~
rarrrrrr
(SpiderOak / Nimbus.io cofounder here)

Note that this is just an announcement and invite site to show the pricing at
$0.06/GB. Nimbus.io will have public git repositories, "developed
collaboratively in the open" before we ever charge money to use the service.
(And this is a wholly different project than the SpiderOak backup/sync
software.)

FYI, you can see the git repos for the prototype we built of this awhile back,
when we called it our storage "DIY API". <https://spideroak.com/diy/> Note
that the code and the rest of the information on that page is way out of date
since it was an early design and prototype.

I'm not sure erasure coding vs. replication is a simple change for other
distributed storage projects. It effects the whole architecture. We researched
pretty heavily before building. If it had been simple to modify any of the
alternatives, this project wouldn't exist. I'm more than happy to be proven
wrong though!

* Edited for pricing info.

~~~
CPlatypus
"I'm not sure erasure coding vs. replication is a simple change for other
distributed storage projects."

It depends on a few factors: how modular the architecture is overall, whether
the existing replication is synchronous or asynchronous, etc. I'm working on
the GlusterFS replication code right now in another window (OK, I _should_ be
but I'm typing here). I can assure you that it would be possible to replace
replication with erasure coding just by replacing that one module, without
perturbing the rest of the architecture. I've also been through the tabled
code and I think it would be possible there too. I suspect the same would be
true for Elliptics, but probably not Swift. Can't tell for Luwak; that would
require more thought than I can afford to put into it right now.

This is something we've actively considered for GlusterFS/HekaFS, and might
still do some day - though it's more likely to be on the IDA/AONT-RS side than
RS/EC. The downside is that, while these approaches do offer better storage
utilization, they also consume more bandwidth. Also, queuing effects can turn
a bandwidth issue into a latency issue. This is especially the case for read-
dominated workloads, where you just can't beat the latency of reading exactly
the bytes you need from one replica. For these reasons I don't think either
full replication or redundant-encoding schemes will ever entirely displace the
other. Each project must prioritize which to implement first, but that doesn't
mean those that have implemented replication first are precluded from offering
other options as alternatives. It's really _not_ an architectural limitation
in most cases. It's just timing.

