Hacker News new | past | comments | ask | show | jobs | submit login
Nimbus.io: Open-source alternative to Amazon S3 (nimbus.io)
271 points by gglanzani on Nov 8, 2011 | hide | past | web | favorite | 94 comments



Is being 100% open-source really the motivating factor to use this over S3? I would think the only factors vs using this over S3 would be price and reliability/performance.

I could care less if this is open source, if I'm going to offload my data to a 3rd party, open source or not and I'm worried about privacy I'm going to encrypt it. I honestly could care less what happens on the back-end, just commit to a data loss and reliability SLA and I'm happy.

If you can support my use case, or have reliable performance near 370k requests/sec (http://aws.typepad.com/aws/2011/10/amazon-s3-566-billion-obj...) and be cheaper than S3 then we'll talk.


As others have said, being open source is important because it gives you more control over your data. If you have access to the code that is storing your data, you have the option to host it yourself or pay someone else to host it with the same client. I've written about this more on my blog (http://programmerthoughts.com/openstack/democratization-of-d...).

If you are looking for an alternative to S3, I'd ask that you look at Openstack swift (http://swift.openstack.org). It's 100% open source, proven at scale in production, and, if you are hosting it yourself, can offer lower op-ex than using S3 (of course, there is a cap-ex cost to buy your hardware).


If you have access to the code that is storing your data, you have the option to host it yourself or pay someone else to host it with the same client.

That's certainly true in 'regular' open source economics.

But storage is somewhat unique in that the significant cost factor is not the price of alternative proprietary software but the hardware costs of the storage medium itself - and those are constant regardless of the license structure the software layer is using.

Additionally, scales of economy come into play to such an extent that the costs of hosting this kind of storage myself will be wildly more expensive than a volume player like Amazon who operates entire datacenters (this argument goes beyond just the op-ex/cap-ex tradeoff)


We also expect price to be a motivator, since it costs less than 1/2 of what S3 does at $0.06/GB.

But we're not competing with S3 directly as a general cloud storage solution. We're specifically focusing on the case of long term archival storage.

You can compare the two services as tradeoffs from the expression: Inexpensive, High Throughput, Low-Latency (pick any two.)

S3 picks High Throughput and Low Latency.

Nimbus.io picks Inexpensive and High Throughput.

But for bulk archival and restore tasks, does 100ms of latency really matter to you? In other words, are you equally happy if your backup/restore job completes in 2 minutes vs. 2 minutes and 0.1 seconds? Do you care enough to pay more than twice as much? So that's why we're focusing on the archival market.


Yes, being opens source is motivation, because you can create your own cloud with your own hardware when you need it.

And it's additional assurance that you'll be able to deploy your system even if they go out of business.


(SpiderOak / Nimbus.io cofounder here)

In addition to supporting the founders personal ethics about software freedom, we feel an open source backend is important for just the sake of confidence.

Some people will want to purchase the minimum of 10 machines and host a Nimbus.io storage cluster themselves (and we are also making our hardware specs open source.) Other cloud storage providers may even do this. We hope a few people will consider the hosted option, paying Nimbus.io $0.06 per GB.

In any case, all of these are a win for us. We're already spending money every day to maintain a reliable storage backend for our encrypted Backup & Sync business at SpiderOak.com. Nimbus.io is an evolution from that. Community involvement here is most welcome. :)

Aside from that, it's just a design we are excited to share. Every other distributed storage system I could find uses replication instead of parity. A system based on parity sacrifices latency but can deliver higher throughput on individual requests (at about 1/3 the cost.) There are use cases even outside of archival storage where this is attractive.


I don't see how a parity based implementation can work in a meaningful way across multiple datacenters. You certainly couldn't rebuild if you lost an entire datacenter due to disaster. Replication is the only way here.

So any comparison to S3 in that regard is meaningless - Nimbus can't achieve that level of durability, correct?

Additionally, if you're just doing parity across multiple chassis in a single datacenter and lost a couple racks do to a power outage it would seem the network would likely shit the bed trying to rebuild, potentially bringing the whole system down. Have you guys worked through nastier failure cases that architectures like S3 can avoid?


Excellent points.

Geographic redundancy with parity compliments the network topology we find in many cities: a metro area fiber ring connecting many data centers with low cost site-to-site (not internet) bandwidth. It's even lower cost to just buy excess capacity with lower QOS.

Every archival storage provider I've talked to has a write-heavy workload. Write traffic maybe more than 3x read traffic. So for example in this situation replicating between two sites requires a site-to-site connection equal to the size of the incoming data. Since site-to-site connections are full-duplex, in the parity system the bandwidth for reads and writes is provided at a similar price to what would be spent on replication bandwidth for writes.

That said, the first iterations of Nimbus.io won't provide geo redundancy beyond the geo-redundancy that creating an offsite backup inherently provides. We expect to add on geo redundancy storage as an upgrade option at a slightly higher price (still way under S3.)

Replying to your second point: If transient conditions like only a couple racks lost power, the system wouldn't trigger an automatic rebuild right away. It would continue to service requests with parity and hinted-handoff until the machines come back online. In any case, when the system decides a full rebuild is needed, the rebuild rate is balanced with servicing new requests (similar to how a RAID controller can give tunable priority to rebuild vs. traffic.)


I don't see how a parity based implementation can work in a meaningful way across multiple datacenters. You certainly couldn't rebuild if you lost an entire datacenter due to disaster.

Sure you can. Given a system that can tolerate loss of N shares, you need to ensure that no datacenter holds more than N shares. In practice, this means you need many smaller datacenters, not two or three; whether that is economically feasible depends on the provider.


Isn't that the whole point of S3 - you don't need to make your own cloud with your own hardware?


Its a leap that folks desiring cloud storage will want to host their own some day. I think its a little like "I want an open-source car, so when they switch back to horses, I'm ready!"


It's less like that than it is "When we're operating at a scale that makes it more affordable to own hardware than to lease, then it'd be nice if we could."

The other unmentioned benefit is in scratching your own itch. If you want feature X, and Amazon won't give it to you, you can develop it in-house and host it yourself.


I think it's more like saying "I want a drivers license, so when I can afford a car I don't have to keep riding the bus."


The license for server-side code - the one you would use to create your own cloud - is AGPL. Isn't this license too restrictive for business?.


Why would it be? It's the same as the GPL, only difference is: modified version sources must be available to remote-network-interaction users. I don't see what's restrictive for business.


If you go and sell somebody a hosting solution based on Nimbus, you'd need to share your source code.

What I'm not sure of, is if you build, say, a photo sharing webapp using Nimbus as the storage back-end, does your webapp become AGPL by linking? I'm fairly certain GPL would require this, but, as per the rationale for AGPL, you don't care about that when you run a webapp.

Curiously, if Nimbus adopted the exact S3 API instead of "similar to", it would not constitute linking, as it's using a standard interface.


AGPL uses the same definition of linking as GPL so communication over a network API to a storage backend is not linking.

http://www.quora.com/Does-the-AGPL-extend-the-idea-of-linkin...


Ah, thanks for that. I was under the impression that the bar for linking across the network was rather higher.


Is there anyone, anywhere, who considers that as a plus? Who would actually consider rolling out their own cloud infrastructure?


Actually, many people consider it. Some are simply cautious about hosting their data with a third party. Some are prevented from using a third party for compliance or regulatory reasons. Also, it's generally more cost-effective for extremely large datasets to be self-hosted rather than hosted by a third party.


Aren't the bulk of customers who turn to the cloud rather small operations, who are trying to "outsource" as much of their infrastructure issues as possible? And aren't these customers much more concerned about pricing, rather than possible future growth?

Note: I don't mean to ask this sarcastically. I'm actually asking.


From my experience working with Rackspace Cloud Files, customer sizes are all over the map. Some customers are very small. Some are very large. I know that S3 has a similar variance in customer size.

From my experience talking to users (and potential users) of Openstack (http://openstack.org), there again is variance. Most people are relatively small (a few hundred GB to a few hundred TB). Some are much bigger (several PB). The most exciting thing I heard was that CERN is evaluating Openstack swift (http://swift.openstack.org) for their storage needs. A researcher from CERN gave a keynote at the last Openstack design summit. CERN generates 25 PB / year and has a 20 year retention policy. They have vast storage needs. The storage needs vary greatly.

I've seen that outsourcing infrastructure is great to a point, but the largest users can generally get substantial cost savings by bringing their infrastructure back in house.


The cloud is great for scaling, but once you have a large dataset and more-or-less predictable growth, it could easily become more economic to handle it your self. Using something like Nimbus would make such a migration easier.

On the other hand, it's not like the S3 interface is rocket-science. Re-writing your apps file-storage interaction is the least of the effort in a multi-terabyte-migration.


A lot of Canadian companies are unable to use S3 (or EC2) due to the Patriot Act.

Sure, they can get around this by using European buckets, but that kind of sucks for latency.

I can easily imagine setting up a company using this software with servers in Canada using this software. So yes, it's a plus.

(It's not just Canada of course, and it's not just the Patriot act. Gambling companies, for example, can't host in the US)


A cloud hosting provider.


"I could care less if this is open source" David Mitchell explains why this phrase makes no sense and means exactly that opposite of what you want to say: http://www.youtube.com/watch?v=om7O0MFkmpw


The phrase is usually used ironically: "I could care less" is used to mean the opposite "I couldn't care less".

Similarly, when my daughter says "Nice hat, Dad", she is not actually complimenting me on my choice of haberdashery, but rather, pointing out that she thinks it is not nice at all.

This message brought you by Irony: Making Communication More Interesting Since the Dawn of Language.


For what it's worth, I would not classify this as irony (or at least it's hardly a prototypical case). I'm sure the current meaning of "I could care less" is quite thoroughly conventionalized: it's part of everyday speech, and many people do not notice any non-literal effects such as irony -- as evidenced by the prescriptionist videos which feel the need to explain to people the "true" meaning of the expression. Irony may have had a role in the etymology of the expression. All of this is very similar to a dead metaphor.


> I'm sure the current meaning of "I could care less" is quite thoroughly conventionalized: it's part of everyday speech

Only in some places - I can only remember having heard it on television from the US. I shiver in pain every time I hear it, too, so I'm pretty sure I haven't heard it in person (having lived in New Zealand and Australia).


In fairness, I wouldn't offer the pedantry of a prescriptionist who was moved to create a video as evidence that the average person doesn't get the sarcasm of "I could care less". I think it's rather more likely that the average person couldn't care less whether the phrase is literally correct, as long as the listener or reader understands its meaning.


I'm sure most people would see the original non-literal features of the phrase if they were to think about it. The point is, they don't! Not because they're dumb but because the entire expression has unit status in their vocabulary. The fact that people do not notice the original non-literalness in the phrase (and indeed understand it as intended) is evidence that it's not non-literal anymore.

I'm harping on about this because it's such a nice poster child for an entrenched (conventionalized) meaning of an entire expression as opposed to just a word, and for the lack of componentiality of meaning in language. In other words, there's more to the meaning of a sentence than just the meaning of its words. Componentiality is one of the points of debate between different schools of thinking in linguistics.


I suppose I just have a hard time getting too exercised about what is, essentially, a banal artifact of a highly idiomatic language. When I find myself getting bogged down over a particular expression, I step back and ask myself: if person A uses this expression, will person B understand what they mean? Really, this is all that really matters.


You are talking about sarcasm. Not irony. And a commonly made grammar mistake isn't irony. Unless your whole post was wrapped in a big <sarcasm> tag and I've just made a fool out of myself.


The relationship between sarcasm and irony is subtle; sarcasm often makes use of irony, but is characterized by its "biting" nature. The quote from my daughter is ironic and sarcastic; the quote about "caring less" is ironic, and may or may not be sarcastic depending on the context.

Irony here refers to verbal irony, a discrepancy between the literally meaning of a phrase and its intended meaning, such as saying "What a nice day!" when it is raining.

Thus, "I could give a shit" and "I couldn't give a shit" are identical in meaning, as the former is doubtless intended ironically. Similarly for caring less.


It's a contranymic idiom. A contridiom!


No. He is talking about irony, not sarcasm. Irony is typically understood to mean saying A while being aware (or of the opinion) that !A. Viz. "Nice hat, dad", or "Real good idea" (when it's not). Sarcasm (cutting remarks) often involve irony, but not always; it's an orthogonal concept.

Also, dropping the "not" in "Could not care less" is not reasonably said to be a grammar mistake. It's usually not sarcasm, either. Whether you agree that it's irony is a different matter (I'm pretty sure it's not).


Really now? Next you'll be telling us that "then" is used ironically instead of "than", even when "than" is what people mean.

Face it, people say "I could care less" for the same reason they say "then" instead of "than". There's just something about the English language that makes most of its native speakers unable to use it.

You came up with this irony theory because you've made the same mistake yourself, and your ego wants to deflect the accompanying shame.

The word "not" is not a particularly big word, but then again, most native speakers can't use "than" either.

It's amazing what an outburst of theorizing wankery your comment sparked.


This argument is pervasive on HN. Here is the best example of it[1], and please see CodyRobbins posts[2] on the subject as some of HN's best posts ever.

[1] - http://news.ycombinator.com/item?id=853100

[2] - http://news.ycombinator.com/item?id=854042


Wow. at first I thought you were linking to something actually interesting and thoughtful regarding the nature of security and open source software. Turns out you were being boorish, prejudicial and wrongheaded about the absolutely normal and acceptable use of language. There is no such thing as linguistic prescriptivism. What you are espousing is the linguistic intellectual equivalent of creationism. It only exists in unfortunate circles of bored, annoying laymen and their grammarian fore-bearers, who were equally unfortunate bored and annoying. For 10,000 reasons why you should leave everyone alone with your silly prejudices, see Language Log or any other of the dozen or so blogs linguists run. </rant>


If that you could care less is the most you can say about something, that's an insult. If you say that you couldn't care less, that's just a lie - you're responding to it.

http://en.wikipedia.org/wiki/Damn_with_faint_praise

I love David Mitchell, but this rant by other people is a long held irritation of mine: false pedantry:)


There is already a 100% open source version of Amazon S3: Openstack swift (http://swift.openstack.org). Swift is in use by many companies, and it is the software that runs Rackspace's Cloud Files product, storing petabytes of data and billions of objects.

Swift's code is available on github (http://github.com/openstack/swift) and devs and users are almost always available in #openstack on freenode. There is a wealth of info available and more available to anyone who asks.

I'm all for encouraging many people to solve large-scale storage problems. However, as others have pointed out, nimbus is claiming to be open without providing much detail.


(SpiderOak / Nimbus.io cofounder here)

Thanks for your interest! Nimbus.io will have public git repositories, "developed in the open" before we ever charge money to use the service. We just haven't posted the links yet. :)

We admire OpenStack and Ceph as great examples of open source S3 alternatives. Also, Riak+Luwak isn't protocol-level compatible with S3 but offers similar capabilities and an truly elegant design.

Nimbus.io takes a different approach than the above options in that it focuses on space efficiency using parity instead of replication, allowing the storage of a little more than twice as much data using the same hardware. It's a tradeoff of cost vs. latency. For long term archival storage, while throughput matters greatly, latency less so. That's why the price is $0.06/GB.


Great to hear. I'll look forward to looking at your implementation and exploring the tradeoffs you are making. I'm especially interested in how you solve durability in the face of multiple, simultaneous hardware failures. I'm also quite curious about how you are handling object metadata.

You are absolutely right that these things are greatly dependent on the use case. I'm happy to see other people trying to solve these problems too.

Can you describe your API? Do you have your own? Are you reimplementing the S3 API? REST-ful? xmlrpc? How do you handle authentication and authorization?


Swift is....... woeful.

Just as an example, they store object listings using SQLite databases that were file based replicated between nodes for HA. Thus when you had too many files in one container your performance would sink like a stone. Assuming it was never corrupted/etc...

I'm all for people working in this space though. A monoculture is rarely good for anybody.


woeful seems a little harsh :-)

There is a current workaround for the issue you describe: use many containers. However, there are 2 ways to solve the issue for good. One (the simplest) is to have dedicated hardware for the account and container servers, and provide that hardware with plenty of IOPS. Our testing has shown sustained 400 puts/sec on a billion item container with this kind of deployment. The other solution is to change the code to automatically shard the container (transparent to the client) as it gets big. This is something we (the swift devs) are working on. I hope that it will be done in the next several months, but, of course, a complex feature like this is hard to fit to a predetermined timeline.


:(

You're going to shard a SQLite database into a series of objects to deal with "large" containers?


The idea is to limit each "shard" to some number configurable number of objects, say, 1 million. As the container grows, the db can be split in two and each of the two new pieces can grow. The original container entity keeps an index listing of what each of its "child shards" hold, ie the start and end markers.

There are tricky problems to solve, of course. How do listings work? Will shards ever be collected? What are the performance tradeoffs? How does replication handle shard conflicts?

These issue will be worked out, and it should eliminate the write bottleneck in large containers. (Note that reads are/were never affected by this issue.)

This implementation of container sharding is something that is being evaluated. It may or may not ever make it into swift itself.


But why SQlite? And why file based?

Why don't you guys use a proper distributed database to handle container mappings/etc?


So it stores data using RS encoding on multiple pieces.

I've had a quick look around the website, and the most information i could sort of squeeze out was from the blog.

The arch page https://nimbus.io/architecture/ is devoid of architecture.

I don't see any mention of compression or dedup.

I don't see any mention of network level failover/redundency.

I don't see any mention of high level CNC functionality/db arch (ie. swift's notorious file replicated SQLite database....)

I don't see a download source button. Is that just me?

Overall, sounds very interesting and rather promising. Who are the people behind Nimbus.io?


"I don't see a download source button. Is that just me?"

Says right there on the front page: "We are currently in private beta. Please sign-up and we will send you an invitation as soon as we are ready!" Presumably, the download is behind the invite-wall.


I think someone's confused about what "100% open source" means.


Exactly. I think open source is just a buzzword for them. There is still money concerns behind this service. I was quite disappointed since I was expecting a "Source" link that goes to github or somewhere.


Wrt. their Spider Oak backup service, from https://spideroak.com/faq/questions/35/why_isnt_spideroak_op...

"Our founders and engineers have a strong open source background and we consider a contributory relationship with the FOSS community as the normal course of business. Thus, our plan all along has been to make our entire client-side code base open source; however, as anyone who has worked with such issues knows, it is often not quite that simple."

So they say they want to be good, but not quite yet. I've posted a question about nimbus.io on that page.


Interesting. If they only open source their client side libraries it would be a rather sad development.


Having source on GitHub is not the only definition of "open source." The generally accepted definition is that the source is available (you may have to ask for it), it might be free (you might be charged for media, though that is less of concern here in the future), and the recipient can modify and redistribute the source.

It might be more appropriate if the site had text like "if you are interested in the source code, please email us." As it stands, it does seem like they are selectively releasing the source (and this is an assumption, right, since the site doesn't say "request an invite and get the source?").


I don't see how this is not open source yet. Just because they are limiting access to the actual source doesn't mean that, when you finally do get past the invite wall and see the source, you don't have all the open rights they assign to it.

It's open source behind an invite wall. Once you get it, it should come with the OSS lincenses. I'm assuming.


It's open source insofar as the code, when it is released, will be released under an open licence. However, it fails the 'unofficial' criterion of a project that is developed in the open. See also: http://www.youtube.com/watch?v=0SARbwvhupQ


I think somebody is making unfounded assumptions about the mental state of another person. A common, albeit embarrassing mistake to make.

And yes, I know full well what it means :)


Says right there on the front page: "Build by SpiderOak on the same proven backend storage network which powers hundreds of thousands of backups".


The Eucalyptus project provides an Open Source alternative to EC2 and S3 as well, with a compatible API.


Storing 50TB on Amazon S3 (US-EAST) Premium costs ~ $6,264

Storing 50TB on Amazon S3 (US-EAST) Reduced Redundancy costs ~ $4,160

Storing 50TB on Nimbus: $3,000

Is Nimbus's fault tolerance closer to the Premium S3 or the Reduced Redundancy S3?

(for completeness, Nimbus's transfer out is $0.06 per GB vs Amazon's $0.12 per GB).


They say[^1] that they can tolerate destruction of any 2 nodes without data loss. I don't know how many nodes Amazon S3 premium can tolerate.

[^1]: https://nimbus.io/architecture/


Amazon doesn't talk about their numbers either. The only thing they do say is that RRS (reduced redundancy storage) 'stores objects on multiple devices across multiple facilities, providing 400 times the durability of a typical disk drive, but does not replicate objects as many times as standard Amazon S3 storage, and thus is even more cost effective.'

This is at the main page: http://aws.amazon.com/s3/ (search for RRS)


Amazon says that S3 provides eleven nines (99.999999999%) durability of files. So if you have 100 billion objects in S3, you should expect to lose on average 1 per year. Or, if you have 10,000 files, you should expect to lose 1 per 10 million years. In addition they say it can tolerate the simultaneous failure of two datacenters. Nimbus, with 3 copies total, appears much less redundant... but nobody knows how Amazon calculated their eleven nines claim.


This is a bit misleading. A big advantage of an "Open Source Software" solution as opposed to a "Propriatary" solution is that you don't have to worry if the original provider goes bust or doesn't release patches or discontinues the product, and you don't have to worry about how many licences you have etc. i.e. Open Source gives you, not the people who made it, the power & control.

With cloud hosting, like Amazon S3, where you store all your data (or servers) on a 3rd party's servers, there are legitimate concerns about control & access (i.e. how much control do you have to do things, how much control do you have to stop someone else doing things (i.e. privacy)). So an "Open Source Alternative to S3" sounds like a good thing that would not have any of these drawbacks.

If someone thinks "Open Source S3" thinks that they, not the hosting company, have power & control, then they would be dissapointed by Nimbus.io, since it has all the drawbacks of S3.


Their blog post from yesterday describing nimbus...

https://spideroak.com/blog/20111107183539-spideroaks-new-ama...


SpiderOak is a mix of proprietary and open source according to Wikipedia. Does anyone know how much of SpiderOak is open source?

I wonder if they considered OpenStack that several companies including Rackspace, NASA, et al, uses?


Not much of SpiderOak is open source, see https://spideroak.com/code

However they are considering open sourcing more and more of their code.


"The server and client components are all free and open source software."

Can't find any link to the source. _Are_ they opensource. Or is it only planned to release the source eventually?


Please note that, if you are not going to use 100GB for a project (for example, storing your own stuff) you are better off grouping them as it is $6 per 100GB (not $0.06 per GB) for stored data. Transfer out is $0.06/GB though. So if you have 101GB (I think the GB is is GigaByte right?) it should cost you $12 a month.

But if you are just wandering with personal backups they have plans $10 a month for 100GB+ (with affiliation program to earn up to 50GBs of data, but that part are given to free users as well) so that is pretty close to $6 for 100GB, and $4 for 40GB of transfer-out. Since they claim they are using very similar or just the same system for their service in SpiderOak, you can bet they are just the same thing.

Or just go with free account and grab 2GB + whatever I get from affiliation.

On the other hand, since they are saying it's a trade-off of low-latency for price. If their data about s3 is correct, that it should be slower than s3 in the way that s3 has 3 drives to read from versus 3 drives but parity shared so effectively just one fast enough.


"Build by SpiderOak on the same proven backend storage network which powers hundreds of thousands of backups"

This concerns me slightly, backup storage is a whole different world to real time data storage. Backups are write once, read occasionally, some people use S3 as a make-shift CDN, so constantly reading data.

Parity based repplication is great for backups, but would it not have performance implications if every request is reading from multiple disks/servers/nodes? I'm not an expert on hardware, but I would have thought being able to read an entire file of one disk is faster than having to put together pieces of data from multiple disks, anyone want to correct/inform me?

If you can offer me a serious alternative to S3 at a cheaper price, and open source software, I can't wait to try it out. I might sound negative but I just wanted to put across my first thoughts on having a look around the site.


Pretty much, it's always faster to read from multiple disks.

There are many reasons why. First is that by splitting things into small blocks spread around the cluster you have more consistent load (why is left as an exercise to the reader), you can more easily read ahead from later blocks, etc.


This is exactly what they said on their blog post:

Long term archival data is different than everyday data. It's created in bulk, generally ignored for weeks or months with only small additions and accesses, and restored in bulk (and then often in a hurried panic!)

This access pattern means that a storage system for backup data ought to be designed differently than a storage system for general data. Designed for this purpose, reliable long term archival storage can be delivered at dramatically lower prices.


Their architecture page seems to confirm this. It seems that their service is explicitly designed to have different performance characteristics from Amazon S3, so maybe they aren't quite a direct competitor to S3, but there are probably a lot of people using S3 for the use cases that Nimbus.IO claims to do better on, simply because S3 was available at the time.


Yes exactly. Nimbus.io is designed for long term archival storage at more affordable prices. We think it's a great time to be competing on price.

We may compete with S3 for low-latency service later on (latency can be made arbitrarily low by spending enough money on caching.) Initial calculations suggest we could be almost as low-latency as S3 and still under price by a good margin.


Latency may be able to be made low through caching, but depending on the distribution the point at which additional cache is uneconomical may be well before the edge of your performance envelope.

How are you calculating your latency? Also, what distribution do you assume your file accesses will come from?


Keep in mind that S3, like all other Amazon products, are priced with stupid margins. As such providing lower prices isn't difficult.


Hmmmm. Downvotes without comments. Classy.

This is cheap and at qty 1.

A backblaze type box is ca 12K for 135TB of storage.

Assume an interest rate of 5% and 36 months worth of repayments and the server itself is worth $725/month

It's uses roughly 1kw of power and 4u of rack space, so say you have 6/rack with a 30A rack. You can get the rack for say 5k, giving us a total rack cost/server of $833/month

Total cost/server month is $1558/month.

Total cost/gb month is $0.011/gb month.

Add in parity replication (1 in 4, 25%), $0.014/gb month.

This doesn't include compression or dedup, both of which drops cost price dramatically.

Compare that to say S3's $0.14/gb and you can see why I'd say the margins are stupid, especially at the scale they're running at.


Nice; that's the same sort of math we're doing.

Note that the BackBlaze machines are optimized for very cold data since they only need to support backup and restore. We also do custom hardware at SpiderOak, but we support web/mobile access, real time sync, etc. That makes our hardware slightly more expensive because of the generally warmer data. So you're off by a few pennies, but certainly in the right zone.

For Amazon, I suspect their internal S3 cost is actually quite a bit higher than either BackBlaze or SpiderOak since their data is warmer.


I'd suspect that their data temperature is very bimodal, so they'd be able to easily split out hot data from cold.

What much warm data data do you normally have/node?


I have submitted my email twice and did not get any confirmation, neither on page, nor in my inbox. Not sure if this is by design or bug, but confusing in either case.


I had that too. I mailed them (info@nimbus.io), let's see if the behavior is normal.


They replied to my email. The registration got through, but the confirmation did not work (at the time?). So if you submitted your email, they'll be in touch when it's time.


I think that there should be more open source systems that do parity across serves versus replication. So to me this is great! AFAIK, something like swift suggests keeping 5 copies of everything, so if you want 1PB usable you need 5PB raw. But with dual parity spread across servers, like in Nimbus, you could probably get 80% usable vs raw. This would be similar to Isilon.

I wonder if Swift may support something similar in the future.


See also: ceph.newdream.net


Ceph is a very interesting project. RADOS, their distributed block store is now mainline I believe and the project is coming along in leaps and bounds.

I'm unsure if anybody has a large scale RADOS based blob store though. It would be interesting to see how it holds up.


Dreamhost was doing a ceph beta with testers over a year ago. I see there's a number of blog posts about ceph on their site.


Since the nimbus.io page has only little information, you can take a look at the spideroak dyi archival storage page[1], which seems to be the predecessor of the nimbus.io offer.

1: https://spideroak.com/diy/


How is this open source?


And there you have your answer for why NuoDB.com used to be NimbusDB but then changed all of the sudden.


I'm pretty sure that bad to do with Nimbus Data, not this project. I'm pretty sure that this project will get a cease and desist too if it becomes popular.

Counting down to a name change in 3... 2...


It's not open source. The code is not currently available. Given that the code for SpiderOak itself has been "coming real soon now" for a year, I'm not going to hold my breath. Even if/when that day does come, it will be "thrown over the wall" open source rather than "developed collaboratively" open source. At least Swift, for all of its alleged technical deficiencies (which don't seem to prevent it being used to billions of files already), hasn't been guilty of false advertising. Alternatively you have Walrus, tabled (from Project Hail), Elliptics, Luwak, Gluster's UFO, and probably more. Practically all of these have solved the harder problems of cluster management, API implementation (including the security that nimbus.io seems awfully quiet about), OS integration, etc. Without source, nimbus.io can't credibly claim to have reached parity in all of these other areas, or that it would take less for it to reach parity than for the others to add the one feature (erasure coding instead of replication) that they crow about.


(SpiderOak / Nimbus.io cofounder here)

Note that this is just an announcement and invite site to show the pricing at $0.06/GB. Nimbus.io will have public git repositories, "developed collaboratively in the open" before we ever charge money to use the service. (And this is a wholly different project than the SpiderOak backup/sync software.)

FYI, you can see the git repos for the prototype we built of this awhile back, when we called it our storage "DIY API". https://spideroak.com/diy/ Note that the code and the rest of the information on that page is way out of date since it was an early design and prototype.

I'm not sure erasure coding vs. replication is a simple change for other distributed storage projects. It effects the whole architecture. We researched pretty heavily before building. If it had been simple to modify any of the alternatives, this project wouldn't exist. I'm more than happy to be proven wrong though!

* Edited for pricing info.


"I'm not sure erasure coding vs. replication is a simple change for other distributed storage projects."

It depends on a few factors: how modular the architecture is overall, whether the existing replication is synchronous or asynchronous, etc. I'm working on the GlusterFS replication code right now in another window (OK, I should be but I'm typing here). I can assure you that it would be possible to replace replication with erasure coding just by replacing that one module, without perturbing the rest of the architecture. I've also been through the tabled code and I think it would be possible there too. I suspect the same would be true for Elliptics, but probably not Swift. Can't tell for Luwak; that would require more thought than I can afford to put into it right now.

This is something we've actively considered for GlusterFS/HekaFS, and might still do some day - though it's more likely to be on the IDA/AONT-RS side than RS/EC. The downside is that, while these approaches do offer better storage utilization, they also consume more bandwidth. Also, queuing effects can turn a bandwidth issue into a latency issue. This is especially the case for read-dominated workloads, where you just can't beat the latency of reading exactly the bytes you need from one replica. For these reasons I don't think either full replication or redundant-encoding schemes will ever entirely displace the other. Each project must prioritize which to implement first, but that doesn't mean those that have implemented replication first are precluded from offering other options as alternatives. It's really not an architectural limitation in most cases. It's just timing.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: