So basically, we are a hosting association that wanted to put their servers at home _and_ sleep at night.
This demanded inter-home redundancy of our data. But none of the existing solutions (MinIO, Ceph...) are designed for high inter-node latency.
Hence Garage! Your cheap and proficient object store: designed to run on a toaster, through carrier-pigeon-grade networks, while still supporting a tons of workloads (static websites, backups, Nextcloud... you name it)
I've been researching a solution for this problem for some time and you all have landed on exactly how I thought this should be architected. Excellent work and I foresee much of the future infrastructure of the internet headed in this direction.
(Garage Contributor here) We reviewed many of the existing solutions and none of them had the feature set we wanted. Compared to SeaweedFS, the main difference we introduce with Garage is that our nodes are not specialized, which lead to the following benefits:
- Garage is easier to deploy and to operate: you don't have to manage independent components like the filer, the volume manager, the master, etc. It also seems that a bucket must be pinned to a volume server on SeaweedFS. In Garage, all buckets are spread on the whole cluster. So you do not have to worry that your bucket fills one of your volume server.
- Garage works better in presence of crashes: I would be very interested by a deep analysis of Seaweed "automatic master failover". They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or sending a request for each transaction, which creates a huge bottleneck in their design.
- Better scalability: because there is no special node, there is no bottlenecks. I suppose that with SeaweedFS, all the requests have to pass through the master. We do not have such limitations.
As a conclusion, we choose a radically different design with Garage. We plan to do a more in-depth comparison in the future, but even today, I can say that if we implement the same API, our radically different designs lead to radically different properties and trade-off.
> independent components like the filer, the volume manager, the master, etc.
You can run volume/master/filer in a single server (single command).
> filer probably needs an external rdbms to handle the metadata
This is true. You can use an external db. Or build/embed some other db inside it (think a distributed kv in golang that you embed inside to host the metadata).
> It also seems that a bucket must be pinned to a volume server on SeaweedFS.
This is not true. A bucket will be using it's own volumes, but can be and is distributed on the whole cluster by default.
> They use Raft, I suppose either by running an healthcheck every second which lead to data loss on a crash, or running for each transaction, which creates a huge bottleneck.
Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.
> Better scalability: because there is no special node, there is no bottlenecks. I suppose that SeaweedFS, all the requests have to pass through the master. We do not have such limitations.
Going to the master is only needed for writes, to get a unique id. This can be easily fixed with a plugin to say, generate twitter-snowflake-ids which are very efficient. For reads, you keep a cache in your client for the volume-to-server mapping so you can do reads directly from the server that has the data, or you can randomly query a server and it will handle everything underneath.
I'm pretty sure seaweedfs has very good fundamentals from researching all other open-source distributed object storage systems that exists.
> Raft is for synchronized writes. It's slow in the case of a single-write being slow because you have to wait for an "ok" from replicas, which is a good thing (compared to async-replication in, say, cassandra/dynamodb). Keep in mind that s3 also moved to synced replication. This is fixed by having more parallelism.
We have synchronous writes without Raft, meaning we are both much faster and still strongly consistent (in the sense of read-after-write consistency, not linearizability). This is all thanks to CRDTs.
If you don't sync immediately, you may lose the node without it replicating yet and losing the data forever. There's no fancy algorithm when the machine gets destroyed before it replicated the data. And you can't write to 2 replicas simultaneously from the client like, say, when using a Cassandra-smart-driver since S3 doesn't support that.
So let's take the example of a 9-nodes clusters with a 100ms RTT over the network to understand. In this specific (yet a little bit artificial) situation, Garage particularly shines compared to Minio or SeaweedFS (or any Raft-based object store) while providing the same consistency properties.
For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). In the end, our write took 300ms and generated 38 messages over the cluster.
Another critical point with Raft is that your writes do not scale: they all have to go through your leader. So on the writes point of view, it is not very different from having a single server.
For a DynamoDB-like object store (Riak CS, Pithos, Openstack Swift, Garage), the gateway receives the request and know directly on which nodes it must store the writes. For Garage, we choose to store every writes on 3 different nodes. So the gateway sends the write request to the 3 nodes and waits that at least 2 nodes confirm the write (+ 100ms, 6 messages). In the end, our write took 100ms, generated 6 messages over the cluster, and the number of writes is not dependent on the number of (raft) nodes in the cluster.
With this model, we can still provide always up to date values. When performing a read request, we also query the 3 nodes that must contain the data and wait for 2 of them. Because we have 3 nodes, wrote at least on 2 of them, and read on 2 of them, we will necessarily get the last value. This algorithm is discussed in Amazon's DynamoDB paper[0].
I reasoned in a model where there is no bandwidth, no CPU limit, no contention at all. In real systems, these limits apply, and we think that's another argument in favor of Garage :-)
> For a Raft-based object store, your gateway will receive the write request and forward it to the leader (+ 100ms, 2 messages). Then, the leader will forward in parallel this write to the 9 nodes of the cluster and wait that a majority answers (+ 100ms, 18 messages). Then the leader will confirm the write to all the cluster and wait for a majority again (+ 100ms, 18 messages). Finally, it will answer to your gateway (already counted in the first step). Our write took 300ms and generated 38 messages over the cluster.
No. The "proxy" node, a random node that you connect to will do:
0. split the file into chunks of ~4MB (can be changed) while streaming
for each chunk (you can write chunks in parallel):
1. get id from master (can be fixed by generating an id in the proxy node with some custom code, 0 messages with custom plugin)
2. write to 1 volume-server (which will write to another node for majority) (2 messages)
3. update metadata layer, to keep track of chunks so you can resume/cancel/clean-failed uploads (metadata may be another raft subsystem, think yugabytedb/cockroachdb, so it needs to do it's own 2 writes) (2 messages)
Mark as "ok" in metadata layer and return ok to client. (2 messages)
The chunking is more complex, you have to track more data, but in the end is better. You spread a file to multiple servers & disks. If a server fails with erasure-coding and you need to read a file, you won't have to "erasure-decode" the whole file since you'll have to do it only for the missing chunks. If you have a hot file, you can spread reads on many machines/disks. You can upload very-big-files (terabytes), you can "append" to a file. You can have a smart-client (or colocate a proxy on your client server) for smart-routing and stuff.
If you're still talking about SeaweedFS, the answer seems to be that it's not a "raft-based object store", hence it's not as chatty as the parent comment described.
That proxy node is a volume server itself, and uses simple replication to mirror its volume on another server. Raft consensus is not used for the writes. Upon replication failure, the data becomes read-only [1], thus giving up partition tolerance. These are not really comparable.
How does step 1 work? My understanding is that the ID from the master tells you which volume server to write to. If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?
> If you're generating it randomly, then are you saying you have queried the master server for the number of volumes upfront & then just randomly distribute it that way?
You just need a unique id (which you generate it locally). And you need an writable volume-id, which you can query the master, master-follower, cache it, or query a volume-server directly.
We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation. We are using CRDTs not so much for their asynchronous replication properties (what is usually touted as eventual consistency), but more as a way to avoid conflicts between concurrent operations so that we don't need a consensus algorithm like Raft. By combining this with the quorum system (two writes out of three need to be successfull before returning ok), we ensure durability of written data but without having to pay the synchronization penalty of Raft.
> We ensure the CRDT is synced with at least two nodes in different geographical areas before returning an OK status to a write operation [...] we ensure durability of written data but without having to pay the synchronization penalty of Raft.
This is, in essence, strongly-consistent replication; in the sense that you wait for a majority of writes before answering a request: So you're still paying the latency cost of a round trip with a least another node on each write. How is this any better than a Raft cluster with the same behavior? (N/2+1 write consistency)
Raft consensus apparently needs more round-trips than that (maybe two round-trips to another node per write?), as evidenced by this benchmark we made against Minio:
Yes we do round-trips to other nodes, but we do much fewer of them to ensure the same level of consistency.
This is to be expected from a distributed system's theory perspective, as consensus (or total order) is a much harder problem to solve than what we are doing.
We haven't (yet) gone into dissecating the Raft protocol or Minio's implementation to figure out why exactly it is much slower, but the benchmark I linked above is already strong enough evidence for us.
I think it would be great if you could make a Github repo that is just about summarising performance characteristics and roundrip types of different storage systems.
You would invite Minio/Ceph/SeaweedFS/etc. authors to make pull requests in there to get their numbers right and explanations added.
This way, you could learn a lot about how other systems work, and users would have an easier time choosing the right system for their problems.
Currently, one only gets detailed comparisons from HN discussions, which arguably aren't a great place for reference and easily get outdated.
RAFT needs a lot more round trips that that, it needs to send a message about the transaction, the nodes need to confirm that they received it, then the leader needs to send back that it was committed (no response required). This is largely implementation specific (etcd does more round trips than that IIRC), but that's the bare minimum.
As someone who wants this technology to be useful to commercial and open-source entities GPL software is more compatible with the status quo and I don't want to fight the battle of the license and the technology.
More-permissive-than-AGPLv3 for service-based software creates a potential (or, I'd argue, likely) route towards large-platform capture and resale of the value of those services without upstream contribution in return, followed at a later date by the turning-down and closure of the software products and communities behind them.
That's the strategy that the platforms have, I think - for rational (albeit potentially immoral) economic and competitive reasons. I'd like to hear if you disbelieve that and/or have other reasons to think that permissive licensing is superior long-term.
Unless garage has a special mitigation against this, usually performance gets much worse in large clusters as the filesystem fills up. As files are added and deleted, it struggles to keep nodes exactly balanced and a growing percentage of nodes will be full and unavailable for new writes.
So in a high throughput system, you may notice a soft performance degradation before actually running out of space.
If you aren't performance sensitive or don't have high write throughput, you might not notice. This is definitely something you should be forecasting and alerting on so you can acquire and add capacity or delete old data.
If you really don't like the idea of everything falling over at the same time, you could use multiple clusters or come up with a quota system (e.g. disable workload X if it's using more than Y TiB).
Yes, in which case you just make your cluster bigger or delete some data. Seems like a reasonable compromise to me.
Garage has advanced functionnality to control how much data is stored on each node, and is also very flexible with respect to adding or removing nodes, making all of this very easy to do.
We don't currently have a comparison with SeaweedFS. For the record, we have been developping Garage for almost two years now, and we hadn't heard of SeaweedFS at the time we started.
To me, the two key differentiators of Garage over its competitors are as follows:
- Garage contains an evolved metadata system that is based on CRDTs and consistent hashing inspired by Dynamo, solidly grounded in distributed system's theory. This allows us to be very efficient as we don't use Raft or other consensus algorithms between nodes, and we also do not rely on an external service for metadata storage (Postgres, Cassandra, whatever) meaning we don't pay an additionnal communication penalty.
- Garage was designed from the start to be multi-datacenter aware, again helped by insights from distributed system's theory. In practice we explicitly chose against implementing erasure coding, instead we spread three full copies of data over different zones so that overall availability is maintained with no degradation in performance when one full zone goes down, and data locality is preserved at all locations for faster access (in the case of a system with three zones, our ideal deployment scenario).
CRDTs are often presented as a way of building asynchronous replication system with eventual consistency. This means that when modifying a CRDT object, you just update your local copy, and then synchronize in background with other nodes.
However this is not the only way of using CRDTs. At there core, CRDTs are a way to resolve conflicts between different versions of an object without coordination: when all of the different states of the object that exist in the network become known to a node, it applies a local procedure known as a merge that produces a deterministic outcome: all nodes do this, and once they have all done it, they are all in the same state. In that way, nodes do not need to coordinate before-hand when doing modifictations, in the sense where they do not need to run a two-phase commit protocol that ensures that operations are applied one after the other in a specific order that is replicated identically at all nodes. (This is the problem of consensus which is theoretically much harder to solve from a distributed system's theory perspective, as well as from an implementation perspective).
In Garage, we have a bit of a special way of using CRDTs. To simplify a bit, each file stored in Garage is a CRDT that is replicated on three known nodes (deterministically decided). While these three nodes could synchronize in the background when an update is made, this would mean two things that we don't want: 1/ when a write is made, it would be written only on one node, so if that node crashes before it had a chance to synchronize with other nodes, data would be lost; 2/ reading from a node wouldn't necessarily ensure that you have the last version of the data, therefore the system is not read-after-write consistent. To fix this, we add a simple synchronization system based on read/write quorums to our CRDT system. More precisely, when updating a CRDT, we wait for the value to be known to at least two of the three responsible nodes before returning OK, which allows us to tolerate one node failure while always ensuring durability of stored data. Further when performing a read, we ask for their current state of the CRDT to at least two of the three nodes: this ensures that at least one of the two will know about the last version that was written (due to the intersection of the quorums being non-empty), making the system read-after-write consistent. These are basically the same principles that are applied in CRDT databases such as Riak.
Sounds interesting. How do you handle the case where you're unable to send the update to other nodes?
So an update goes to node A, but not to B and C. Meanwhile, the connection to the client may be disrupted, so the client doesn't know the fate of the update. If you're unlucky here, a subsequent read will ask B and C for data, but the newest data is actually on A. Right?
I assume there's some kind of async replication between the nodes to ensure that B and C eventually catch up, but you do have an inconsistency there.
You also say there is no async replication, but surely there must be some, since by definition there is a quorum, and updates aren't hitting all of the nodes.
I understand that CRDTs make it easier to order updates, which solves part of consistent replication, but you still need a consistent view of the data, which is something Paxos, Raft, etc. solve, but CRDTs separated across multiple nodes don't automatically give you that, unless I am missing something. You need more than one node in order to figure out what the newest version is, assuming the client needs perfect consistency.
True; we don't solve this. If there is a fault and data is stored only on node A and not nodes B and C, the view might be inconsistent until the next repair procedure is run (which happens on each read operation if an inconsistency is detected, and also regularly on the whole dataset using an anti-entropy procedure). However if that happens, the client that sends an update will not have received an OK response, so it will know that its update is in an indeterminate state. The only guarantee that Garage gives you is that if you have an OK, the update will be visible in all subsequent reads (read-after-write consistency).
I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms. Great mission and work by DeuxFluers team btw. Bonne chance!
> I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms.
This is because "CRDT" is not about a new or different approach to replication, although for some reason this has become a widely held perception. CRDT is about a new approach to the _analysis_ of replication mechanisms, using order theory. If you read the original CRDT paper(s) you'll find old-school mechanisms like Lamport Clocks.
So when someone says "we're using a CRDT" this can be translated as: "we're using an eventually consistent replication mechanism proven to converge using techniques from the CRDT paper".
The thing is, if you don't have CRDT, the only way to replicate things over nodes in such a way that they end up in consistent states, is to have a way of ordering operations so that all nodes apply them in the same order, which is costly.
Let me give you a small example. Suppose we have a very simple key-value storage, and that two clients are writing different values at the same time on the same key. The first one will invoke write(k, v1), and the second one will invoke write(k, v2), where v1 and v2 are different values.
If all nodes receive the two write operation but don't have a way to know which one came first, some will receive v1 before v2, and end up with value v2 as the last written values, and other nodes will receive v2 before v1 meaning the will keep v1 as the definitive value. The system is now in an inconsistent state.
There are several ways to avoid this.
The first one is Raft consensus: all write operations will go through a specific node of the system, the leader, which is responsible for putting them in order and informing everyone of the order it selected for the operations. This adds a cost of talking to the leader at each operation, as well as a cost of simply selecting which node is the leader node.
CRDT are another way to ensure that we have a consistent result after applying the two writes, not by having a leader that puts everything in a certain order, but by embedding certain metadata with the write operation itself, which is enough to disambiguate between the two writes in a consistent fashion.
In our example, now, the node that does write(k, v1) will for instance generate a timestamp ts1 that corresponded to the (approximate) time at which v1 was written, and it will also generate a UUID id1. Similarly, the nodes that does write(k, v2) will generate ts2 and id2. Now when they send their new values to other nodes in the network, they will send along their values of ts1, id1, ts2 and id2. Nodes now know enough to always deterministcally select either v1 or v2, consistently everywhere: if ts1 > ts2 or ts1 = ts2 and id1 > id2, then v1 is selected, otherwise v2 is selected (we suppose that id1 = id2 has a negligible probability of happening). In terms of message round-trips between nodes, we see that with this new method, nodes simply communicate once with all other nodes in the network, which is much faster than having to pass through a leader.
Here the example uses timestamps as a way to disambiguate between v1 and v2, but CRDTs are more general and other ways of handling concurrent updates can be devised. The core property is that concurrent operations are combined a-posteriori using a deterministic rule once they reach the storage nodes, and not a-priori by putting them in a certain order.
Got it, thanks. This reminded me of e-Paxos (e for Egalitarian) and a bit of digging there are now 2 additional contenders (Atlas and Gryff, both 2020) in this space per below:
This sounds like ABD but avoiding the read round trip as long as the register can be modeled with a CRDT. It's interesting to see this applied to a metadata portion of an object storage system. I'm sure this has implications on how rich of storage APIs you can offer. Can you do conditional writes? Can old versions of a register be compacted to reduce metadata space? How does garbage collecting dead blobs work?
It would be helpful to see specific goals of the project laid out, especially since this bucks the trend of being an enterprise/production grade object storage service (and that's totally ok).
If by that you mean something like compare-and-swap or test-and-set, then no, that would require consensus number > 1 [0]
> Can old versions of a register be compacted to reduce metadata space?
Yes, when using last-write-wins registers, old versions are deleted as soon as they are overseded by a newer version. Almost all metadata stored in Garage uses this mechanism.
> How does garbage collecting dead blobs work?
Nodes have a shared table of block references, that identifies for each data block the set of objects (or rather, versions of objects) that use this block. Block references have a "deleted" flag that is set to true when the version in question is deleted from the store. Once set to true, it cannot go back, but a new version can always be created that references the same block. Nodes keep a local count of the number of non-deleted block references for each block, so that when it reaches zero, they queue up the block for deletion (after a small delay). There is also a metadata GC system so that block references with the deleted flag are dropped from the table after a certain time.
Basically, our association is trying to move away from file systems, for scalability and availability purposes.
There's a similarity between RDBMS (say SQL) and filesystems, in the sense that they distribute like crap. Main reason being that they have too strong consistency properties.
Although, one can use Garage as a file system, notably using WebDAV. It's a beta feature, called Bagage: https://git.deuxfleurs.fr/Deuxfleurs/bagage
I'll let my knowledgeable fellows answer, if you'd like to hear about differences between seaweedfs and Garage in more detail :)
"Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication."
Y'all might want to get to the point a bit quicker in your pitch. A lot of fluff to wade through before the lede..
"Garage is a distributed storage solution, that automatically replicates your data on several servers. Garage takes into account the geographical location of servers, and ensures that copies of your data are located at different locations when possible for maximal redundancy, a unique feature in the landscape of distributed storage systems.
Garage implements the Amazon S3 protocol, a de-facto standard that makes it compatible with a large variety of existing software. For instance it can be used as a storage back-end for many self-hosted web applications such as NextCloud, Matrix, Mastodon, Peertube, and many others, replacing the local file system of a server by a distributed storage layer. Garage can also be used to synchronize your files or store your backups with utilities such as Rclone or Restic. Last but not least, Garage can be used to host static websites, such as the one you are currently reading, which is served directly by the Garage cluster we host at Deuxfleurs."
I'm not quite sure who the article is actually written for. If it's written to generate interest from investors, or for some other such semi/non technical audience then please ignore this (although the fact that I can't tell who it's written for is in itself a warning signal).
If it's written to attract potential users, however, then it's woefully light on useful content. I don't care in the slightest about your views on tech monopolies - why on earth waste such a huge amount of space talking about this? What I care about are three things: 1) what workloads does it do well and, and which ones does it do poorly at (if you don't tell me about the latter I won't believe you - no storage system is great at everything). 2) What consistency/availability model does it use, and 3) how you've tested it.
As written, it's full of fluff and vague handwavy promises (we have PhDs!) - the only technical information in the entire post is that it's written in rust. For users of your application, what programming language it's written in is about the least interesting thing possible to say about it. Even going through your docs I can't find a single proper discussion of your consistency and availability model, which is a huge red flag to me.
This article broadly explains the motivation and vision behind our project. There are many object stores, why build a new one? To push self-hosting, in face of democratic/political threats we feel are daunting.
Sorry it didn't float your boat. You're just not the article's target audience I guess! (And, to answer your question: the target audience is mildly-knowledgeable but politically concerned citizens.)
I agree this is maybe not the type of post you'd expect on HN. But, we don't want to _compete_ with other object stores. We want to expand the application domain of distributed data stores.
> I can't find a single proper discussion of your consistency and availability model
LGTM, I would be dubious too. This will come in time, please be patient (the project is less than 2 yo!). You know Cypherunks' saying? "We write code." Well, for now, we've been focusing on the code, and we want to continue doing so until 1.0. Some by-standers will do the paperware for sure, but it's not our utmost priority.
Thanks for the response. It's great that you're trying to build something that will allow folks to do self-hosting well, I'm fully on board with this aim. My point about the political bits though, is that even for folks that agree with you, pushing the political bits too much can signal that you're not confident enough in the technical merits of what you're doing for them to stand on their own.
Re: the paperwork vs code thing - While a full white paper on your theoretical foundations would be nice, IMO the table stakes for getting people to take your product seriously is a sentence or two on the model you're using. For example, adding something like: "We aim to prioritise consistency and partition tolerance over availability, and use the raft algorithm for leader elections. Our particular focus is making sure the storage system is really good at handling X and Y." or whatever would be a huge improvement, until you can get around to writing up the details.
The problem with "we write code" is that your users are unlikely to trust you to be writing the correct code, when there's so little evidence that you've thought deeply about how to solve the hard problems inherent in distributed storage.
For the average user it shouldn't matter what its written in, but lately I've returned to the theory that language choice for a project signals a certain amount of technical sophistication. Particularly, for things like this, which should be considered systems programming. The overhead of a slow JIT, GC, threading model, IO model, whatever basically means the project is doomed to subpar perf forever and the cost of developing it has been passed on to the users, which will be forced to buy 10x the hardware to compensate.
> For the average user it shouldn't matter what its written in,
True.
> but lately I've returned to the theory that language choice for a project signals a certain amount of technical sophistication.
Sometimes it can signal the opposite, if you think a given language/framework/other is not a great tool for the job. For a relatively new project it can be a sign that there might be otherwise avoidable growing pains later.
From an open source standpoint what a project is written using can be very important from the point of view of what you are familiar with. If the project progress or support vanishes (for instance if the core maintainers move on to other things for whatever reason, or there is a significant license change making future work on the core project incompatible with your use) would others be able to maintain it securely in a fork and if no one else steps up would you be able to (at least until you migrated onto something else)?
It's the layer below: a sound distributed storage backend.
You can use it to distribute your Nextcloud data, store your backups, host websites... You can even mount it in your OS' file manager and use it as an external drive!
You own the servers. This is a tool to build your own object-storage cluster. For example, you can get 3 old desktop PCs, install Linux on them, download and launch Garage on them, configure your 3 instances in a single cluster, then send data to this cluster. Your data will be spread and duplicated on the 3 machines. If one machine fails or is offline, you can still access and write data to the cluster.
Then, how does Garage achieve 'different geographical locations'? I only have my house to put my server(s). That's one of the main reasons I'm using cloud storage. Or is the point that I can arrange those servers abroad myself, independent of the software solution (S3 etc)?
Garage is designed for self-hosting by collectives of system administrators, what we could call "inter-hosting". Basically you ask your friends to put a server box at their home and achieve redundancy that way.
The content is currently stored in plaintext on the disk by Garage, so you have to encrypt the data yourself. For example, you can configure your server to encrypt at rest the partition that contains your `data_dir` and your `meta_dir` or build/use applications that supports client-side encryption such as rclone with its crypt module[0] or Nextcloud with its end-to-end encryption module[1].
And, in a second time, we will work on a better/working IPv6 configuration for our Git repository and all of our services (we use Nomad+Docker and did not find a way to expose IPv6 in a satisfying way yet).
Looks like the TTL on the records is still affecting local resolution. I'm aware of the git protocol work-around to force IPv4 but that doesn't help when using a regular web-browser to visit the published URL for issues (first thing I evaluate for a new-to-me project):
Hello, quick question: from what i understand since it's based on CRDTs it means any server in the cluster may save something to the storage, that gets replicated elsewhere.
That's fine when there is absolute trust between the server operators, but am i correct to assume it's not the same threat model as encrypted/signed backups pushed to some friends lending storage space for you (who could corrupt your backup but can't corrupt your working data)?
If my understanding is correct, maybe your homepage should outline the threat model more clearly (i.e. a single trusted operator for the whole cluster) and point to other solutions like TAHOE-LAFS for other use-cases.
Congratulations on doing selfhosting with friends, that's pretty cool! Do you have any idea if some hosting cooperatives (CHATONS) or ISPs (FFDN) have practical use-cases in mind? They already have many physical locations and rather good bandwidth, but i personally can't think of an interesting idea.
> ...am i correct to assume it's not the same threat model as encrypted/signed backups pushed to some friends lending storage space for you (who could corrupt your backup but can't corrupt your working data)?
You are absolutely correct concerning the trust model of Garage: any administrator of one node is an administrator of the entire cluster, and can manipulate data arbitrarily in the system.
From an ideological perspective, we are strongly attached to the building of tight-knit communities in which strong trust bonds can emerge, as it gives us more meaning than living in an individualized society where all exchanges between individuals are mediated by a market, or worse, by blockchain technology. This means that trusting several system's administrator makes sense to us.
Note that in the case of cooperatives such as CHATONS, most users are non-technical and have to trust their sysadmin anyways; here, they just have to trust several sysadmins instead of just one. We know that several hosting cooperatives of the CHATONS network are thinking like us and are interested in setting up systems such as Garage that work under this assumption.
In the meantime, we also do perfectly recognize the possibility of a variety of attack scenarios against which we want to consider practical defenses, such as the following:
1/ An honest-but-curious system administrator or an intruder in the network that wants to read user's private data, or equivalently, a police raid where server software is embarked for inspection by state services;
2/ A malicious administrator or an intruder that wants to manipulate the users by introducing fake data;
3/ A malicious administrator or an intruder that simply wants to wreak havoc by deleting everything.
Point 1 is the biggest risk in my eyes. Several solutions can be built to add an encryption layer over S3 for different usage scenarios. For instance for storing personnal files, Rclone can be used to add simple file encryption to an S3 bucket and can also be mounted directly via FUSE, allowing us to access Garage as an end-to-end encrypted network drive. For backups, programs like Restic and Borg allow us to upload our backups encrypted into Garage.
Point 2 can be at least partially solved by adding signatures and verification in the encryption layer which is handled on the client (I don't know for sure if Rclone, Restic and Borg are doing this).
Point 3 is harder and probably requires adaptation on the side of Garage to be solved, for instance by adding a restriction on which nodes are allowed to propagate updates in the network and thus establishing a hierarchy between two categories of nodes: those that implement the S3 gateway and thus have full power in the network, and those that are only responsible for storing data given by the gateway but cannot originate modifications themselves (under the assumption that modifications must be accompanied by a digital signature and that only gateway nodes have the private keys to generate such signatures). Then we could separate gateway nodes according to the different buckets in which they can write for better separation of concernts. However, as long as we are trying to implement the S3 protocol I think we are stuck with some imperfect solution like this, because S3 itself does not implement using public-key cryptography to attest operations sent by the users.
It is perfectly true that solutions that are explicitely designed to handle these threats (such as TAHOE-LAFS) would provide better guarantees in a system that is maybe more consistent as a whole. However we are also trying to juggle these security constraints with deployment contraints such as keeping compatibility with standard protocols such as S3 (and soon also IMAP for mailbox storage), which restricts us in the design choices we make.
Just to clarify, the fact that any node can tamper with any of the data in the cluster is not strictly linked to the fact that we use CRDTs internally. It is true that CRDTs make it more difficult, but we do believe that there are solutions to this and we intend to implement at least some of them in our next project that involves mailbox storage.
Thanks for the detailed comment! I'd be happy to continue the discussion, is there an official IRC/XMPP bridged chan for me to join your discussions? Or should i go through any gateway and hope for it to work (matrix XMPP gateway is not really famous, admin-operated matterbridge is far more reliable).
I understand the appeal of CRDTs for some use-cases (in fact, for many more we could develop), but isn't implementation of permissions on top quite a big enterprise? Isn't that what the Matrix project is about? How would your mitigations compare to Matrix, beyond optimizing for arbitrary large messages (files)?
> It is perfectly true that solutions that are explicitely designed to handle these threats
Please make it more explicit on the homepage. Strong personal opinion: i like it when projects make their design tradeoffs explicit and link to alternatives involving other tradeoffs ; it helps me to avoid reading detailed specifications and source code to understand whether the tool is fitted to my usecase.
> our next project that involves mailbox storage
That's pretty cool! Are you aware of existing solutions in this space and failed attempts at producing new ones? leap.se/bitmask.net had some interesting takes but had to lower their goals due to limited human resources, but there's still people very interested in that question over there. Mailbox encryption as implemented by riseup/posteo (source code available, private key unlocked with passphrase at login time) is also very interesting and i wish that approach was used with other protocols as well (eg. XMPP) though it raises some interesting questions in regards to allow/denylisting.
I wish you the best of luck, and certainly hope to read more from you. Don't hesitate to post around in seemingly-unrelated venues to gather critical feedback on your design before implementation.
"or worse, by blockchain technology" -> I really would appreciate if you could elaborate on this. Blockchain technology has its pros and cons, depending on the underlying consensus algorithm, but it could certainly be put to good use in the context of a political aware project promoting decentralization and freedom ?
I'd much rather have a system that doesn't try to solve human conflict by algorithms. Blockchains do this: you write code, the code becomes the law, and then there is no recourse when things go wrong. I deeply believe that we should rather focus on developping our interpersonnal communication skills and building social structures that empower us to do things together by trusting eachother and taking care together of the group. Algorithms are just tools, and we don't want them interfering in our relationships; in particular in the case of blockchains, the promise of trustlessness is absolutely opposite to this ideal.
Thanks ! I understand what you mean, this is perfectly relevant especially for small communities. I am not sure it could scale to larger communities though.
"Trust noone" cannot work. Humanity is built on trust. The question is what level of trust you place on different operators.
From a quick look StorJ and fuelcoin both appear to be classic crypto-scams where the entire tech stack is controlled by a single company... talk about "zero trust" :-)
If you're really interested about low trust tech for file storage, i recommend you check out TAHOE-LAFS. It's pretty cool tech and there's no money to be made out of it, nor is there a company dictating who can join what pool and how much you have to pay for what kind of storage. It's an actually decentralized system, unlike most blockchains.
Interesting point to note is that they have some external funding from the EU.
> The Deuxfleurs association has received a grant from NGI POINTER[0], to fund 3 people working on Garage full-time for a year : from October 2021 to September 2022.
(Deuxfleurs administrative council member here:) Starting from a little association with 200€ in a cash box last year, we were indeed amazed to actually get this funding.
I understand the founders are Terry Pratchett / Discworld fans from the name of the associations.
For non french speakers, Deuxfleurs is the french traduction of the Twoflower character appearing in The Colour of Magic, The Light Fantastic and Interesting Times from Terry Pratchett's Discworld books series.
I tried to answer to a comment that was deleted, probably due to the form. Instead of deleting my answer, I want to share the reworded critics and my answers.
> Why not using Riak and adding an S3 API around it.
Riak was developed by a company named Basho that went bankrupt some years ago, the software is not developed anymore. In fact, we do no need to add an S3 API around Riak KV, Basho even released "Riak Cloud Storage"[0] that exactly does this: provide an S3 API on top of Riak KV architecture. We plan to release a comparison between Garage and Riak CS, Garage has some interesting features that Riak CS does not have! In practice, implementing an object store on top of a DynamoDB-like KV store is not that straightforward. For example, Exoscale, a cloud provider went this way for their first implementation of their KV store, Pithos[1], but rewrote it later as you need special logic to handle your chunks (they did not publish Pithos v2).
> Most apps don't have S3 support
We are maintaining in our documentation an "integration" section listing all the compatible applications. Garage already works with Matrix, Mastodon, Peertube, Nextcloud, Restic (an alternative to Borg), Hugo and Publii (a static site generator with a GUI). These applications are only a fraction of all existing applications, but our software is targeted at its users/hosters.
> A distributed system is not necessarily highly available
I will not fight on the wording: we come from an academic background where the term "distributed computing" has a specific meaning that may differ outside. In our field, we define models where we study systems made of processes that can crash. Depending on your algorithms and the properties you want, you can prove that your system will work despite some crashes. We want to build software on these academic foundations. This is also the reason we put "Standing on the shoulders of giants" on our front page and linking to research papers. To put it in a nutshell, one critic we address to other software is that sometimes they lack theoretical/academic foundations that lead to unexpected failures/more work to sysadmins. But on the theoretical point, Basho and Riak were exemplary and a model for us!
I also had some things to answer to that comment which I'll put back here for the record.
> I have been building distributed systems for 20 years, and they are not more reliable. They are probabilistically more likely to fail.
It depends on what kind of faults you want to protect against. In our case, we are hosting servers at home, meaning that any one of them could be disconnected at any time due to a power outage, a fiber being cut off, or any number of reasons. We are also running old hardware where individual machines are more likely to fail. We also do not run clusters composed of very large numbers of machines, meaning that the number of simultaneous failures that can be expected actually remains quite low. This means that the choices made by Garage's architecture make sense for us.
But maybe your point was about the distinction between distributed systems and high availability, in which case I agree. Several of us have studied distributed systems in the academic setting, and in our vocabulary, distributed systems almost by definition includes crash-tolerance and thus making systems HA. I understand that in the engineering community the vocabulary might be different and we might orient our communication more towards presenting Garage as HA thanks to your insight, as it is one of our core, defining features.
> However, this isn't it. This is distributed S3 with CRDTs. Still too application-specific, because every app that wants to use it has to be integrated with S3. They could have just downloaded something like Riak and added an S3 API around it.
Garage is almost that, except that we didn't download Riak but made our own CRDT-based distributed storage system. It's actually not the most complex part at all, and most of the developpement time was spent on S3 compatibility. Rewriting the storage layer means that we have better integration between components, as everything is built in Rust and heavily depends on the type system to ensure things work well together. In the future, we plan to reuse the storage layer we built for Garage for other projects, in particular to build an e-mail storage server.
We still intend to publish the successor to Pithos at some point, it has seen a large overhaul in terms of how metadata is handled and relies on a in-house system for blob storage.
This is a fantastic project and I can't wait to try it out!
One question on the network requirements. The web page says for networking: "200 ms or less, 50 Mbps or more".
How hard are these requirements? For folks like me that can't afford a guaranteed 50Mbps internet connection, is this still usable?
There are plenty of places in the world where 50Mbps internet connectivity would be a dream. Even here in Canada there are plenty of places with a max of 10Mbps. The African continent for example will have many more.
It's not a hard requirement, you might just have a harder time handling voluminous files as Garage nodes will in all case have to transfer data internally in the cluster (remember that files have to be sent to three nodes when they are stored in Garage, meaning 3x more bandwidth needs to be used). 10Mbps is already pretty good if it is stable and your ping isn't off the charts, and it might be totally workable depending on your use case.
What's with the obsession with S3 API? Why not FUSE and just use any programming language without libraries to access the files? (Obviously without some S3/FUSE layer, without HTTP, just direct access to garage with the fastest protocol possible)
Hi this is a very cool project, and I love that it's geared towards self-hosting. I'm about inherit some hardware so I'm actually looking at clustered network storage options right now.
In my setup, I don't care about redundancy. I much prefer to maximize storage capacity. Can Garage be configured without replication?
If not, maybe someone else can point me in the right direction. Something like a glusterfs distributed volume. At first glance, SeaweedFS has a "no replication" option, but their docs make it seem like they're geared towards handling billions of tiny files. I have about 24TB of 1Gb to 20GB files.
You probably want at least some redundancy otherwise you're at risk of losing data as soon as a single one of your hard drive fails. However if you're interested in maximizing capacity, you probably need erasure coding, which Garage does not provide. Gluster, Ceph and Minio might all be good candidates for your use case. Or if you have the possibility of putting all your drives in a single box, just do that and make a ZFS pool.
> You probably want at least some redundancy otherwise you're at risk of losing data as soon as a single one of your hard drive fails.
If by "fails" you mean the network connection drops out. Then yes, that would be a huge problem. I was hoping some project had a built-in solution to this. Currently, I'm using MergerFS to effectively create 1 disk out of 3 external USB drives and it handles accidental drive disconnects with no problems (I can't gush enough over how great mergerfs is).
But, if by "fails" you mean actual hardware failure. Then, I don't really care. I keep 1 to 1 backups. A few days of downtime to restore the data isn't a big deal; this is just my home network.
> Or if you have the possibility of putting all your drives in a single box...
Unfortunately, I've maxed out the drive bays on my TS140 server. Buying new, larger drives to replace existing drives seems wasteful. Also, I've just been gifted another TS140, which is a good platform to start building another file server.
You've given me something to think about, thanks. I appreciate you taking the time to respond!
Thanks! I skipped over this one but looked at LizardFS, which was forked from Moose in 2013. The comments I found online about lizard weren't too great. I'll give moose a look. 2013 was a long time ago.
I've been looking through the available options for this use-case, with the same motivation given in this post, and was surprised to see there wasn't something which could fit this niche (until now!).
Really happy to see this being worked on by a dedicated team. Combined with a tool like https://github.com/slackhq/nebula this could help form the foundation of a fully autonomous, durable, online community.
NICE ! Very Nice ! Congrats on the project and the naming !
If I may add a few thoughts : You may be very well placed for being a more suitable replacement to Minio in Data science / AI projects. Why ? Because a hard requirement for any serious MLOps construction needs 2 things : An append-only storage and a way to stream large content to a single place. First one is very hard to get right, second is relatively easy.
Being CRDT-based it should be _very easy_ for you to provide an append-only storage that can store partitioned, ordered logs of immutable objects (think dataframes, and think kafka). Once you have that, it's really easy to build the remaining missing pieces (UI, API, ...) for creating a _much better_ (and distributed) version of MLFlow.
Finally, S3 protocol is an "okay" version for file storage, but as you are probably aware, it's clearly a huge limiter. So, trash it. Provide a read-only S3 gateway for compatibility, but writes should use a different API.
Thanks for your enthusiastic feedback, fellow breton!
It's an interesting take you make about ML workloads. We haven't investigated that yet (next in line is e-mail: you see we target ubiquitous, low-tech needs that are horrendous to host). But we will definitely consider it for future works: small breton food trucks do need their ML solutions for better galettes.
Currently, we have no elegant way to achieve what you want.
When failures occur, repair is done through workers that says when they launch, when they repair chunks, and when they exit in the logs. We also have `garage status` and `garage stats`. The first command displays healthy and non healthy nodes, the second one displays the queue length of our tables and chunks, if their values are greater than zero, we are repairing the cluster.
We are documenting failure recovery in our documentation: https://garagehq.deuxfleurs.fr/documentation/cookbook/recove...
Exposing Prometheus metrics is an ongoing work which we haven't had much time to advance yet. For now we can check on replication progress or overall cluster health by inspecting Garage's state from the command line.
- replication is way too expensive for what it's worth. As soon as you can afford it you really should go for some kind of erasure coding. The exact moment when erasure coding is more efficient (storage wise, transport wise) and more robust (in terms of devices you can lose) depends on the size of the objects (fixed cost of metadata).
- you don't need distributed consistency for the data storage actions just for the meta data (the map that tells you where the data is). You can start pushing data, and if the metadata for the object metadata never is created, you just pushed garbage which can be collected.
Hey, looks pretty interesting but I have a couple of questions:
1) the license of Garage is AGPL, MinIO is also AGPL so I'm not sure what's the problem with it as a valid choice for self-hosting? You seem to not have a problem using Apache 2 licensed software (HashiCorp stack).
2) if you're a non-profit, how come MinIO is your "closest competitor"?
Think the idea is a good one. As S3 becomes standard, it would be nice if someone could rip amazon's face-off (like amazon have done for other open-source companies, e.g. elastic).
Other comments from my colleagues also shed light on Garage's specific features (do check the discussion on Raft).
In a nutshell, Garage is designed for higher inter-node latency, thanks to less round-trips for reads & writes. Garage does not intend to compete with MinIO, though - rather, to expand the application domain of object stores.
Another benefit compared to MinIO is we have "flexible topologies".
Due to our design choice, you can add and remove nodes without any constraint on number of nodes and size of the storage. So you do not have to overprovision your cluster as recommended by MinIO[0].
Additionally, and we planned a full blog post on this subject, adding or removing a node in the cluster does not lead to a full rebalance of the cluster.
To understand why, I must explain how it works traditionally and how we improved on existing work.
When you initialize the cluster, we split the cluster in partitions, then assign partitions to nodes (see Maglev[1]). Later, based on their hash, we will store data in its corresponding partition. When a node is added or removed, traditional approaches rerun the whole algorithm and comes with a totally different partition assignation. Instead, we try to compute a new partition distribution that minimize partitions assignment change, which in the end minimize the number of partitions moved.
On the drawback side, Garage does not implement erasure coding (as it also the reason of many MinIO's limitations) and duplicate data 3 times which is less efficient. Garage also implements less S3 endpoints than Minio (for example we do not support versioning), the full list is available in our documentation[2].
> account the geographical location of servers, and ensures that copies of your data are located at different locations when possible for maximal redundancy, a unique feature in the landscape of distributed storage systems.
Azure already does this, so claiming it's unique seems untrue.
I mean, sure, but we're operating under the premise that we don't want to give out our data to a cloud provider and instead want to store data ourselves. Half of the post is dedicated to explaining that, so pretending that you can't infer that from context seems a bit unfair.
Our software is published under the AGPLv3 license and comes with no guarantee, like any other FOSS project (if you do not pay for support). We are considering our software as "public beta" quality, so we think it works well, at least for us.
On the plus side, it survived Hacker News Hug of Death. Indeed, the website we linked is hosted on our own Garage cluster made of old Lenovo ThinkCentre M83 (with Intel Pentium G3420 and 8GB of RAM) and the cluster seems fine. We also host more than 100k objects in our Matrix (a chat service) bucket.
On the minus side, this is the first time we have so much coverage, so our software has not yet been tested by thousands of people. It is possible that in the near future, some edge cases we never triggered are reported. This is the reason why most people wait that an application reaches a certain level of adoption before using it, in other words they don't want to pay "the early adopter cost".
License-wise, Garage is AGPLv3 like many similar software projects such as MinIO and MongoDB. We are not lawyers, but in the spirit, this should not preclude from using it in commercial software as long as it is not modified. In the case where you are modifying it, like e.g. adding a nice administration GUI, we kindly ask that you publish your modifications as open source so that we can incorporate them back in the main product if we want.
Then you didn't look close enough. Have a look at the .drone.yml to see the entry points, we have both unit tests and integration tests that run at each build.
Github mirror? It's extremely easy to forget and stop following up on a project that's not on Github!
You might have personal reasons by your current choice, but maybe the project can have shot at success if it's in a more social and high traffic ecosystem.