OOC, what are the downsides of storing service discovery information like this in TXT records? As opposed to using ZK, consul, etcd or a more standard solution?
A zone transfer or grep really isn't so terrible, even for someone who doesn't know much about dns it's just a quick memorized dig away. HTTP or grpc aren't really better CLI UX and if this is your infra you can just write a cli to display your dns-driven information cleanly.
Storing key-value pairs in TXT records is easily parseable in any language if you use a simple delimiter. This isn't even an abuse of DNS, the protocol has been used to serve arbitrary data forever.
All the other modern services like consul, etcd, zk, etc require keeping a quorum of servers happy and have pretty heavy clients. By contrast, it's so hard to take down DNS and spinning up a new server is as easy as copying a zone file to a new server and the new server could even be running a totally different dns implementation because zone files are so standardized. Plus, your tooling can directly parse zone files with whatever dns library you were using and have a trivial way to dump the data without any server at all.
DNS can be replicated in arbitrary configurations and everything supports DNS caching for really high HA.
When you do want dynamic discovery and don't want to implement direct zone file generation, there's always CoreDNS which has plugins for so many datasources.
And if you don't want to host it, there are tons of DNS providers with great uptime.
At this point, I really can't think of any solution for service discovery that's better than DNS for most cases. Especially since the majority of service discovery solutions end up returning hostnames instead of IP addresses so you're already taking a dependency on DNS. Other solutions only really add benefit if you need to store tons of metadata or take advantage of things like leader election, etc
I mean, kinda same, because my lizard brain still remembers getting burned - but in theory it's not much different than what Storage Gateway in File Gateway mode or Nasuni and others have been doing for years, is it?
We started to build JuiceFS since 2016, released it as a SaaS solution in 2017. After years of improvements, we released the core of JuiceFS recently, hopefully you will find it useful.
I'm the founder of JuiceFS, would like to answer any questions here.
Hi. I'm pretty interested and excited about this project. Under "Credits" the project states:
>"The design of JuiceFS was inspired by Google File System, HDFS and MooseFS, thanks to their great work."
Would you consider writing up a design doc for JuiceFS. I would be interested to know more about what specific implementation ideas you used for each of those if any, design choices and tradeoffs made, learnings etc. It would make a great blog post. Cheers.
The whole idea was came from GFS: separate the metadata and data, load all the meta into memory, single meta server for simplicity, fixed-size chunk.
The POSIX and FUSE stuff was learned from MooseFS, but changed to use read-only chunk, and merge them together, and do compaction in background. Since most of object storage provide eventual consistency, the model work pretty well, also simplify the burden on cache eviction. In order to access object store in parallel, we divide the chunk into smaller blocks (4MB), which is also a good unit for caching.
The Hadoop SDK (not released yet) was learned from HDFS.
One key thing in the implementation is to use Redis transaction to guarantee atomicy on metadata operations, otherwise we will get into millions of random bugs.
Our team is JuiceFS's launching customer. We've been using its enterprise offering since late 2016 and accumulated several hundreds TBs of data on JuiceFS along the way, it worked well.
I'm from JuiceFS team. We run JuiceFS full managed service on every public cloud worldwide for 4 years. We decide open source it for more explore and usage.
We were doing this at Avere in 2015. The system built a POSIX filesystem out of S3 objects on the backend (including metadata) and then served it over NFS or SMB from a cluster of cache nodes. Keeping metadata and data in separate data stores with different consistency models is a disaster waiting to happen - ask anyone who has run Lustre. Having fast caches with SSDs was the key to getting any kind of decent performance out of S3. The fun part was mounting a filesystem and running df and seeing an exabyte of capacity. They were acquired by Microsoft in 2018 and integrated into Azure.
I saw Avere was recommended in GCP before, but never find the details, thanks for sharing that. It seems that Avere is close to ObjectiveFS, which also use S3 both for data and metadata.
My guest is that Avere could require a cluster of nodes as the fast layer for write and synchronization. In JuiceFS, Redis is used for synchronization and persisting metadata. Local SSD could used for data read-caching, not for writing.
JuiceFS is in production for 3+ years, we have not find much challenge that was difficult with this design, since it's borrowed from GFS, HDFS and MooseFS, have be proved in production for more than 10 years. I'd like to hear what's the challenge you were facing.
The nice thing about storing everything including metadata in S3 is that you have a consistent copy of your filesystem. How do you back up a filesystem that is split across Redis and S3 for example? I was able to blow away the cluster running it, spin up a new one, point it at the bucket and add the encryption key, and it would bring up a working filesystem. Just like plugging in a hard drive.
JuiceFS has a single node version working like this, it store in the metadata in SQLite and upload S3 in background, it's useful in some cases but not in general, you may lose recent metadata if the single node is broken.
The difficulty of using S3 as metadata is that they is not way to persistent metadata under 1 ms, for example, created a symlink, it will take more than 20ms, or you may lose it.
With en external persisted database, we could have the ACID for metadata operations. Also, the meta is the source of truth, whenever the object store is out of sync (losing or leaking a object), the whole file system is still consistent, rather than part of a file is corrupted, should be safer than put everything into S3.
The database is the key, redis is our first choice, we will add support to other databases in future.
>"Keeping metadata and data in separate data stores with different consistency models is a disaster waiting to happen - ask anyone who has run Lustre."
Can you elaborate - is the issue corruption or performance? I've never used Lustre.
Lustre is all about performance, at the cost of being split across multiple distributed systems. So there are opportunities to have different components get out of sync - the metadata servers are usually the weak link. And the more moving parts the harder it is to debug when something goes wrong.
How POSIX-compatible is it exactly? There's a lot of niche features that tend to break on not fully compliant network filesystems. Do unlinked files remain accessible (the dreaded ESTALE on some NFS implementations)? mmap? atomic rename? atomic append? range locks? what's the consistency model?
Some of those things don't appear to be covered by pjdfstest.
1. Unlinked file remain accessible, when it's unlink from same machine.
2. mmap is supported.
3. atomic rename is supported.
4. Is there atomic append in POSIX?
5. range lock is supported.
6. the consistency mode is open-after-close, which means once a file is closed, you can open and read the latest data.
That and further POSIX requires that a read which can be proven to occur after a write returns the new data (I'd think that would be quite difficult to implement efficiently, if multiple clients are allowed). I couldn't find that being mentioned in either the documentation of juicefs nor the pjdfs test.
Am I the only one who always feels uneasy when using NFS-like filesystems? In my experience way too much software has been built without any kind of fault tolerance regarding file system access, and no matter how good your network filesystem is, it still can cause havoc and every sort of data loss.
I've seen so many disasters related to software basically assuming a file can't just vanish into thin air, something that can very much happen when your FS is running on top of an arbitrary network connection. Hiding away such a fundamental detail in order to provide a file-like API tends to instill every sort of bad ideas into people (NFS via Wi-Fi? Why not!).
My instinct is with yours, but in some very simple use cases, it can make sense.
For example, you've got a legacy PHP web application you have to maintain, and it's got spaghetti code all over the place, but all uploaded user content is stored / served from a single directory. You can probably use an S3-backed file system to replace that directory.
Obviously if you try to do, say, an "ls -la" on that directory and there are a lot of files in there, you may be waiting a while, since that translates to a lot of API calls. Similarly if you have something like a virus scan running on the box, it's likely to be tripped up.
But if you know the only thing using that directory is doing simple CRUD operations, it might be easier than trying to retrofit the application itself to talk to S3. Especially now that S3 has strong consistency.
I learned not to use Dropbox for things like Git repos. I'd have a fresh copy on one computer, and then have an older copy on a different computer. Sometimes I'd accidentally overwrite the newer code w/ autosaving on the older code. Now I'm skeptical that I'll be able to keep everything properly in sync on any sort of network-ish file system that is wrestling for control with my local file system.
Oh, I ran into the same problems, so in the end I decided to treat the Dropbox repo as a remote - create a bare repository in your Dropbox, clone it to a local folder, push/pull as you would any other repo. It's not optimal, but it's an easy way to get a private hosted repo.
This is neat! I am quite a fan of all the go based file systems that are springing up. Question: what are the main compare and contrast points between juice and seaweed fs?
I think the major problem is latency. Try sshfs (fs over ssh) and see what I mean. Don't get me wrong, I use and like sshfs for a quick data transfer, but it's just not good enough to run your application with.
For a stable POSIX filesystem in production latency is key. Often times in a datacenter 10GE is recommended for network storage solutions, not because of bandwidth (which is also important), but for the 10x reduced latency of a 10GE NIC. Most applications simply expect response times of microseconds or a few milliseconds at most from a POSIX Filesystem, they simply cannot run (without modifying the codebase) on something much slower.
But if you had to rewrite your application anyways, then you might want to use plain S3 without the FS, that is easier to operate in the long run.
> Often times in a datacenter 10GE is recommended for network storage solutions, not because of bandwidth (which is also important), but for the 10x reduced latency of a 10GE NIC.
Totally true. In those cases you also need your filer _not_ to compress on the fly, and not deduplicate objects... unless your hardware can do it for you.
... I remember horrible performance on VMs with images stored on a oversized netapp filer, because someone enabled deduplication and compression instead of using 1 of the 4 spare drives (when it was HDD).
A low latency metadata server is good, but you still have the latency to the S3 server, e.g. Amazon S3 (from your README.md).
Now you could say, that I could host my own S3 e.g. MinIO in the same DC, but then I could also simply deploy Ceph, which is battle tested for years up to the petabyte range and with iSCSI, S3 and FS interfaces.
So I think this project might give the wrong impression that you can simply combine a Redis with Amazon S3 and then have a good FS solution available, which is unfortunately not the case.
TTFB in S3 is 20-30ms around the 50th percentile. it can go much higher at p99 [1]. In any case, rotational latency for HDD drives is an order of magnitude lower (typically 2-5ms for a seek operation).
S3 is great for higher throughput workloads where TTFB is amortized across larger downloads (this is why it's very common to use S3 as a "data lake" where larger columnar files are stored, usually at the order of hundreds of MiB).
I think it's an interesting project but perhaps explaining the use cases where this solution is beneficial would go a long way here.
JuiceFS was initial designed for big data workload in Cloud, and we have tens of paying customer use it in this use case.
For NAS use case, the latency would be slower when case missed, but the overall IOPS could be higher. In the case of overwhelmed HDD, the latency could also go up to hundreds of milliseconds. We have pay lots of effort to improve the caching (in kernel and on disk) and prefetching. The overall performance is comparable to HDD over NFS.
Redis can be persisted with RDB and AOF, can also be replicated to another machine. In the cloud, you don't need to worry about that, hosted Redis are ready to use.
The is an ongoing effort [1] to improve the persistency and availability in general, which is expected to be GA in 2021.
given a lost write to Redis would translate to corruption or missing data at the filesystem level, the only "safe" way to run this ATM is using Redis' extremely inefficient "always fsync" setting for its AOF log [1].
Keep in mind that Elasticache does not support it (in general, it doesn't really support running Redis in a durable way).
Presumably this would be useful if you have apps that currently expect a posix fs since not all (maybe a majority even) of apps don’t run on the cloud. I imagine it’s a drop in replacement for NFS.
Cloud storage access control and data lifecycle control is much more advanced which is something you would probably have to give up with this. Eg IAM restrictions per bucket/object, lifecycle policies etc.
If you’re writing new apps, I don’t see why you would want to add another abstraction layer rather than access cloud storage directly except for very specific use cases.
> If you’re writing new apps, I don’t see why you would want to add another abstraction layer
I can see the usefulness in basing your app on FS and other POSIXly primitives (as opposed to the "cloud-native" storage du jour) if you want your app to continue to be usable on the largest class of machines and scenarios including local deployment under traditional Unix site autonomy assumptions. The general purpose being portability, need for on-premise deployment, (very) long-term viability, developer experience, accountability, integration with legacy software and permission infrastructure, use of existing upload/download or VCS software, straightforward file or metadata exchange, forensic or academic transparency, and avoidance of lock-in.
How do you deal with failures? What happens if the redid availability zone disappears for example? An I manually responsible for recovery and backups in this cases, or do you use redis as a cache that can be recovered from s3?
Redis is used to persistent metadata, it can not be recovered from S3. The hosted Redis solution should already have failover solution for that, for example, AWS ElasticCache have multi-AZ failover[1].
I've been using s3ql for a while, but this looks great. After testing it for a few hours, the performance seems solid. I'm glad it has a data cache feature like s3ql.
How does this compare with seaweedfs? That also has fuse and can store metadata and small data in redis and provides its own s3 api, so there's one less moving part.
Would not an NFS solution have this kind of caching and durability built-in? Without doing actual “Jepsen tests” (they are almost a generic term at this point due to their name) how would this improve my life versus buying an NFS vendor solution, or rolling my own?
NFS doesn’t scale to effectively infinity with an underlying object store. This is to give you a ton of storage without using a traditional volume target with your app that, for whatever reason, requires a posix filesystem.
I’m sure someone from AWS can’t comment, but I imagine this is how AWS’ EFS service is built (NFS wire protocol to clients, but using S3 and metadata caching under the hood). Blobs or blocks doesn’t matter much, just how fast the abstraction is.
well, NFS may not scale to infinity, but easily beats this thing I guess... And for scaling to infinity: how about benchmarking vs. GPFS, BeeGFS or Gluster?
Interesting, I’d guess list operations are significantly faster because of the cache, even a hash map like Redis? That can be a real slowdown in s3 with large quantities of files.
How much does it cost to use on S3? For example, how many GetObject and other misc non-free API calls does it use? And does it store data using intelligent tiering?
For small files (less than 4MB), the number of Get/Put/Delete is the same comparing using S3 directly. For larger file, each object in S3 is about 4MiB. Most of S3 Client will do the similar thing to GET/PUT small parts in parallel to speed things up. Overall, JuiceFS should use the similar number of GET/PUT request comparing to use S3 directly.
Second, all the List and Head request go to Redis, they are free, so you may save some cost on API costs.
Third, the frequently read data will be cached in your local disks, so you will also save some cost on GET/PUT requests.
JuiceFS access objects in S3 using an unique id to generate key, so the result of lifecycle will not change the way JuiceFS accessing it, except the StorageClass Glacier or Deep Arching, which is not accessible instantly, will hurt the user experience for JuiceFS.
Thanks. Sorry I wasn't being very clear, I meant: what happens if an object is permanently deleted from a bucket by an out-of-band process, invisible to JuuiceFS (e.g. by a user operating on the bucket directly or by a misconfigured life cycle rule that deletes old objects after X days/months/years)?
Does JuiceFS's metadata server handle this loss of synchronisation gracefully?
Understood, the annoying part of NFS is that some operation can not be interrupted when the network is not healthy, and we have no way to kill it.
We have put lots of effort to make the operations interruptible when either meta or object store is slow or down. There are a few cases that the `close` can't be interrupted, we can still abort the FUSE connection or kill the JuiceFS process, then all the operation are canceled.
In general, JuiceFS is built on FUSE, so we can have more control on it.
So operations-wise, applications would still require workarounds to be resilient to failure in production. From what I recall, all network filesystems have this problem, so I avoid them all.
The essential problem is trying to do something with an interface that it wasn't designed for. It's the same problem network protocols have. They can't communicate metadata about each layer across the layers, so your HTTP protocol has no idea that it's actually being tunneled in another HTTP protocol and that that protocol's TCP connection just got a PSH,FIN,RST packet. Your file i/o app also has no idea that you just lost a quorum, or that one section of the network just crapped out (and even if it did know, what would it do?)
So... a file system on top of a virtualized file system hosted on someone else's computer across the land, with "Outstanding Performance: The latency can be as low as a few milliseconds" ... Milliseconds disk access is outstanding?
I mean, amazing, and, maybe you know, use a file system.
Remember that a network filesystem, can't really be faster than the network latency. If you are not in the same rack with your filesystem server, you can expect a few milliseconds of network latency. So a network filesystem that has a scalable design and is POSIX compatible, being in the same latency ballpark with the network sounds quite nice actually.
before you want to use it in your project, make sure to have a look at the code - their codebase is pretty much comment free, very little number of tests. Other than the marketing term "posix file system", there is not any proof on such claim.
I am also not sure how it is a "distributed" file system given its storage is entirely done by S3. Should I call my backup program that backups my data to S3 every night a "distributed system"? When running on top of Redis, it explicitly mentions that Redis cluster is not supported. I haven't used Redis for many years, did I miss something here? A "distributed" file system built on top of a single instance of Redis? Sounds not very "distributed" to me.
Since POSIX is so complicated that we would not trust the test we wrote to prove the compatibility.
So we use several third-party test suite, including fsx, pjdfstest, fsracer, flock. We also use third-party tools for benchmark, fio, mdtest and others.
The architect of JuiceFS is very close to GFS, which use a single node master for many years, even now. Since Redis is only responsible for metadata, a single node can serve hundreds of millions of files, and tens of thousands of IOPS, that should be enough for many use cases.
The term `distributed`, means that JuiceFS is not a `local` file system, or can only be used by single machine. JuiceFS should be qualified as distributed system, even the core part is the client, which could be used by many machines in the same time.
Based on some information found on internet, Colossus is built on top of GFS, which use BigTable (or Spanner) as the meta store, and the BigTable still use GFS.
The recent GFS may have multiple masters[1], but they are separate namespace, similar to HDFS federation.
Since JuiceFS written in Go, GO SDK will be the first, then we can have other languages using CGO. We already have Java SDK internally, Python will be the next.
But the way, A S3 gateway is on our roadmap, you can spin up an S3 gateway for JuiceFS, and talk to that using existing S3 SDK. This only make sense when you have other applications outside of Lambda using JuiceFS.
Do you mean use JuiceFS as the place to store RDB or AOF? Yes, for backup. The fsync() is much slower than SSD(20ms vs 100us), so it may hurt performance.
Well, maybe it's not a big deal. I guess programs using this POSIX filesystem might not be derivatives and wouldn't need to also be released under the AGPL.