I mean, your inventory is either a grep or a zone transfer this way, for one.
Storing key-value pairs in TXT records is easily parseable in any language if you use a simple delimiter. This isn't even an abuse of DNS, the protocol has been used to serve arbitrary data forever.
All the other modern services like consul, etcd, zk, etc require keeping a quorum of servers happy and have pretty heavy clients. By contrast, it's so hard to take down DNS and spinning up a new server is as easy as copying a zone file to a new server and the new server could even be running a totally different dns implementation because zone files are so standardized. Plus, your tooling can directly parse zone files with whatever dns library you were using and have a trivial way to dump the data without any server at all.
DNS can be replicated in arbitrary configurations and everything supports DNS caching for really high HA.
When you do want dynamic discovery and don't want to implement direct zone file generation, there's always CoreDNS which has plugins for so many datasources.
And if you don't want to host it, there are tons of DNS providers with great uptime.
At this point, I really can't think of any solution for service discovery that's better than DNS for most cases. Especially since the majority of service discovery solutions end up returning hostnames instead of IP addresses so you're already taking a dependency on DNS. Other solutions only really add benefit if you need to store tons of metadata or take advantage of things like leader election, etc
I'm the founder of JuiceFS, would like to answer any questions here.
>"The design of JuiceFS was inspired by Google File System, HDFS and MooseFS, thanks to their great work."
Would you consider writing up a design doc for JuiceFS. I would be interested to know more about what specific implementation ideas you used for each of those if any, design choices and tradeoffs made, learnings etc. It would make a great blog post. Cheers.
The POSIX and FUSE stuff was learned from MooseFS, but changed to use read-only chunk, and merge them together, and do compaction in background. Since most of object storage provide eventual consistency, the model work pretty well, also simplify the burden on cache eviction. In order to access object store in parallel, we divide the chunk into smaller blocks (4MB), which is also a good unit for caching.
The Hadoop SDK (not released yet) was learned from HDFS.
One key thing in the implementation is to use Redis transaction to guarantee atomicy on metadata operations, otherwise we will get into millions of random bugs.
If you run it on your own, please pay attention on the persistency options and HA solution, there should be plenty of article on these.
2. Is there any access control and if there is who enforces it?
Yes, please use `rediss://host:port/`
> 2. Is there any access control and if there is who enforces it?
You may specify port 0 to disable the non-TLS port completely. To enable only TLS on the default Redis port, use:
-port 0 -tls-port 6379
Good luck with the project.
My guest is that Avere could require a cluster of nodes as the fast layer for write and synchronization. In JuiceFS, Redis is used for synchronization and persisting metadata. Local SSD could used for data read-caching, not for writing.
JuiceFS is in production for 3+ years, we have not find much challenge that was difficult with this design, since it's borrowed from GFS, HDFS and MooseFS, have be proved in production for more than 10 years. I'd like to hear what's the challenge you were facing.
The difficulty of using S3 as metadata is that they is not way to persistent metadata under 1 ms, for example, created a symlink, it will take more than 20ms, or you may lose it.
With en external persisted database, we could have the ACID for metadata operations. Also, the meta is the source of truth, whenever the object store is out of sync (losing or leaking a object), the whole file system is still consistent, rather than part of a file is corrupted, should be safer than put everything into S3.
The database is the key, redis is our first choice, we will add support to other databases in future.
Can you elaborate - is the issue corruption or performance? I've never used Lustre.
I've wrote a blog post (in Chinese) explaining our MySQL backup practice around JuiceFS in 2018.
They provided an open source version recently.
Some of those things don't appear to be covered by pjdfstest.
1. Unlinked file remain accessible, when it's unlink from same machine.
2. mmap is supported.
3. atomic rename is supported.
4. Is there atomic append in POSIX?
5. range lock is supported.
6. the consistency mode is open-after-close, which means once a file is closed, you can open and read the latest data.
I've seen so many disasters related to software basically assuming a file can't just vanish into thin air, something that can very much happen when your FS is running on top of an arbitrary network connection. Hiding away such a fundamental detail in order to provide a file-like API tends to instill every sort of bad ideas into people (NFS via Wi-Fi? Why not!).
For example, you've got a legacy PHP web application you have to maintain, and it's got spaghetti code all over the place, but all uploaded user content is stored / served from a single directory. You can probably use an S3-backed file system to replace that directory.
Obviously if you try to do, say, an "ls -la" on that directory and there are a lot of files in there, you may be waiting a while, since that translates to a lot of API calls. Similarly if you have something like a virus scan running on the box, it's likely to be tripped up.
But if you know the only thing using that directory is doing simple CRUD operations, it might be easier than trying to retrofit the application itself to talk to S3. Especially now that S3 has strong consistency.
Here is a compendium for those interested:
The core of seaweedfs is to manage many small blobs, you can use SeaweedFS together with JuiceFS to have a full featured POSIX file system.
For a stable POSIX filesystem in production latency is key. Often times in a datacenter 10GE is recommended for network storage solutions, not because of bandwidth (which is also important), but for the 10x reduced latency of a 10GE NIC. Most applications simply expect response times of microseconds or a few milliseconds at most from a POSIX Filesystem, they simply cannot run (without modifying the codebase) on something much slower.
But if you had to rewrite your application anyways, then you might want to use plain S3 without the FS, that is easier to operate in the long run.
Totally true. In those cases you also need your filer _not_ to compress on the fly, and not deduplicate objects... unless your hardware can do it for you.
... I remember horrible performance on VMs with images stored on a oversized netapp filer, because someone enabled deduplication and compression instead of using 1 of the 4 spare drives (when it was HDD).
Later on, We will try the client cache of Redis, which could also reduce the latency for some metadata operations down to a few microseconds.
Now you could say, that I could host my own S3 e.g. MinIO in the same DC, but then I could also simply deploy Ceph, which is battle tested for years up to the petabyte range and with iSCSI, S3 and FS interfaces.
So I think this project might give the wrong impression that you can simply combine a Redis with Amazon S3 and then have a good FS solution available, which is unfortunately not the case.
Ceph is great, if you can master the complexity under the hood, MinIO + Redis + JuiceFS could be the easier answer for beginners.
S3 is great for higher throughput workloads where TTFB is amortized across larger downloads (this is why it's very common to use S3 as a "data lake" where larger columnar files are stored, usually at the order of hundreds of MiB).
I think it's an interesting project but perhaps explaining the use cases where this solution is beneficial would go a long way here.
JuiceFS was initial designed for big data workload in Cloud, and we have tens of paying customer use it in this use case.
For NAS use case, the latency would be slower when case missed, but the overall IOPS could be higher. In the case of overwhelmed HDD, the latency could also go up to hundreds of milliseconds. We have pay lots of effort to improve the caching (in kernel and on disk) and prefetching. The overall performance is comparable to HDD over NFS.
Else I don't understand how the metadata can be persistent after reboot as AFAIK redis cannot dump and reload its state.
The is an ongoing effort  to improve the persistency and availability in general, which is expected to be GA in 2021.
Keep in mind that Elasticache does not support it (in general, it doesn't really support running Redis in a durable way).
Cloud storage access control and data lifecycle control is much more advanced which is something you would probably have to give up with this. Eg IAM restrictions per bucket/object, lifecycle policies etc.
If you’re writing new apps, I don’t see why you would want to add another abstraction layer rather than access cloud storage directly except for very specific use cases.
I can see the usefulness in basing your app on FS and other POSIXly primitives (as opposed to the "cloud-native" storage du jour) if you want your app to continue to be usable on the largest class of machines and scenarios including local deployment under traditional Unix site autonomy assumptions. The general purpose being portability, need for on-premise deployment, (very) long-term viability, developer experience, accountability, integration with legacy software and permission infrastructure, use of existing upload/download or VCS software, straightforward file or metadata exchange, forensic or academic transparency, and avoidance of lock-in.
We will provide tools to dump the metadata as JSON, then you could recover your files using that.
I’m sure someone from AWS can’t comment, but I imagine this is how AWS’ EFS service is built (NFS wire protocol to clients, but using S3 and metadata caching under the hood). Blobs or blocks doesn’t matter much, just how fast the abstraction is.
They require a cluster of machines, the replicate the data across them, using either expensive EBS or local disk (a few larger instance to pick).
Maintaining them well is another burden. The cool idea of JuiceFS is to shift the maintenance to hosted Redis and S3.
I have saw people asking SQL database support for s3ql, to make s3ql accessible from multiple machines, right now, JuiceFS could be a choice.
We have put lots of effort to make the operations interruptible when either meta or object store is slow or down. There are a few cases that the `close` can't be interrupted, we can still abort the FUSE connection or kill the JuiceFS process, then all the operation are canceled.
In general, JuiceFS is built on FUSE, so we can have more control on it.
The essential problem is trying to do something with an interface that it wasn't designed for. It's the same problem network protocols have. They can't communicate metadata about each layer across the layers, so your HTTP protocol has no idea that it's actually being tunneled in another HTTP protocol and that that protocol's TCP connection just got a PSH,FIN,RST packet. Your file i/o app also has no idea that you just lost a quorum, or that one section of the network just crapped out (and even if it did know, what would it do?)
Second, all the List and Head request go to Redis, they are free, so you may save some cost on API costs.
Third, the frequently read data will be cached in your local disks, so you will also save some cost on GET/PUT requests.
Does JuiceFS's metadata server handle this loss of synchronisation gracefully?
I mean, amazing, and, maybe you know, use a file system.
We can have a SDK to access JuiceFS from Lambda, similar to S3 SDK, when you need to use JuiceFS outside of Lambda.
Same to Fargate, we can not mount JuiceFS in Farget because of lacking FUSE permission, people are asking for it.
But the way, A S3 gateway is on our roadmap, you can spin up an S3 gateway for JuiceFS, and talk to that using existing S3 SDK. This only make sense when you have other applications outside of Lambda using JuiceFS.
That would help dogfood the SDK, and allow it to be used across all languages and environments.
Most of file storage picked GPL, for example, Ceph, GlusterFS, MooseFS, so we followed them.
I am also not sure how it is a "distributed" file system given its storage is entirely done by S3. Should I call my backup program that backups my data to S3 every night a "distributed system"? When running on top of Redis, it explicitly mentions that Redis cluster is not supported. I haven't used Redis for many years, did I miss something here? A "distributed" file system built on top of a single instance of Redis? Sounds not very "distributed" to me.
So we use several third-party test suite, including fsx, pjdfstest, fsracer, flock. We also use third-party tools for benchmark, fio, mdtest and others.
The term `distributed`, means that JuiceFS is not a `local` file system, or can only be used by single machine. JuiceFS should be qualified as distributed system, even the core part is the client, which could be used by many machines in the same time.
GFS has since evolved into Colossus which doesn't have this architecture limitation.
The recent GFS may have multiple masters, but they are separate namespace, similar to HDFS federation.