I do appreciate that Google is now officially supporting gcsfuse because it genuinely is a great project. However, their Kubernetes CSI driver seems to have in large part copied code from the one I and a co-maintainer have been working on for years:
Notice for example not just the code but also the associated files. In the Dockerfile it blatantly copied the one from my repo, even the dual license I chose because I was very into Rust at the time. Or take a look at the deployment examples which use Kustomize which I like but is very uncommon and most Kubernetes projects provide Helm charts instead.
They were most certainly aware of the project because Google reached out to discuss potential collaboration but never responded back: https://imgur.com/a/KDuf9mj
Your repository seems to have both an Apache and MIT license. What license are you distributing your code under?
Edit: I see you said it’s dual licensed. From the look of it both allow Google or any other company to copy and reuse code, so what are you upset about?
I don't mean to be rude but yeah, this is exactly what AGPL was intended to combat. It's a lesson learned for these developers, and Google did nothing wrong or even unethical imo.
A lot of people treat licensing emotionally (e.g. WTFPL, or picking licenses that feel good, or that we saw in another project), however business people are very logical and will unfortunately exploit this.
The irony is that Google probably would not have done this if the codebase just omitted a license entirely. When I worked there, they wouldn't allow OSS with no license.
> The irony is that Google probably would not have done this if the codebase just omitted a license entirely. When I worked there, they wouldn't allow OSS with no license.
This is because a license is the only way to legally use code. Code being publicly accessible doesn't mean it's free-as-in-freedom to use.
> The irony is that Google probably would not have done this if the codebase just omitted a license entirely.
Yes they would. Google's code appears not to include attribution to the OP. So either Google authored the code or violated the license. One would hope that it's former.
OP should submit a PR to correct this. IANAL but pretty sure they're supposed to use the original copy including copyright notice "Copyright 2020 Ofek Lev"
I have to agree. I use non-copyleft licenses because I don't expect people using the code to give me anything in return. Removing your the header and attribution is wrong, but not collaborating? As was said in other comments, if that's what you want, dictate it in your license because there are plenty of -- and dare I so most -- people that use the MIT license that don't care.
This is why free software and open source aren't the same. Free software is about this kind of fairness, among others. The simplicity of open source does have its downsides.
unless there are legal definitions to “bad” it’s well within their rights you’ve granted them.
You can’t have a license that says one thing and then rely on implicit community norms to expect something to happen. (For one, you’re assuming the person is even aware of the community norms)
I am a contributor who works on the Google Cloud Storage FUSE CSI Driver project. The project is partially inspired by your CSI implementation. Thank you so much for the contribution to the Kubernetes community. However, I would like to clarify a few things regarding your post.
The Cloud Storage FUSE CSI Driver project does not have “in large part copied code” from your implementation. The initial commit you referred to in the post was based on a fork of another open source project: https://github.com/kubernetes-sigs/gcp-filestore-csi-driver. If you compare the Google Cloud Storage FUSE CSI Driver repo with the Google Cloud Filestore CSI Driver repo, you will notice the obvious similarities, in terms of the code structure, the Dockerfile, the usage of Kustomize, and the way the CSI is implemented. Moreover, the design of the Google Cloud Storage FUSE CSI Driver included a proxy server, and then evolved to a sidecar container mode, which are all significantly different from your implementation.
As for the Dockerfile annotations you pointed out in the initial commit, I did follow the pattern in your repo because I thought it was the standard way to declare the copyright. However, it didn't take me too long to realize that the Dockerfile annotations are not required, so I removed them.
Thank you again for your contribution to the open source community. I have included your project link on the readme page. I take the copyright very seriously, so please feel free to directly create issues or PRs on the Cloud Storage FUSE CSI Driver GitHub project page if I missed any other copyright information.
If the GP is right, Google is violating the terms of the license. A quick search of the code reveals that Google's code doesn't include copyright headers with attribution to the GP. This could be stolen code.
Yes, copying the code without following up to actually collaborate or even forking to show attribution I think is bad practice for such a large organization, or any entity for that matter.
If their behavior annoys you, why not DMCA takedown the repo in retaliation for them stripping the copyright header? It'd be within your rights since they violated the license and sends a message they can't ignore.
Otherwise if you're upset they otherwise used code you released under a permissive license that's a personal problem. You signaled to the world that you'd relax your copyright under a permissive license and don't like the results? This is like gifting something and being mad about how the gift was used.
As others have pointed out you don't seem to grasp why licenses exist in the first place? No org or individual in general wants to play guessing games and risk liability. It's entirely on you to signal how you want your work to be used.
Are you accusing Google of "large part copied code" based on an old commit which is not even used in this official launch? Do you have any evidence from their recent commit? At least I don't see the current two repos are anywhere alike, except for that you both implement the same interface. Also, they did reach out to you and you just didn't respond, so why are you complaining now?
It makes me sad that no one here cares about whether your blame is true. And I'd expect you can provide more convincing evidence. But looks like the accusation is not even true. It's not fair for those contributors man, I hope you can apologize.
I'm not sure if you thoroughly read what I wrote but I did respond to them. This is not a false accusation as you are claiming, you can check the contents of the repo in the current state.
Per the licenses they can copy but they must maintain attribution which has not been done.
I’ve experimented with using gcsfuse and its AWS equivalent, s3fs-fuse in production. At best, they are suited to niche applications; at worst, they are merely nice toys. The issue is that every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation.
For certain applications that consistently read limited subsets of the filesystem, this can be mitigated somewhat by the disk cache, but for applications that would thrash the cache, cloud buckets are simply not a good storage backend if you desire disk-like access.
What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts.
Sure you're not going to use this as a consumer in place of a local disk, nor are you going to use this as part of your web app.
But there are lots of situations in reporting, batch/cron jobs, data processing, and general file administration where it's incredibly easier to use the file system interface than to use an HTTP API via a cloud storage library. Which FUSE is a godsend for. The latency doesn't matter in these cases for one-off things or scripts that already take seconds/minutes/hours anyways.
So no this isn't niche or a toy. It's a fantastic production tool for a lot of different common uses. It's not for everything but nothing is. Use the right tool for the job.
In the old days, we had a system called NFS (Network File System) where, yes, you may decide to use only remote disks. There were several advantages apart from lowering the cost of disks, mainly that you could centrally manage boot images for a fleet of machines. Then we got the web and everyone seemed to assume you could do the same thing over the internet.
I agree with you, I would prefer a local disk to one with 100+ msec of latency and local storage prices are at the point where the right answer is probably "just add local storage."
But I watch with some sympathy the small army of sys-admins (something like 15-20 people) responsible for managing the 3000+ Macs our company uses and remember the 2 person staff which supported the 1500+ diskless workstations from my years at a sadly defunct mini-super-computer manufacturer. It was quite nice... you could go to any machine and log in and your desktop would follow you. I'm told doing the same thing with MSFT requires 10-20 people just to manage the AD hardware (though as a unix-fan, I hang out with other unix-fans who are notoriously rude to MSFT, so maybe it's only 5-10 people needed to manage the AD instance.)
Applications for which filesystem-like access is important (i.e. requiring lots of POSIX file I/O system calls, e.g. read(2)/write(2)/lseek(2)) but latency is unimportant seem pretty niche to me. If you don't need any of the POSIX syscalls, it's not that much more difficult to work with bucket URLs vs. file paths — the general format is the same, i.e. slash-delimited file/directory hierarchies.
Not everything is a webserver. There's a lot of software out there that wouldn't expect files to exist anywhere else besides on disk, and it's not worth fetching them all from cloud storage before you begin working on the data. It's easier just to GCSFuse a bucket to a VM and let the user do what they will. Works great for ad-hoc analysis of poorly or unstructured data.
And for your use case, the latency is not a concern? I suppose that would be true if you were mostly dealing with really big files and only cared about reading large contiguous chunks of them, but I would consider this a fairly niche application.
In my use case, taking ~1 second each time to `ls` a directory, `stat` a file, or `lseek` within a file was simply unacceptable. This was on a cloud VM, so the latency would be at its absolute minimum.
The problem is that such systems have a habit of growing in scope until they reach a point where you really do need the more optimal access patterns of using the real HTTP APIs, and the inefficiencies of emulating the full filesystem API will gradually start to bite you. Maybe you’re lucky enough that that won’t happen, but it’s important to understand it for the quick hack job it is, IMO.
I agree. For example if you want to use Google's ASR (Automated Speech Recognition), if your file is longer than 1 minute in duration, you first need to upload it to a bucket, which is a lot of added complexity compared to a regular HTTP POST.
Just copying the file to a mounted bucket would make this a lot easier.
Then again, how does one get the metadata of the uploaded file?
Calling any software system "niche" is kind of hilarious, as if, if it isn't postgres it's a massive failure. It's not supposed to be a high-performance cache of data.
My company uses GCSFuse for ad-hoc analysis/visualization of large but poorly structured output from our lifesciences jobs and it works just fine for that.
Yep. I once inherited a system where the previous team had used GCSFuse to back the `/etc/letsencrypt` directory on a cluster of nginx webservers. It "worked" and may have been a reasonable approach at the time they built it, avoiding setting up a single "master" to handle HTTP-01 challenges (and it was before GCP's HTTPS LB could handle more than a handful of domains/certificates). The problem was that as the number of domains/certificates it handled increased, nginx startup or config reload time got slower and slower as it insists on stat-ing and reading every single file in that directory in the process. It got high enough that it started running into request throttling on the storage bucket. It's no fun when `nginx -s reload` takes two minutes and sometimes fails completely.
I mean... literally every VM running nginx or apache that I've ever seen has had the SSL certs just sitting on the filesystem in /etc/ssl or /etc/letsencrypt or similar... All of letsencrypt's documentation points people in that direction.
My understanding is that everything is encrypted by default in GCP. Though you need to manually configure encryption keys if you want to prevent Google ever having access to your data.
>What I would really like to see is a two-tier cache system
Is there any sort of Linux HSM (Hieracrhical Storage Manager)? I haven't see any and have been a bit surprised nothing has really developed there. They can manage putting hot data in RAM, SSDs, colder or larger data on spinning rust, deep freezing onto a tape silo or a cloud storage...
Some of the NAS devices and RAID cards can support a two-tier caching or data migration using SSDs, where hot or highly-random data (usually identified by smaller write sizes) go to the SSDs, and then can migrate to the spinning discs.
I've done some "poor mans" version of this using LVM, where I can "pvmove" blocks of a logical volume between spinning discs and SSDs, which is pretty slick, but a very crude tool.
Not a general kernel facility that I know of. I use nfscache every day though; my Steam data directory lives on NFS, and I set up nfscache with a 100GB LRU storage. This way I can avoid the "backup/restore" dance and have all my games installed, at the cost of waiting up to a few minutes to warm the cache for a new game.
I once evaluated using s3fuse for managing about 36 million images. The old storage model was on a filesystem so it was supposed to make a smooth transition to the cloud.
AWS Premium Support wisely advised me against it, not just because of latency but also because the abstraction makes /far/ more API calls then a native solution would.
After a bit of testing to confirm, I switched to using native API calls. That code was easy to write and the performance was great. I've been wary of cloud FUSE adapters ever since.
> What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts
This is really hard to get right if the origin cloud storage is anything other than immutable. Otherwise you're in for a world of cache invalidation and consistency pain.
I've gradually come round to the other opinion: there should be devices that sit on the PCIe/NVMe bus and provide a blob storage API rather than a block one, and there should be an operating system blob API that is similar to but not identical to the filesystem one.
Same experience. I remember opening a .docx in Word and watching it hang or studder at different operations. I think you'd need very reliable and low latency networking for this to be anything but a painful to use toy.
I'd be curious to see how it works running on EC2, especially with an S3 endpoint in the VPC. Although I still think you'd be better suited by using S3 as an object store, given the option to built it right.
Catfs is not super production (there are some small changes you need to make in inode handling), but you can do this. We have it on top of goofys. They both need a few changes to work under load but what we do is quite standard:
1. Goofys for S3 FUSE
2. Catfs for local disk caching
3. Linux caches in memory
4. Mmap file means processes share it
5. One device then exports this over the network to other machines, each of which have an application layer disk cache.
6. Machines are linked via 10 GigE (we use SFP+).
Overall the goofys and catfs guy (kahing) wrote very performant software. Big fan.
>>> every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation
gcsfuse latency is ok as it embodies "infinite sync & persistence" ;)
My personal conspiracy theory: most "cloud services" are just... bad.
VMs and disk space I understand completely, having machines on-prem is too much of an hassle and the price isn't that bad. But for stuff like this, managed services, databases especially, you're just getting scammed.
As the author of rclone I thought I'd have a quick look through the docs to see what this is about.
From reading the docs, it looks very similar to `rclone mount` with `--vfs-cache-mode off` (the default). The limitations are almost identical.
* Metadata: Cloud Storage FUSE does not transfer object metadata when uploading files to Cloud Storage, with the exception of mtime and symlink targets. This means that you cannot set object metadata when you upload files using Cloud Storage FUSE. If you need to preserve object metadata, consider uploading files using gsutil, the JSON API, or the Google Cloud console.
* Concurrency: Cloud Storage FUSE does not provide concurrency control for multiple writes to the same file. When multiple writes try to replace a file, the last write wins and all previous writes are lost. There is no merging, version control, or user notification of the subsequent overwrite.
* Linking: Cloud Storage FUSE does not support hard links.
* File locking and file patching: Cloud Storage FUSE does not support file locking or file patching. As such, you should not store version control system repositories in Cloud Storage FUSE mount points, as version control systems rely on file locking and patching. Additionally, you should not use Cloud Storage FUSE as a filer replacement.
* Semantics: Semantics in Cloud Storage FUSE are different from semantics in a traditional file system. For example, metadata like last access time are not supported, and some metadata operations like directory renaming are not atomic. For a list of differences between Cloud Storage FUSE semantics and traditional file system semantics, see Semantics in the Cloud Storage FUSE GitHub documentation.
* Overwriting in the middle of a file: Cloud Storage FUSE does not support overwriting in the middle of a file. Only sequential writes are supported.
Access: Authorization for files is governed by Cloud Storage permissions. POSIX-style access control does not work.
However rclone has `--vfs-cache-mode writes` which caches file writes to disk first to allow overwriting in the middle of a file and `--vfs-cache-mode full` to cache all objects on a LRU basis. They both make the file system a whole lot more POSIX compatible and most applications will run using `--vfs-cache-mode writes` unlike `--vfs-cache-mode off`.
And of course rclone supports s3/azureblob/b2/r2/sftp/webdav/etc/etc also...
I don't think it is possible to adapt something with cloud storage semantics to a file system without caching to disk, unless you are willing to leave behind the 1:1 mapping of files seen in the mount to object in the cloud storage.
Please, listen to me: use this only in extremely limited cases where performance, stability, and cost efficiency are not paramount. An object store is not a file system no matter how hard you bludgeon it.
Looking at change descriptions, looks like underlying changes were made to get to this like now using GO client library. I would expect a more stable product, and better performance which looks like the performance benchmarks located under docs has been updated as well. Happy to finally see Google standing behind this, and the official CSI driver is really cool to see.
Not that I know of, we have some virtual filesystems for specific things, but in general Drive is for shared docs, videos (recorded meetings/presentations) and things like this.
We don't use drive to store other files. Actually, we don't really "store files" since almost everything we need is remote.
Through the web frontend. I'm not aware of any special fuse clients, nor is it particularly appealing. All files I store in it are for web based applications (primarily gsuite). We have alternative collosus based file share mounts which we can use for "native" files. I personally use git and/or rsync to share files between my various corp devices (laptop, cloud vm, desktop) in addition to those other options.
I wonder the same, but I also wonder what the actual use case is for the Drive app on Linux. For me, Drive is mostly for syncing office docs (namely MS-office docs), PDFs and images among teams. That type of work doesn’t lend itself well to a Linux env anyway. And for programming-heavy sync tasks, a user will more likely use a remote Git repo for code and GCS for data. Does google even use MS office internally?
I write my articles in Markdown and would want to switch to a terminal based Linux (Pi Zero, low power, e-paper, distraction free) to do this instead of a GUI.
(I currently use Goland and Scrivener to write articles and books)
That'd be weird, considering they have their own suite of office tools. Kind of like if Microsoft would be using Google Cloud rather than Azure internally.
Offering Linux on Azure makes sense, but do you think their own services they run in production run on Linux? If so, I'd be wary if I was a Windows Server customer.
Unfortunately it's common to have a policy in place disallowing 3rd-party app api access to drive storage. This prevents apps like rclone from working, but the drive client works because it isn't 3rd-party.
This has been a thing for a while; I remember using it (or something like it) several years ago. While it's great for random files you might want to place in the G-Cloud, what I really wanted was to access my google docs content from the Linux command line. And you can do that, it's just that they're in non-obvious, non-documented, frequently changing formats that will only ever be usable with Google Docs.
But if you're using the google cloud like you might use Box.Net or DropBox, it seems fine for light usage.
Object storage is a higher-level abstraction than block-storage. FUSE and similar tech can do the job for basic requirements like read-only access by legacy applications but rarely works well for other scenarios.
A more complex layer like https://objectivefs.com/ (based on the S3 API) would be more useful, although I would've expected the cloud providers to scale their own block-store/SANs backed with object-stores by now.
Adds a DBMS or key-value store for metadata, making the filesystem much faster (POSIX, small overwrites don't have to replace a full object in the GCS/S3 backend).
Almost certainly a better solution if you want to turn your object storage into a mountable filesystem, with the (big) caveat that you can't access the files directly in the bucket (they are not stored transparently).
JuiceFS is mostly POSIX compatible, but there are important caveats such as no extended ACL, copying files changes their mtime (impacts backup tools), it offers "close-to-open" consistency (dangerous for log appenders), etc.
One challenge with writes in the middle is that it changes the file hash. Cloud services typically expose the object hash, so changing any bit of a 1TB file would require a costly read of the whole object to compute the new hash.
You could spilt the file into smaller chunks and reassemble at the application layer. That way you limit the cost of changing any byte to the chunk size.
That could also support inserting or removing a byte. You'd have a new chunk of DEFUALT_CHUNK_SIZE+1 (or -1). Split and merge chunks when they get too large or small.
Of course at some point if you are using a file metaphor you want a real file system.
Doesn't this mean that most programs you might want to use with the FUSE API won't actually work? They'll do fine for a while, until they try to seek, and then they'll get an error?
Or is there a large group of programs that only ever write sequentially?
I'd think non-appending writes are quite rare in practice, other than databases. Even when the application is logically overwriting data, in other kinds of programs it's almost always implemented as writing to a new file + an atomic rename, not in-place modification.
Most programs either write a full file every time and replace the old file by a single move or append to an old file. Writting in the middle could happen in a program writting to some kind of archive or disk image. There is probably a whole group of programs that do this I'm not familiar with, but I'm pretty sure of my first sentence.
I'm not completely confident (I tried looking in the source and it wasn't immediately obvious) but I think emacs does small in-place edits when you're working with very large files.
well yeah, but there's a lot of things FUSE makes easier. no need to implement a client library, no need to write some custom wrapper or rsync thing to sync files to the bucket or bucket to local system, etc. it won't work for every app but for the ones it does support it saves a ton of extra work and maintenance.
"Cloud Storage FUSE is available free of charge, but the storage, metadata, and network I/O it generates to and from Cloud Storage are charged like any other Cloud Storage interface. In other words, all data transfer and operations performed by Cloud Storage FUSE map to Cloud Storage transfers and operations, and are charged accordingly."
You will be doing storage operations silently and in a unoptimized fashion, more so if the underlying FUSE filesystem is implemented in a naive fashion.
For example, Cloud Storage never moves or renames your objects; copying and deleting the original one instead. This can end up costing quite a lot if you're using data other that in "standard store" because of minimum storage duration.
- https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver
- https://github.com/ofek/csi-gcs
Here is the initial commit: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/c...
Notice for example not just the code but also the associated files. In the Dockerfile it blatantly copied the one from my repo, even the dual license I chose because I was very into Rust at the time. Or take a look at the deployment examples which use Kustomize which I like but is very uncommon and most Kubernetes projects provide Helm charts instead.
They were most certainly aware of the project because Google reached out to discuss potential collaboration but never responded back: https://imgur.com/a/KDuf9mj