Hacker News new | past | comments | ask | show | jobs | submit login
AWS Announces Open Source Mountpoint for Amazon S3 (infoq.com)
138 points by rbanffy on March 26, 2023 | hide | past | favorite | 78 comments



This does sound nice, but the article barely mentions a pretty major limitation from the release announcement[0]:

> it doesn’t support writes in this first release, and in the future will only support sequential writes to new objects.

[0]: https://aws.amazon.com/blogs/storage/the-inside-story-on-mou...


It wouldn't be an AWS announcement without a whopper of a caveat, it's like the Amazon equivalent of Google canceling things.


AWS really does have its own quirks as a company.

- The tools feel fragmented, not aligned and often don't compose well

- Always sharp corners (Billing, ops, you name it)

- Products in various state of usability / maturity, sometimes a product is launched by barely usable.

I've never worked for Amazon, but if what is said about amazon is true, then the culture is reflected in their products, teams siloed owning their own services, then it does make sense that their services have those quirks.


That comparison seems unfair... AWS has a habit of launching early and iterating, and they almost never shut stuff down. In this case it's a filesystem implementation of a HTTP API, so the capabilities reflect the main API. Deciding to build and launch reads before writes isn't really a bad decision, it helps those who have the read use case try this out and give feedback immediately.


> In this case it's a filesystem implementation of a HTTP API

When you put it like that, it makes me wonder why they don't support WebDAV.


I think there’s a bunch of gateways available already. The example use case for this seems to be mounting a 10 petabyte “hard drive” with CSVs to a reporting machine.


To be fair it sounds like this isn’t for that use case.


>only support sequential writes to new objects.

"Doesn't supports writes in the first release" is a letdown, but the latter part above seems expected. I suppose you could abstract what would have to happen to fake arbitrary seek() and overwriting portions of a file, appending, etc, but it would encourage things that wouldn't work well.


Partial writes would need to buffered and flush()ed. That is not unlike other file systems, until you fully know it is persisted - or am I missing something?


It’s not supported by S3 to partially update an object; you would have to completely replace the entire object. While they could do some magic under the hood to make it seem like it works, it’s better to just reject these operations and let the application developer decide what’s best.


Maybe I'm missing some context here, but can't you generate X number of presigned URL's with your chunk size and if you provide the chunk ordering it's automatically done on S3 side, but I though you could manually implement the chunk order assignment also.


I would suspect the team making this tool is not exactly the same team maintaining S3...

So adding "server side operations" would require work from the other team to be added.. which certainly could be done if they decided it was needed.

As a first MVP release this project seems neat for read use cases.. and feedback from the community can help ensure read operations are solid while they are also working on the write side -- which they admitted was lacking in this initial release, but want to improve.


That you can't update an S3 object. There is no equivalent to open() on an existing file, seek() to some position, and modify a few bytes.


That S3 storage is almost immediately consistent on a planetary scale is pretty impressive already. This would make it a lot more complicated to do that.

OTOH, you can always engineer around such limitations by storing smaller objects you can completely ingest and store on a single operation.


For that use case, they have EFS. If you are reading from one of more S3 blobs and gradually building another blob, you can keep that blob on EFS and, when finished, dump it back to S3.


My understanding it goofys and s3fs support these types of operations?

Still, a win for reliability / performance of whatever aws will be supporting


> it doesn’t support writes in this first release, and in the future will only support sequential writes to new objects.

To me it just sounds like Azure Files with less features. You can just mount a share on any machine supporting NFS or SMB or just use the REST interface. [0]

[0] https://learn.microsoft.com/en-us/rest/api/storageservices/f...


Fair point, defs should be mentioned by any serious article.

I have use cases where forbidding writes is a feature, and if it means more stability and performance it’s a trade-off that’s worth it to me.


Related:

https://news.ycombinator.com/item?id=35155944 - March 14, 2023 (97 comments)


> The open-source client does not emulate operations like directory renames that would require many S3 API calls or POSIX file system features that are not supported in S3 APIs.

At this point I'd probably be fine with using a different API than the filesystem, perhaps one inside of whatever backend language I was using

Feels more fused than mounted


Indeed, if you're getting that far away from historical file systems primitives then may as well use an API model that makes senses for the storage backend. Or just contribute direct to rclone and fuse.


You can open an s3 object as a 'file object' in python and it is at least seekable.

Other than that I think you're best to treat s3 as a huge kv store and not a filesystem at all.


Seek only works because the whole object gets downloaded into memory first.


Either that, or the Unix principle of many small files being better than a single big one (without the need to worry about inode exhaustion in S3).


S3 works best with a bunch of medium and similar sized files. It’s really not like a file system.


Reminds me of tape-based storage like the Sinclair QL or Exatron's "stringy floppy". Or Mitsumi's QuickDisk, a format I saw on the Sharp MZ series which is not a disk, but a spiral track that looks a lot like a loop of tape running past RW heads.


Right. Big and slow kv.


I experimented with similar ideas 11 years ago with the OpenStack counterpart of S3, and basically an object storage is not a real filesystem. You can get away with a FTP/SFTP interface (for example), but that's it. Everybody wants a cheap object storage, but not the fact that what you deal with is "objects".

I wrote a NBD server to proxy to OpenStack Object Storage: https://github.com/reidrac/swift-nbd-server -- and it was fun, but not that useful.


> Everybody wants a cheap object storage [...]

Are you sure? I would guess most people want a networked, cheap file system. Not an object-storage.


Normal file-systems come with too much baggage - permissions portability between Windows and Linux being the main bugbear for my own use-cases.


And AWS Rights Management is the pinnacle of this approach.


Is this related to the S3 Rust Mountpoint project?

Discussed 12 days ago:

Mountpoint – file client for S3 written in Rust, from AWS

https://news.ycombinator.com/item?id=35155944 (97 comments)


The article links to an announcement which links to the GitHub repo linked in that post so yes


How is this different than these other solutions?

https://github.com/kahing/goofys

https://github.com/s3fs-fuse/s3fs-fuse


It’s an officially supported solution. Other than that they hint at correctness. I would assume it’s going to be faster because why else bother.


> I would assume it’s going to be faster because why else bother

aws s3 sync (official) is orders of magnitude slower than rclone (oss)


FYI mountpoint is written in Rust, but using a new framework called CRT (common runtime) that’s written in C. The perf is multiple of that of the AWS cli since the CLI is in Python. You can actually try using CRT in Python by setting some flag - I forgot what. It doesn’t support all the APIs yet though


Cool. I was mostly just pointing out the assumption that the official software is somehow the best option is proven incorrect time and time again with AWS.


It's Amazon Web Services, not Amazon Web Clients


Isn't cliv2 written in go, and also pretty slow?


AFAIK it's still a Python package: https://github.com/aws/aws-cli/tree/2.11.6


Well the sdk is written in Python so the rust option has a bit of an advantage there


I would hope it's more stable.

I like s3fs, sshfs, and similar solutions, but it gets wonky in corner cases, like gaps in connectivity.

What I'd really like is a kernel-level file system abstraction which:

(1) Has capabilities, and unsupported operations fail. If I try to move a log file over to a medium like S3, it works. If I try to write a log file directly to S3, incrementally adding data, it fails. I don't want compatibility kludges (e.g. copying a file down, appending a line, and copying it back).

(2) Provides common abstractions (my file manager works on S3 as much as it does locally)

(3) Supports all media.

The current implementation broke down with NFS and hot swappable media like USB sticks. Having a removable media stuck in an impossible-to-unmount state shouldn't happen. Nor should assorted failures due to connectivity.


rclone has the ability to mount S3 and many more cloud storage providers as a virtual drive


I’ve been burnt by trying to use FsX as a mount-s3-as-a-readonly-filesystem. Flaky mounts, complicated workarounds for maintenance windows - if this lives up to the promise I can delete a whole module (and the cognitive overhead of reasoning about its state and behaviour) that we’ve written to manage a local disk to mirror an s3 bucket.


Oh I am considering to us FSx for Lustre with S3 export/import for our ML workloads. Isn't it stable enough?


To be frank, I think S3 developments like this might happen due to S3 compatibility both on the "server" (Backblaze B2, Minio, SeaweedFS, etc.) and so on as well as on the client side happens, so Amazon is probably not too happy about competition. It's probably a good idea to stay away even from open source by Amazon, given that there is very well, used in production by big companies, working and more independent alternatives.


> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Oof. Can we please, please, please stop releasing software with "| sh"?


I love Rust but I have to agree installing rustup like that feels like a crime to my machine. For some time I would only install rustup when inside an air-tight virtual machine since I can't really bother to read the sh script every single time I'm going to download it.

It's 2023, we have package managers, we have packages, we have containers, why are we still shipping software like this? Even worse is the fact that you can generally find Rust in your package manager of choice but it will often be outdated and you won't be able to choose your versions the same you would with rustup.

I don't know, I'm not sure I have a solution, but I just wish maintainers would put a little more effort into trying to support package managers as the de-facto way to set up a tool chain such as Rust. Even if you want to keep the meta-package manager such as nvm or rustup, at least let me download _them_ from my distros repositories instead of running a random sh script from the internet.


How come do you trust the package of Rust from Rust but you don't trust the `sh` install script from Rust?


This specifically isn't an issue of trusting them with my system, it's that a shell script can give a shit all over a system without a good way to undo it, even if it was well-intentioned.

Package managers are modern technology, they exist because they can track what files were placed where, and can remove them cleanly when given an uninstall command.


Maybe the problem is the rust community seems to live on the bleeding edge and always needs to have a ‘nightly’ build to do anything interesting?

And distros usually don’t bump versions between releases because they just don’t. Run Debian Sid or Fedora Rawhide or <whatever> if that’s what you’re after.


> rust community seems to live on the bleeding edge and always needs to have a ‘nightly’ build to do anything interesting

You have an example of a project that needs nightly to build? I’m sure some exist, but nearly all libraries and projects live on the latest stable release.


That indeed used to be very common in the early days of Rust, however, I believe stable Rust is what most libraries and projects use nowadays. It's been some time since I saw anything that used Rust beyond 1.60 (release April last year) as a minimum version.


It's straightforward to set up an apt-get repository that updates nightly and have users use that. For something as widespread as Rust I expect better than curl | sh scripts.


Probably doesn’t matter for Rust but once you start messing with random system packages that other packages depend on it becomes less than straightforward really quick.

One simple version bump can effect hundreds of packages and, if you’re not careful, bork an entire install… ask me how I know that one.

—edit—

Also should say that I’m fully in the package manager camp. If I want to install something that’s not in the repos I almost always find or make a package and build it locally because I don’t want random orphaned files strewn around my system folders.

Unless it’s just some command line program then I usually just use it from the source directory and don’t even bother having it in my path.


Well I guess that solves the issue for one OS/distro.

It’s probably not _why_ they did it, but I certainly like that regardless of OS, installation is the same and it’s reliable.


It is a major part of why we did it. Giving everyone a nice flow for getting up and going matters for adoption.


That’s awesome.

For what it’s worth, I had a C# dev who’d never dealt with Rust at all before get themselves setup, and then compiling and running the Rust app by themselves in about 10 minutes flat (our internet is slow) and I definitely think the ease of the setup flow contributed to that hugely, so massive thanks for making it so nice.


Many Rust users are on platforms that apt does not support.


Why do you trust scripts from your package manager more than one from the official upstream Rust project? The Rust project also has a good security track record.


I trust them more because I choose who mantains the repos I use. I trust them more because I already have to trust them: they provide almost every single piece of software I run on my machine. In this case I'm on Fedora which has a good track record for security, stability, and only allowing free software.

It's not to say that I don't trust the Rust project, nowadays I kinda have to, but curl sh installation is messy and separated from the rest of the system. If they just packaged rustup into an rpm and set up their own repos I could point dnf to it would make system maintenance so much easier.

I just want them to use the tools that already exist instead of reinventing the wheel with an esoteric 700 lines-long sh script. This applies to Rust and any other kind of tooling that does the same installation workflow. I understand the reasoning behind it, but I believe the other options should also be considered and supported.


What's wrong with it as long as it comes from a https URL with low potential for typo squatting (a short .sh domain would be even more natural but maybe controversial and a bit expensive for just this)? You don't have to pipe it into sh, you can also redirect it ie "> install.sh" then examine it before running. What's the alternative? A dozen incompatible centralized lang-specific or distro-specific package managers? Might work for devs but not end users, might not work with all licenses, is extra work for distros + Mac OS so chances are the O/S you're using or would otherwise be using isn't covered, ...


Uninstallation issues.

I'd rather have package that tracks what files it installs and removes cleanly.


`rustup uninstall` will cleanly uninstall everything.


https://rust-lang.github.io/rustup/installation/other.html

The whole point of this installation method is ease of use. You can use it for convenience if you prefer it. I don't see the issue with providing more options not less.


Try examining the above script. It's a lot of work. It's not just a bunch of wget's and cp's, there are a bunch of subroutines and conditionals. Too much to look at.

Also as another user pointed out, uninstallation is a problem.


> Try examining the above script. It's a lot of work. It's not just a bunch of wget's and cp's, there are a bunch of subroutines and conditionals. Too much to look at.

This would be a reasonable counter-argument if most people could honestly claim to have inspected the source of >1% of the things they'd installed from apt-get/yum/etc.

"But I trust the maintainers of those repositories to verify correctness" - yes, and people trust the Rust maintainers, too.

I'll grant you that uninstallation is usually a good argument (though not in this particular case).


Sure, once we have a universal way to accomplish the task that's easier and safer.


Ansible provides a module to install OS packages. It works with apt, yum, dnf, pacman, etc. This is just one of the implementation wrappers around a universal way to accomplish the task. This same mechanism also allows uninstallations and relies on the OS to manage dependencies, which it does. This one true way necessitates the "OS package manager" being the way for an OS to manage packages. This exists, and tools exist to eliminate the need to memorize how to do these processes like update, remove, reinstall, rollback, etc. It also centralizes into one config Nd workflow your very limited number of trusted parties, what their GPG keys are, and how to validate, rescind, and securely handle and use those keys. It handles digest verification, sha sum validation, etc. One way to do this exists by as many names as there are distro families, but it's one way nevertheless. This way, with these keys, stored under the security postures built and refined over decades, exists. The people complaining that this one way should be used turn out to have a valid point, and your suggestion is important, so it sounds like you're in agreement: let's use the OS package manager when installing OS packages. This way should be foremost, recommended, and most prominent. It's easier and safer. Do you disagree?


Can we please, please, please have someone write the authoritative article on why this is _actually_ a bad idea, so that it can be linked next time this conversation comes up, rather than merely intimating that it's bad?

...Yes, I'm intentionally invoking Cunningham's Law in the hopes that it exists and someone will link it here. Just to do a little due diligence, I searched for some answers, and found:

* https://stackoverflow.com/a/29389868/1040915 - "Because you are giving root access to whatever script you are executing. It can do a wide variety of nasty things." - apt-get and yum require sudo too.

* https://stackoverflow.com/a/34016579/1040915 - "if the script were provided to you over HTTP instead HTTPS..." sure, no arguments there! So don't do that then! ... "if the connection closes mid-stream, there may be executed partial commands, which were not intended to" ok, an actually reasonable answer! (though see the next link)

* https://www.arp242.net/curl-to-sh.html - a pretty comprehensive article in support of `curl | sh`, which points out that it's of equivalent security to `git clone && cd dir && ./make`, and only very slightly less than using package managers (which provide checksums). It also points out you can avoid the "interrupted connection" issue by running within a function

* https://medium.com/@esotericmeans/the-truth-about-curl-and-i... - reiterates that interrupted connection issues are minor, and repeats some server-side exploits that could potentially happen (if you don't trust the server, don't _ever_ install anything from it, via any means!)

* https://news.ycombinator.com/item?id=11532599 - a discussion of an article (which, ironically enough, itself has an expired certificate) with, I'll admit, more comments than I was willing to read in-depth - but one comment https://news.ycombinator.com/item?id=11533515 points out that `curl | sh` is fully portable whereas package managers are not.

Don't get me wrong - given the choice, I'd rather have the audit log, built-in checksum, uninstallability, and other features of a package manager any day. I'm not arguing that `curl | sh` is _better_ than package managers. But I have always been a little baffled why it's portrayed as _so_ much worse (when installing from a trusted source, over HTTPS) as to be anathema and repugnant.


There’s nothing wrong with curl | sh other than people realising “damn, this software could be malicious!” Yes, it could be. Doesn’t mean a HTTPS link maintained by a project with a good security track record is actually serving malicious links though.


My biggest concern is typo-squatting.

> connection closes mid-stream

I get around this by using wget to download the script, and THEN use sh.


Releasing? No. But you're already free to consume it via some other means, such as your OS/distro's/choice of package manager. It's not really on tool authors to inform you how to use your (and every other choice of) package manager, nevermind ensure it's packaged for all of them.


Various distros have rustup packaged, so use that instead where available.


would like to see benchmarks against comparable solutions


I use juicefs and it’s the way to go.


How do you handle metadata backups?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: