One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.
Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.
SeaweedFS is built on top of a blob storage based on Facebook's Haystack paper.
The features are not fully developed yet, but what makes it different is a new way of programming for the cloud era.
When needing some storage, just fallocate some space to write to, and a file_id is returned. Use the file_id similar to a pointer to a memory block.
There will be more features built on top of it. File system and Object store are just a couple of them. Need more help on this.
The allocated storage is append only. For updates, just allocate another blob. The deleted blobs would be garbage collected later. So it is not really mmap.
> Also what is the difference between a file, an object, a blob, a filesystem and an object store?
The answer would be too long to fit here. Maybe chatgpt can help. :)
I, too, am interested in your views on the last 2 questions, since your views, not chatGPT's, are what informed the design. Part of learning from others' designs [0] is understanding what the designers think about their own design, and how they came about it.
Would you mind elaborating on them? HN gives a lot of space, and I'm confident you can find a way to summarize without running out, or sounding dismissive (which is what the response kind of sounds like now).
The blob storage is what SeaweedFS built on. All blob access has O(1) network and disk operation.
Files and S3 are higher layers above the blob storage. They require metadata to manage to the blobs, and other metadata for directories, S3 access, etc.
These metadata usually sit together with the disks containing the files. But in highly scalable systems, the metadata has dedicated stores, e.g., Google's Colossus, Facebook's Techtonics, etc. SeaweedFS file system layer is built as a web application of managing the metadata of blobs.
Actually SeaweedFS file system implementation is just one way to manage the metadata. There are other possible variations, depending on requirements.
There are a couple of slides on the SeaweedFS github README page. You may get more details there.
Thank you, that was very informative. I appreciate your succinct, information dense writing style, and appreciate it in the documentation, too, after reviewing that.
what makes it different is a new way of programming for the cloud era.
but you aren't even explaining how anything is different from what a normal file system can do, let alone what makes it a "new way of programming for the cloud era".
Sorry, everybody has different background of knowledge. Hard to understand where the question comes from.
They were straightforward questions. The paper you linked talks about blobs as a term for appending to files. Mostly it seems to be about wrapping and replicating XFS.
Is that why you are avoiding talking about specifics? Are you wrapping XFS?
I'm little confused why people are being so weird with the OP, asking what the difference between a blob and a file aren't something for seaweedfs lol, Blobs and Files, and other terms are terms used to describe different layers of data allocation in almost every modern object storage solution.
Blobs are what lie under files, you can have a file split into multiple blobs spread across different drives, or different servers etc, and then can put it back together into a file when requested, thats how i understand it at a basic level
I think they are being weird. According to the facebook pdf they linked I think that would fall under chunks, but either way, why would someone advertise a filesystem for 'blob' storage when users don't interact with that? According to the paper 'blobs' are sent to append to files, but that isn't really 'blob storage', it's just giving a different name for an operation that's been done since the 70s - appending to a networked file. No one would say 'this filesystem can store all your file appends' and no one is storing discreet serialized data without a name and once you do that, you have a file.
They also seem like they are being vague and patronizing to avoid admitting that their product is not a unique filesystem, but just something to distribute XFS.
> Why does a user need that? Filesystems already break up files into blocks / sectors. Why wouldn't a user just deal with files and let the filesystem handle it?
A blob has its own storage, which can be replicated to other hosts in case current host is not available. It can scale up independently of the file metadata.
Why does a user need that? Filesystems already break up files into blocks / sectors. Why wouldn't a user just deal with files and let the filesystem handle it?
I really don't understand why you aren't eager to explain the differences and what problems are being solved.
First, the feature set you have built is very impressive.
I think SeaweedFS would really benefit from more documentation on what exactly it does.
People who want to deploy production systems need that, and it would also help potential contributors.
Some examples:
* It says "optimised for small files", but it is not super clear from the whitepaper and other documentation what that means. It mostly talks about about how small the per-file overhad is, but that's not enough. For example, on Ceph I can also store 500M files without problem, but then later discover that some operations that happen only infrequently, such as recovery or scrubs, are O(files) and thus have O(files) many seeks, which can mean 2 months of seeks for a recovery of 500M files to finish. ("Recovery" here means when a replica fails and the data is copied to another replica.)
* More on small files: Assuming small files are packed somehow to solve the seek problem, what happens if I delete some files in the middle of the pack? Do I get fragmentation (space wasted by holes)? If yes, is there a defragmentation routine?
* One page https://github.com/seaweedfs/seaweedfs/wiki/Replication#writ... says "volumes are append only", which suggests that there will be fragmentation. But here I need to piece together info from different unrelated pages in order to answer a core question about how SeaweedFS works.
* https://github.com/seaweedfs/seaweedfs/wiki/FAQ#why-files-ar... suggests that "vacuum" is the defragmentation process. It says it triggers automatically when deleted-space overhead reaches 30%. But what performance implications does a vacuum have, can it take long and block some data access? This would be the immediate next question any operator would have.
* Scrubs and integrity: It is common for redundant-storage systems (md-RAID, ZFS, Ceph) to detect and recover from bitrot via checksums and cross-replica comparisons. This requires automatic regular inspections of the stored data ("scrubs"). For SeaweedFS, I can find no docs about it, only some Github issues (https://github.com/seaweedfs/seaweedfs/issues?q=scrub) that suggest that there is some script that runs every 17 minutes. But looking at that script, I can't find which command is doing the "repair" action. Note that just having checksums is not enough for preventing bitrot: It helps detect it, but does not guarantee that the target number of replicas is brought back up (as it may take years until you read some data again). For that, regular scrubs are needed.
* Filers: For a production store of a highly-available POSIX FUSE mount I need to choose a suitable Filer backend. There's a useful page about these on https://github.com/seaweedfs/seaweedfs/wiki/Filer-Stores. But they are many, and information is limited to ~8 words per backend. To know how a backend will perform, I need to know both the backend well, and also how SeaweedFS will use it. I will also be subject to the workflows of that backend, e.g. running and upgrading a large HA Postgres is unfortunately not easy. As another example, Postgres itself also does not scale beyond a single machine, unless one uses something like Citus, and I have no info on whether SeaweedFS will work with that.
* The word "Upgrades" seems generally un-mentioned in Wiki and README. How are forward and backward compatibility handled? Can I just switch SeaweedFS versions forward and backward and expect everything will automatically work? For Ceph there are usually detailed instructions on how one should upgrade a large cluster and its clients.
In general the way this should be approached is: Pretend to know nothing about SeaweedFS, and imagine what a user that wants to use it in production wants to know, and what their followup questions would be.
Some parts of that are partially answered in the presentations, but it is difficult to piece together how a software currently works from presentations of different ages (maybe they are already outdated?) and the presentations are also quite light on infos (usually only 1 slide per topic). I think the Github Wiki is a good way to do it, but it too, is too light on information and I'm not sure it has everything that's in the presentations.
I understand the README already says "more tools and documentation", I just want to highlight how important the "what does it do and how does it behave" part of documentation is for software like this.
SeaweedFS author here. Thanks for your candid answer. You do not need to use multiple SeaweedFS components. Just download the binary and run "weed server -s3".
There are many other components, but you do not really need to use them. This default mode should be good enough for most cases. I saw many times people try to optimize too early, but often unnecessary, and sometimes in the wrong way.
I would like to know what kind of setup you are running. It should beat most other options if the use case needs lots of small files, e.g. millions or billions of files. If just small use case, e.g. a few personal files, it would be an overkill.
Another aspect is how to increase capacity for existing clusters. It should be most simple for SeaweedFS, just start one more volume server. And it will linearly increase the throughput.
Yeah sorry my answer was more than insufficient to be honest. I wrote it _in bed_ and was embarrassed the next day because it was of really low quality. Thought of expanding it later. So yeah I screwed the pooch here and I'm sorry, I will try to do better now by expanding on my answer.
First of all this is all from memory and I didn't try seaweedfs again for this.
So first things first. I evaluated seaweedfs for HPC Cluster usage in 2020 (oh my this is some time ago), but my test setup were VMs. I tried it with many small and larger files and it didn't scale at all (at least when I tested it) for parallel loads. The response time was acceptable, but the throughput was very low. When I tried it "weed server" spun up everything more or less fine, but had problems binding correctly that a distributed setup worked. Based on the wiki documentation I configured a master server, a filer and a few volume server (iirc).
My main gripes at that time were as follows:
* the syntax of the different clients was incosistent
* the throughput was rather low
* the components didn't work well together in a certain configuration and I had to specify many things manually
* the wiki was lacking
I tried filer (fuse), s3 and hadoop. s3 wasn't compatible enough to work with everything I tried with it so I spun up a minIO instance as a gateway to test the whole thing.
When working over a longer period I had some hangs as well.
That's sadly everything I remember on it but I made a presentation if you are interested I can look for it and give you the benchmarks I tried and the limitations I found (although they will be all HORRIBLY out of date). When I tested it there were 2 versions with different size limitations iirc. I just now looked over your gitlab releases and can't find these.
Sorry again if I misrepresented seaweedfs here with my outdated tests. I looked at the github wiki and it looks much better then when I last played with it. I will give it a spin again soon and if I find my old experience of it to be not represantive, maybe write something about it and post it here.
---
minIO was when I tried it mainly an s3 server and gateway. It had a simple web framework that allowed you to upload and share files. One of our use cases that we thought we could use minIO for was as a bucket browser/web interface. It was easy to setup as a server as well. Like I said I didn't track it after testing it for about a month. Today it boasts with it's performance and AI/ML use cases. Here is there pricing model https://min.io/pricing and you can see how they add value to their product.
---
Ceph is like I said the most complex product of the three with the most components that need to be setup (even though it's quite easy now). Performance is optimized in their crimson project https://next.redhat.com/2021/01/18/crimson-evolving-ceph-for... (this is a WIP and not enabled by default). It's not the most straight forward to tune since many small things can lead to big performance gains and losses (for instance the erasure code k and m you choose), but I found that the defaults got more sane with time.
Thanks for the detailed clarification! I am too deep into the SeaweedFS low level details and am all ears on how to make it simpler to use. SeaweedFS has weekly releases and is constantly evolving.
Depending on your case, you may need to add more filers. UCSD has a setup that uses about 10 filers to achieve 1.5 billion iops. https://twitter.com/SeaweedFS/status/1549890262633107456 There are many AI/ML users switching from MinIO or CEPH to SeaweedFS, especially with lots of images/text/audio files to process.
I found MinIO benchmark results is really, well, "marketing". MinIO is basically just an S3 API layer on top of the local disks. Any object is mapped to at least 2 files on disk, one for metadata and one is the object itself.
Besides storage cost, S3 API access cost can also be high if frequently accessed. And latency is unpredictable.
You can use SeaweedFS Remote Object Store Gateway to cache S3 (or any S3 API compatible vendors) to local servers, and access them at local network speed, and asynchronously sync back to S3.
One similar use case used Cassandra as SeaweedFS filer store, and created thousands of files per second in a temp folder, and moved the files to a final folder. It caused a lot of tombstones for the updates in Cassandra.
Later, they changed to use Redis for the temp folder, and keep Cassandra for other folders. Everything has been very smooth since then.