The exciting thing about GFS wasn't even the technology. It was the discovery you made as a new engineer starting at Google (in the 2000s) that from your workstation (and from any job you ran on Borg) you had access to this enormous amount of data stored on GFS across different data centers.
This is the thing that people keep missing. The technology was only the enabler. What made the difference was to have thousands of developers with wide access to the biggest data lake in the world.
Almost nobody gets that when trying to make data lakes, or whatever they are called now, in the corporate world. They focus on vendors and technologies and whatnot. Then ignore that what makes data valuable is access to it by people who don't even know yet what they might do with it.
Absolutely! IMHO, new engineers start with "What technology should we use to do this?", while experienced engineers start with "Assume we've already done this successfully with some (i.e. any) technology, what will the impact be?"
And yes, if you have a hierarchical eng org then there's different job responsibilities at different levels. But at some point, new engineers end up making their own calls, and they usually don't pay enough attention to the forest.
Some technically complex things just aren't worth doing, and some simple things are incredibly valuable to do.
E.g. Porting the fundamentals of an MMU + append optimizations into a distributed storage system: not theoretically groundbreaking. But the impact for its users was!
The ability to join datasets at google through BQ and 'Flink' like pipelines was really un-matched. No matter what dataset you wanted it was accessible and joinable using the same data-lake, query tools, and ACL system. Sure teams differed in how they queried it (tool wise), but GFS made it all readily accessible for joining.
Most large orgs miss this when they make their data lake. Instead making a collection of silo'd data lakes
The standard technique, at least when I was there, is to encrypt with a per-user public key whose decryption key is only present when your application has the user's login session. Basically, your application can't decrypt the data unless the user is making a request to your app. (There are a lot of potential attacks here, like modifying the code to exfiltrate data when the user is present. Hopefully those are easier to detect than just putting the data in a database unencrypted, though.)
I worked on Google Fiber. We sent all the system logs from our WiFi routers (etc.) back to us, mostly to get to the bottom of the most obscure WiFi bugs. This could have been a privacy disaster, as Linux logs all sorts of PII like the MAC addresses of devices that are connected to the router. My team modified the Linux kernel to not do this, and added filtering before upload to XX anything that looked like a MAC address. That way, when some PM came up with a horrifyingly bad idea like "hey, let's make a social network based on who connected to your WiFi", that data didn't exist. (This predated widespread implementation of MAC address randomization. We took care with your PII, but nobody else did, so phone vendors simply stopped giving access points PII. Probably the right solution, honestly.)
This wasn't just us that did things like this, it was institutionalized. There was a privacy team that had to review and sign off on this sort of thing. If you wanted to run something in production, or retain any sort of data, someone that actually cared about user privacy would be taking a look. HN kind of undervalues how much people actually cared; there is some trade off between usability and security, and nobody is going to use a product that can't be used, and no corporation can really stand up to the government the way they want. I thought it was a decent balance, though.
Obviously, some things can't be redacted like we did, and you should be careful about who you share those things with. Location history comes to mind as the most obviously abused feature. I thought it was super cool to have "you visited this cafe 2 years ago" on Google Maps, but it comes at the cost of fishy warrants for "anyone within 1 mile of this minor property crime". I'd rather not be on those lists, so I don't use it. (But be careful, your cell phone provider surely has this information, even if the Vice article is about Google. They just don't use it for anything useful; it's ONLY a dragnet.)
Defense in depth. First, the common RPC identity and authorization system provided RPC identity to the GFS API calls, and there was a common user/group system. GFS had ACLs and so despite what was said up-thread most people could _not_ access the data in the Gmail GFS cells. The team membership vs runtime user setup also ensured that Gmail SREs using their own credentials could not directly access the GFS cells (but once you have root on the storage boxes, that somewhat disappears). Second, on top of that you have encryption, using an encryption key service which would spot anomalies in things asking for the decryption keys to decode the stored data.
The insider risk stuff always was really cool to me and, IMO, represents ways that the big tech companies do way more than everybody else in this space. BCID can be a huge pain in the ass but being able to say "hey, it actually would be pretty tricky for a single disgruntled employee to execute code to steal user data" is quite powerful.
That sort of stuff was not really stored in GFS directly, it is stored in storage systems that build on top of GFS.
That has ~always been the case. So while GFS might have had, say, open access to the map data and tiles, or the crawled web pages, user data isn't accessed that way (and this was standardized and enforced very strongly, even 16 years ago when i started).
regular users of GFS (developers) couldn't trivially access that data, but certainly up to a point it was possible for expert users to troll that data (if you were caught, you'd be fired). Eventually better protections were applied.
I worked with a guy who had built part of search and later made the gmail "lawyer search" feature (legal discovery). He said that any interest in looking at people's emails went away after having to spend hours going over various emails involved in court cases (typically done in a room with several lawyers) to make sure the ranking algorithms could surface emails demonstrating illegal behavior
I worked on Colossus while I was at G. It is an incredible piece of technology, and the current versions are a lot better than GFS. It makes me a little sad that no papers get written about it. Scale-out ZFS and Ceph from the outside world may be the closest competitors, but they don't hold a candle to Colossus in terms of scalability and reliability.
Google publishes a lot of papers in AI, networking, and kernel-type work (isolation, scheduling, etc.). This seems to attract a lot of talent and helps build relationships with academia.
For whatever reason though, the storage and analytics groups don't publish nearly as much as they could.
Bigtable folks wrote a follow-up paper to the original, but it was turned down by the conference's reviewers for not being novel enough or something like that, so the team just went ”ok, whatever..." and moved on. Years later, there were a couple of talks by Carter and Phaneendhar at the HBase conference that might have been hosted at Google NY.
does it use custom hardware or soes it work om commodity PCs? If I recall correctly, spanner used GPS.or atomic clocks to have precise time on every machine, and that enables it to perform much better.
It uses commodity hardware. Also, Spanner doesn't really need the atomic clocks, they just help performance. CockroachDB, which is similar, runs fine on commodity hardware.
I seem to have misunderstood Spanner by and large. I thought their achilles heel was the atomic clock since it is this clock which makes consensus at scale possible and there by transactions. Cockroach and Yugabyte both work on similar principles but they use these hybrid logical clocks. The hybrid logical clock is a function of NTP and a monotonic local clock so I guess this works ok for Raft consensus.
From Jepsen - "yugabyte seems mostly resilient to clock skews' - this is a very safe but at the same time a vague statement. What does mostly mean ? Good for 99% of practical use cases ?
I think it means "we can't prove that it works, but we have tested it and we haven't found a case where it doesn't." There are conceivable cases where the hybrid logical clock won't work, but most of these are cases where other systems will fail first (eg TLS/SSL).
Edit: To address the first part of your comment, the atomic clocks narrow the bounds on how bad your NTP clock can be. Spanner essentially waits out the bounds on clock variance on each write transaction to get consistency. There still would be a measurable bound on your clock precision without the atomic clock, but it would be a lot larger.
Ceph is incredibly reliable. I run a ~ 250 PB Ceph install with the equivalent of 1.5 medium-paid full time engineers[0], and I sleep very well at night.
Failure after hardware failure, it just keeps running and running. We currently fail at a rack level, but no reason we can't do that at a building level with enough hardware.
0: We have more people on our team, but we also deal with other things that take a lot more of our time.
after a couple decades of breaking every cluster filesystem I've used at scale, Colossus was the first one that managed to handle my loads effortlessly. It required far less attention from SREs when abusive users sent it a lot of load. It also made using reed solomon codes easy.
The truly problematic Colossus clusters usually fell into one of two buckets: a) those with shrinking machine counts (but always increasing traffic) because the hardware was ancient and repairs/upgrades were not viable and b) those where most of the quota was held by a single project, with massive jobs. One cluster at IDI was mostly Image Search's and it wasn't fun at all.
its not a "natural" choice for a developer, unless you've worked in a large scale unix network that's been setup properly.
We've been told for years that file system primitives are bad, unscalable and nasty. part of this is because NFS can be a dick, but a lot is because large scale file systems generally are fast, scalable, coherent, secure and reliable (pick two).
A lot of things can be managed with S3, but it has a lot of flaws.
Unless you've worked at a place where you've had access to something like GPFS(or whatever IBM calls it now), where you can put hooks into everything, and programme how you want to mirror/cache/copy namespaces, it all seems impossible.
Take FB for instance. They've gone balls deep with an S3 clone, which is both slow and difficult to use. Every operation takes about 3 seconds. (pulling a 10k file? pulling a 100meg file? list operation? mkdir? all 3 seconds.) Its an interface on top of the underlying blob store, so its looks like it stores it's metadata in a SQL Db (ie like lustre with metadata node and object storage nodes)
That sucks for machine learning or any kind of "normal" file operations this means there is lots of heavy engineering to get something like performance that would be solved by an ephemeral clustered file system.
The other big thing to note is that clustered filesystems are a massive massive pain in the arse to manage, especially if you choose the wrong one. (looks knowingly in the direction of ceph and gluster) you need to know what tradeoffs you are choosing before you install. Lustre is/was effectively a raid-0 over the network, Ceph is just plain inefficient, gluster is just silly, GPFS is held back by IBM being IBM.
for most people clustered filesystems are wrong choice.
> Unless you've worked at a place where you've had access to something like GPFS(or whatever IBM calls it now), where you can put hooks into everything, and programme how you want to mirror/cache/copy namespaces, it all seems impossible.
Do I understand correctly that this would be one possible interpretation of this sentence: If you've only used UNIX, and not something IBM made decades ago, then it seems impossible.
IF you want to take it in isolation, and assume that no-one uses GPFS, then yeah?
The general gist is that unless you've worked in a place that has setup a "proper" network, that is roaming home directories, mapped posix storage, kerberised user account, and some sort of multi-machine graph based job dispatch system (ie airflow, grid engine, large scale k8s with a batch plugin, aws's batch pixar's tractor), then you will have not seen a need for a large clustered filesystem with a posix-like interface.
moreover, given the "filesystems are hard and bad, use object storage instead" noise that we've had since we joined the cloud era, why would you _try_ to use a clustered filesystem? Especially as most of the "free" ones are shite.
1. The abstraction is not what most developers want. Colossus gives you append-only files, halfway between a file store and an object store. This is not an abstraction that most developers are comfortable with. It also has a relatively thick client library, so if you were planning to have a "microservice" it is a lot to import. Both of these factors help to simplify the reliability story.
2. Open source users are generally used to picking the abstraction they want, then finding the thing that does it. In contrast, Colossus/GFS use is mandated at G.
3. Colossus's bugs were worked out due to scale, which means wide adoption. Ceph and Lustre, in contrast, tend to have relatively small installations (measured in TiB or single-digit PiB) and their maintainers have seen many fewer byzantine failures by consequence.
I have played with a SaaS offering of a similar service, but it's a huge project and I don't want Google to sue me. OSS would likely be completely off the table except as an "open core" type of concept - there is just too much engineering effort needed to get to the first version.
> Colossus gives you append-only files, halfway between a file store and an object store. This is not an abstraction that most developers are comfortable with.
Most of the files I use are append-only (or in fact write-once).
The only files that are random-access are database files (but I prefer using the filesystem as an approximation to a database; not sure to what extent Colossus can serve that usecase; e.g. does it provide transactions?)
Do you have other examples of files that are not append-only or write-once?
BTrees for databases are the most important file types that you lose. Colossus/GFS is why Google adopted LSM trees (eg in LevelDB) instead of BTrees.
Another example is documents (think .doc files), although these can also be log-structured or checkpointed (I think the .docx is almost 100% log-structured).
I know what you meant but it might confuse people to say that it is append-only. Clients may read at any committed offset, otherwise it would be quite useless.
- SaaS solution, not just open source (improvements to s3, for example)
- A global hierarchy of files rather than an assortment of different buckets like s3
- Per-subdirectory permissions and chargeback so anyone can create project-specific subdirectories and have their organization billed for storage costs
- Filesystem drivers allowing every production and development computer and coding environment to access remote files in the same way and with the same ease as remote files. The difference between accessing remote files and local files in Google code is literally just the file prefix. In all other ways, remote files are _better_ than local files (they have all the same features, plus more) so there's very little reason to directly use local disk most of the time.
Most people do not run their data centers like Google. The data centers used the clos network to provide massive bandwidth between very many very similar machines, and so spreading storage across all of them made a lot of sense. Most algorithms were very parallel/distributed in nature and so sharding data non-locally made a lot of sense v.s. running databases or caches on single machines which favors fast localized storage (15K RPM RAIDs and then SSD and NVME, sometimes in SAN chasses to aggregate more IOPS/bandwidth over fast dedicated storage networks). As others pointed out, it couldn't use the traditional file system interface and so doesn't work with commodity software either. Rewriting all the open source software to use an open version of colossus would be a huge task, and without tooling the usefulness of such a thing is very small.
Most companies don't need to process hundreds of terabytes per second to and from disk and so they can centralize storage behind a few network interfaces. Centralized storage is conceptually simpler and more straightforward to manage.
In short, very few folks have the architecture to run something like Colossus, or the need for it, so it doesn't attract open-source replication. Gluster, Ceph, etc. are the closest.
First: CEPH is the closest I've seen. And it is pretty damned good.
Two reasons:
1. Colossus is extremely dependent on other Google Tech to work and be cool. Implementing 'just colossus' without the rest of the tech stack would require developing quite a diverse set of other tools first.
2. Part of what makes it cool is that everybody is mandated to use it, and hardware, kernel and so forth improvements are made to make it better.
Open source is unlikely, because such a thing is expensive to develop and there is no business model to support that (commercially or as a story for VCs).
At Quobyte we're building a scalable fault-tolerant file system (POSIX, so it has a different architecture and constraints), but it's closed source.
And that's part of what others are saying. Some class of developers want POSIX, but POSIX is _terrible_ to horizontally scale.
I've tried a couple times at different companies to get developers to stop assuming POSIX and switch to object storage as their persistent byte abstraction layer.
Trying to scale POSIX is a dead-end. Every time I try and see people build it, it's a big pile of suck. For example, CephFS. Yea, it works, but it still sucks compared to object storage. There's just no magic in the universe that will get you around the CAP issues with trying to make POSIX work over a network.
Only the namespace is hard to scale horizontally because of rename; the rest of POSIX does not have inherent scalability issues. For the vast majority of applications, namespace scalability is not an issue (there are only few single applications that store billions of files/objects), so in most cases it's mostly a system management topic.
If you go full object storage semantics (no in-place updates, to appends, no rename, ..) you're pushing the problem to the application layer. Experience at Google was that that's not a good place to have it, because application developers are usually not good at solving distributed systems problems. This lesson learned is the part of the Bigtable (no transactions) to Megastore (application layer transactions on top of Bigtable) to Spanner (transactions built-in) journey.
So yes, of course there are inherent hard trade-offs, but dropping strong semantics from the storage layer by using an object store instead of a file system is usually not a good idea.
HDFS was closely modelled after GFS . So the question maybe would be if progress was made here: we have OrangeFS, SwiftFS, CephFS, CassandraFS . Then for HPC there is lustreFS. Most focus is though on object storage though. We use CEPH and Lustre and I am quite happy. I wonder if there really is a concrete demand FOSS cannot serve. Also most of us are not hyperscalers...
Inside that company they have a forcing function that no outside developer would enjoy. They can, and did, just dictate that POSIX i/o does not exist and your machine has no disks. In my opinion this dictatorship is responsible for a large fraction of software quality at Google. If you tell people the only thing that they can rely on is appending to files in a distributed filesystem, they're going to be able to avoid a whole class of mistakes (notably: writing something to local disk and expecting it to be there afterward).
> Also, there don't seem to be public research papers about it.
The question of atomic clocks aside (whose answer I don't know), Ex-Googlers have created open-source alternatives to Google-internal software in a lot of other cases, too, so I don't think the lack of public research papers could be the reason.
The two are very different technologies. In AFS, the fundamental unit of sharding is the volume, each volume is controlled by a single volser, and the "master" (i.e. vldb) is responsible for resolving volume IDs to the set of volsers that hold a copy (at most one of which is read/write)
In GFS, on the other hand, individual files are sharded between servers on a chunk level (where 1 block==64MiB IIRC). To access a file's contents, you would contact the master, ask it for the chunkservers for a file, and it would tell you which set of chunkservers were responsible for each chunk in the file; you'd then contact the chunkserver to access the data.
The impact of this is that the latency of accessing a file was much higher on GFS; you needed to go through the master (which tended to be a bottleneck) to get the metadata, and then establish a new connection with authorization to each chunkserver to access the data. However, because each file was stored by multiple machines and the data location was independent of the position in the hierarchy, there was very little risk of chunkservers limiting the transfer speed.
With AFS, on the other hand, the master is much less of a bottleneck as it's only used when mounting volumes. Further, once a volume is mounted, accessing files and data within those files could be done in a single round trip. However, if you have a volume with high write traffic, that will end up bottlenecking on a single server's IO speed.
> In GFS, on the other hand, individual files are sharded between servers on a chunk level (where 1 block==64MiB IIRC). To access a file's contents, you would contact the master, ask it for the chunkservers for a file, and it would tell you which set of chunkservers were responsible for each chunk in the file; you'd then contact the chunkserver to access the data.
> The impact of this is that the latency of accessing a file was much higher on GFS; you needed to go through the master (which tended to be a bottleneck) to get the metadata, and then establish a new connection with authorization to each chunkserver to access the data. However, because each file was stored by multiple machines and the data location was independent of the position in the hierarchy, there was very little risk of chunkservers limiting the transfer speed.
When I went to college from 1993-97 our university used AFS for all the Unix workstations. We had a mix of Solaris, SunOS, HP-UX, DEC Ultrix, SGI IRIX, and then some Linux boxes.
When you logged in at any workstation from any vendor in any computer lab your home dir would be mounted and everything was encrypted. We had ACLs where we could give permission by username and create our own ACL groups and give permission to a project directory to just our project/lab partners. It was so much easier than the chmod uid/gid stuff.
In theory we could access other universities under /afs/mit or whatever site they were but I don't think we ever had that active.
I spent a couple of years at Morgan Stanley and when we cut over from SunOS to Solaris they included a rollout of AFS worldwide. It was pretty successful. It was pretty mind blowing to be able to login to a Solaris workstation while visiting the NYC office and have your [local-to-London] $HOME and CDE environment just kinda sync automatically (albeit slowly back then). The administration of AFS is ... interesting, but considering this was the mid-to-late 90s, it really was "ahead of its time". FWIW, Morgan Stanley back then had an incredible UNIX Engineering department.
I really thought AFS, DCE, and SSI (single system image) were going to take over. It was kind of depressing when I got a job and everything was NIS/NFS and we manually logged into a machine and started a job.
Then we got clustering software like LFS to distribute jobs but I thought the dynamic process migration using checkpoint / restart kind of stuff would take over but it seems to be more trouble than it is worth.
At work we would always run stuff off a powerful 4 to 16 CPU Sun and display to our cheaper SPARCstation 5 or Linux desktop via X11.
I will say it is great now to have a virtual desktop session Exceed/NoMachine/X2go and be able to go home and connect to it and have all my windows in the same place just like when I left. I've got X2go set up at home so even when I travel I can get a desktop at home if I need it.
It is absurdly scalable in contrast to other filesystems, and extremely reliable. Also, you get per-file control over replication level (file-level durability/availability), plus comprehensive permissions management.
The permissions management (user/group but with cloud roles) is what I miss most. NFS style approaches assume that I will give every user an unprivileged user on their local machine and deny root access so they can't just create a user with a different ID. S3 style ACLs per-object are just very confusing and no one is going to set those. S3 access policies per role are obviously not a solution because that requires centralizing lists of what the permissions are. I just want to be able to assign a chmod and a chown on a file for a cloud user and group... And don't forget about how fantastic the group tooling was.
I also miss being able to write files with /ttl=72h/ anywhere in the path and knowing they will be cleaned up automatically without me doing any more work. In contrast S3 just rots. Need to find an intern to write an SQS pipeline to schedule file deletions :)
The scope is entirely different. You can't grant privileges to create lifecycle rules on a subtree of a bucket. You set the rules on the entire bucket.
> NFS style approaches assume that I will give every user an unprivileged user on their local machine
With Kerberos it kinda works, each client is bound to a UID instead of blindly trusting the client, but very few people nowadays deploy this ancient tech (and it's fragile).
If you work at G, you should be able to see how D does permissions by looking at its ACL checking code (start with one of the files in the D server code that defines an op, and you can get to the permissions code pretty quickly). There's a lot of code in Colossus/D that is special because it has to be able to come up in a nearly empty cluster.
For me, largely ancillary reasons. One is its extreme degree of observability. At any point in time you can instantly answer complex, detailed questions like how many disk heads had to move to complete one given gmail request. Most of Google’s observability stack is attributable to the need to observe Colossus and D.
Original whitepaper is a great overview of how the system works, but devil in the details. I still have a lot of questions about implementation, for instance:
from where did master get information about chunkservers (chubby, configuration files) ? did chunkserver know about which master it belongs to? etc.
The master is configured with the endpoints for all the chunkaervers. A separate control/config system supported this including things like automatic replacement of dead nodes.
Awesome! Just checked the whitepaper there was a cluster with 300 machines, so all machines had to be described in configuration file that was used by master, right?
Mostly. Binaries of that era took everything on the command line. The config system would generate that from files and keep the running binaries up to date via ssh.
GFS had some serious issues- one of the biggest being the master node RAM- since the metadata was stored in RAM, there was a limit on filesystem. Sometimes you had to bribe a hwops person with whiskey to upgrade the ram over christmas
The GFS paper 1. showed that software can overcome faults in the hardware and 2. formed the storage basis for Google MapReduce.
From there, Apache Hadoop, the open source clone by the Nutch team took off, which in turn led to Zacharia's Spark (now Apache Spark), which provided the RDD abstraction to remove the batch-only limitations of Hadoop, thus enabling distributed iterative and graph methods at scale.
Can someone articulate the differences from zfs? Apart from the fact that, Google's file system seems to be working better with large files and allow simultaneous writing (or, rather, appending) without blinking an eye. But that'd be useful for, maybe very large tech companies.
GFS/colossus is designed to offer similar level of features (well in terms of performance and namespace, not snapshots) but over many machines.
ZFS doesn't have to worry too much about latency, or another machine writing to the block that it wants to update. As soon as you break free from one machine, you have to think about how you deal with: where data lives ( is the data stored seperate from the metadata?) how a client signals who has a write lock, permissions (ie user authentication, nfs basically just uses UserIds, which can be faked, unless you are running kerberos/NFSv4) data affinity (pull data from another data center is expensive, do we sync, cache or clone? how do we resolve write clashes?)
All of those questions have an impact on speed, scalability, reliability and durability. You need to choose your designs carefully to get a filesystem that does what you want it to
One of the most important differences is that Colossus and GFS are user-space filesystems for the client. You're just making RPCs to a service. There is no kernel filesystem. Upgrading the client and the server is just software upgrade, no kernel changes. And the operations were intentionally restricted- for example, instead of the debacle that is locking in NFS, locking is part of a different (and even more fundamental) tech at Google (chubby, which is NOT a filesystem, or so they say).
This is the thing that people keep missing. The technology was only the enabler. What made the difference was to have thousands of developers with wide access to the biggest data lake in the world.
Almost nobody gets that when trying to make data lakes, or whatever they are called now, in the corporate world. They focus on vendors and technologies and whatnot. Then ignore that what makes data valuable is access to it by people who don't even know yet what they might do with it.