The key thing to understand about GFS and relatives is how essential it is for distributed computation.
It is natural for us to look at a big file and think, "I need to do something complicated, so I'll do a distributed calculation, then get a new file." This makes the file you start with, and the file you wind up with, both bottlenecks to the calculation.
Instead you really want to have data distributed across machines. Now do a distributed map-reduce calculations, and get another distributed dataset. Then create a pipeline of such calculations. Leading to a distributed calculation with no bottlenecks. Anywhere.
(Of course this is a low-level view of how it works. We may want to build that pipeline automatically off of, say, a SQL statement. To get a distributed query against a datastore.)
Absolutely essential for this is the ability to have a distributed, replicated, filesystem. This is what GFS was. When you try to scale a system, you focus on your current bottlenecks, and accept that there will be later ones. GFS did not distribute metadata. It did not play well with later systems created for managing distributed machines. But it was an important step towards a distributed world. And the fact that Google was so far ahead of everyone else on this was a giant piece of their secret sauce.
True story. I went to Google at about the same time that eBay shut down their first datacenter. So I was reading an article from eBay about how more work it took to shut down a datacenter without disrupting operations. And how proud they are at succeeding. And then I saw the best practices that Google was using so that users wouldn't notice if a random datacenter went offline without warning. Which was a capacity that Google regularly tested. Because if you're afraid to test your emergency systems, they probably don't work.
What eBay was proud of doing with a lot of work, was entirely taken for granted and automated at Google.
I'm biased because I did some work on Colossus for a few years, but it's my opinion that Google's main advantage in its ability to build complex distributed systems is Colossus (and GFS). The Colossus layer is the point where a lot of systems actually take care of the CAP theorem, and because Colossus does it so well they almost don't have to worry about it at all.
Ceph, Hadoop, scale-out ZFS, and others that came afterward got the idea of distributing a filesystem but sort of missed the point that GFS/Colossus's great advantage was that it's so damn good at being a distributed system, and has been tried and tested through a lot of different kinds of failure modes.
Sadly, the best public material is the technical marketing for GCP, see the following for example (a few other materials can be found in this comments section):
> you really want to have data distributed across machines
It's funny how poorly Hadoop, HDFS, and HBase learned this lesson. HBase tries to keep a replica of its data on the local machine, under the incorrect belief that a local RegionServer (approx: chunk server) will be somehow better than a building full of them. This is why HBase performance is hideous. But it is all supposedly inspired by or modeled after GFS and Bigtable?
Hadoop and family fell squarely in the category of something where I always felt I was missing something important. Because it was being discussed everywhere, but I could never match it up with a problem I needed to solve, even though we were dealing with other “big data” systems.
It wasn't an interesting idea, it was a footgun. There's nothing naturally advantageous about the machine your process happens to be running on. In the era of very large machines there may be substantial contention for the block device that by coincidence is in the same chassis as your CPU.
It's a footgun if your cluster is providing data/compute to other systems, absolutely, the case you mentioned is a risk worth extra network traffic to avoid.
But theoretically, if we're trying to beat a benchmark or something, the "move compute to data" idea has something to it. Nobody transfers the LLM weights off of card memory, let alone to other machines for compute.
It was useful at the time though. Both network and disk were substantially slower at the time. And servers had far fewer cores. So which server stored the data could make your processing faster. And you’d have less risk for IO contention within the same chassis.
Fast forward a few years and compute servers can have hundreds of cores and the network is significantly faster.
I think it was a “good enough” design for the time and those assumptions. And batch processing. Now? I think there are different assumptions to be made.
(And I think there is a good argument that by the time Hadoop/HDFS had stabilized the assumptions had already shifted)
It got lost a lot longer than 10 years ago. Sun published the NFS 3 spec (NeFS -- Network extensible File System) 34 years ago in 1990, which was in turn based on John Warnock's earlier ideas of using PostScript as a "linguistic motherboard" with "slots" for different kinds of "cards" like graphics and file systems (drawing from his experience at Adobe and Xerox PARC):
PostScript is a linguistic "mother board", which has "slots"
for several "cards". The first card we (Adobe) built was a
graphics card. We're considering other cards. In particular,
we've thought about other network services, such as a file
server card.
Network Extensible File System Protocol Specification (2/12/90)
Comments to: sun!nfs3 nfs3@SUN.COM
Sun Microsystems, Inc. 2550 Garcia Ave. Mountain View, CA 94043
1.0 Introduction
The Network Extensible File System protocol (NeFS) provides transparent remote access to shared file systems over networks. The NeFS protocol is designed to be machine, operating system, network architecture, and transport protocol independent. This document is the draft specification for the protocol. It will remain in draft form during a period of public review. Italicized comments in the document are intended to present the rationale behind elements of the design and to raise questions where there are doubts. Comments and suggestions on this draft specification are most welcome.
[...]
Although it has features in common with NFS, NeFS is a radical departure from NFS. The NFS protocol is built according to a Remote Procedure Call model (RPC) where filesystem operations are mapped across the network as remote procedure calls. The NeFS protocol abandons this model in favor of an interpretive model in which the filesystem operations become operators in an interpreted language. Clients send their requests to the server as programs to be interpreted. Execution of the request by the server’s interpreter results in the filesystem operations being invoked and results returned to the client. Using the interpretive model, filesystem operations can be defined more simply. Clients can build arbitrarily complex requests from these simple operations.
DonHopkins on March 1, 2020 | parent | context | favorite | on: Sun's NeWS was a mistake, as are all toolkit-in-se...
Owen Densmore recounted John Warnock's idea that PostScript was actually a "linguistic motherboard".
(This was part of a discussion with Owen about NeFS, which was a proposal for the next version of NFS to run a PostScript interpreter in the kernel. More about that here:)
Date: Tue, 20 Feb 90 15:20:52 PST
From: owen@Sun.COM (Owen Densmore)
To: don@cs.UMD.EDU
Subject: Re: NeFS
> They changed the meaning of some of the standard PostScript operators,
> like read and write, which I don't think was a good idea, for several
> reasons... They should have used different names, or at least made ..
Agreed. And I DO see reasons for the old operators. They could
be optimized as a local cache for NeFS to use in its own calcs.
> Basically, NeFS is a *particular* application of an abstraction of
> NeWS. The abstract idea is that of having a server with a dynamically
> extensible interpreter as an interface to whatever library or resource
> you want to make available over the network (let's call it a generic
> Ne* server).
Very true. This has been particularly difficult for me to get across
to others here at Sun. I recently wrote it up for Steve MacKay and
include it at the end of the message.
> It's not clear to me if NeFS supports multiple light weight PostScript
> processes like NeWS.
I asked Brent about this, and he agreed that it's an issue. Brent
has been talking to a guy here who's interested in re-writing the
NeWS interpreter to be much easier to program and debug. I'd love
to see them come up with a NeWS Core that could be used as a generic
NetWare core.
I think you should send your comments off to nfs3 & see what happens!
I agree with most of your points.
Owen
Here's the memo I consed up for MacKay:
> A: Rumor has it that GFS has been replaced by something called
Colossus, with the same overall goals, but improvements in master
performance and fault-tolerance. In addition, many applications within
Google have switched to more database-like storage systems such as
BigTable and Spanner. However, much of the GFS design lives on in
HDFS, the storage system for the Hadoop open-source MapReduce.
At Google the project to migrate GFS to Colossus was called Moonshot. It was endorsed by Eric Schmidt and became a top-down mandate. This goes into details of the migration: https://sre.google/static/pdf/CaseStudiesInfrastructureChang... (scroll to Chapter 1)
Interestingly, Colossus was initially designed to be the next generation BigTable storage. It turned out that other teams had a similar problem, so it has become a more general file system.
Colossus retains a lot of the ideas of GFS, so I wouldn't say that this paper is completely outdated. I would recommend the Tectonic paper from Facebook for a more modern distributed filesystem example:
It's understandable hyper scalers roll their own infrastructure tools yet naturally small shops reach for 3rd party tools ... it's telling when speaking with each how justified each feels ... yet at the boundary zone between those extremes folks it gets interesting ... mid sized shops need to be aware of the benefits of rolling their own and at a minimum not frown on job candidates coming from a big shop who instinctively lean towards build versus just use off the shelf libs
Does anybody know if there is any usage/implementation of a distributed hash table (with consistent hashing) for reliable file storage in a datacenter instead of internet p2p ? Would that make sense and what would be the drawbacks ?
GFS was terrible. It may have been great in the early days of google when they couldn't build an index, but I recall having to bribe hwops people (with single-malt) to upgrade RAM on the GFS masters, and a few times, my team (Ads) threatened to run our own filesystems that didn't fail so badly.
Single malt was the currency of choice when bribing and/or placating SRE's and hwops folks. For example, if a SWE botched a rollout that caused a multiple SRE's to get paged at 3am, a bottle of single malt donated to the SRE bar was considered a way of apologizing.
Among other things, GFS ran outside of Borg, so you couldn't just donate quota to let it run with more resources, as you found out. Conversely, given that Colossus ran on Borg, part of the migration involved an elaborate system that minted CPU/RAM quota for the storage system users whenever the CFS cell was grown. Then there were all the issues on the low end machines (hello, Nocona) that had not enough resources to run the storage stack AND user jobs. Stuff wouldn't schedule (and the cluster suddenly appeared to be missing some disk), but at least that was a lot more explicit than the performance issues you witnessed on GFS.
Instead of using five masters like GFS, which typically ran on five machines dedicated to the purpose, Colossus uses a variable number of so called curators to handle all the metadata work (think of directory entries and inodes in Unix, separate from the data blocks). They're just regular Borg tasks, running along those from other services like YouTube, Gmail, Reader, etc. The more storage added to the cluster, the more traffic it gets and the busier the curators become. For very small clusters five or fewer curators might suffice, but when you hit hundreds of petabytes, you need many, many more of them. Borg requires identities, including the system account that runs the Colossus service, to have quota, so there needs to be a way to grant it an amount of CPU cores and memory somewhat proportional to the storage cluster size. (Fun things happen when the cell shrinks instead of growing.) Monitoring the entire stack also requires resources of its own, proportional to cluster size, so there were additional grants at higher priority just for those jobs.
I've always wondered how Google does this internally.
For all the cloud companies, "quota" is literally dollars. Bandwidth, memory, cpu, these are things they sell. So they are measured in money. The existing corporate budgetary process already manages dollars (indeed, that's its main function). If you want more of one of these things, you'll be displacing some customer who stood ready to pay for them (i.e. on the spot market), so you need to replace the revenue they were providing. Very straightforward.
What does google measure all these things in? Quotons? Where do quotons come from? When google becomes more profitable, does the Central Bank of Quoton print more quotons each month than they had previously? Does Google have a Chief Quoton Officer?
Totally curious about this. Google is the only company operating on this scale that doesn't sell raw bandwidth/cpu/ram as their product, nor do they buy these things from underlying cloud vendors (like e.g. Cloudflare does). This makes a lot of things simpler internally and probably avoids a lot of internal conflicts, but it does mean that they have to have a parallel internal accounting system for this stuff.
Is there an official exchange rate between the two accounting systems? Are different teams allowed to sell each other quotons, in exchange for budget dollars? Has the IRS ever audited Google's Quoton books?
Okay that last question was not serious. But the rest were.
This is what I was wondering too. I suppose it's useful because it separates internal demand from external prices, so internal teams can be compared with a common currency, and planning can happen on that basis, and then totally different teams can care about how to purchase compute/storage/etc.
But how do they decide when to expand a datacenter or build a new one?
The demand is priced in quotons but the costs are in dollars.
Somebody somewhere at google must be maintaining a quoton/dollar exchange rate. It's weird that it's hidden from the rest of the company.
> so internal teams can be compared with a common currency
What's wrong with comparing them using dollars (or euros or whatever) as the currency? That's what Google's revenues are measured in terms of. If the price of electricity doubles (like it has in the EU) a team with euro-constant revenues could swing from being highly profitable to a massive loss... and nobody would notice because quotons conceal the cost of electricity.
What I think the GP means is previously, GFS ran on their own machines and didn't need to participate in the Google-wide quota system for CPU/RAM.
However Colossus _does_ run on the general Google compute infrastructure, and so needed a way to get CPU/RAM quota where none existed (because it was all used by existing, non-storage users).
It is natural for us to look at a big file and think, "I need to do something complicated, so I'll do a distributed calculation, then get a new file." This makes the file you start with, and the file you wind up with, both bottlenecks to the calculation.
Instead you really want to have data distributed across machines. Now do a distributed map-reduce calculations, and get another distributed dataset. Then create a pipeline of such calculations. Leading to a distributed calculation with no bottlenecks. Anywhere.
(Of course this is a low-level view of how it works. We may want to build that pipeline automatically off of, say, a SQL statement. To get a distributed query against a datastore.)
Absolutely essential for this is the ability to have a distributed, replicated, filesystem. This is what GFS was. When you try to scale a system, you focus on your current bottlenecks, and accept that there will be later ones. GFS did not distribute metadata. It did not play well with later systems created for managing distributed machines. But it was an important step towards a distributed world. And the fact that Google was so far ahead of everyone else on this was a giant piece of their secret sauce.
True story. I went to Google at about the same time that eBay shut down their first datacenter. So I was reading an article from eBay about how more work it took to shut down a datacenter without disrupting operations. And how proud they are at succeeding. And then I saw the best practices that Google was using so that users wouldn't notice if a random datacenter went offline without warning. Which was a capacity that Google regularly tested. Because if you're afraid to test your emergency systems, they probably don't work.
What eBay was proud of doing with a lot of work, was entirely taken for granted and automated at Google.