* Unit files. When you create a file and write it, it's not visible for other opens until you close it. If you open a file with O_CREAT|O_WRONLY|O_TRUNC, you create a new file, which replaces the old one on close. In the event of a program or system crash, or exiting via "abort" without closing first, the old file remains. So there's always one completely written file. Creating a unit file is an atomic operation. Most files are unit files. (This isn't original with me; it comes from a distributed UNIX variant developed at UCLA in the 1980s.)
Replacing an existing file currently requires elaborate renaming gyrations which vary from OS to OS and file system to file system. At least for Linux, this should Just Work in the normal case.
* Log files. If you open a file with O_APPEND, you can only write at the end. The file system should guarantee that, even after a crash, you get the file as written out to some previous write. If you call "fsync", the recovery guarantee should include everything written up to the "fsync" point. No seeking backwards and overwriting on a log file.
* Temporary files. You can do all the file operations, and the file disappears on a reboot.
* Managed files. These are for databases and such. They have some extra API functions, for async reads and writes and commits. Async I/O should have two callbacks - "the data has been taken and you can now reuse the buffer", and "this write is definitely committed and will survive a system crash" That's what databases really need, and try to fake with "fsync".
Only a few programs will use this, but those are important programs.
Classic MacOS-9 had the PBExchangeFiles call which did this perfectly. Before call:
dirEntryA -> fileContentsA
dirEntryB -> fileContentsB
dirEntryA -> fileContentsB
dirEntryB -> fileContentsA
So when saving a new document you wrote it to a new hidden temp file, and when everything was written, you called PBExchangeFiles to swap the contents of the old and new files. After this you deleted your temp file which now contained the old document content.
"The existing file is superseded; that is, a new file with the same name as the old one is created. If possible, the implementation should not destroy the old file until the new stream is closed."
I imagine they lifted this idea from some prior Lisps, which suggests it's a pretty old concept.
So tags from a will end up on b. If b is a temp file, then the tags will be lost after the temp file is deleted.
It would be nice to be able to have a process tree own a temporary file, such that when the last process in the tree exits (not necessarily the process which created the file), the file is automatically deleted, rather than having to wait for the next reboot.
Yes there are workarounds (library extensions, signals, /proc/fd, etc.) But you'd think a basic function like "delete this file when this process exits" wouldn't be too hard.
Consider that filesystem data cannot be synced to underlying storage in a such way that it can be reliably read later and backups by definition are asynchronous and provide eventual consistency. This means that the best meaningful guarantee an application can get after machine crash is being able to read data written up to some point in the past, but not after every successful fsync before crash. Usually, though, even that guarantee is hard to achieve, as bad blocks happen and redundancy is not there. Given all that it's ok to relax filesystem APIs behavior to proper physical constraints. Like fsync is only meaningful as an ordering operation, no need to actually flush anything to disk immediately on fsync or any operation.
Next is multi process and multi threaded scenarios. Should O_APPEND only work correctly from a single thread, should each write be atomic and to what size, we certainly can't have gigabytes in an atomic append or should there be some synchronization mechanism that blocks others? Same for temporary files and unit files.
And what to do on bad blocks? Should there be redundancy within a single disk, should block device underneath be log structured storage with block remapping and scrubbing and provide reliable storage layer to the outside by sacrificing space? Maybe for desktop machines it should, but not for servers, at least not all of servers, they need a different API.
I'm not even touching performance considerations here, that depend on performance-friendly APIs a lot.
We have the temporary files you're asking for. https://lwn.net/Articles/619146/
With O_TMPFILE, you can also write new data, and then automatically replace a file on disk.
How? Using "linkat"? That's not automatic.
If newpath already exists, it will be atomically replaced, so that
there is no point at which another process attempting to access
newpath will find it missing. However, there will probably be a
window in which both oldpath and newpath refer to the file being
I wonder if this would ideally also provide a rotation API. Rotating log files in the presence of multiple writers is messy, and maybe shouldn't be reimplemented at the application layer every time.
> Temporary files
Windows is not always a model of filesystem elegance, but it has a "delete on close" flag in their equivalent to open, which makes it go away on the last close (handle can still be duplicated or inherited, so you get some reference counting through that).
Actually I think you can do similar to this on Unix by unlinking after the open, but keeping an fd open for the lifetime of the temp file.
That feature was in UNIVAC EXEC8, now OS-2200. It was called F-cycles. Over half a century old, and still working.
I think there is a more universal truism here - that "complex and subtle" are sources of pain, problems and headaches.
I want to write "cool" and "magical" code as much as the next person, but that's the stuff that I look at later WTF because I am no longer in the same state of mind. Clear, simple, straight forward, plain as day are better than anything else.
And if you do have to do something "magical" something "odd" or hard to understand for the love of god please leave notes explaining what, why and how you did what you did. And if you are replacing a "clean" and "readable" version leave it there commented out. Sure it is "in the repo" but almost no one ever looks and you are just making it harder for me to figure out the original intent of whatever was there.
As the complexity of the projects I work on has increased (especially in the time since I became a part of a professional dev team), I've come to realize that most of that tricky, complicated-looking code I had read years earlier was actually kind of the easy way out. When you are just trying to get stuff done, the complexity of your code mirrors the complexity of the project. There's no effort put in (and often no time to do so) to create a smoother interface to shelter the code from the complexity of the task at hand.
It's much more interesting to me now to be faced with a complex task and to figure out how to make the code simple and clear. Elegant, well-designed libraries/APIs belie the challenge in writing them. The code looks so simple that it feels immediately obvious as you're reading it. I've come to realize that reading a code base that seems dead simple -- so simple anyone could instantly understand it -- is actually often inversely related to the difficulty in creating it.
Yes, necessary complexity does exist, and sometimes there's no way around doing something fairly nasty in your code. But as an ideal to strive for, I find 'simple' to be fascinating.
That’s probably the main reason why so many C++ codebases are unnecessarily complicated. People look at what standard library developers did, and emulating it.
The complexity of the standard library is actually justified: containers need to scale from 1 to 1E9 items and more, algorithms need to work with whatever broken classes users throw at them, the whole standard library needs to be extremely generic.
None of these requirements apply to code of most applications, yet I have seen many C++ projects where people designed their code the same way the standard library is designed, without thinking about the reasons. The result is template heavy code which takes long time to compile, and hard to modify or debug.
This is also true of "its in the repo" and that doesn't get hit by "find and replace".
I don't think it should be done for EVERY case, rather the cases where clarity is replaced with something that is ambiguous but meets another need. Performance hacks are notorious for being "ugly" and "magical" and having a readable companion piece would make sense.
To clarify I don't think it replaces "prose" but should exist along side it, and be mixed into it.
It's very much not! The thing "in the repo" is of exactly the same vintage as the thing in the next commit. The thing "in the comments" is preserved in amber next to a growing thing that no longer matches it. Inevitably to understand the thing "in the comments" you have to go back to commits of the same age.
Compare that with a GPU. It took decades, but (at least for nvidia GPUs) they have finally reached the point where you can be at least reasonably certain that high-performance is the default. Sure, you can ruin your performance by doing some stupid things, but it's at least harder to do those stupid things. The defaults are high-performance.
It's exactly the opposite for Cloud TPUs. By default, it only uses one core. You have to use their TPUEstimator API, which is a byzantine mess. Since two weeks ago, I've spent roughly 3 solid working days of effort solely trying to read and understand (a) what is the TPUEstimator actually doing? (b) why is it supposedly so much faster?
I have some half-hearted justifications -- the answer seems to be "you have to colocate your gradients with the device; you have to scope your tensorflow graph to a specific device; and each TPU core is a separate device." But there are unanswered questions. For example, you're supposed to pass your tensorflow computation to tpu.rewrite(). Yet I've never done that. The defaults seem to just work. So does that mean it does the rewriting for me automatically? Is it being emulated in software, and I'm ruining my performance? Tensorflow, why don't you just crash instead of being so damn slow on TPUs? That would at least let me aim my optimizations!
If only I had low level access to the actual TPU operations, I could just write a compiler that specifically emits instructions to give me the performance I need. But this tensorflow graph abstraction makes everything "easy" yet exponentially more complicated.
Anyway. Yes. More of your mindset, please. Simplicity is such a lovely metric seldom optimized for.
(Anyone who's curious can see a dramatic 11x difference in performance on GPUs vs TPUs in Colab: https://twitter.com/theshawwn/status/1196593451174891520 ... this was very surprising, since Google's marketing would have you believe that TPUs are the bee's knees. Yet by default a TPUv2 is 11x slower than a K80 GPU for some basic matrix multiplications.)
We're trying to use TPUs to fine tune GPT-2 1.5B. The model takes up 5.8GB memory, which is well over half of a TPUv2 core (8GB). It always OOMs when I try to do a training step, due to the gradient calculations requiring memory. It even OOMs on a TPUv3, which has 16GB per core. I've tried using bfloat16 (which ought to cut memory usage in half) and using Ada optimizer (which should be no more expensive than plain old SGD). Yet if I colocate the gradients to the same core as the model, I always OOM. (Colocation just means "don't use any memory except the memory physically on this one core.") With colocation off, I don't OOM, and I do see some speed gains using all 8 cores. But it's no more than a factor of 2x, and in fact closer to like 1.15x (i.e. it's roughly equivalent to just using larger batch sizes on a single core). And I don't understand why I'd OOM in the TPUv3 case; even with float32, the model is only using 5.8GB out of 16GB. Are gradient computations really taking up more than 10GB for the optimizer? (That leads to https://github.com/cybertronai/gradient-checkpointing and such, but I haven't tried it yet.)
If I try the same experiment with a much smaller model (117M, or about 13x smaller), I can successfully colocate the gradients onto the same core as the model. And when I use all eight cores, I'm able to get 1225 tokens/sec (roughly 1 example per second, since 1 example = 1024 tokens for GPT-2), vs the standard case of around 400 tokens/sec when using only one core. But that's still "only" a 3x speedup.
So when I see "50-100x increases," alarm bells start going off. I'm missing something fundamental here. Either you are getting 100x speedups, or I am somehow missing something fundamental.
People have even started asking me for answers regarding the TPU case, and I'm forced to be like "Yeah! I expected TPUs to be so much faster too. Everyone says they're getting 100x speed gains. Yet we're 11x slower than the GPU case, and here's a notebook showing a 11x slowdown."
I'm suspecting that memory bandwidth is the bottleneck here for large models. This paper even pretty much says "GPUs are more flexible and faster when memory bandwidth is an issue": https://twitter.com/mosicr/status/1196749286815481856
The closest I've come to finding an actual example of a speedup to aim for is this: https://github.com/imcaspar/gpt2-ml
They used a TPUv3-512 pod to train a GPT-2 1.5B model to 99k steps in 50 hours. If you work out the math, that's about 1 example per second. We're getting about 0.08 examples per second on a single TPUv2 core. So yes, it's a big speedup (12.5x) but certainly not 50-100x. Yet it has 64x the cores as my TPUv2; why isn't it 64x faster? And we're only using 1 core; why not 512x faster?
I have also tried this on TPUv3-8, and we're getting about the same examples/sec, further increasing the plausibility of the theory that memory bandwidth is the bottleneck.
A perfect C++ program that outperforms a simple looking python program which does the same thing isn't worse because it's harder to read or to write, or better because it's faster, it's just different.
If you need to optimize for readability, so be it. If you need to optimize for performance, so be it.
If you write code executed trillions of times in a trillion places, where power use and the expense of hardware is dependent on the performance of the software, I think it's okay for it to be as complex as it needs to be to be very optimal, at the expense of readability and other things. It should be well annotated, sure. That's always a reasonable thing to expect except in rare, extreme cases where you don't even have the time to type out the comments.
It's important to weigh the relative importance of simplicity vs speed vs memory frugality vs portability vs maintenance difficulty (dependencies and tool choice factor in here) and so on to determine the best approach.
Dogma about how best to code in a general sense is only helpful if you don't want to think or talk about what's the best fit.
I tend to think there is always a better way, say creating a module that contains both implementations as alternatives, but if this is the worst wart on the codebase it is probably a good codebase.
Just leaving giant blocks of commented out, trivial to recreate code as the system evolves is annoying. But I do think that the rule of not leaving old code in as comments is one where the bar to breaking it shouldn't be too terribly high. Personally I've gone overboard following it in the past, and all it did was hurt productivity. A few comments blocks aren't the end of the world.
Rebasing and other means of moving patches (email etc, adding signed-off-by) can cause the original commit id to become invalid, and eventually unretrievable.
To some people, C is simple. To other people, C is full of "complex and subtle" sources of pain.
To some people, a VM with a GC and a JIT is "complex and subtle". To other people, using a HLL with those features enables them to write programs which are much simpler and clearer.
These positions also change over time. A JIT is a fairly standard technique in 2019, but in 1973 it was still very much in the research phase. Even the idea of writing a kernel in such a HLL as C was once revolutionary.
I have an SDET friend who calls these types of universal truisms "apple pie", as in, you hold a meeting and say "Apple pie is good, right?" and everybody nods and says "Yes!" and then goes back to work and absolutely nothing was learned or decided.
Most programmers abstraction of a computer system is synchronous and consistent. If you make the simple synchronous cases do the wrong thing, like write returning success on disk full when a write did not happen, you are going to break people’s code.
This also explains why when you when you choose a database, a relational DB with strong consistency and linearizability should be your default. Going for eventually consistent in the database layer and expecting the application logic to deal with it will lead to grief in many, many cases.
As a developer I find myself in a different scenario. I'm usually trying to find out what exactly the 100% guaranteed way to do something is. Instead, I find incomplete documentation and different people with different opinions on what the guarantees are, and most people writing bad code that they assume will usually work.
Just modifying a file in an atomic way requires a complicated dance of multiple files and multiple syncs and a rarely tested cleanup routine the next time the file is opened. No one does this.
I don't know what the solution is.
POSIX, and by extension, the classic 1960s-1980s era UNIX way of doing things just needs die a long overdue death.
This stuff was designed at a time when every CPU instruction mattered, everything was optimised to death for frugality, and commands were abbreviated from "copy" to "cp" because ermahgerd two bytes is a huge saving! That mentality got us Y2K. This is an era where latencies were not the bottleneck, CPU cycles and memory bytes were.
A lot of stuff in filesystems is just plain stupid. For example, why do applications install their files. one. at. a. time? Like... what the fuck? How does it make any sense for an application to be partially installed? Who actually codes their application with 500 modules and dynamic libraries to be able to handle the scenario where one of them is inaccessible due to an ACL or a mismatched version because of an overwrite by something or someone else? NOBODY, that's who. Meanwhile, I can make a cup of tea while Adobe Lightroom launches on an SSD drive because it is 99% OS API overhead and 1% usermode action.
This is why Docker is popular. Not because Docker is good, but because OS APIs are retarded.
Every application install should be a union fs. This union fs should be entirely user-mode, so that if an application has 10,000 files, it doesn't take 10,000 round-trips to the OS kernel with the Intel mitigations, context switches, and cache flushes that all brings with it.
Copying a file shouldn't require a user-mode buffer to feed the data through, forcing it to come down WAN links just to go back up the same WAN link again on the way out.
Overwriting a file shouldn't require more than a single API call, because it's nearly 2020, and we should have long since realised that kernel transitions are expensive, so we should optimise to minimise the number of round-trips. open(), write(), write(), write(), flush(), sync(), close(), poke(), prod(), jesusfuckingchrist(). Just take a buffer bigger than 4KB, or better yet, standardise an API to take a stream from user mode.
Just take a buffer and a filename, and atomically replace. Done. Bang. No lost data, not torn writes, just DO IT. How hard can this be? Is it impossible to do this? Are we forever stuck with POSIX, which was created in 1988, before most of its modern users were born?
POSIX might not be the right lower level API, and maybe it can be replaced with something that achieves those details better. But "done bang just do it" is simply not how computers work. Software is how you achieve that, and generally by building simpler abstractions on top of more brittle ones.
"How hard can this be?" is the right question. It's very damn hard.
Now, Vulkan and Metal offer the detailed control for library developers, and everyone else uses some higher-level wrapper.
Does it make sense to split the file system in a similar way? I guess the main challenge is avoiding too many competing wrappers.
I disagree with this, as you've already over-complicated it by assuming some kind of unix-like-scatter-everything-all-over-the-file-tree-for-no-good-reason application installation. Just do what MacOS, RiscOS, DOS, etc did and have applications be a single file (or folder). There, install is just copy, can place it on any media you want, can carry it around with you, keep multiple versions, etc. It even keeps the abstraction of the application actually existing where it appears to exist.
The problem is not with files. Kernel transitions are expensive, but the problem is not that. It's metadata sync (file inode + directory inode) and so on.
Install is slow on Windows because braindeadness of vendors (cygwin is fast to install). Package stuff is slow on Linux, because apt calls out to dpkg, which first reads the package db. Which is a plain text file with newlines as separators and it has to be parsed. Yum is slow because by default it syncs repos on install, etc.
Considering the number of times I type that command, I wouldn't be as quick to throw out the savings here.
But if most people think that this doesn’t apply outside of FS, I have the will to change the title :-)
Linus makes the point here is that the file system API is already complicated to the point that few use or implement it correctly. Further complicating the API will likely create more problems than it fixes.
> Linus: Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
His point seems to be more about accepting reality and allowing the practice (which in this case is theoretically "badly written" code) to "just work".
As the title stands now:
> Linus: People should aim to make "badly written" code "just work"
One might incorrectly assume Linus is suggesting that defensive programming should be practiced heavily -- but that does not appear to be what he is saying here.
Isn’t that the opposite of what Torvalds is saying? He seems to be arguing for simplicity. APIs that do a bunch of magic for you are the opposite of simple and tend to be mountains of subtle bugs and unexpected behavior.
You're mixing simplicity of API with simplicity of implementation. More often than not, you can only have one but not both.
Modern Linux or Windows do huge amount of magic when you call kernel API like open (POSIX) / CreateFile (Windows), yet the API is simple and easy.
You can expose all implementation details, your code will be simple, but hard to build upon. Speaking about data storage, once upon a time I programmed Nintendo consoles, their file system API probably was very simple for Nintendo to implement, but using it wasn't fun: SDK documentation specified delays, specified how to deal with corrupt flash memory, etc.
You can do the other way, you'll have to do lot of work handling all the edge cases, your code will be very complex, but this way you might make a system that's actually useful. About data storage, SQL servers have tons of internal complexity, even sqlight does, but API, the SQL, is high level and easy to use even by non-programmers.
These strike me more as "Linux doesn't believe in actual testing" rather than inherent bugs because its a filesystem.
The reason "badly written" works for Open Source is that if code is useful - there will be someone in the future who will refactor it. In proprietary setting that only happens when the fate of the company itself (or a large chunk of the business) is at stake. Otherwise stagnation is king.
in practice, this never happens because the code will cease to be considered useful first
That's why this idea is popular, and your comment is controversial. You're taking away a crutch that many of our peers cling desperately to as a way to justify their shortcuts and poor decision making.
There are definitely some serious pitfalls when it comes to unix file io, just look at for how long we lived with postgres and its broken assumptions surrounding fsync behavior on linux.
You mean its assumption that the API wasn't lying about the integrity of the data it claimed to be writing? Very broken indeed.
I cut my teeth on Perl and actually believed the books when they said you should always use taint mode when touching data that came e.g. over the network. I wrote all my web code under -T (-wT actually but you get the idea).
Then one day I went to drop in some full text search via a then popular library (Plucene). And what do you know, Plucene would not work under taint mode, because it was not developed under taint mode. The maintainer would not accept my simple patch essentially because he did not understand that you could not untaint without a regex somewhere (i.e. he did not know how taint mode worked at a basic level). So I maintained a patched version of that lib privately. Only to later hit the same issue with another popular library.
So I had to stop using taint mode. If it’s just an option — even one aggressively marketed in O’Reilly books back when people actually mostly read O’Reilly books to learn various systems — it’s not going to win much adoption.
I'd think "everybody" knows, certainly I would expect Linus to know, about Postel's law and the subtle ways it ends up causing problems. Whenever you make things easy or difficult, you shape the evolution of how people do those things. There's no simple universal answer to "do we make things easy or difficult". Or "do we blame the user or the toolmaker?"
I don't really understand the psychology of going around arguing one side of an insoluble problem, observing that others are totally convinced of the other side, and occasionally flipping sides, but never acknowledging the meta-problem of integrating both or deciding when to apply each.
Do you say "you're holding it wrong", or do you adjust to what people seem to be like?
I mean, it's insoluble if treated as a single binary decision and not contextual.
Say you have 16GB of RAM. You disable overcommit for your database, and you run it, and it allocates 14GB. But Chrome has overcommit enabled (why are you running Chrome and a database on the same box? Dunno, never got an answer from that engineer...), and it happily allocates 23GB. Later, Chrome uses up all the available memory, and the database is sad. But it's not Chrome's fault, it didn't know it wasn't allowed to use it all up! So then you redesign overcommit so there's a "no-overcommit pool" with priority, and an "overcommit pool". You end up with only 2GB of overcommit pool for the kernel + userland + Chrome, so Chrome just gets killed early and often. Might as well have just disabled overcommit entirely, if random apps are going to die anyway.
(Incidentally, cgroups allow setting soft and hard memory limits, so you can impose particular memory limits on arbitrary applications to have more determinism)
I think that's why overcommit exists. It's generally not easy to re-design all userland applications to deal with difficult memory management problems in a complex system, but it is easy to just lie to them all so they can continue to do stupid things and not crash. If you have a system with only "100% correctly written" software, just disable overcommit (echo 2 > /proc/sys/vm/overcommit_memory).
That's not a bad idea, that's the way that any application that cares about not losing data should behave. OOM is just one of the any reasons an application might suddenly die.