Hacker News new | past | comments | ask | show | jobs | submit login
Filesystem devs should aim to make “badly written” app code “just work” (2009) (lwn.net)
192 points by pcr910303 4 days ago | hide | past | web | favorite | 108 comments





I'd argue that UNIX-type file systems should offer several types of files:

* Unit files. When you create a file and write it, it's not visible for other opens until you close it. If you open a file with O_CREAT|O_WRONLY|O_TRUNC, you create a new file, which replaces the old one on close. In the event of a program or system crash, or exiting via "abort" without closing first, the old file remains. So there's always one completely written file. Creating a unit file is an atomic operation. Most files are unit files. (This isn't original with me; it comes from a distributed UNIX variant developed at UCLA in the 1980s.)

Replacing an existing file currently requires elaborate renaming gyrations which vary from OS to OS and file system to file system. At least for Linux, this should Just Work in the normal case.

* Log files. If you open a file with O_APPEND, you can only write at the end. The file system should guarantee that, even after a crash, you get the file as written out to some previous write. If you call "fsync", the recovery guarantee should include everything written up to the "fsync" point. No seeking backwards and overwriting on a log file.

* Temporary files. You can do all the file operations, and the file disappears on a reboot.

* Managed files. These are for databases and such. They have some extra API functions, for async reads and writes and commits. Async I/O should have two callbacks - "the data has been taken and you can now reuse the buffer", and "this write is definitely committed and will survive a system crash" That's what databases really need, and try to fake with "fsync". Only a few programs will use this, but those are important programs.


> Unit files. ... you create a new file, which replaces the old one on close.

Classic MacOS-9 had the PBExchangeFiles call which did this perfectly. Before call:

    dirEntryA -> fileContentsA     
    dirEntryB -> fileContentsB     
after call

    dirEntryA -> fileContentsB     
    dirEntryB -> fileContentsA     
This meant that the user kept all meta info for files, e.g. tags, window position, custom icons, etc.

So when saving a new document you wrote it to a new hidden temp file, and when everything was written, you called PBExchangeFiles to swap the contents of the old and new files. After this you deleted your temp file which now contained the old document content.

http://mirror.informatimago.com/next/developer.apple.com/doc...


Common Lisp standard also strongly suggests this behavior for the :if-exist :supersede argument to #'open:

"The existing file is superseded; that is, a new file with the same name as the old one is created. If possible, the implementation should not destroy the old file until the new stream is closed."

http://clhs.lisp.se/Body/f_open.htm

I imagine they lifted this idea from some prior Lisps, which suggests it's a pretty old concept.


Like renameat2 on Linux with RENAME_EXCHANGE ?

No renameat2 moves the meta info (e.g. xattrs)as it’s an atomic “mv a temp; mv b a; mv tmp b”

So tags from a will end up on b. If b is a temp file, then the tags will be lost after the temp file is deleted.


> * Temporary files. You can do all the file operations, and the file disappears on a reboot.

It would be nice to be able to have a process tree own a temporary file, such that when the last process in the tree exits (not necessarily the process which created the file), the file is automatically deleted, rather than having to wait for the next reboot.


I would prefer it to not be the default; it's sometimes useful to keep temporary files in the event of a process crash, especially on a server that would then immediately restart said process. I had such a case very recently, and was thankful for the existing behavior of tmp files.

Rather than delete the file straight away, put it in a “trash can” or “recycle bin”. A background process deletes files from “trash can” at a later date. It could normally give them a grace period (e.g. 7 days, configurable) but the files could be deleted early if storage space is running low. The same feature could be used to provide undelete for non-temporary files too.

Can't you create file, unlink it and then fork as much as necessary? I think that OS will maintain reference counter for that inode and will delete it when all processes will close that handle (explicitly or implicitly with exit).

Sadly not because then the file doesn't have a filename, and there are a lot of things that only work with file names, not descriptors. Even basic stuff like std::fstream needs a filename.

Yes there are workarounds (library extensions, signals, /proc/fd, etc.) But you'd think a basic function like "delete this file when this process exits" wouldn't be too hard.


This is mostly about defining behavior for the bad cases - program abort and system restart. Those are classically undefined behavior, which causes problems.

It's not that easy to come up with a decent filesystem API.

Consider that filesystem data cannot be synced to underlying storage in a such way that it can be reliably read later and backups by definition are asynchronous and provide eventual consistency. This means that the best meaningful guarantee an application can get after machine crash is being able to read data written up to some point in the past, but not after every successful fsync before crash. Usually, though, even that guarantee is hard to achieve, as bad blocks happen and redundancy is not there. Given all that it's ok to relax filesystem APIs behavior to proper physical constraints. Like fsync is only meaningful as an ordering operation, no need to actually flush anything to disk immediately on fsync or any operation.

Next is multi process and multi threaded scenarios. Should O_APPEND only work correctly from a single thread, should each write be atomic and to what size, we certainly can't have gigabytes in an atomic append or should there be some synchronization mechanism that blocks others? Same for temporary files and unit files.

And what to do on bad blocks? Should there be redundancy within a single disk, should block device underneath be log structured storage with block remapping and scrubbing and provide reliable storage layer to the outside by sacrificing space? Maybe for desktop machines it should, but not for servers, at least not all of servers, they need a different API.

I'm not even touching performance considerations here, that depend on performance-friendly APIs a lot.


Nagle, I believe I last read your comment on this (https://news.ycombinator.com/item?id=13964053), progress has been made .

We have the temporary files you're asking for. https://lwn.net/Articles/619146/

With O_TMPFILE, you can also write new data, and then automatically replace a file on disk.


With O_TMPFILE, you can also write new data, and then automatically replace a file on disk.

How? Using "linkat"? That's not automatic.


renameat2 is meant to be atomic. From the man page:

  If newpath already exists, it will be atomically replaced, so that
  there is no point at which another process attempting to access
  newpath will find it missing.  However, there will probably be a
  window in which both oldpath and newpath refer to the file being
  renamed.

Yes, that's one of the many workarounds for not having unit files. There are different workarounds in the Windows world, and the Linux workround does not work on some file system types.

> Log files.

I wonder if this would ideally also provide a rotation API. Rotating log files in the presence of multiple writers is messy, and maybe shouldn't be reimplemented at the application layer every time.

> Temporary files

Windows is not always a model of filesystem elegance, but it has a "delete on close" flag in their equivalent to open, which makes it go away on the last close (handle can still be duplicated or inherited, so you get some reference counting through that).

Actually I think you can do similar to this on Unix by unlinking after the open, but keeping an fd open for the lifetime of the temp file.


I wonder if this would ideally also provide a rotation API. Rotating log files in the presence of multiple writers is messy, and maybe shouldn't be reimplemented at the application layer every time.

That feature was in UNIVAC EXEC8, now OS-2200. It was called F-cycles.[1] Over half a century old, and still working.

[1] https://public.support.unisys.com/2200/docs/CP18.0/PDF/78307...


Fun fact for those who don't know: Windows already has a database API. I've never used it, but it's called the Extensible Storage Engine (JetCreateDatabase, etc.).

Is this the same storage engine that used to provide the back end for Visual Source Safe and Exchange? Which has a 2GB limit per storage unit and corrupts itself irreparably if you hit the limit?

I don't know!

Windows also now bundles SQLite [1]. So Windows has more than one “database API”.

[1] C:\windows\system32\winsqlite3.dll


How about sandboxed files? Ie can be assigned to a process - full acl per process/application binary. Maybe even checksum the binary before loading to ram and granting access.

"Anybody who wants more complex and subtle filesystem interfaces is just crazy. Not only will they never get used, they'll definitely not be stable."

I think there is a more universal truism here - that "complex and subtle" are sources of pain, problems and headaches.

I want to write "cool" and "magical" code as much as the next person, but that's the stuff that I look at later WTF because I am no longer in the same state of mind. Clear, simple, straight forward, plain as day are better than anything else.

And if you do have to do something "magical" something "odd" or hard to understand for the love of god please leave notes explaining what, why and how you did what you did. And if you are replacing a "clean" and "readable" version leave it there commented out. Sure it is "in the repo" but almost no one ever looks and you are just making it harder for me to figure out the original intent of whatever was there.


The more time I spend as a dev, the more I realize that writing clear, simple, straight forward code is actually the greater challenge. When I started progressing beyond learning the basics, the sort of projects I was building were quite simple, so writing the simple code for them came to feel boring. I would read complex codebases and see all the fascinating tricks they employed and wished I was writing code like that. It felt like, "Those are the smart programmers. I should emulate them."

As the complexity of the projects I work on has increased (especially in the time since I became a part of a professional dev team), I've come to realize that most of that tricky, complicated-looking code I had read years earlier was actually kind of the easy way out. When you are just trying to get stuff done, the complexity of your code mirrors the complexity of the project. There's no effort put in (and often no time to do so) to create a smoother interface to shelter the code from the complexity of the task at hand.

It's much more interesting to me now to be faced with a complex task and to figure out how to make the code simple and clear. Elegant, well-designed libraries/APIs belie the challenge in writing them. The code looks so simple that it feels immediately obvious as you're reading it. I've come to realize that reading a code base that seems dead simple -- so simple anyone could instantly understand it -- is actually often inversely related to the difficulty in creating it.

Yes, necessary complexity does exist, and sometimes there's no way around doing something fairly nasty in your code. But as an ideal to strive for, I find 'simple' to be fascinating.


> It felt like, "Those are the smart programmers. I should emulate them."

That’s probably the main reason why so many C++ codebases are unnecessarily complicated. People look at what standard library developers did, and emulating it.

The complexity of the standard library is actually justified: containers need to scale from 1 to 1E9 items and more, algorithms need to work with whatever broken classes users throw at them, the whole standard library needs to be extremely generic.

None of these requirements apply to code of most applications, yet I have seen many C++ projects where people designed their code the same way the standard library is designed, without thinking about the reasons. The result is template heavy code which takes long time to compile, and hard to modify or debug.


Problem is when you have a shitty language like Python that does almost no optimisations so you're forced to write "clever" code if you want it to run reasonably fast.

If you must use Python, then pawn off the important work to C. Half of the reason this language exists for is easy FFI, and that's all the popular libraries get enough performance to be usable for non-toy applications.

I agree completely don't do magic, keep it clean and simple. I disagree with leave the thing you replaced commented. It's not just that you can find it "in the repo", it's that things drift over time. Prose comments have that reputation but code in comments that isn't being maintained drift even further. No one even has any intention of maintaining commented code except if it happens to hit find and replace, and even then good luck.

> code in comments that isn't being maintained drift even further.

This is also true of "its in the repo" and that doesn't get hit by "find and replace".

I don't think it should be done for EVERY case, rather the cases where clarity is replaced with something that is ambiguous but meets another need. Performance hacks are notorious for being "ugly" and "magical" and having a readable companion piece would make sense.

To clarify I don't think it replaces "prose" but should exist along side it, and be mixed into it.


> This is also true of "its in the repo" and that doesn't get hit by "find and replace".

It's very much not! The thing "in the repo" is of exactly the same vintage as the thing in the next commit. The thing "in the comments" is preserved in amber next to a growing thing that no longer matches it. Inevitably to understand the thing "in the comments" you have to go back to commits of the same age.


To add to my earlier response: if I needed to achieve the performance critical cleverness required solution and felt it needed a human readable code sidekick, I would include both as real running code alongside each other, both of which running the same set of tests, and both of which required to be run by whatever CI/automated tests are used by the project. Comments are for humans to read, not machines.

I like this approach, it would cover 80-90% of the cases where it would provide value!

I am currently trying to wring maximum performance from Cloud TPUs. This comment really resonated with me, because Cloud TPUs are complex, subtle, overly complicated, and have twenty opaque ways that you can ruin your performance.

Compare that with a GPU. It took decades, but (at least for nvidia GPUs) they have finally reached the point where you can be at least reasonably certain that high-performance is the default. Sure, you can ruin your performance by doing some stupid things, but it's at least harder to do those stupid things. The defaults are high-performance.

It's exactly the opposite for Cloud TPUs. By default, it only uses one core. You have to use their TPUEstimator API, which is a byzantine mess. Since two weeks ago, I've spent roughly 3 solid working days of effort solely trying to read and understand (a) what is the TPUEstimator actually doing? (b) why is it supposedly so much faster?

I have some half-hearted justifications -- the answer seems to be "you have to colocate your gradients with the device; you have to scope your tensorflow graph to a specific device; and each TPU core is a separate device." But there are unanswered questions. For example, you're supposed to pass your tensorflow computation to tpu.rewrite(). Yet I've never done that. The defaults seem to just work. So does that mean it does the rewriting for me automatically? Is it being emulated in software, and I'm ruining my performance? Tensorflow, why don't you just crash instead of being so damn slow on TPUs? That would at least let me aim my optimizations!

If only I had low level access to the actual TPU operations, I could just write a compiler that specifically emits instructions to give me the performance I need. But this tensorflow graph abstraction makes everything "easy" yet exponentially more complicated.

Anyway. Yes. More of your mindset, please. Simplicity is such a lovely metric seldom optimized for.

(Anyone who's curious can see a dramatic 11x difference in performance on GPUs vs TPUs in Colab: https://twitter.com/theshawwn/status/1196593451174891520 ... this was very surprising, since Google's marketing would have you believe that TPUs are the bee's knees. Yet by default a TPUv2 is 11x slower than a K80 GPU for some basic matrix multiplications.)


Anecdotally, I have seen 50-100x (seriously) increases when moving from a V100 to TPUv3. To be fair, some of that is batch size increase. This was using the models in https://github.com/tensorflow/tpu/tree/master/models/officia.... On the other hand, a lot of those models are broken in some way and need fixing before running, caveat emptor.

Can you be more specific? I have seen many such claims, and every time I try to reproduce the results, there always seems to be some catch.

We're trying to use TPUs to fine tune GPT-2 1.5B. The model takes up 5.8GB memory, which is well over half of a TPUv2 core (8GB). It always OOMs when I try to do a training step, due to the gradient calculations requiring memory. It even OOMs on a TPUv3, which has 16GB per core. I've tried using bfloat16 (which ought to cut memory usage in half) and using Ada optimizer (which should be no more expensive than plain old SGD). Yet if I colocate the gradients to the same core as the model, I always OOM. (Colocation just means "don't use any memory except the memory physically on this one core.") With colocation off, I don't OOM, and I do see some speed gains using all 8 cores. But it's no more than a factor of 2x, and in fact closer to like 1.15x (i.e. it's roughly equivalent to just using larger batch sizes on a single core). And I don't understand why I'd OOM in the TPUv3 case; even with float32, the model is only using 5.8GB out of 16GB. Are gradient computations really taking up more than 10GB for the optimizer? (That leads to https://github.com/cybertronai/gradient-checkpointing and such, but I haven't tried it yet.)

If I try the same experiment with a much smaller model (117M, or about 13x smaller), I can successfully colocate the gradients onto the same core as the model. And when I use all eight cores, I'm able to get 1225 tokens/sec (roughly 1 example per second, since 1 example = 1024 tokens for GPT-2), vs the standard case of around 400 tokens/sec when using only one core. But that's still "only" a 3x speedup.

So when I see "50-100x increases," alarm bells start going off. I'm missing something fundamental here. Either you are getting 100x speedups, or I am somehow missing something fundamental.

People have even started asking me for answers regarding the TPU case, and I'm forced to be like "Yeah! I expected TPUs to be so much faster too. Everyone says they're getting 100x speed gains. Yet we're 11x slower than the GPU case, and here's a notebook showing a 11x slowdown."

https://github.com/shawwn/gpt-2/issues/5

I'm suspecting that memory bandwidth is the bottleneck here for large models. This paper even pretty much says "GPUs are more flexible and faster when memory bandwidth is an issue": https://twitter.com/mosicr/status/1196749286815481856

The closest I've come to finding an actual example of a speedup to aim for is this: https://github.com/imcaspar/gpt2-ml

They used a TPUv3-512 pod to train a GPT-2 1.5B model to 99k steps in 50 hours. If you work out the math, that's about 1 example per second. We're getting about 0.08 examples per second on a single TPUv2 core. So yes, it's a big speedup (12.5x) but certainly not 50-100x. Yet it has 64x the cores as my TPUv2; why isn't it 64x faster? And we're only using 1 core; why not 512x faster?

I have also tried this on TPUv3-8, and we're getting about the same examples/sec, further increasing the plausibility of the theory that memory bandwidth is the bottleneck.


I don't think you can judge code just by how readable it is, and you can't frown on what seems cool and magical all the time. Sometimes "clean and readable" means that complexity is pushed outwards onto dependencies or you've simply created an inelegant solution that doesn't consider the whole problem domain.

A perfect C++ program that outperforms a simple looking python program which does the same thing isn't worse because it's harder to read or to write, or better because it's faster, it's just different.

If you need to optimize for readability, so be it. If you need to optimize for performance, so be it.

If you write code executed trillions of times in a trillion places, where power use and the expense of hardware is dependent on the performance of the software, I think it's okay for it to be as complex as it needs to be to be very optimal, at the expense of readability and other things. It should be well annotated, sure. That's always a reasonable thing to expect except in rare, extreme cases where you don't even have the time to type out the comments.

It's important to weigh the relative importance of simplicity vs speed vs memory frugality vs portability vs maintenance difficulty (dependencies and tool choice factor in here) and so on to determine the best approach.

Dogma about how best to code in a general sense is only helpful if you don't want to think or talk about what's the best fit.


The clean and readable version can be kept, but not in comments. Why not keep it as a reference implementation, and test the new thing against it?

You really like leaving in commented code? I used to prefer that too but then at work we had a policy of taking out unused code. I actually think it's cleaner so I've been doing it in my own code and rarely regret deleting something. But when code (my own or someone else's) is more chaotic I do find it to be useful to leave bits and pieces lying around.

I think this is about the rare cases where you have multiple competing implementations that become better or worse depending on small changes elsewhere or when a simple initial implementation can serve as documentation for a carefully optimized subsequent implementation.

I tend to think there is always a better way, say creating a module that contains both implementations as alternatives, but if this is the worst wart on the codebase it is probably a good codebase.

Just leaving giant blocks of commented out, trivial to recreate code as the system evolves is annoying. But I do think that the rule of not leaving old code in as comments is one where the bar to breaking it shouldn't be too terribly high. Personally I've gone overboard following it in the past, and all it did was hurt productivity. A few comments blocks aren't the end of the world.


I would go for a comment like "a simpler naive implementation is in commit A0348C. It ran approx. 2.5x slower in 2019."

I would do similar but include the commit timestamp and/or words from the subject to help locate the code if the commit id can't be retrieved.

Rebasing and other means of moving patches (email etc, adding signed-off-by) can cause the original commit id to become invalid, and eventually unretrievable.


The problem is that everybody agrees on that "universal truism" but nobody can agree on what it means. "Clear, simple, straight forward, plain as day" is up to the reader. They're also a moving target.

To some people, C is simple. To other people, C is full of "complex and subtle" sources of pain.

To some people, a VM with a GC and a JIT is "complex and subtle". To other people, using a HLL with those features enables them to write programs which are much simpler and clearer.

These positions also change over time. A JIT is a fairly standard technique in 2019, but in 1973 it was still very much in the research phase. Even the idea of writing a kernel in such a HLL as C was once revolutionary.

I have an SDET friend who calls these types of universal truisms "apple pie", as in, you hold a meeting and say "Apple pie is good, right?" and everybody nods and says "Yes!" and then goes back to work and absolutely nothing was learned or decided.


I like your description because it (accidentally?) likens the state of mind of wanting to write "cool" and "magical" code with being high on drugs. In my career I've made that connection many times. This team is not interested in some engineer having an "awesome trip" through arcane abstractions, we're interested in shipping stuff that works.

And that’s why reading Golang is much more straightforward than Rust which is way more verbose/flexible.

> The undeniable FACT that people don't tend to check errors from close() should, for example, mean that delayed allocation must still track disk full conditions, for example. If your filesystem returns ENOSPC at close() rather than at write(), you just lost error coverage for disk full cases from 90% of all apps. It's that simple.

Most programmers abstraction of a computer system is synchronous and consistent. If you make the simple synchronous cases do the wrong thing, like write returning success on disk full when a write did not happen, you are going to break people’s code.

This also explains why when you when you choose a database, a relational DB with strong consistency and linearizability should be your default. Going for eventually consistent in the database layer and expecting the application logic to deal with it will lead to grief in many, many cases.


This could be summarised as: don't assume the responsibility to do something correctly if you don't have to, and whoever has to do it otherwise is more competent than you.

I think he's got a point.

As a developer I find myself in a different scenario. I'm usually trying to find out what exactly the 100% guaranteed way to do something is. Instead, I find incomplete documentation and different people with different opinions on what the guarantees are, and most people writing bad code that they assume will usually work.

Just modifying a file in an atomic way requires a complicated dance of multiple files and multiple syncs and a rarely tested cleanup routine the next time the file is opened. No one does this.

I don't know what the solution is.


I do, but it's not a popular opinion.

POSIX, and by extension, the classic 1960s-1980s era UNIX way of doing things just needs die a long overdue death.

This stuff was designed at a time when every CPU instruction mattered, everything was optimised to death for frugality, and commands were abbreviated from "copy" to "cp" because ermahgerd two bytes is a huge saving! That mentality got us Y2K. This is an era where latencies were not the bottleneck, CPU cycles and memory bytes were.

A lot of stuff in filesystems is just plain stupid. For example, why do applications install their files. one. at. a. time? Like... what the fuck? How does it make any sense for an application to be partially installed? Who actually codes their application with 500 modules and dynamic libraries to be able to handle the scenario where one of them is inaccessible due to an ACL or a mismatched version because of an overwrite by something or someone else? NOBODY, that's who. Meanwhile, I can make a cup of tea while Adobe Lightroom launches on an SSD drive because it is 99% OS API overhead and 1% usermode action.

This is why Docker is popular. Not because Docker is good, but because OS APIs are retarded.

Every application install should be a union fs. This union fs should be entirely user-mode, so that if an application has 10,000 files, it doesn't take 10,000 round-trips to the OS kernel with the Intel mitigations, context switches, and cache flushes that all brings with it.

Copying a file shouldn't require a user-mode buffer to feed the data through, forcing it to come down WAN links just to go back up the same WAN link again on the way out.

Overwriting a file shouldn't require more than a single API call, because it's nearly 2020, and we should have long since realised that kernel transitions are expensive, so we should optimise to minimise the number of round-trips. open(), write(), write(), write(), flush(), sync(), close(), poke(), prod(), jesusfuckingchrist(). Just take a buffer bigger than 4KB, or better yet, standardise an API to take a stream from user mode.

Just take a buffer and a filename, and atomically replace. Done. Bang. No lost data, not torn writes, just DO IT. How hard can this be? Is it impossible to do this? Are we forever stuck with POSIX, which was created in 1988, before most of its modern users were born?


Speaking as someone whose job is primarily building high level APIs, this is unachievable. Those high level APIs satisfy the work the vast majority of people want or need to achieve, but some tasks require greater control over the details those APIs achieve, and all tasks require those details to otherwise be right.

POSIX might not be the right lower level API, and maybe it can be replaced with something that achieves those details better. But "done bang just do it" is simply not how computers work. Software is how you achieve that, and generally by building simpler abstractions on top of more brittle ones.

"How hard can this be?" is the right question. It's very damn hard.


Your comment reminds me of the situation of game engines and 3D graphics APIs a few years ago before Vulkan and Metal were released. Too high-level for developers who want control (and understand how the hardware actually works), but too low-level for developers who want to minimize complexity.

Now, Vulkan and Metal offer the detailed control for library developers, and everyone else uses some higher-level wrapper.

Does it make sense to split the file system in a similar way? I guess the main challenge is avoiding too many competing wrappers.


The problem with Vulkan is even graphics driver developers struggle to use it correctly.

> Every application install should be a union fs. This union fs should be entirely user-mode, so that if an application has 10,000 files, it doesn't take 10,000 round-trips to the OS kernel with the Intel mitigations, context switches, and cache flushes that all brings with it.

I disagree with this, as you've already over-complicated it by assuming some kind of unix-like-scatter-everything-all-over-the-file-tree-for-no-good-reason application installation. Just do what MacOS, RiscOS, DOS, etc did and have applications be a single file (or folder). There, install is just copy, can place it on any media you want, can carry it around with you, keep multiple versions, etc. It even keeps the abstraction of the application actually existing where it appears to exist.


This already exists. O_DIRECT exists, now there's io_urig too (plus the libaio1 which uses io_submit the old async API).

The problem is not with files. Kernel transitions are expensive, but the problem is not that. It's metadata sync (file inode + directory inode) and so on.

Install is slow on Windows because braindeadness of vendors (cygwin is fast to install). Package stuff is slow on Linux, because apt calls out to dpkg, which first reads the package db. Which is a plain text file with newlines as separators and it has to be parsed. Yum is slow because by default it syncs repos on install, etc.


Windows filesystem performance is also much poorer than Linux for lots of small files.

I give it 2-3 more years of people saying "this is impossible" before Lennart does it and makes everyone mad

Nobody said it was impossible

> commands were abbreviated from "copy" to "cp" because ermahgerd two bytes is a huge saving

Considering the number of times I type that command, I wouldn't be as quick to throw out the savings here.


With respect that wasn’t solely a design mentality. Resource limitations were a factor as well.

All of what you are describing is already a product on top of the APIs you are describing?

As a developer, no one seems to know what's the best solution. But if they're not the one implementing it, then suddenly everyone's an expert and has an opinion.

The thread title omits a crucial word: "Filesystem" (as in "filesystem people", not just "people"). The point he is making is that filesystems are supposed to be utterly reliable; applications should not have to take extreme precautions to avoid having the filesystem lose their data. And the fact that practically nobody actually takes any precautions, let alone extreme ones, is strong evidence that programmers do in fact expect filesystems to be that reliable.

“File system people should make badly written application code just work.”

That makes much more sense. And it turns out to be the opposite of what I thought when I read the title. I had initially thought - just write bad code whose only merit is that it works.

I personally thought that this opinion is applicable to not only file systems but all ‘base’ systems that can be built upon. Hence the omission of the ‘filesystem’ word.

But if most people think that this doesn’t apply outside of FS, I have the will to change the title :-)


It applies broadly to people building APIs, but your title reads as though Linus is giving general coding advice, saying not to be fussy about code quality but focus on correctness. At least one comment here is replying to that notion.

More to the point he's saying crappy code that uses the filesystem should just work if possible. Don't make it worse than it is. And don't expect anyone's going to fix their code to use your new and improved API. Because they won't.

The title should be changed, he's saying literally the opposite of what the title implies.

Kyle wants to add a new API called barrier() which will improve consistency and remove the need for fsync.

Linus makes the point here is that the file system API is already complicated to the point that few use or implement it correctly. Further complicating the API will likely create more problems than it fixes.


Suggestion: I think the title should be:

> Linus: Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

His point seems to be more about accepting reality and allowing the practice (which in this case is theoretically "badly written" code) to "just work".

As the title stands now:

> Linus: People should aim to make "badly written" code "just work"

One might incorrectly assume Linus is suggesting that defensive programming should be practiced heavily -- but that does not appear to be what he is saying here.


Or perhaps “Filesystems should aim…”. Which is what he actually seemed to mean.

Yeah, that seems a bit more precise!

APIs you provide to consumers should aim to make their badly written code just work. That's what Linus said.

Yeah. So many api writers aim to force clients do all the heavy lifting. The whole point of a good api is that it reduces heavy lifting. Anyone can write pass through apis that don’t do anything.

> The whole point of a good api is that it reduces heavy lifting.

Isn’t that the opposite of what Torvalds is saying? He seems to be arguing for simplicity. APIs that do a bunch of magic for you are the opposite of simple and tend to be mountains of subtle bugs and unexpected behavior.


> APIs that do a bunch of magic for you are the opposite of simple

You're mixing simplicity of API with simplicity of implementation. More often than not, you can only have one but not both.

Modern Linux or Windows do huge amount of magic when you call kernel API like open (POSIX) / CreateFile (Windows), yet the API is simple and easy.

You can expose all implementation details, your code will be simple, but hard to build upon. Speaking about data storage, once upon a time I programmed Nintendo consoles, their file system API probably was very simple for Nintendo to implement, but using it wasn't fun: SDK documentation specified delays, specified how to deal with corrupt flash memory, etc.

You can do the other way, you'll have to do lot of work handling all the edge cases, your code will be very complex, but this way you might make a system that's actually useful. About data storage, SQL servers have tons of internal complexity, even sqlight does, but API, the SQL, is high level and easy to use even by non-programmers.


I think that lies in the art of developing the API in the first place. It should give you enough primitives that it gets the job done balancing the responsibilities it advertises with the cognitive load on the developer to use it correctly.

Disagree in this case, it's about exposing a simple abstraction, which may mean a simple implementation, or may mean a complex one, depending on the impedance mismatch with what's going on under the hood.

To put this more generally: a properly designed API makes is easy and natural to do things right, and difficult (but not impossible) to do things wrong.

I wonder if ZFS had these issues in the same timeframe?

These strike me more as "Linux doesn't believe in actual testing" rather than inherent bugs because its a filesystem.


Is there a point though, when badly written code becomes so hard to maintain and improve, that people would just avoid doing that altogether?

The reason "badly written" works for Open Source is that if code is useful - there will be someone in the future who will refactor it. In proprietary setting that only happens when the fate of the company itself (or a large chunk of the business) is at stake. Otherwise stagnation is king.


> there will be someone in the future who will refactor it

maybe abstractly

in practice, this never happens because the code will cease to be considered useful first


It's easy to pretend that "badly written" code is intentional or desirable, because then when we write it, we can excuse ourselves by saying, "BUT IT WORKS!"

That's why this idea is popular, and your comment is controversial. You're taking away a crutch that many of our peers cling desperately to as a way to justify their shortcuts and poor decision making.


Slightly OT: is there any new emerging tech for Linux file systems that aren’t ext4/xfs/zfs? I was surprised the other day by those being the options for / on centos.

Btrfs was included as experimental in CentOS 6 and 7, but removed for CentOS 8.

bcachefs looks pretty cool but who knows how long it will take to get upstream.

Missing the W. Richard Stevens books right about now...

(2009)

I feel like Linus is basically just arguing in favor of the principle of least surprise.

There are definitely some serious pitfalls when it comes to unix file io, just look at for how long we lived with postgres and its broken assumptions surrounding fsync behavior on linux.


> how long we lived with postgres and its broken assumptions surrounding fsync behavior on linux.

You mean its assumption that the API wasn't lying about the integrity of the data it claimed to be writing? Very broken indeed.


I think the message is that we should make the basic stuff "just work" before we start adding the complexity of more bells and whistles.

More features don't have to be more complex or "subtle" (I wish).

At the end of the day, the client/users/etc just want it to work. They don't care about the quality of the code.

Told you: Perl is built by the gods.

Perl has subtle complexity just like everything else. But it also has taint mode, which depending on the way you squint is either a great example of "security that just works", or a great example of "you must use X method to get good security".

I vote for the second option :-) IMO taint mode is a great example of what Linus is talking about. Too subtle.

I cut my teeth on Perl and actually believed the books when they said you should always use taint mode when touching data that came e.g. over the network. I wrote all my web code under -T (-wT actually but you get the idea).

Then one day I went to drop in some full text search via a then popular library (Plucene). And what do you know, Plucene would not work under taint mode, because it was not developed under taint mode. The maintainer would not accept my simple patch essentially because he did not understand that you could not untaint without a regex somewhere (i.e. he did not know how taint mode worked at a basic level). So I maintained a patched version of that lib privately. Only to later hit the same issue with another popular library.

So I had to stop using taint mode. If it’s just an option — even one aggressively marketed in O’Reilly books back when people actually mostly read O’Reilly books to learn various systems — it’s not going to win much adoption.


My dudes: it's Saturday night, and this was just a subtle ribbing on a programming language. Save your downvotes.

Quite a bizarre read considering Linux's decision to e.g. overcommit memory makes 100% correctly written code break nondeterministically...

It seems to me that a basic requirement for not being an idiot is that you recognize hard problems as being hard, and I see a lot of supposedly smart people declaring hard problems are easy because they're just ignoring tradeoffs or aspects of an approach that undermine it.

I'd think "everybody" knows, certainly I would expect Linus to know, about Postel's law and the subtle ways it ends up causing problems. Whenever you make things easy or difficult, you shape the evolution of how people do those things. There's no simple universal answer to "do we make things easy or difficult". Or "do we blame the user or the toolmaker?"

I don't really understand the psychology of going around arguing one side of an insoluble problem, observing that others are totally convinced of the other side, and occasionally flipping sides, but never acknowledging the meta-problem of integrating both or deciding when to apply each.


What do you think the "insoluble" problem is here?

Deciding whether and how to influence the way people use a tool or product.

Do you say "you're holding it wrong", or do you adjust to what people seem to be like?

I mean, it's insoluble if treated as a single binary decision and not contextual.


The difference is that people do tend to check return values of write() (in cases where it actually matters), but not check the return value of malloc(). Since Linux must work with the code which exists, and since most programs tend to malloc() memory but never use it, the overcommitting practice begins to make sense. A twisted and ugly sense, but that is the world we live in.

That's at best a reason to allow over-commitment to be enabled or disabled on a per-program basis, not a reason to force it on the whole system so that it becomes outright impossible to write correct code.

I think this is why it's not per-program:

Say you have 16GB of RAM. You disable overcommit for your database, and you run it, and it allocates 14GB. But Chrome has overcommit enabled (why are you running Chrome and a database on the same box? Dunno, never got an answer from that engineer...), and it happily allocates 23GB. Later, Chrome uses up all the available memory, and the database is sad. But it's not Chrome's fault, it didn't know it wasn't allowed to use it all up! So then you redesign overcommit so there's a "no-overcommit pool" with priority, and an "overcommit pool". You end up with only 2GB of overcommit pool for the kernel + userland + Chrome, so Chrome just gets killed early and often. Might as well have just disabled overcommit entirely, if random apps are going to die anyway.

(Incidentally, cgroups allow setting soft and hard memory limits, so you can impose particular memory limits on arbitrary applications to have more determinism)


I think the difference between "don't implement workarounds" and "use overcommit" is that the kernel can try to be clever, but userland should not have to be clever. The kernel is supposed to just make things work for userland.

I think that's why overcommit exists. It's generally not easy to re-design all userland applications to deal with difficult memory management problems in a complex system, but it is easy to just lie to them all so they can continue to do stupid things and not crash. If you have a system with only "100% correctly written" software, just disable overcommit (echo 2 > /proc/sys/vm/overcommit_memory).


The problem is that userland has to be really clever because of overcommit... that's what the whole thread is about! The reason people want complex things like fsync and barriers is that the OOM killer has normalized the bad idea that applications should behave well when suddenly SIGKILL'd, and this is way harder than looking at the return value from malloc. When a program I write has elevated permissions, which fortunately is often, I will always write -1000 into my oom_score_adj, but not everyone has this luxury.

You don't need fsync if you only care about your user process being killed. It protects only against power loss/kernel crash - the filesystem is not running as user process, so it can't be killed.

>bad idea that applications should behave well when suddenly SIGKILL'd,

That's not a bad idea, that's the way that any application that cares about not losing data should behave. OOM is just one of the any reasons an application might suddenly die.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: