In the mid nineties I worked in a research institute. There was a large shared Novell drive which was always on the verge of full. Almost every day we were asked to clean up our files as much as possible. There were no disc quota for some reason.
One day I was working with my colleague and when the fileserver was full he went to a project folder and removed a file called balloon.txt which immediately freed up a few percent of disk space.
Turned out that we had a number of people who, as soon as the disk had some free space, created large files in order to reserve that free space for themself. About half the capacity of the fileserver was taken up by balloon.txt files.
Perfect example of the tragedy of the commons. If individuals don't create these balloon files then they won't be able to use the file server when they need it, yet by creating these balloon files the collective action depletes the shared resource of its main function.
This is similar to how some government agencies retain their budgets.
At the end of the budget period they've only spent 80% of their allocated budget, so they throw out a bunch of perfectly good equipment/furniture/etc. and order new stuff so that their budget doesn't get cut the following year, rather than accepting that maybe they were over-budgeted to begin with.
Rinse, repeat, thus continuing the cycle of wasting X% of the budget every year.
I think the problem is that you do not need 100% of your budget every year, but getting it back when you do need it is much harder than keeping it in the first place.
Yep! The problem happens when you divide the safety buffer up in the first place. Safety buffers demand to be shared, when one part does not use all of its safety margin you want to transfer that to another system.
Another surprising place where this happens is project scheduling. We budget time for each individual step of a project based on our guess of a 90% or 95% success rate, then our "old-timers' experience" kicks in and we double or triple our time for all the steps together, then our boss adds 50% before giving the estimate to their boss, which sounds gratuitous but it is to protect you because their boss looks at how grotesquely long the estimate is and barks out a cut of 20%, so the overall effect of those two is (3/2) × (4/5), so your boss still netted you a 20% buffer while making the skip-level feel very productive and important.
Say the 50%-confidence-to-95%-confidence gives you 30% more time as safety buffer, and you only double the estimate, and the work that you missed in your initial assessment, while it's not gonna be say half the project, maybe generously it's a third of the project or so. So the project actually takes time 1.5 measured properly, you have together budgeted 1.3 × 2 × 1.2 = 3.12 time. The total project deadline is more than half composed of safety buffer. And we still consistently overrun~!
But if Alice needs to work on some step after Bob, and Bob finishes early, when does Alice start on it? Usually not when Bob finishes. Alice has been told that Bob has until X deadline to complete, and has scheduled herself with other tasks until X. Bob says "I got done early!" and Alice says "that's great, I'm still working on other things but I will pick my tasks up right on time." Bob's safety buffer gets wasted. This does not always cause any impact to the deadline, but it does for the important steps.
Of course, if you are a web developer you already know this intuitively because you work on servers, and you don't run your servers (Alice, for example) at 100% load, because if you do then you can't respond to new requests (Bob's completion event) with low latency. It's worth thinking about, in an efficient workplace, how much are you not working so that you have excess capacity to operate efficiently?
Have you ever thought of just accepting that you can't predict how long a project will take to complete?
It's a revelation. You get to have some hard conversations with other managers. But in the end everyone finds it easier to deal with "it'll be ready when it's ready" rather than endless missed deadlines and overruns.
In my experience people demand fantasies, and will fight tooth and nail any encroachment on them by the reality that things never happen as they are planned. Although when I say this, I am thinking of one year and above kind of estimates.
When people who are experts on the topic evaluate the work needed over a period of say 3 months, even in something as notoriously hard to plan as video game production, it can hold. This entails being willing to adjust scope and resources though, when planning, in order to ensure the objectives are likely to be met.
This never worked in 15+ years I've been working. The people who did try that went out the door very soon once management realised some other person could tell them a date and they could plan their business around that date even though that date half of the time got missed anyhow.
Yeah it's tough to get the point across. Worth it if you can, though. For everyone - no-one enjoys rescheduling everything because the deadline was missed again
I've never understood this about budgeting. So you allocate. A budget. These are fund YOU ARE PLANNING TO SPEND! So, OK, you DON'T spend them this year. Why the fuck don't you get to SAVE THAT MONEY!? No, instead you are punished for now spending it all and you cannot create a realistic budget for next year - why the hell not!?
Sorry, but this frustrates the hell out of me! What am I missing here? What arcane bit of finance lore leads us down this path? Am I just hopelessly naive? Is acing money such a bad thing!? I just don't get it...
Your frustration may come from a good place, believing gov and big orgs want to be efficient. They don't and they don't have to. They are super efficient when it comes to taxes though. At least in my country, every department is trash except the revenue service which is so forward thinking and effective, they put private business to shame.
But I'd say it's not a feature of governments only, any sufficiently big organisation which centralise power ends up being like this. That's why large corporates need nimble startups to innovate: startups either innovate or die.
The government is just the largest example of this phenomenon - and they don't have any analogue for startups, they're just doomed to grow larger and larger over centuries until they collapse.
Look at the USA, once the perfect minarchist experiment and now the largest employer in the world.
Not here in the US. The IRS has been crippled by previous Republican administrations to the point of uselessness, as another method of giving free unlimited tax breaks to the millionaires and billionaires that bought them in office.
But god help the poor people who file simple tax returns that can be easily audited, however.
If you don't reduce your budget next year, what does "save that money" mean to a department in a company? If this year I have 105% of last year's budget (because they always go up), what am I supposed to do with the 20% surplus from last year? Most companies wouldn't even have a place in your cost center to track a surplus, it's such a foreign concept.
Zero based budgeting is one answer to the moral hazards of either over or under-estimating your budget on purpose. If each year you start with a blank spreadsheet and then add (with justification) expenses for the year, it avoids some of the pitfalls. Not a panacea however.
Just brainstorming, but maybe the irony is that your scenario somehow has even worse incentives? For example, building up that rollover number would gamify thriftiness greatly exceeding the typical "oops didn't spend it all," and think about the consequences of when you DIY something you're not the best at instead of hiring the pros.
1) separation of duty: you might not be the best department to invest surplus
2) cost effectiveness: if you're operating with a deficit, as is generally the case with governments these days, this money is not free, so it could effectively be cheaper to give it back and re-borrow it when you actually need it
> 1) separation of duty: you might not be the best department to invest surplus
But with this reasoning there is no surplus, because departments will spend their money at all cost.
> 2) cost effectiveness: if you're operating with a deficit, as is generally the case with governments these days, this money is not free, so it could effectively be cheaper to give it back and re-borrow it when you actually need it
That's totally fine, when GP said “Save the money” they didn't meant “on their own bank account”. It just means: the top management owe them this money when they'll need it later.
Anecdote: I'm currently working on a project started in emergency earlier this month, which must be done before the end of the month (because it's the end of the accounting year at this company) for this exact reason. And this project is overprices by a factor close to two, because this money really had to be spent!
Top management doesn’t “owe” them any money when they need it later. Say you budget $100 for dinner tonight and you go out and it costs $75. Do you owe the restaurant $25? While certainly some people might roll the $25 into the next day’s meals, some people might allocate that $25 to another cost center like buying a new car.
Budgets are meant to estimate costs and manage cash flow. From a greedy team perspective it’s best (and self interested) to try to game the system as much as possible so you get the largest share of the pie. From the organizational perspective it’s best to reallocate capital efficiently, especially if a team consistently over budgets.
> Say you budget $100 for dinner tonight and you go out and it costs $75. Do you owe the restaurant $25?
No, but if you accurately forecast that dinner will cost $100 on average, and this time it only happened to cost $75, you should put most of the savings aside for the other times when it will cost $125 and not reallocate it to be spent on something else.
Consistent over-budgeting is still an issue which would need to be addressed, of course, but a system where any annual cost underrun is treated as over-budgeting and punished by reallocating that part of the budget to other groups ignores the inevitable presence of risk in the budget forecast.
We’re arguing about different things it appears. This thread started with someone saying that a team coming in under budget should is “owed” that money in the future by management. I said this isn’t so and that it’s a self centered and myopic viewpoint. You are talking about punishment and reallocation, presumably by reducing the budget the next cycle. I’m not in favor of that unless it’s clear that the team is consistently over budgeting.
For example, if a team says they need $100 a year and comes in at $90 then I don’t think next year’s budget should be $110 while some people in this thread think it should be. That makes no sense. Neither do I think the budget should be cut to $90. Unless something has changed, the budget should stay the same.
Your point about average cost just means that you’re budgeting on the wrong timeframe. If you estimate your average dinner is $100 but you’re spending $75 most of the time except for one huge dinner every month then you should be budgeting $75 for dinner and then budget separately for one large dinner a month. Similarly, if a team says they need $10MM a year but half of that is them trying to amortize a $25MM cost over 5 years then they are budgeting incorrectly. Their budget should be $5MM with a $25MM side fund contributed to on a risk adjusted basis.
The worst case scenario is the team budgeting $10MM when they only need $5MM and losing control of their budget so that when the real charge comes due they’re fucked because they’ve been spending $10MM for the past 5 years without realizing the fixed charge is coming or, worse, realizing the fixed charge is coming but just ignoring it so they can buy new office furniture and exhaust their budget this year selfishly.
> For example, if a team says they need $100 a year and comes in at $90 then I don’t think next year’s budget should be $110 while some people in this thread think it should be.
IMHO it depends on why the expenses were less than the budget. If it's a matter of probability or essential uncertainty then the savings should be set aside for other occasions where luck isn't as favorable. If the department realized cost savings by improving business practices then most or all of the savings should stay with the department to be invested in future improvements (a one-time carry-over into the next budget period) and/or distributed as reward to those responsible for the improvements, as an incentive to continue making such improvements. If costs were lower because the department didn't accomplish everything they set out to do then that might be a justification for reallocating their budget, and/or implementing more drastic changes to get them back on track.
> Your point about average cost just means that you’re budgeting on the wrong timeframe.
The timeframe for the budget would generally be predetermined (e.g. one fiscal year) and not set by the department itself.
> If you estimate your average dinner is $100 but you’re spending $75 most of the time except for one huge dinner every month then you should be budgeting $75 for dinner and then budget separately for one large dinner a month.
Sure, but I was referring to probabilistic variation due to uncertainty in the forecast, not a predictable mix of large and small expenses. And the "dinners" in this analogy would be once per budget period (i.e. annual for most organizations), not frequent enough to average out.
I think we agree in general and are just quibbling about the details of how to budget correctly (timeframes, line items, etc.). Most of the issues that come up with these stories of people getting their budgets slashed if they don't spend enough or having to buy a bunch of bullshit at the end of the year are just a result of poor budgeting at some point which has been allowed to continue.
It is about opportunity costs. The budget you did not spend could have been spent elsewhere in the meantime and since it didn't get invested elsewhere, it's not a savings, that's a net loss, because of course anything less than 100% utilisation of 100% of "resources", 100% of the time is a loss .. Or some such.
Years ago I worked as a research assistant for a university. One day, my boss (a professor) pulled me aside for an impromptu meeting. "I have $5000 left in a research grant I need to spend this week or else it's gone forever – do you have any ideas of what I should spend it on?"
Unfortunately I couldn't think of much. I suggested maybe we buy some more computers with it but I'm sure he'd already thought of that himself. I don't know what he ended up doing, but I'm sure he'd have decided to buy something with it rather than just losing it entirely.
This is usually handled much more elegantly by senior academic staff.
You contact a department who’s services you use a lot, then you arrange to pre-pay for services. Ideally you negotiate a discount.
Then you use the service and state which grant to draw from.
This way you have grants paying for things that are completely unrelated to their intent, you have one nightmare of a billing system which no one understands and you get to use everybody cent.
This as a fairly recent occurrence in my research group. It’s often quite tedious because you don’t want to waste the money and it’s never clear if there’s going to be a period where we’re short on cash at some point in the future. most of it’s spent on boring but expensive things to be used down the line. Would be far better if funding wasn’t quite so cyclical!
Could you order conference tickets or something similar that allows free cancelations in the future? In my previous job, we did this to carry over training budgets.
If whatever actions turn a non government entity into something inefficient, then the entity wont survive for long and will go out of business (or at least that's the hope of a competitive free market economy)
> then the entity wont survive for long and will go out of business (or at least that's the hope of a competitive free market economy)
Only if the inefficiency is large enough to overcome other forces.
Or to put it another way, picture if every single individual teams at Google did this to the tune of 100k a year, per team, and assume among 135,000 employees there are 13500 teams.
That's 1.35 billion dollars. Well under 1% of their revenue.
No way is a competitor going to appear that is identical to Google in every way except they have better budget management. Google has too many moats around their business, they can be really inefficient in many many ways and still dominate in multiple markets.
It was not even believed by Adam Smith. He writes that it only works that way in a controlled environment. That’s why European countries usually rank higher in market freedom than the US, because we don’t have companies getting so cancerously big that they have very real effects on law making (how lobbying is legal is still beyond me)
The USA got rich because of unbridled capitalism. Then richness trickled down through generations of companies while regulations caught up and their government became a behemoth not dissimilar to the ones living in EU and that the USA was running from.
Nowadays the USA jurisdiction is comparable to the EU one, but they still have more $$$.
Of course. I agree with what I think you're getting at, specifically that not nearly enough richness trickled down. However, the poorest Americans today are still significantly better off than they were in most any previous decade or century because a bit of that wealth did trickle down.
The amount of inefficiencies in form of red tape, confusing processes, custom half-baked tools that crash half the time is just mind-boggling. I've spent more than a week now for opening firewall on one ip/port on one host just to test my prototype in dev environment(local machine or docker are not an option due to lack of admin rights ), and it's still in change approval stage.
If we weren't this giant too-big-to-fail bank we'd be out of business by now.
Hope is a really interesting way to frame something that has consistently failed to prove true after centuries of theory and decades of targeted policy changes.
It's possible that once a company reaches a certain size, it's inevitable. Corporations internally have the same top-down centralized organizational structure as a typical government. Market forces can't eliminate that kind of inefficiency if it invariably affects all large enterprises, and the economies of scale enjoyed by such companies outweigh the perverse incentives of sub-organizations.
What strikes me as unique to government is the tendency for sufficiently powerful appendages to secure enough resources to start wagging the dog (e.g. the military industry in the US), although now that I think about it seems possible that it would happen within companies.
Also don’t forget how unevenly applied market forces are: if McDonald’s started charging 10% more for a hamburger they’d lose sales to Burger King a LOT faster than, say, Comcast or Oracle because the products are basically the same and most customers can switch almost effortlessly whereas you have to be especially mad to trench fiber out to your house or migrate every database in a large enterprise.
Any business with a natural monopoly, high migration costs, etc. can support a surprising amount of inefficiency even if most of their customers find the experience unsatisfying.
But large companies tend to have MBA types scurrying around rooting this stuff out as it pops up or shortly thereafter. Government has no such sort of immune system to fix these problems on the go. It just gets sicker and sicker until the tax payers vote for something drastic or revolt.
You see this in nonprofit entities too. They get big, abstract away from their mission and waste a lot of money until someone gets tasked with cleaning house or a more mission-driven comes along and replaces them.
I talked about that topic with my principal when I was in school.
He told me that the school had to prevent those automatic budget cuts. His reasoning was that it's nearly impossible to get a higher budget if some big expenses had to be made.
And suddenly needing a higher budge, after for example 3 years of low expenses, doesn't make a good impression on higher-level administrators.
This happens in private industry too. I can set my watch by the fiscal calendar of certain groups in public companies having to spend their budgets by the end of their year so it doesn't get cut the next year.
It's a decentralized implementation of a quota system.
By slowly releasing supply you prevent anyone having to self-regulate (which requires unreasonable deprivation, OR global knowledge) and everyone bases their decisions off of the only global signal, free space.
Tragedy of the Commons is the libertarian "private property is essential" interpretation. It's a cynical take, assuming that human selfishness is the deepest of truths and that there is no use fighting it, that the best solution is to organize society around it.
The conventional Game Theory take is that this is a prisoners dilemma, and everyone creating balloon.txt files are defecting. They are making the most rational choice under the rules of the game (no communication thus no reliable cooperation). It's no globally optimal, but it is locally for each of them. This take also suffers from the same assumption: that rationality is centered on self-interest only.
If we are to evolve as a species, then we need to get beyond such limited thinking. We need to transcend our base natures. That is the whole point of culture: to transcend as a group what our genes otherwise program us as individuals to do.
Tragedy of the Commons effects are well-established to exist both in economics and outside it (in ecology, for instance). You seem to be attempting to shoehorn some misguided political take into the situation, even though Tragedy of the Commons is a decent characterization of this particular social pathology.
Their point was that tragedy of the commons need not be a given anywhere we see tragedies or commons. Last paragraph is lofty but I think the whole idea is we have the cognitive ability to deliberately prove its not a natural law.
Understanding sociology as ecology at human scale is core to libertarianism.
Though did you mean, "Understanding sociology as Darwinian ecology at human scale is core to libertarianism."? Because the notion that ecology is characterized only by "the law of the jungle" is also strongly debated. Even "the selfish gene" is debatable simplistic reductionism. Individuals aren't the only actors; there are higher order emergent entities, e.g. species and ecosystems, that also evolve to perpetuate themselves and flourish, much like our own bodies are cooperative and interdependent systems of cells (with native and foreign DNA, the latter existing primarily in our GI tract) that originally evolved as single-celled "selfish" organisms.
And as you point out, "we have the cognitive ability" that nature lacks. We can do at least as well.
As to "lofty", I agree. But let's consider other things that were once considered lofty if not insanity:
- in ancient Greece, that democracy should be extended beyond the aristocracy
- in Medieval Europe, that democracy should exist at all, that the divine right of kings should be seen as a scam
- in the 19th century United States, that democracy should include women and blacks
- in the 1970's United States, that lesbians, gays, bisexuals, transexuals and queers should be treated with the same dignity as straights, should be able to marry, adopt children and serve in the military. And that we stop using "he/him" by default as you just did because that is an artifact of patriarchy as well as outmoded thinking about even binary gender.
- in India today, that when a woman is raped, she should be protected by law and the male rapist should be punished, not the other way around. The same proposition if proposed in America or Europe not all that long ago.
This is self conflicting. You take "human selfishness is the deepest of truths" as a mere assumption, then you say "we need to transcend our base natures".
Human selfishness IS nature. It is not just about humans either, all evolution is guided by environment (resource availability).
For anything else you need ALL people to NOT be selfish, only some being altruistic does not cut it. Your only other option is to punish selfishness, but then you will ban progress.
If most people don't create the balloon.txt file, BUT, there is no punishment for creating one, then if I believe I have a good idea and that I DESERVE more resources to pursue it, I'll create a nice big balloon.txt file. Your only option is to punish me for doing so. I would not want to live in a world where people are punished for trying to gather resources to make things that most other people won't. Some people have bright ideas, and they need resources to pursue them. Most people don't have many ideas and they don't want to do anything. If you prevent the means of passionate people to gather big resources to do big things, and want to live in a zero entropy world where everything is equal (made sure through the use of force / punishment, which will eventually be corrupt, because by definition punishers can't be equals to others) and nothing moves because of it, keep dreaming. It is not even scary because that literally cannot happen.
The way to resolve this particular tragedy of the commons, like most other such cases, is to privatize the commons: make people pay for the disk space they use. If you want a nice big balloon.txt file to reserve space for the future, fine, but you're paying for the space you reserved. How you use it is up to you. In return, the administrators get both the money and incentive they need to buy more storage capacity, ensuring that running out of available space will be less of a concern.
> we need to get beyond such limited thinking. We need to transcend our base natures
Refusing to accept the human nature as-is and always requiring some sort of "evolved new man" is one of the characteristics of the communist/socialist ideology.
Also a handy excuse when the system inevitably fails: it wasn't the system, it was the selfish people who did not implement it correctly.
Ahhh the old "socialism/communism inevitably fails" meme.
Let's assume one could even call those failures communism/socialism.[1] How long have we experimented with and developed socialism/communism? 100 years.
How long have we been trying to get democracy right? 2,500 years. With many starts, fits and failures, devolving into dictatorships many, many times. The self-proclaimed "greatest democracy in history" is guilty of genocide and slavery. Even today how much it is a democracy as opposed to an oligarchy/kleptocracy/plutocracy is questionable.
How about capitalism? 500-800 years. And in that time it has exploited, enslaved and murdered people, pillaged entire nations and continents[2], raped the environment, and poisoned every culture that has adopted it with the notion that "selfishness is a virtue".[3]
The only reason capitalism hasn't collapsed (yet) is because capitalists are smart enough to not do pure capitalism, knowing that it would lead quickly to revolution, and because the environment's revolt is just getting started.
---
[1] “The west called [the Soviet Union] Socialism in order to defame Socialism by associating it with this miserable tyranny; the Soviet Union called it Socialism to benefit from the moral appeal that true Socialism had among large parts of the general world population.” ~ Chomsky
[2] The United States: "look how many people died in the Soviet Union's industrialization program!"
Socialists: "how did the United States industrialize again?
The United States: "look, you need to do a BIT of genocide and slavery to kick things off…" ~ Existential Comics
[3] One of the most beneficial things about immersing yourself in deep study of American history is that you get to a point where this country can no longer effectively lie to you about why it is the way it is. It disabuses you of the notion that the inequality we see is an accident. ~ Clint Smith
Capitalism is not exploiting anyone. Capitalism is purely about organising the economy around voluntary transactions.
Exploiting, enslaving and murdering is purely what socialist countries do - and they can get away with all of this, just because they can socialise the cost of all their evil deeds and force people into paying them money.
The only reason capitalism hasn't collapsed is that it's the only way to have a profitable economy. The crooks that you call government recognise that they can steal only so much from the economy before a country collapse.
I'd also argue that we've experimented with elements of socialism and elements of capitalism for the entire existence of civilisation.
Communism can't work unless you have either perfect individuals or a tyrannical states which force resources distribution. In the real world, you end up with socialism. Because people are not perfect, the government which will redistribute resources won't do a perfect job in the best case and will just be completely corrupted in the worst case.
And still, communism was attempted. When did we ever attempt to have an entire capitalist society without a government to ruin it?
So other people acknowledge the problem, provide a solution, and your response is to say "that is selfish libertarian propaganda, the real solution is some magical evolution"?
It’s not just libertarians. I’m sure that even communists accept the premise of human self interest, but instead of private property their solution is for one all powerful government to own everything
That would be great if everyone were truly on a level playing field.
You could make that so in this shared computing scenario, but our broader world is systemically rigged in favor of some people and against others. Capitalism depends on the un-levelness of the playing field for cheap labor.
i.e. while it can be useful if prices are attached to commodities (with caveats around externalities etc), it is not a good thing that prices are attached to humans, making some people's being and work less valued than others.
I worked at a large company during a migration from Lotus to Outlook. We were told we'd get our current Lost email storage + 100MiB as a new email quota limit under Outlook.
I made a bunch of 100MiB files of `/dev/random` noise (so they don't compress, compressed size was part of the quota) and emailed them to myself before the migration, to get a few GiB of quota buffer.
My co-workers were constantly having to delete old emails in Outlook to stay under quota, but not me. I'd just delete one of my jumbo attachment emails, as needed. ;)
Email quotas aren’t just a cost thing. It forces deletion of files/communications that aren’t relevant anymore. The last thing the legal department wants is some executive’s laptop with 10 years of undeleted email to make it’s way to discovery.
Unfortunately, those goals are rarely communicated and accepted by the people they're imposed on.
My first full-time job had an unexplained email expiry policy. After being frustrated several times at losing some explanation on how/why, I started forwarding all my emails to gmail. In retrospect, that's probably a worse result to whoever imposed the expiration.
Fortunately, these days people are better about consolidating knowledge on wikis or some kind of shared docs instead of only email.
It’s a hush hush kind of thing. You advertise it’s to avoid discovery and you are openly admiting to liability should someone find out while trying to pull your execs email during discovery.
The excuse of resource contention provides plausible deniability
Yeah, this is really common. Normally there'll be one unrecorded/easily deleted means of communication, and people use that for discussing things that potentially could expose the company to legal liability.
But nobody ever talks about it (except on said un-recorded meetings. That reminds me, I should explain this to our junior today, so that he knows for the future).
I just spent 20 minutes trying to find an article by Bryne Hobart vaguely in this area [0], but for personal messaging. The idea being if you control the storage (or deletion) you can avoid casual or speculative regulatory interest in your chat logs.
(apologies it's on medium, I couldn't find it anywhere else)
Lotus to Exchange migrations were all likely in the pre-Sarbanes Oxley and other retention regular era of email retention requirements
iirc at the time the only industries that required retention were health, legal and government
With SOX (PCI, FDIC, et al) retention laws we had another explosion of work rolling out all the compliance features of Exchange
Those were crazy times getting everybody either migrated with email or onto corporate email - there's a similar explosion of work right now with migration to M365
I knew a place where Exchange was configured to delete all mails after 6 months. Soon after I discovered that people started to form circles in which they would forward older mails from internal mailing lists to each other to retain them longer than that.
A previous company I worked for had a one month retention window in the email server. People just ended up storing email in their local machine's Outlook folder so they can refer to old emails.
Or for the more technical folk with access to a linux server, setup postfix/dovecot, connect outlook to it and arrange for archived emails to go to the IMAP server.
The IT people get smart about looking for OST or PST files, but let's see them catch that :-)
Then configure a new mail account in Outlook and connect to the IMAP server. It's optional, but convenient for replies, to configure the account to send via postfix if you have an internal SMTP server to connect to.
I gave up on email folders years ago, so at the end of the month would just create two new folders in the archive account (YYYYMM and YYYYMM_Sent) and drag all the mail from the Exchange account into the IMAP folders. Et voila! You now have your own local email archive.
I imagine it looks better at discovery time to say 'oh sorry we lost these emails because we ran out of disk space' rather than 'we deleted them because we didn't want you to read them'.
No, companies need to be able to point to an official retention policy that says in writing that emails older then x months or years get deleted. Most do (including my employer), and it's because of legal discovery. But it feels like we're lobotomizing ourselves, as often the reason some odd thing was done was based on a long-deleted email discussion.
Sounds like the retention policy is also solving the wrong problem. If for legal reasons you want to destroy any potential evidence, maybe it's a good idea to stop doing illegal actions.
I remember Matt Levine talking about how regulators would often find emails along the line of "Let's sell this crap to those idiots" and use that as leverage to force settlements rather than showing actual violation of regulations.
The reason being that it's hard to show intent to defraud, and much easier to threaten bad press.
Thanks to patents, everyone in technology is doing "illegal actions" all the time, since you can't do anything without infringing hundreds of patents. And if you can find an email somewhere indicating that someone knows that a competitor has feature X, or knows about the existence of a patent, viola, evidence of knowing infringement! Triple damages under US law.
That's why I did it. I'd always be trying to find an email from the prior year, that held a fix I needed to use again, but it had been deleted to stay in quota. Old email can be helpful.
I am sure "legal" might want it but is it not better for society in general if they where discoverable.
A bit like when investigating police/government misconduct and a lot of files turn out to have been destroyed - but of course our data gets kept forever
Same thing happens with floating licenses, if they are too scarce, people open the program first thing in the morning ‘just in case’ and keep a license reserved all day.
The real game starts when people run infinite while/for loops that try to check out one as soon as it's available. Or run useless operations within the licensed software just so that that the license doesn't expire and return to the pool. I'm guilty of both, sadly. In an academic environment, additional resources aren't going to fall from the sky.
ouch I can't stop thinking now about how much cost gets imposed on the economy by habits like this established in higher education - I built my original business with very few formally qualified people who included a large proportion of the most experienced and professionally qualified individuals including several with multinational boardroom careers in F500s. we didn't have the culture to tolerate games like holding up a floating licence (of which licence a lot of critical software used) and we weren't the generation raised with computers by a few distant, but hearing this both makes perfect sense that it might be prevalent and simultaneously is thoroughly unnerving me about how strongly I might react on encountering the same if my present venture gets going.
At the opposite end, I heard a story of actually full storage from the beginning of the century, when I worked at a "large medical and research institution in the Midwest". They had expensive SMB shares (NetApp?) that kept getting full all the time. So they did the sane thing in the era of Napster: they started deleting MP3 files, with or without prior warning. Pretty soon, they got an angry call that music could not be played in the operating room. Oops. Surgeons, as you can guess, were treated like royalty and didn't appreciate seeing their routines disrupted.
I use some scripts that monitor disk space, and monitor disk usages by "subsystem" (logs, mail, services, etc) using Nagios. And as DevOps Borat says, "Disk not full unless Nagios say 'Disk is full'" :_) Although long before it is full it starts warning me.
It doesn't go off very much, but it did when I had a bunch of attacks on my web server that started core dumping and that filled up disk reasonably quickly.
Back in the day we actually put different things in different partitions so that we could partition failures but that seems out of favor with a lot of the distros these days.
This is a surprisingly common hoarding behavior among humans using scarce resources. In technology you see it everywhere, virtualization infrastructure, disk storage, etc.
This is actually kind of clever. How the tribal knowledge for how to "reserve space" was developed and disseminated would be pretty interesting to study.
At school we had a 800Mb quota for each class (around 90 people). Usually the first year everyone discovered the space problem when trying to get everything done for your first project. When you cannot compile code or generate pdf because there's no space left the witch hunt starts: there's always some people with left-over files from .pdf to .tex conversions.
To help some students had put in place a crawler making statistics about who was using the space for all classes. And usually once bitten you made your own space requisition script which would take any byte left when available until it hit some reasonable size.
That's dire, ~8Mb per person. Its an interesting problem though, When the resource is not scarce, allocating 800Mb per class is the correct way to do things. someone who needs 9, 12 or 30Mb would be able to complete the allocation. But as soon as resource contention happens, students with the biggest allocation would need to relinquish alot of data. 800Mb is nothing over a modern connection nowadays but playing this game with petabytes would be a nightmare.
I remember in Android one year the focus was on slimming down the memory usage. Of course we found an app that shall not be named the allocated a chunk of memory on startup just in case it was going to be needed later.
It's basically reserving part of the disk for very important things only, which scares off less important uses. Like making the commons seem more polluted than it actually is to get some action taken.
If those files weren't there, the space would probably fill up, but now without any emergency relief valves.
It would be better if these files were a smaller fraction of space and had more oversight... but that's just a quota system. This is something halfway in between real quotas and full-on tragedy of the commons.
I am far from an expert on game theory, but it seems that the cause of the tragedy of the commons is that people can use the shared resource for free. If there was a price to be paid, and the price was dynamically adjusted depending on conditions, then the overuse could be avoided.
Similarly for file storage and "reserving" it by creating huge but useless files. If everyone was charged a fee per gigabyte per day, then people would be less likely to create those placeholder files. You probably have to be careful about how you measure, otherwise you'll get automated processes that delete the placeholder files at 11:59pm and create them at 12:01am.
I was more on a sociological/existential plane but I take that information too. I wish I'd read this kind of economic books rather than supply/demand or finance
I always leave some unallocated space in LVM in my machines. However, in a cloud environment it's probably easier or only possible to delete that 8 GB file.
For everyone saying "This isn't a real solution!" I'd like to explain why I think you're wrong.
1) It's not intended to be a Real Solution(tm). It's intended to buy the admin some time to solve the Real Issue.
2) Having a failsafe on standby such as this will save an admin's butt when it's 2am and PagerDuty won't shut up, and you're just awake enough to apply a temp fix and work on it in the morning.
3) Because "FIX IT NOW OR ELSE" is a thing. Okay, sure. Null the file and then fill it with 7GB. Problem solved, for now. Everybody is happy and now I can work on the Real Problem: Bob won't stop hoarding spam.
Motorboat fuel tanks have a reserve as well. It's just a raised area that splits the bottom of the tank into 2 separate concave areas. One of the concave areas contains the end of the fuel line, and the other doesn't. When you run out of gas, you tip the tank up to dump the remaining gas from the other basin into the main one, and then you restart the engine (or keep it from stopping at all if you're quick enough on the draw) and head for the docks.
Old SCUBA tanks didn't have gauges, they had a reserve tank with enough air to get you to the surface. You'd realize you were running low (which I'm sure was terrifying) then hit the switch and slowly surface (you don't want to surface quickly when diving).
Yeah, my dad had a tank like that. I dove with it exactly once - never again, yikes. It was coated inside and out so, despite being a steel tank, it was in excellent shape.
The bikes I've had that have had reserve tanks have also been old enough to raise the disconcerting follow-on question, which is: "is the reserve gas also full of sludgey crap that's settled in the tank and hasn't been disturbed really in a year, and am i about to run that through my poor carbs?"
My friend had a truck with a reserve tank, but it was the same size as the main tank, so he would just flip the switch at every fill up to make sure they both got used.
Had this in a 70s F150. A "Main - Aux" switch on the dash, right above the 8-track player. I used to let the main tank sputter out on fumes and then triumphantly shout "Rerouting auxiliary power to engine!" while sliding the switch. Letting them empty out alternately would have been a lot smarter.
My father drove a '95 F-150 for years that had the dual tanks. Shortly after highschool I got in accident that ended up totaling my vehicle and got a couple months I was using his truck (he runs an Auto repair shop from a garage behind the house so he almost always had something available to drive) and I ended up using it to go out on a date with someone I had met at work.
I noticed on the way to pick them up that the truck was running on empty in the main tank but I checked and the aux tank was full. Then I remembered the first time my dad let the tank run down and start sputtering down the road and decided to keep going on the empty tank.
Make it to pick them up and start heading down the highway(where we were it was a good 3-4 miles to the nearest gas station) and then the truck finally started to sputter. I proceed to play along with it pretending to panic for a good 20 seconds and then I turned and saw the look on their face and couldn't help but start laughing. Switched to the aux tank and when the truck started running again I turned and and the look I was getting indicated I was being mentally murdered. Then they punched the crap outta my arm and started laughing and calling me not so nice things.
Ended up being an awesome night out with someone I'd end up being friends with for a long time. It's weird how this kind of random conversation in an unrelated internet post can drag you way back down memory lane.
This is typically used for agricultural/off-road fuel which is not priced with road taxes and as a result much cheaper. Off road fuel is dyed red in the US. If you get caught running dyed diesel on road you will be fined. Thus the switch on the dash, when you leave the highway to drive on your farm you flip over to dyed fuel to save $$.
Oh, fascinating! My first vehicle was the family's 3/4-ton Diesel '84 Chevy Pickup from the farm, and I'd forgotten it had an Aux fuel tank! This makes a lot of sense.
the two-tube design of the tank on my 1975 honda CB meant that there was about an inch and a half of tank that sat below the primary fuel port. Tank crud (steel tank, theoretically passivated, 40 years old) settles faster than I ran through a tank of gas, so the bottom layer had sediment in it fairly regularly.
I kept spare inline fuel filters in a tool roll just in case after a while.
always fun when you're barreling down the highway and the engine starts to lean out, prompting you to hurriedly locate and switch the petcock over before the engine stalls completely.
suppose then that you go fill up and forget to set the petcock back to normal. 8ball says: "I see a long walk in your future."
out of years of riding it's only happened to me a couple times.
one time i was eastbound on the bay bridge when my bike started to sputter. i'd just reassembled the tank and had left the screw-style reserve fuel valve open, so there was no reserve fuel to be had. a very kind lady put her blinkers on behind me and followed as i coasted the last few hundred yards toward yerba buena island.
i pushed my bike up the ramp and looked in the tank to assess. it's a dirtbike, so the tank has two distinct "lobes" to accomodate the top tube of the frame. I had a few ounces in the tank but they were not in the lobe with the fuel pickup, so i dumped the bike on its side to get the fuel to slosh over to where i wanted it.
i got back on the highway and, going quite slowly and gently, managed to get to the gas station at west oakland bart, the engine leaning out and sputtering right as i rolled into their lot.
I think that driving on those last few ounces of fuel is a completely different feeling.
Normally you take for granted that the engine works for hours at at time.
When you've come to a stop and found those last few ounces of fuel, it's such a relief that the engine can run again, and you know it won't run for very long, but every minute that it continues running saves you many minutes of walking or pushing. You appreciate every minute that the engine produces that amazing amount of power (compared to your own power when you're pushing a 300+ pound bike)
Typically there aren't two separate tanks - In one tank there are two tubes at different heights. As the fuel level falls below the height of the "main" tube the engine sputters, then turning the petcock engages the lower down "reserve" tube which is still below the fuel level. It's more of a warning than a true reserve, and most bikes with an actual fuel gauge don't have a reserve.
On bikes like that, there's a reserve-reserve trick sometimes. Sometimes, the tank is an inverted U shape so when the pickup runs dry there's still a little more fuel on the other side of the U. If the bike is light enough, you can stop the bike and lean it way over to pour that last bit over to the pickup side. Might get you another couple miles.
Fuel injected motorcycles don’t have reserve (at least, none that I’ve seen.) instead they have low fuel lights or full fuel gauges. I’m guessing it’s because the fuel pumps are in the tank and the fuel injection system needs high pressure.
Fuel injectors require filtered gas because even small particles can clog them, and said filter is more likely to be clogged or even compromised by sucking up the last drops of fuel (and scale and debris) in the tank, so the low-fuel warning is required.
Carb jets can get clogged, too, but are wider since they're not under as much pressure. Also, since they're a wear item they're a lot easier to clean and/or replace.
I think grandparent commenter had it right: it's because the pump is in the tank. There's just no good way to have an external petcock determine where a tank-internal pump gets its fuel from.
Many new bikes come with a lot of rider aids for safety (ABS, TCS) as well as all kinds of electronics (fuel maps), so this is changing. But of course manual transmission won't go away until bike are electric.
I am one of those who likes things old school. My bike still has a carburetor, has no fuel light or tachometer, and I have certainly had some practice reaching down to turn the fuel petcock to reserve while sputtering on the highway. If they didn't intend for me to do that, why did they put it on the left side? :)
Goldwings also have a reverse gear.
Even more remarkable: I used to have an Aprilia scooter that had a remote release button (on the key fob) for the under-seat storage area. I think I used it once just to see if it works.
Some newer bikes, like mine, don't have a reserve petcock. They have a low fuel light. No forgetting about the petcock and an obvious warning light instead of sputtering.
Some older bikes, like my '99 Ducati Monster, don't have a petcock. It has a low fuel light that first failed in around 2002, and for which that part that fails (the in-tank float switch) stopped being available in about 2015 or so. No petcock _or_ warning light. (And that trip where the speedo cable fails so I couldn't even use thew trip meter to estimate fuel requirements was a fun one...)
Can you find someone who can adapt a float switch from a different bike? It seems like a very useful thing to have, even if it's not the original factory part.
I've just gotten used to it. I'm fairly reliable about always resetting the trip meter when I fill it up (and always fill it to full). I know it'll get 200km easy, maybe only 180 if I'm having _way_ too much fun. That's always about time I want to stop and stretch my legs anyway. It doesn't bother me enough to "solve the problem".
Most motorcycles with a manual petcock are very manual in nature. Often this is to minimize the number of moving parts that could die on you if you take it into rural areas. An automatic petcock adds more complexity that could cause a malfunction.
It is a shame that motorcycles have moved away from this model. My last bike had a manual petcock with a reserve setting. It was problematic because I’d forget to turn it from off to on, take off on what’s left in the carburetor bowl, and the engine would start sputtering just down the road. But I also never got stranded.
New bike has a vacuum-actuated fuel valve, no reserve. It does have a fuel gauge but since the tank is not a nice simple rectangle and the angle makes a difference the gauge is basically untrustworthy. So I go by the mileage and hope I don’t get it wrong. How hard would it be for them to add a reserve setting so it could just be between On and Reserve so I could just flip between them as needed?
In the Honda CBF125 group on Facebook, a fellow Indian shared a photo of his bike. A British guy asked what's the switch, he's never seen one before. Same bike, same country of origin, but only certain markets get the switch and the recessed panel.
It is extremely thick plastic. I wouldn't be surprised if it dislodged from the frame before it burst. In any event, in any collision violent enough to rupture the tank, the rider will have already been thrown a hundred feet away (and be dead...)
> 1) It's not intended to be a Real Solution(tm). It's intended to buy the admin some time to solve the Real Issue.
If you don't have monitoring, will you even be aware that your disk is filling up?
If you do have monitoring, why are you artificially filling up your disk so that it will be at 100% more quickly instead of just setting your monitoring up to alert you when it's at $whateverItWasSetToMinusEightGB?
One argument in favor of it is the 8GB file may cause a runaway process to crash, leaving you without it continuing to chew up space and able to recover.
A second argument is it's not opened by any process. One problem I've had fixing disk full errors was figuring out which process still had a file open.
(For any POSIX noobs: the space occupied by a file is controlled by its inode. Deleting a file "unlinks" the inode from the directory, but an open filehandle counts as a link to that inode. Until all links to the inode are deleted, the OS won't release the space occupied by the file. Particularly with log files, you need to kill any processes that have it open to actually reclaim the disk space.)
Because even if you have monitoring, some unforseen issue rapidly eating disk space at 3:00 am may not give you the time to solve it without downtime or degraded performance unless you can immediately remove the bottleneck while you troubleshoot.
Then why not automate the removal of the 8 GB spacer file when the disk gets full? Or in other words, just sound your alarms when there is 8 GB of free disk space.
Because if it is a broken process then it will fill up the disk again before you wake up and look at it.
I think the idea is that once you are at the system you can try to find out the cause without removing the file, or worse case remove the file and act fast (you may be on a short timer at this point). So for example if you find out that process X broke and is writing a ton of logs you can disable that process, remove the file, then most of your system is operational while you can properly fix the root cause or at the very least decide how to handle the data that filled up the disk in the first place. (You can't always just delete it without thought)
I think a more refined approach would be disk quotas that ensured that root (or a debugging user) always had a buffer to do the repairs. This file just serves as a system wide disk quota (but you need to remove it to take advantage of that reserved space).
Besides runaway log files that aren't being properly rotated, human error can cause it too. I managed to completely eat up the disk space of one of our staging servers a few weeks ago trying to tar up a directory so I could work on it locally. Didn't realize the directory was 31GB and we only had 25GB of space. By the time the notification for 80% usage was triggered (no more than 2 minutes after we hit 80%), the entire disk was full. Luckily it was just a staging server and no real harm was done, but such a mistake could have just as easily been made on a production server. In this case, the obvious solution is to just delete the file you were creating but if you're running a more complicated process that is generating logs and many files, it may not be so easy and this 8GB empty file might be useful after you cancel the process.
Right, but again, what good does the spacer file do if you're not aware that you're running low on disk space? That is: if your monitoring isn't working, how do you know that you need to quickly make room?
And if your monitoring is working correctly, the spacer file really serves no purpose other than lowering the available disk space.
1. When your DBMS is no longer responding to queries, your boss and your customers replace your monitoring system (unlimited free phone calls 24/7 included ;). Case in point: HN is often a better place to check than Google Cloud status page, for example.
2. Maybe you didn't get it, but "nullmailer not forwarding cron email due to mailgun problems" was a bit too specific to be an example I just made up, wasn't it? Again, the premise "if your monitoring is working correctly" is not a good one to base your reasoning upon. Especially if you have 1 VM (VPS) and not a whole k8s cluster with a devops team with rotational on-call assignments.
The reason was, I thought, discussed in the article.
When you actually fill up your disc, many linux commands will simply fail to run, meaning getting out of that state is extremely difficult. Deleting the file means you have room to move files around / run emacs / whatever, to fix the problem.
Yes, yes, but they will notify you after your service is down (because that's when they notice), in part thanks to a spacer file that eats up available disk space without being of any use. A monitoring service would notify you before your service is down, users grab pitchforks and start looking for torches.
I understand the benefit to be able to quickly delete some file to be able to run some command that would need space, though I find that highly theoretical. If it's your shell that requires space to start, you won't be able to run the command to remove the spacer, and once you're in the shell, I've never found it hard to clean up space; path autocompletion is the only noticeable victim usually. And at this point, the services are down anyhow, and you likely don't want to restart them before figuring out what the problem was, so I don't see the point of quickly being able to make some room.
It feels like "having two flat tires at the same it is highly unlikely, so I always drive with a flat tire just to make sure I don't get an unforeseen flat tire". It's cute, but I'd look for a new job if anyone in the company suggested that unironically.
This is an additional safety net. It's like doing backups. Of course you should replace your hard drive before the other drive breaks down, but you want to have a backup in case your server burns down.
because sometimes you run things that don't really need monitoring.
I run bunch of websites for pet projects and for friends clubs etc. They don't need monitoring, and even if they go down for couple of hours (or days) doesn't really matter.
I do monitor them, but mostly as an excuse to test various software, that I don't get you use during my day job (pretty sure that bunch of static sites and low use forums don't need elstic cluster, for log storage :) )
And sometims you simply don't have the time to deal with this right now. So you do a quick hack, and do it later.
I agree with this assessment. Of course its not a solution. Its delaying the inevitable. But depending on the rate of "filling up the disk for unknown reason" it will buy you time.
So when you're running out of space, you immediately delete the junk file. Suddenly there's "No Problem" and you've reset the symptom back to hopefully well before it was an issue. Now you can run whatever you need to, do reports, do traces etc. Even add more storage if necessary.
More importantly, as soon as you delete that junk file now you have space for logs. You have space and time for investigation.
> 1) It's not intended to be a Real Solution(tm). It's intended to buy the admin some time to solve the Real Issue.
It doesn't do that though. If you don't have monitoring/alerting that can either a) give you sufficient notice that you're trending out of disk space, b) take action on its own (e.g. defensively shutting down the machine), or c) both of the above, then having your server disks fill up is bad whether you have a ballast file or not.
If your database server goes to 100%, you can't trust your database anymore whether you could ssh in and delete an 8GB file or not.
I find that either a server needs more space, or has files that can be deleted. For the former you just increase the disk space, since most things are VMs these days and increasing space is easy. For the latter you can usually delete enough files to get the service back up before you start the proper cleanup.
If you really need some reserve space (physical server), I'd much rather store it in a vg (or zfs/btrfs subvolume). Will you remember the file exists at 2am? What about the other admins on your team?
As someone who has been woken up at 2am for this exact issue, emphatically yes. I would much rather be back in bed than trying to Google the command to find large files on disk.
"proper monitoring" is extremely broad. And, I would say, almost unreachable goal.
You have it mail you when it goes over 80% disk usage (and what if you are on holiday)? Does it mail all colleagues? Who picks it up (I thought Bob picked it up, but Bob thought Anne picked it up. So no one did)? Does it come and wake you in person when it reaches 92%?
Will this catch this async job that fails (but should never) in an endless loop but keeps creating 20MB json files as fast as the disk allows it to?
Is it an alerting that finds anomalies in trends? Will it be fast enough for you to come online before that job has filled the disk?
I've been doing a lot of hosting management and such. And there is one constant: all unforeseen issues are unforeseen.
Slack warning/ticket at 75%, page at 85% (to oncall obviously). Don’t let user workload crap into your root partition. I’ve been doing this for over 10 years and managed many thousands of nodes and literally don’t recall full disk problem unless it was in staging somewhere where monitoring was deliberately disabled.
Your requirements for "proper monitoring" are not everyones requirements.
On a current gig, we host at heroku. Our monitoring is all about 95th percentile response-times, secondary services, backlogs, slow-queries and whatnots. For another job, "disk space filling up" is important. Again another job will need to monitor email-delivery-rates and so on and so forth.
Keep in mind that sysadmins are essentially babysitting software that they do not develop. The hacks that we come up with are to work around responsible party and help us get a good night's sleep instead of a 2am wakeup call. I try to cut you guys some slack, usually this proliferates when management decides they are willing to accept some inefficiency in favor of getting new features out the door. I get it, really.
My org is in the middle of a SRE introduction and for some reason I'm getting a lot of pushback on the topic of 'error budgets' and what to do with alerts when they are exceeded. Can't imagine why.
How does using proposed solution prevents a 2am wake up call? Your monitoring/alerting does, this just makes it easier to recover already broken software. And btw I’ve been carrying pagers for more than a decade so well aware of all the organizational dynamics here. Best way to prevent this is have devs carrying pager too (amazons “you built it you run it”) - and magically your nighttime oncall is much more pleasant ;-)
THANK YOU. How are so many people in this thread content with saying “monitoring isn’t perfect, this solution is ingenious”. Ofc nothing is perfect and even when you do everything right things can still go wrong, but if you don’t have a ROBUST monitoring/alert system in place then you’re not even doing the bare minimum. They’re acting like it’s rocket science to set thresholds, and have meaningful alerts and checks in place. Not to mention if you wait until disk full you risk issues like block corruption among others and your 8GB of space doesn’t do anything. It’s why people in this industry are on call, it’s why they have monitoring on their monitoring systems. The bare minimum
Yeah it’s crazy. If someone does this on their homelab server it’s probably fine but if they run it in production I really want to know because Im not buying jack from them.
Of course! But do you put all your trust in your monitoring, 100%? You've never had monitoring fail for any reason at all? You've never had a server fill up before you can respond to the alert?
This 8gb file idea isn't to replace monitoring. It's to offer a quick stopgap solution so you can do things in a hurry and give yourself a little extra "out" when things go awry. Because believe me, they WILL go awry. And if you're not prepared for that eventuality, then I don't know what else to say.
> But do you put all your trust in your monitoring, 100%?
Yes. If I didn't feel that I can trust it, I would get another solution.
> You've never had a server fill up before you can respond to the alert?
I have. With the proposed hack in this article: it would fill up even faster: by that amount of time it would take the problem to write 8gb of data.
> Because believe me, they WILL go awry.
In my experience: not in any way that this would help. If your disk fills up, it's either slow (and your monitoring alerts you days or at least hours before it's a problem) or it's really, really fast. In the latter case, it's much faster than you can jump on your computer, ssh into the machine and delete your spacer file.
Invest in better monitoring, that's much, much, much, much better than adding spacer files to fill up your disk or changing the wall clock to give you more time.
Ah I see where you are coming from. You see the spacer as a way to prevent a problem that should be prevented by better monitoring. But that's not what it is for. It's for quickly providing a stopgap so that you have time to solve the root cause without enduring more downtime.
If you've had a disk go full on you, what's the first thing you do? For me, I log in and start looking for a log file to truncate to buy me a few megs of space, at least. This spacer file is just a guaranteed way to find the space you need without having to hunt for it.
Also it doesn't HAVE to be 8GB. On most systems I think a 500mb file would be every bit as effective.
> he had put aside those two megabytes of memory early in the development cycle. He knew from experience that it was always impossible to cut content down to memory budgets, and that many projects had come close to failing because of it. So now, as a regular practice, he always put aside a nice block of memory to free up when it's really needed.
In my work it is very common to make the memory map a little smaller than it has to be. If you can't ship an initial version in a reduced footprint you will have no hope of shipping future bugfixes.
Many years ago I spent a couple of weeks fixing a firmware bug. The firmware was only a few dozen bytes shy of the EEPROM. I just #ifdef'd out a bunch of features to focus on debugging what was broken, but to get the fix released I had to manually optimize several other parts of the code to get everything to fit in the 2MB or whatever it was.
Would've been nice if someone had reserved some space ahead of time. Maybe they did, but nobody was around who remembered that codebase.
My favorite part of that story is how the initial question about overflow should make it obvious that what they're doing doesn't work, but nobody noticed.
I'd read in 'Apollo: Race To The Moon', Murray and Cox, that the booster engineers had done something similar with their weight budget, something the spacecraft engineers wound up needing. Contingency funds of all sorts are a great thing.
Back in the late eighties a colleague of mine was making a game for the Atari ST and he purposely put in some time wasting code in the game loop so that he had to work against a smaller budget which gave him some contingency for later on when he needed some extra cycles.
If true, I hate that story. Think of the better art assets that were needlessly left behind. How is it that said block of memory had never been identified by any profiling?
> Think of the better art assets that were needlessly left behind.
Consider how long it takes to edit or recreate art assets to reduce their size. Depending on the asset, you might be basically starting over from scratch. Rewriting code to reduce its size is likely to be an even worse option, introducing new bugs and possibly running slower to boot. At least smaller, simpler art assets are likely to render faster.
This is also the kind of problem that's more likely to occur later in the schedule, when time is even more scarce. Between these two factors (lack of time and amount of effort required to get art assets which are both decent looking and smaller), I think in practice you're actually more likely to get better quality art assets by having an artificially reduced memory budget from the outset.
1. Deal with possibly multiple issues possibly involving multiple people with the politics that entails resulting in a lot of stress for all involved as any one issue could render it a complete failure.
2. Have extra space you can decide to optimise if you want. You could even have politics and arguments over what to optimise, but if nothing happens it all still works so there is a lot less stress.
Better PMs do this today by having buffer-features they can cut when needed. It'll handle the not-enough-memory issue as well as a meddlesome VP who think you're over-subscribed and wants you to cut to meet your dates.
Also, don't forget you're hearing decades-later retellings of someone else's story. I don't doubt that they trickled this extra space out as changing requirements mandated it, but that they kept from doing so until the team had actually reached a certain level of product-maturity and reclaimed all of their own waste first.
Remember that the PMs goal is to ship. Them blocking some assets but actually shipping is a success. Better 95% of the product than 0%.
There's a difference between "The server is not responding right now. We're loosing customers.", and "Low resources during product development". Actually the latter may be a case of enforcing premature optimization.
So no, it's not the same idea.
I think we are thinking of a different baseline. You are thinking along the lines of "this should run, we can reduce server costs later", I would suggest (if I may) "the app needs to run on any Android device with 2GB RAM". And then you develop a game to run on a 1.5GB RAM phone, expecting that it will eventually fit into 2GB RAM budget.
A lot of tips in this thread are about how to better alert when you get low on disk space, how to recover, etc. but I'd like to highlight the statement: "The disk filled up, and that's one thing you don't want on a Linux server—or a Mac for that matter. When the disk is full nothing good happens."
As developers, we need to be better at handling edge cases like out of disk space, out of memory, pegged bandwidth and pegged CPU. We typically see the bug in our triage queue and think in our minds "Oh! out of disk space: Edge case. P3. Punt it to the backlog forever." This is how we get in this place where every tool in the toolbox simply stops working when there's zero disk space.
Especially on today's mobile devices, running out of disk space is common. I know people who install apps, use them, then uninstall them when they're done, in order to save space, because their filesystem is choked with thousands of pictures and videos. It's not an edge case anymore, and should not be treated as such.
A lot of measures are preventative, and kind of have to be.
Consider the hypothetical scenario of being totally out of memory. I mean completely: not a single page free, all buffers and caches flushed, everything else taken up by data that cannot be evicted. So in result, you cannot spawn a process. You cannot do any filesystem operations that would end up in allocations. You can't even get new page tables.
Hence things like Linux's OOM killer, which judiciously kills processes--not necessarily the ones you would like killed in such a situation. And again, a lot of preventative measures to not let it come that far.
Our Turing Machines still want infinite tapes, in a way.
I had this on my Ubuntu server... The NFS mount died for some reason and the downloading app wrote it all to the local filesystem, filling my SSD to the brink within minutes. By the time I ssh'd in the NFS had remounted, so it took ages to figure out where all that disk space actually was used since all dir scan tools would traverse into the NFS mount again.
It felt like everything was falling apart. As soon as I deleted something another app filled it up in minutes. Even Bash Tab completion breaks... There really should be a 98% disk usage threshold in Linux so that you can at least use all system tools to try and fix it.
Early Symbian apps are an excellent example how to write apps so that they don't crash when storage or memory becomes full. They just show an error dialog and the user can still use the system to free storage or memory. Modern phone apps either crash or the entire phone crashes in similar situations.
It doesn't help that the base model of many phones had ridiculously undersized storage for so many years.
"I have an unlimited data plan, I'll just store everything in the cloud." only to discover later that unlimited has an asterisk by it and a footnote that says "LOL it's still limited".
> As developers, we need to be better at handling edge cases like out of disk space, out of memory, pegged bandwidth and pegged CPU
In what situation though? Let's consider disk space. This certainly does not apply to all developers or all programs. Making your program understand the fact that the system has no space left does not seem like something that would be very productive in the vast majority of cases. Like running out of memory, it is not something the program can recover from all by itself unless it knows it created temporary files somewhere that it could go and delete. If that scenario does in fact apply to your program, then it's not even an edge case: the program should be deleting temporary files if it doesn't need them anymore. If the P3 was created to add support for that exact function, then I agree that it should be acted upon. A P3 is fine as long as it's reached. If you don't reach your P3s ever, then there are different issues that need addressing. I'd even say for something littering users' disks it should be higher than a P3, but the point is it's a specific case where it makes sense to handle that error. In every other case, your best bet is a _generic_ exception handler for write operations that will catch any failure and inform the user (e.g. "[Errno 28] No space left on device"), but that's something that should already be a habit.
There are cases when you want to try to avoid running out of disk space because your program might know that it needs to consume a lot of it (e.g. installers) so it will be checked preemptively. Even then you probably do want to try to handle running out of disk space (e.g. in the unfortunate event that something else consumed the rest of your disk _after_ you preemptively calculated how much was required) so you can attempt a rollback and inform the user to try again.
Other than that, when else is that _specific_ error more important than knowing that the data just couldn't be written in general? Let's say you have a camera app that tries to save an image. Surely you'd have a generic exception handler for not being able to save the image, rather than a specific handler for "out of space", which seems oddly specific considering there are literally hundreds of specific errnos you could be encountering that would prohibit you from writing. I'm sure the user doesn't want to see something like "Looks like you're out of disk space. Do you want to try save this image in lower quality instead?"
So my point in all of this is I agree that we should _consider_ the impact of disk space but it doesn't need to be prioritized by developers unless it's actually important like in the first few examples I gave.
It's important that you can recover from this condition.
For example, I'm working on an NVR project. It has a SQLite database that should be placed on your SSD-based root filesystem and puts video frames on spinning disks. It's essentially a specialized DBMS. You should never touch its data except though its interface.
If you misconfigure it, it will fill the spinning disks and stall. No surprise there. The logical thing for the admin to do is stop it, go into the config tool, reduce the retention, and restart. (Eventually I'd like to be able to reconfigure a running system but for now this is fine.)
But...in an earlier version, this wouldn't work. It updates a small metadata file in each video dir on startup to help catch accidents like starting with an older version of the db than the dir or vice versa. It used to do this by writing a new metadata file and then renaming into place. This procedure would fail and you couldn't delete anything. Ugh.
I fixed it through a(nother) variation of preallocation. Now the metadata files are a fixed 512 bytes. I just overwrite them directly, assuming the filesystem/block/hardware layers offer atomic writes this size. I'm not sure this assumption is entirely true (you really can't find an authoritative list of filesystem guarantees, unfortunately), but it's more true then assuming disks never fill.
It might also not start if your root filesystem is full because it expects to be able to run SQLite transactions, which might grow the database or WAL. I'm not as concerned about this. The SQLite db is normally relatively small and you should have other options for freeing space on the root filesystem. Certainly you could keep a delete-me file around as the author does.
The author seems to forget that ext-based filesystems keep 5% of disk space available for root at all times by default, known as "reserved blocks".[0] That means if a non-root user uses all of the available space, it wasn't really all of the space -- root still has access to 5% free space within that partition. That's exactly the same as the useless 8GB file but in an officially-supported manner. If you run out of disk space, you actually have 5% left for root. So log in as root and fix the issue. Simple.
Also:
> On Linux servers it can be incredibly difficult for any process to succeed if the disk is full. Copy commands and even deletions can fail or take forever as memory tries to swap to a full disk and there's very little you can do to free up large chunks of space.
Why would memory swap to disk when the disk is full? I feel like the author is conflating potential memory pressure issues with disk issues.
How many serious production-grade servers even use swap, which usually just causes everything to grind to a halt if memory becomes full?
Sure, but one of the most common causes of disks being filled on Linux is either the kernel or some process running as root filling the disk with endless repeating log entries.
It's a good point that the process that is flooding the disk could be running as root and therefore bypasses the reserved blocks restriction.
It's also worth noting that the kernel is not a process and it does not write directly to files in the conventional sense. syslog-ng or its equivalents (which do run as root) will pick up messages from the kernel's circular ring buffer at /dev/kmsg and write them to a text file such as /var/log/kern.log, so it's possible that the kernel or one of its modules are verbose enough that syslog-ng causes that text file to get big. However, these log files can be limited to a certain size and/or logrotated[0] daily which will also remove log files older than a configured amount of time. In other words, there are better ways to manage servers than creating 8GB files as a bandaid.
Log files are awful, log rotation is even worse. Logs are a continuous, persistent stream of messages that sometimes gets expunged when too old. Log files try to flatten and serialize them, but the coordination and management of them is a hassle. An FS like Reiser where each log entry is a file arranged in directory trees by minute, hour, day, month, and year, or a sqlite database would make more sense than spewing blindly into a giant text file and praying that it works when another process comes along and moves things around underneath a running program.
This is why on *nix systems, each directory of the vfs that is intended to accumulate files OR have different mount options is broken-off into its own volume. / /home /usr /usr/local /var /var/log /var/lib/docker /var/lib/postgresql and so on.
Running `tune2fs -m 0 /dev/sdaX` as root instantly makes the reserved blocks available. You could even just lower it by 1% which would be more than sufficient in the meantime: `tune2fs -m 4 /dev/sdaX`. When you've fixed the issue and freed up space, you can increase it to 5% again.
sshd runs as root, so yes, it will be possible to login as any user -- the sshd daemon will be able to function. :-)
The only files that would be appended to on user login (by sshd, which is running as root) would be /var/log/{utmp,wtmp,btmp} to record the login (in practice I've only seen wtmp). After that you have sshd logging (e.g. /var/log/{messages,syslog,auth.log}) which is picked up by syslog-ng or its equivalents, which also run as root.
Regarding not being able to login via SSH as root by default: the sshd_config default is actually `PermitRootLogin prohibit-password` which means you can login if you use public-key cryptography, it just won't allow you to login with the root password (even if it's correct.) It's good practice these days to use public keys for SSH anyway, so I wouldn't say this is much of a setback.
You can login as root at the console on many providers. Even where not allowed (eg. AWS) you can mount the volume on another instance and clean it up from there.
> The minimum amount of space guaranteed to a dataset and its descendants. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by its reservation. Reservations are accounted for in the parent datasets' space used, and count against the parent datasets' quotas and reservations.
ZFS, like ext[34], also has a reserved space allocation for the entire pool to allow you to still do certain operations when it's reporting "0" free space (...like deleting files or snapshots, to free space).
Unlike ext[34], that reservation is not available for root's general use, but it's there.
I know about this, but I do think it's not a bad idea doing what he does because the reserved block count is for root and most server processes still run as root. And it's usually them that are causing the disk to fill. Though I suppose this also makes the problem itself more prominent in the first place. I guess if you run into this a lot, stricter monitoring would be a better solution.
The way I found out about it originally was because I was using external storage drives and I was never able to fit as much as I expected :D
Luckily you can easily change this without reformatting.
What servers usually run as root? Some may start as root, but usually drop privileges for the actual server processes quickly, eg. apache, nginx, sshd.
Nothing that actually does the "serving" or accesses data should be running as root.
It used to be common, before "the cloud", to have many apparently unnecessary partitions in a server install. One for /, one for /var, one for /home, one for swap at the low sector numbers...
The idea is that /var filling up would not make the system unrecoverable.
Is that true for all logfiles? I still have plenty of daemons (by default) writing directly to some file in /var/log eg EXIM, Apache, and the like. Also plenty of system stuff still write to files in that directory. And yes this is a machine that uses systemd.
Most vendors (Debian/Ubuntu, RHEL/clones, etc.) add a hook into rsyslog to be a partner with the systemd logger and write out text files next to the journal - they realize that a lot of people dislike dealing with journalctl (I'm one of them) and provide an alternate hook already installed and working for you behind the scenes.
This is for daemons using syslog methodology, not direct writers like apache/nginx/mysql/etc; think more like cron, systemd, chrony, NetworkManager, and so forth. The vendors are not all aligned on what goes where (example: on RHEL, pacemaker/crm write to their own logs buy on openSUSE they're sent to syslog) - the actual results differ slightly from vendor to vendor.
DIY distros like Arch do not implement the rsyslog backend by default, you have to set it up yourself following the wiki - only journalctl is there by default.
But those daemons don’t usually have their own log writer processes running as root, do they? Instead, either the log file is accessible by the user the daemon is running as, or the daemon opens the log file as root before dropping privileges for the rest of its operation.
My rule of thumb to avoid these issues is that application/server data gets its own dedicated volume that contains nothing else: logs get their own volume, and root its own. It's an especially bad idea for an application to put its data and logs in the same directory where its binaries reside.
That way, even if your log volume or root somehow fills up before monitoring had a chance to react, your service is unaffected. You can even catch issues pre-emptively by keeping log volumes small so that weird behaviour is likely to trigger an alert before anything goes truly wrong.
On cloud instances, it's silly to put anything on the instance root volume (on AWS, I keep them at the default 8 GB; it's never been a problem) when you can just attach an arbitrary number of additional disks. Container systems would use persistent volumes, and with physical servers, you use LVM or equivalent. This solves most disk allocation issues and makes operations easy when you need more space.
I suspect the blog author did not understand this (based on the content) - as a Linode user myself, I just had a look at one of my VMs and they install with the regular 5% reserved space (ext4/Debian).
Funny because I have always tune2fs -m1 or tune2fs -m0 because the reserved space was never supposed to scale linearly with hard drive capacities and is not useful to userspace in anyway. Have never had any issues and been doing it for decades in commercial applications. In some cases, where you probably shouldn't be using ext3/4 anyways, we are talking about reclaiming TBs of reserved space.
It's important to note that mkfs doesn't care if you are formatting the root partition or a data volume partition, it will still reserve space for the kernel.
If you try to use a file system to 99% full --- and it doesn't matter whether it is a 10GB file system or a 10TB file system, you will see significant performance penalties as the file system gets badly fragmented. So that's why having a fixed percentage even for massively big disks still makes sense.
Disk space is cheap enough that even 5% of a 14TB disk is really not that much money --- and if you see bad performance, and then have to pay $$$ to switch to an SSD, maybe it would have been much cheaper to use a HDD with a larger free space reserve....
> If you try to use a file system to 99% full --- and it doesn't matter whether it is a 10GB file system or a 10TB file system, you will see significant performance penalties as the file system gets badly fragmented.
Not true, I've checked. I have plenty of Linux ext3 servers running for many years that routinely drop down to 1% free space for extended periods before being cleaned-up, which still have essentially zero fragmentation. You can create plenty of 10MB files on a multi-terabyte volume that has under 1% free space, as that's still tens of gigabytes to work with.
Obviously at some point you'll hit a severe problem and it's best to avoid taking a chance, but a fixed percentage really isn't the best measurement to tell you where that horizon will be.
It's true, but it's more true for some file systems than others. When you write a file larger than the contiguous available space after its starting point, a file system must break the file into "extents" (chunks). The less space available, the smaller the extents tend to be, and the more fragmentation you will impose for continued writes. It's just math.
Different file systems have wildly different strategies and data structures behind this process, however. Some drop to their knees over 92-93%. Some can write to the last byte with reasonable efficiency—but it'll never be as fast as when it was empty. Copy-on-write systems like ZFS tend to do poorly under near-full conditions.
> When you write a file larger than the contiguous available space after its starting point, a file system must break the file
And why would having a 10TB file system force you to write 100GByte files, whereas a 10GB file system would write 100MByte files instead?
Because that's the topic you're responding to... GP said it doesn't matter whether it is a 10GB file system or a 10TB file system, it "gets badly fragmented" when you get to "99% full".
Like most things, it depends on your workload. If all of the files are written all at once when they created (e.g., no slow append workloads) and the files are all the same size, then that's a very "friendly" workload that will be much less likely to suffer fragmentation. But if you are creating and deleting files that have a large range of sizes, and some files grow gradually over time, then free space fragmentation will tend to occur much more quickly.
Had anyone succeed in looking up the motivation behind the reserved space idea in ext filesystems, e.g a commit message from the time this feature was introduced? I've tried but failed miserably, got lost in many different git repositories.
One fun fact I learned though - one of the reasons is "for quota file". Since ext stores quota information in an ordinary file, there could be an issue where user fills up a disk but the information about it isn't written in quota file due to lack of space.
> On Linux servers it can be incredibly difficult for any process to succeed if the disk is full. Copy commands and even deletions can fail or take forever as memory tries to swap to a full disk and there's very little you can do to free up large chunks of space.
This reasoning doesn't make sense. On Linux, swap is preallocated. This is true regardless of whether you're using a swap partition or a swap file. See man swapon(8):
> The swap file implementation in the kernel expects to be able to write to the file directly, without the assistance of the filesystem. This is a problem on files with holes or on copy-on-write files on filesystems like Btrfs.
> Commands like cp(1) or truncate(1) create files with holes. These files will be rejected by swapon.
I just verified on Linux 5.8.0-48-generic (Ubuntu 20.10) / ext4 that trying to swapon a sparse file fails with "skipping - it appears to have holes".
Now, swap is horribly slow, particularly on spinning rust rather than SSD. I run my systems without any swap for that reason. But swapping shouldn't fail on a full filesystem, unless you're trying to create & swapon a new swapfile after the filesystem is filled.
Not sure about their reasoning.. but if you don't have root ssh enabled, sudo can break if there is no free disk space. I do something similar where I write a 500mb file to /tmp and chmod 777 it so anyone can free it up without needing sudo.
I've experienced far more full disks than I'd want to admit, on many different hardware and software configurations, and I've never seen sudo break. Is this something you've experienced recently?
I definitely agree with your advice and will go double check all my servers if /filler is 777 (not in /tmp since it's sometimes mounted tmpfs), but if sudo does break in that situation, that sounds like a pretty severe and most likely fixable bug.
Yea, have seen it recently on centos 7. I think it's due to having sudo logging enabled. It won't let you run sudo without it working (at least I think that's the case, but haven't spent the time to investigate too thoroughly).
I've never had sudo break on my full disks. However, that doesn't mean recovery is easy...
Working in a terminal to find out what on earth has just filled up your disk is a real pain when your shell complains about failing to write to your $HISTFILE and such. And, of course, the problem always shows up on that one server that doesn't have ncdu installed...
I'm sure sudo can theoretically break with 0 free disk space, but that's not the usual mode of failure in my experience. At most sudo need to touch a dotfile or two, so deleting _any_ temporary file or old log archive will do for it to recover.
The balloon file is not a bad idea. I think I will apply it on my own servers just for good measure, although 8GiB is a bit much for my tastes.
You recall incorrectly; swap is not needed. It's not just me who runs without it; Google production machines did for many years.
"The behavior is often worse without swap" is more vague / subjective. I prefer that a process die cleanly than everything slow to a crawl semi-permanently. I've previously written about swap causing the latter: https://news.ycombinator.com/item?id=13715917 To some extent the bad behavior can happen even without swap because non-locked file-backed pages get paged in slowly also, but it seems significantly worse with swap.
zram is a decent idea though. I use it on my memory-limited Raspberry Pi machines.
Well, I'm not inventing it. This phoronix article [1] leads to a LKML post [2]:
> Yeah that's a known problem, made worse SSD's in fact, as they are able
> to keep refaulting the last remaining file pages fast enough, so there
> is still apparent progress in reclaim and OOM doesn't kick in.
And so on. That's part of the reason there is a working effort towards creating a userspace OOM daemon. I'm grateful for that effort, since I'd rather have some apps crash than having my system unusable. However, there is an issue with the Linux kernel trashing.
I couldn't find mention of it in that thread, but it's mentioned multiple times in passing (as well as in the phoronix comments) that the system behaves worse without swap. It's also noted as one of the reproduction steps.
I agree that it's vague and subjective, but I'm pretty sure I've seen some serious discussion of the topic. There are a few sources that point in that direction [3-4], but I haven't seen anything conclusive.
I'm surprised that the swap implementation (or distros swap configuration) is so bad on Linux. I have a fresh Ubuntu 20.10 desktop installation and a few times it has come to slow crawl where you can't do anything, can't escape to tty or ssh, reboot is basically the only option.
I recently learned searching the internets that it was the swapfile configuration. My 16GB RAM machine got a 2GB swapfile from the default installation so I doubled it to 4GB, but today it almost got full again, luckily I was just about to close a big memory hog and that saved me. I have now increased the swapfile to 6GB.
If Ubuntu is giving default installations 2GB of swapfile on 16GB of RAM, shouldn't this happen to lots of users quite frequently? How many users are technical enough to understand this and fix it? How can this be a good solution?
I can't recall that I ever got in trouble on Windows because of the swapfile.
That manually triggers OOM kill, making your system responsive again at the risk of killing something it shouldn't. At some point, I was probably using it five times a day. It got better when I started using swap (easier to recover when both full). I really recommend zram.
I had zero problems without swap-files for a long time, thinking that 16gb was a lot of ram, and that i could manage. it wasn't and i couldn't.
Now I run two swap partitions the size of physical memory on all machines. Turns out that after some weeks a good third of allocated ram lives in swap proving its usefulness .. and that its quick to turn the aux-swap partition into one for lvm snapshots and i suppose to act as part of the set to emulate'baloon files' too.
tl;dr; bad defaults. go with the size of physical ram and get suspend/hibernate/hybrid working and tested too.
use a 2nd ram-sized partition for snapshots/lvm and emergency space/lvm|zfs.
That explains it, I noticed that if my machine was on for a while a degradation in performance happened, it was probably the swapfile then too, especially when it only has 2GB to work with.
Yes, I probably need to set my swapfile to 16GB, I'm also investigating the swappiness setting, default 60 for desktop.
>> It's not just me who runs without it; Google production machines did for many years.
But then you need other mechanisms monitoring "out of memory" situations. For example Kubernetes also require nodes to have swap disabled. But running workloads should have resource request/limits defined and then kubelet config specifies minimums on free disk and RAM memory. If there is not emough RAM, workloads are evicted to another worker node with more free memory automatically.
I agree with your comment if you remove the words "but then" and "other". It's a good idea for orchestration software to detect/avoid unhealthy nodes and enforce the resource limits it uses for bin-packing. It's also a good idea for monitoring systems (by which I mean ones which do alerting and visualization) to track memory usage and pressure stalls. [1] I wouldn't say swap is a substitute for these in any way. A swapping machine is an unhealthy machine.
Depends if you want things to gracefully degrade because you know you don't have enough RAM or if you'd rather things just straight up die. E.g. for the things I work on my laptop with if whatever I do isn't going to work with 128 GB of RAM (80% of which was meant to be cached data not actually used) then it's because it went horribly wrong and needs to be halted not because I needed some swap which is just going to try to hide that things have gone horribly wrong for a minute and then die anyways. Now if I were doing the same things on a machine with 8 GB or 16 GB of RAM then yeah I want to gracefully handle running out of physical memory because things are probably working correctly it's just a heavier load and it can be better to swap pages to disk than drop them from a small amount of cache completely.
Personally, I prefer things that use a huge amount of memory to die instead of having the machine become unresponsive, especially if I have latency-sensitive things at the same time, like a conference call.
Now, my main issue is that disabling swap doesn't cause this. With swap, OOM ends up being detected and corrected by the kernel. On some machines, it doesn't happen without swap, and often neither with swap. A sweet spot seems to be current-to-last CPU gen with 8 GB of memory.
There's probably a lot of factors at play: what makes a task "hung"? How memory pressure is measured. Stuff probably changed with time, SSDs are now pretty quick, yet storage is a lot slower than RAM compared to what it used to be.
Anyway, I use zram and have upgraded my memory, I'll also increase a bit my swap partition size in light of all of this. I'll probably have a look at userspace OOM daemons too.
One of essential services crashed and then whole unit rebooted automatically. I don't remember details now since it was a year ago, but that's approximately what has happened. Again, that's not a regular x86 fat server with linux, but embedded arm device running barebones debian plus some custom software, so it may behave differently.
The disk filled up, and that's one thing you don't want on a Linux server—or a Mac for that matter. When the disk is full nothing good happens.
I had this happen few times on a Mac and every time I was shocked that if disk gets full you cannot even delete a file and the only option is to do a full system reboot. I was also unable to save any open file, even to external disk and suffered minor data loss every time due to that.
What is the proper way of dealing with such issue on macOS? (or other systems, if they behave the same way)
I had this happen few times on a Mac and every time I was shocked that if disk gets full you cannot even delete a file and the only option is to do a full system reboot. I was also unable to save any open file, even to external disk and suffered minor data loss every time due to that.
This just happened to me. I got the best error message I've ever seen. Something akin to "Can not remove file because the disk is full." This wasn't from the Finder, this was command line rm.
On the Mac it's also exacerbated by the fact that swap will use the system drive and can fill up the disk, and can not be stopped. If you have some rogue process consuming RAM, among other things, your disk will suffer until it is full. And, as mentioned, macOS does not behave well with a full disk.
And, even if you've remedied the swap issue (i.e. killed the process), there's no way I know to recover the swap files created without restarting.
Just seems like the design is trouble waiting to happen, and it has happened to me.
When this last happened, somehow it managed to corrupt my external Time Machine volume.
I've been living with this for the past few years. The only remedy is to do a full system reboot. Sometimes I reboot a few times a night.
One way to buy yourself some time is to disable the sleep file. I'm not sure what it's called -- it's a file that MacOS uses to let the computer hibernate when there's no power. It's a few GB, which (like the blog post stated) is a nontrivial amount of freeable space.
> I'm not sure what it's called -- it's a file that MacOS uses to let the computer hibernate when there's no power. It's a few GB, which (like the blog post stated) is a nontrivial amount of freeable space.
Should be /var/vm/sleepimage and the same size as your RAM.
I don't know with Mac, but this is why many Linux distros recommend putting /home is on a separate partition. If it fills, it won't lock up the whole system.
Fun story with this. Ubuntu now has an experimental root-on-zfs feature. I installed it and started playing with some docker containers, trying to compile a certain version of pytorch. Suddenly, my computer crashed. Apparently, my root partition filled because docker installed everything on the same partition as my OS, crashing everything immediately.
I even hacked my MacOS to disable the message. Computers shouldn't nag their owners repeatedly, even if it's in their best interest, unless the computer is about to catch fire.
Wouldn't it be better to take the system's advice and clear off some space rather than playing russian roulette?
I've had the mispleasure of pointing a large video export to the wrong drive and other misdeeds that allowed a drive to fill up. It's not pleasant. A simple reboot sometimes frees up the swap space to allow for more spring cleaning, but I typically just resort to booting into recovery mode and searching the web for the proper terminal command to decrypt/mount the root volume for spring cleaning.
I ran into this on OpenWRT back in the day. It had a similar filesystem behavior where you could not delete from a full FS. The solution was to truncate a file that was at least 1 block big, thus freeing up a few kilobytes. Then you can rm a large file, and then you can resume normal cleanup.
One thing that many Linux/Unix users do not know is that all commonly used filesystems have a "reserved" amount of space to which only "root" can write. The typical format (mkfs) default is to leave 5% of the disk reserved. The reserved space can be modified (by root) any time, and it can be specified as a block count or a percentage.
As long as your application does not have root privileges, it will hit the wall when the free+reserved space runs out. Instead of the clumsy "spacer.img" solution, one could simply (temporarily) reduce the reserved space to quickly recover from a disk full condition.
This reminds me of Perl’s esoteric $^M variable. You assign it some giant string, and in an out-of-memory condition, the value is cleared to free up some emergency space for graceful shutdown.
“To discourage casual use of this advanced feature, there is no English long name for this variable.”
But the language-build flag to enable it has a great name: -DPERL_EMERGENCY_SBRK, obviously inspired by emergency brake.
I happen to run a couple of small servers myself and here's a better version of this approach. Create a cron job that will run a simple self-testing script once every few hours. My self-test does this:
1. Checks that all domains can be accessed via HTTP and HTTPS. If not, DNS might have died.
2. Checks that a few known CMS-generated pages contain some phrases they should contain. If not, SQL might have died.
3. Checks that the HTTPS certificate has enough runway left. If not, certbot might have died.
4. Sends a basic email message from my domain to a gmail account. Receives it via IMAP and sends a reply. Then, verifies the reply. This catches a whole bunch of mail-related issues.
5. Checks the free RAM and disk space. Updates an internal "dashboard" page and sends me an email if they are off.
It only took a couple of hours to hack this together and I must say, I get a much better night time sleep ever since.
And I would even love to maintain and support it if there was a culture of paid software on Linux. But because the status quo is that everything should be free and "if you publish it, it's your duty to support it", we're stuck reinventing our own wheels.
Lots of comments assailing this approach as a poor replacement for monitoring miss the point. Of course monitoring and proactive repair are preferable - but those are systems that can also fail!
This is a low cost way to make failure of your first line of defense less painful to recover, and seems like a Good Idea for those managing bare-metal non-cattle systems.
This reminds me of an old gamedev story that I have no idea how to find. The project was getting near to shipping, they had cut all the space they could cut, but they still needed another megabyte of space. After a week of this, the senior dev told the narrator to meet him in his office, and he closed the door. He opened one of the project files and deleted a 1 MB static array. "At the beginning of development I always reserve space for just this occasion," he said. Shortly afterwards he emerged from his office, announced that he had been able to find some extra space, and was lauded as a hero.
When I worked at SevOne, we had 10x500 MB files on each disk that were called ballast files. They served the same purpose, but there were a couple nice tools built in to make sure they got repopulated when disk space was under control, plus alerting you whenever one got "blown." IIRC it could also blow ballast automatically in later versions, but I don't remember it being turned on by default.
This is why the invention of LVM was such a good idea even for simpler systems (where some people claimed it was useless overhead). In my old sysadmin days I never allocated a full disk. The "menace" of an almost full filesystem was usually enough to incentivize cleanups but, when necessity came, the volume could be easily expanded.
I think that in the past I saw that when creating a file with e.g. ...
dd if=/dev/zero of=deleteme.file bs=1M count=8196
...the "free space" shown by "df" slowly decreased while the file was being created, but then once the operation completed that "free space" magically went back to its original value => the big existing file (full of "0"s) was basically not using any storage.
Is this what you mean?
I just tried to replicate this behaviour but, dammit, I cannot demonstrate that right now as the behaviour so far was the expected one (free storage decreasing when creating the file and sticking to that even after the completion of the operation).
I strongly believe that that's what I saw in the past (when I was preallocating image files to then be used by KVM VMs), but now I'm wondering if I'm imagining things... :P
EDIT: this happened when using ext4 and/or xfs (don't remember) without using any compression.
You have to beware if you're on a filesystem (such as ZFS) that has compression enabled. A file of all zeros compresses quite well, and may not get you the space you need when you remove it.
That sets the root reserve on a disk. It's space that only root can use, but also you can change it on the fly. So if you run out of userland space you can make it smaller, and if a process running as root fill your disk, well, you probably did something real bad anyway. :)
How is this better than sounding alarms when free disk space drops below 8GB? If you’re going to ignore the alarms, then you’re going to have the same problem after you remove your spacer file and the disk fills up again!
Okay, so now you have a disk full, only become aware of it when it's full and your database throws errors. You have an easy way to fix it, just delete the spacer file. But what good does that do? You're still in the mess where your database is really unhappy.
On the other hand, if your monitoring was set up well, you got a notification and had time to react to it before it was at 100%.
Granted, if you have a process that just wrote a file at maximum speed, that time window is tiny, but that's not usually what happens in my experience. What happens is that something starts logging more and it slowly builds up while you're happy that your server is running so well that you don't need to pay attention. And then the alert comes and tells you that there's less than 10% space available, and you have plenty of time to investigate and avert the crisis.
>You have an easy way to fix it, just delete the spacer file. But what good does that do?
You solve the issue right then and there.
Step 1. realize there is a space issue and get to terminal
Step 2. free space so any solution has memory
Step 3. Solve by doing <??? specifics ???>
Let me try to explain what the other person is saying.
If you have an 8gb spacer file, at some point the disk will get full and cause some errors, so you will have to log in, remove the spacer file and then deal with the problem.
If you have an alarm for 8gb remaining, you will receive the alarm before any application ever notices the disk is full. You will have basically the same amount of time to solve the problem for good, but if you're able to solve it before these 8gb also ends, you won't have to deal with any "app crashed/misbehaved because it reached a point where no space was available" issues.
Your described scenario is no different than when you would have set your first warning at 16GB disk space, except that you won't have to scramble to delete the roadblock halfway between here and the crash zone.
So if your alarm sounded when there was 8 GB of free disk space (instead of 0 GB), then you could still respond in the same amount of time and you would still have an additional 8 GB worth of padding while you determined the root cause. The only difference is that you wouldn't need to actually go in and delete the spacer file (and potentially have downtime in the time it takes you to delete the spacer file).
Another way to think of this is that you have the 8 GB spacer file, but when the disk fills up the spacer file is automatically deleted and your alarm goes off. Which is literally the same as having your alarm go off when free disk space reaches 8 GB.
It isn't though. Whatever rogue process is generating the garbage so quickly has likely thrown and died (potentially leaving other, useful processes, able to continue work). Not 100%, of course, but there's a solid chance that the garbage will stop being generated.
Also, forcing manual intervention has a psychological effect. An alarm that goes off at 8 GB remaining? Eh, I'll get to it at some point. A "disk is full, error, error, everything is broken"? I will deal with it -right now-, especially since I know a fix. Do that, with an alarm at 16 GB (so I still get the early alert in the event I'm that good a citizen and actually prioritize getting to the bottom of it even though it's caused no issue yet), and I'm in a better position still.
The spacer solution guarantees you will always have one instance of "disk got full", even if the fill rate is slow. With an alarm there is a chance you end up solving the problem with zero instances of "disk got full" if the fill rate is not that big. That is the main difference.
I don't understand. You will still have an alarm when the disk fills up, and you will need to respond and delete the spacer file. Your response time latency will be the same, right?
I believe you are correct. The other replies are really not addressing your critique at all. Both solutions require some form of "alerting", when not just do it the proper way. Worried about how long an alert will take to respsond to? Well, alert when 16GB remaining, you just bought yourself more time!
Both solutions assume that you will have some way of knowing when the disk is full. Whether the "alarm" is an automated health monitoring system, or an angry customer calling your cell phone, there's no point in discussing how to solve problems without assuming that you have some way of knowing there a problem exists.
This would work if you have sufficient time between alarm and failure. If some issue or process uses up all of your available disk space in a short time span, you won't have that luxury. Hopefully, the author is using alerts on top of having this failsafe
What?!? How does that work? Does he just draw up a blueprint and write "solid gold block goes here" and them some contractor says "yes that gold block will be $NNNNN" and includes it in the budget??
It's likely not literal. He likely quotes for price +50k or something like that, so that people will start thinking about reducing the price before they run out of budget.
My understanding is this is why one should partition a drive. If you have a data partition, a swap partition, and an OS partition, you can get around issues where a server’s lack of disk space hoses the whole system.
100% agree. I think at the bare minimum every system should have two partitions: `/` and `/var`
/var is usually where the most data gets added. Logs, database files, caches, and whatever other junk your app spits out. 99% of the time that's what causes the out of space issues. By keeping that separate from root, you've saved your system from being completely hosed when it fills up (which it will).
Obviously there are other places that should get their own mounts, like /home and /usr, but before you know it you've got an OpenBSD install on your hands with 15 partitions :)
/var gets its own disk on my machine! Yay academic machine learning docker containers. What's a few 30GB docker images between friends? So yeah /var gets to live with training data on SSD which gets cold swapped onto spinny platters as needed. /home is another, on the "main sdd".
What do you mean, "lose mindshare"? Docker is one of the absolute best things to happen to ML. My work is largely algo integration - I take academics' ML code and bundle it for API consumption. Before this, I was a bench chemist, so I kind of have a thing for reproducing experiments.
The academic ML scene has a reproducibility crisis that makes other science reproducibility crises look like Phys 100 labs. These things depend on someone's conda env with versions pinned to nightly builds, inscrutable code written by a postdoc that has since left, datasets which have been mutated since the paper was published, on and on.
Docker gives me a fighting chance to actually get reproducible results without going stark raving mad.
Should these researchers have better software process to avoid this situation in the first place? Heck yes. But these are people doing things like `os.system("rm " + filepath)` and git committing entire models, they are just really green and most don't know any better yet and academia isn't really known for its mandatory Practical Software Engineering classes for CS majors.
You had a flash of self-awareness. Docker containers take massive amounts of disk space, and you need arcane knowledge just to use them. I'd rather focus on doing ML, not learning docker. In other words, "I'm getting to old for this." (33 is up there.)
It bugged me in the game industry when some old programmer said "shaders are a young person's game," implying that they wouldn't even look into how they worked. Now that I am that older programmer, or getting there, I see what they mean. Docker arcana is a young person’s game.
If Docker solved the reproducibility crisis, you might have a point. But it doesn't. Most of the crisis is the fact that (a) datasets are trapped behind institutions that won't make them available, (b) the models themselves aren't made available (OpenAI), and (c) the code itself isn't available (also OpenAI).
Those three things are the main problem. Forcing everyone to use a 30GB docker container just to do basic ML isn't going to do anything but waste time and turn newbies away from ML.
You're a fine debator though. It was an enjoyable read; have an upvote.
I place I used to work achieved something similar with lvm thin provisioning and split out something like /, /home, /var, /var/log and maybe a couple others. I think they also had something clever with lvm snapshots to rollback bad updates (snapshot system, upgrade, verify) so even if an update went rogue and deleted some important, unrelated files it could be undid
Do people not create partitions any more? I thought this was sysadmin 101 for, like, forever. Databases, web servers, etc. should never ever fill up the entire disk. Separate partitions for boot, swap, /, /home, /var, and /tmp are the minimum common sense partitions.
If you happen to use ext as your default filesystem, check the output of tune2fs; it's possible your distro has conveniently defaulted some 2-5% of disk space as "reserved" for just such an occasion. As the root user, in a pinch, you can set that to 0% and immediately relieve filesystem pressure, buying you a little bit more time to troubleshoot whatever the real problem is that filled the disk in the first place.
This points to a much more serious problem. This is 2021 and the technology is from the 90s, with a really poor user experience design. Your car warns you when you're low on fuel, but your server doesn't if you're low on critical resources.
Everyone has this kind of alerting set up, but that's not the point. The beauty of this solution is that it's dead simple and will never fail. Alerting can fail or be ignored.
It's the same as old VW beetles which had a reserve gas tank. When you ran out of gas you opened a valve and you could limp to a gas station. Less likely to fail versus a 1950's era gauge that is telling you you're low. Also impossible to ignore it.
It's the same as old VW beetles which had a reserve gas tank. When you ran out of gas you opened a valve and you could limp to a gas station
In scuba diving there used to be "J-valves". When you had 50 bar left in the tank they would cut out. Then you would pull to reenable your air and return to the surface. Unsurprisingly they are no longer popular.
Same was true of most motorcycles until rather recently, though with motorcycles it was rare that there was a fuel gauge at all. A sputtering engine was how you knew it was low. And I believe that like with motorcycles, the "reserve tank" in an old Beetle is really the same tank - there are two hoses located in the tank at different heights.
> The beauty of this solution is that it's dead simple and will never fail. Alerting can fail or be ignored.
It's not that straightforward IMO. Would this file be deleted before the space is filled? If so, there is alerting in place, and it assumes there's a way to delete files before space fills up. If this file is deleted after space fills up, how is this different from not having the file, other than making finding files to delete easier? Then what happens after that? If you delete the file and realize there's nothing else to delete, you'd have to solve the problem the same way if you didn't use this method.
ramdisk? It wouldn't be the first time I'd extract a .deb to tmpfs to resolve a temporary issue.
Don't think I've ever encountered a critical issue where "add more swap" would be a serious disaster recovery solution. I've certainly seen situations where swap was nearing 100% full, and although I would have minutes off wall-clock time to formulate a strategy, those minutes have never allowed me to input more than a handful of characters or so.
The 'beauty' artificially chokes your HDD and produces the same problems that you are trying to avoid.. not a sane way to proactively manage your disk usage.
Alerting is also a hack really. In 2021 the operating SYSTEM should work as a system - complexly managing it's resources and make intelligent decisions. Ideally OS should dynamically reserve as much resources as needed on it's own.
That's a quality problem. Your car can absolutely drop from "alarm" to "empty" in 30 seconds if there's a leak in the tank. We just don't build fuel tanks that don't spontaneously develop leaks, partly because the manufacturer can be held liable.
Assuming we're talking about VMs (2021 etc.), for a SME is there any downside to giving 2TB of space to your discs and let dynamic allocation do the work?
Perhaps consolidate/defrag once a year. Even monitoring total usage more often than that is probably not worth the effort - just buy ample cheap storage.
Also, there was a tradition to split drives into OS, DB, DB Logs. That was mostly a rust performance thing and these days is probably just voluntary management overhead.
If you are using less space than the underlying datastore, there's no benefit to dynamic allocation, you may as well give the servers larger fixed disks. If you are thinking that one server might need more than the fixed size for a sudden growth, then you need to be monitoring to deal with that because that will run out of your space. If you are overprovisioning the datastore, you have the same problem at a level lower, and need to be monitoring that and alerting for that instead (as well).
> "just buy ample cheap storage"; "That was mostly a rust performance thing and these days is probably just"
In the UK a 6TB enterprise rust disk is £150 and a 2TB enterprise SSD is £300, it's 6x the price to SSD everything, and take 3x more drive bays so add more for that. And you can never "just" buy more storage than you ever need - apart from the obvious "when you bought it, you thought you were buying enough, because if you thought you needed more you would have bought more", so that amounts to saying "just know the future better", but it can't happen because Parkinson's Law ("work expands so as to fill the time available for its completion") applies to storage, the more there is available, the more things appear to fill it up.
Room for a test restore of the backups in that space. Room for a clone of the database to do some testing. Room for a trial of a new product. Room for a copy of all the installers and packagers for convenience. Room for a massive central logging server there. What do you mean it's full?
One VM using excessively more disk space than it's supposed to can potentially cause data corruption in all the other VMs on that system. For just spinning VMs up and down for testing, you probably won't run into that issue, but on a production system, it could potentially cause some massive downtime
Virtual machine disk space (e.g. Xen, Linode, AWS EC2, or similar) does not work this way. Each VM gets a dedicated amount of disk space allocated to it, they don't all share a pool of free space.
Yes they do with the "dyanmic allocation" the parent comment mentions; VMware datastore has 1TB total, you put VMs in with dynamically expanding disks they are sharing the same 1TB of free space and will fill it if they all want their max space at the same time and you've overprovisioned their max space.
And if you haven't overprovisioned their max space, you may as well not be using dynamic allocation and use fixed size disks.
Even then, snapshots will grow forever and fill the space, and then you hope you have a "spacer.img" file you can delete from the datastore, because you can't remove snapshots when the disk is full and you're stuck. It's the same problem, at a lower level.
I see, a VMware feature, thanks for clarifying. I suppose it's a nice idea in theory, but you'd have to be crazy to use that in production, or for any workload that you care about. It would just be a ticking time bomb.
Hyper-V can do that too, and so can you under Linux. It's called thinly-allocated disks, sparse files, or the dm-thin device mapper target. Professional SANs also allow you to overallocate the total size of the iSCSI volumes offered.
Yes, I've seen that time bomb go off on multiple occasions. Never on my watch though.
Linux servers aren't like mass consumer products. It's assumed users know what they're doing and can build and configure what they need on top of it.
> This is 2021 and the technology is from the 90s
I don't see how this is a valid point. Is integrated circuit technology outdated because it was developed in the 60s?
There's no reason not to have multiple fail-safes. Receiving the alert on a device at 3am would still mean he could free up 8gb immediately and have breathing room to solve the problem. And remember this is for a single admin. Asking such a person to be on call 24-7 all year, vacations, holidays, weekends... Having a quick way to get breathing room can significantly reduce the stress & cognitive load of worrying about such things in your off-time.
No. Careful partitioning is the solution to this problem. Monitor the growth of your partitions and make sure nothing on rootfs or other sensitive partitions grow significantly.
I don't think this has to be an either/or scenario. Having some bloat you can get rid of quickly is a nice backup in case your monitoring fails for whatever reason.
They also hurt far more than dynamic subvolume allocations, due to their static nature. You still can't repartition an active disk without downtime under Linux FAFAIK, it requires a reboot or unmounting all other partitions on that disk, even for partitions that didn't change.
I'll take lvm/zfs/btrfs subvolumes over static partitions any day.
To extend space in any filesystem in the root volume group on AIX you need space in /tmp. Years ago while working for some major bank I proposed to create such dummy file in /tmp exactly for the reason of extending filesystem. It saved us several times :)
Back in my early university days the disks always seemed to be full at inconvenient times on the shared Unix systems we used. Some students resorted to "reserving" disk space when available. Which of course made the overall situation even worse.
All my servers have an alarm when disk space goes above 70%. It sends an email every hour once the disk usage goes above 70%. Never had a server go down because of disk space issue after adopting this practise.
Also one of the main reasons server disks go full is generally log files. Always remember to "logrotate" your log files and you will not have this issue that much.
Yes one more thing, for all user uploaded files use external storage like NFS or S3.
> for all user uploaded files use external storage like NFS or S3
We send our log files to S3 too. I mean, we write them locally (EC2) and then push them to S3 every minute.
Then we have a tool that will let us search the log files in S3 and it will parse these rotated log files and join together the relevant pieces depending on what we're looking for (or all of it for a specific time period if we don't know what we're looking for).
This is great because if the server goes down and we can't access it, or the instance is gone, we can still see log files from shortly before the problem occured. We also use bugsnag, etc for real time logging and tracking where possible.
This goes into the same vein I was going to point out.
Most uncontrolled space usage comes from logs, users doing user things, or something like build servers just eating temporary and caching storage for lunch. Databases also tend to have uncontrolled space usage, but that tends to be wanted.
So, if you push /var/log to it's own 20-30Gb partition, a mad logger cannot fill up /. It can kill logging, but no logging is better than fighting with a full /. Similar things with /home - let users fill up their home dirs and pout and scream about it... but / is still fine. And you can use their input to provide more storage, if they have useful workflows.
Something like databases - where their primary use case is to grow - need monitoring though to add storage as necessary.
Icinga is a common solution for monitoring FS and other use metrics. I imagine his setup, if custom rolled, is a shell script checking df and sending an email when the usage is at or above 70
This really goes to show, there is more than one way to skin a cat. Yeah the guy could probably overhaul his entire approach to system administration, but also...this works. Well-placed hacks are maybe my favorite thing.
This won't work with ZFS, as it may be impossible to delete a file on ZFS when disk is full. The equivalent in ZFS is to create an empty dataset with reserved space.
A way to prevent this is to create a dataset and reserve n amount of space, typically 10-20% and set it read-only.. before the pool gets full. Then when the pool fills up, you can reduce the reservation to be able to clean up files.
It's interesting to me that linux doesn't natively reserve a little space to allow basic commands like directory listing and file deletion to function even with a full disk.
Because really the biggest problem when I've had a partition get full, is I sometimes can't even delete the offending log file.
Depends entirely on the design of the file system. In copy-on-write file systems, it's a necessity: you need to at least allocate a new metadata block that doesn't record the existence of some file... and that's assuming you don't have snapshots keeping it allocated anyway.
You can run into real trouble on btrfs if you fill it, it has no reserve space to protected from this scenario. ZFS at least reserves a fraction of the total space so that deletes are allowed to work even when the pool reaches 100% capacity.
No real good way to send messages on HN, i recalled a convo we had a while back about phone switches and thought you might enjoy this YouTube channel - https://www.youtube.com/watch?v=ngRb6mBB9HY
This is an old trick for when you need to deploy to media with a fixed size - floppy/CD-ROM/etc. Make a file that is 5-10% the size of your media and don't remove unless you're running out of space in crunch time.
An alternative approach here... make sure (all) your filesystems are on top of LVM. This reduces the steps needed to grow your free space. Whether you have a 8gb empty file laying around, or an 8gb block device to attach...LVM will happily take them both as pv's, add them to your vg's, and finally expand your lv's.
If you are using LVM on all of your filesystems, it seems like a bad idea to use a file residing on LVM block device as another PV. And actually I'd be surprised if this was even allowed. Though maybe it is difficult to detect.
You'd effectively send all block changes through LVM twice (once through the file, then through the underlying block device(s))
LVM is just fancy orchestration for the device-mapper subsystem with some headers for setup information.
For block operations it's no different from manual setup of loop-mounted volumes, that also need to travel a couple of layers to hit the backing device.
Though there is an important caveat - LVM is more abstracted, making it easier to mistakenly map a drive onto itself, which may create a spectacular failure (haven't tried).
Yes LVM can help here. Another approach would be when you create the logical volume to intentionally under allocate. Perhaps only use 80-90% of the physical volume.
if you are on an ext filesystem, reducing the reserved percentage on the full filesystem can save the day. its more or less this same trick built in to the filesystem
IIRC 5% is reserved when the filesystem os created, and if it gets full you can run:
tune2fs -m 4 /dev/whatever
which will instantly make 1% of the disk available.
of course should be used sparingly and restored when finished
can you do this on smartwatches? I know someone that went to full extreme of hour push, but they did this by setting their system time to the next time zone over.
A great idea, but it still leaves the possibility for performance issues prior to an admin's ability to address is. Some like two 4gb blocks might work better: if you get within, say, 200mb of storage limits you remove the first one and trigger an email/text/whatever to the admin, that way they can address it before it goes further. It's an early warning and automated solution. Then, if the situation continues, the second 4gb block is also automatically removed with another message send to the admin. Nothing fails silently.
This is why I insist on data and root partitions on all the machines I administer. Go ahead and kill the data partition, at least the root partition will keep the system up and running.
For ext* filesystems, you can use tune2fs to change the reserved block percentage to accomplish this in what might - depending on your preferences - be a more graceful way.
Basically it lets you knock 8 GB or more (although it's a percentage instead, 5% by default) off of the disk space available to non-root users.
When it hits 100% and things start breaking, that reserve can be used by root to do compression safely, move things around, and so on. Alternatively the reserve percentage can be changed with a single command (by root), to allow non-root processes more space while the admin contemplates what do do next.
One nice aspect of using the reserve instead of a file is that it prevents runs of "du" for including the file in their results. Another is that it's pretty much impossible to accidentally remove the reserve (or for some other admin to find it and decide it's superfluous).
This is less effective at sites that have a lot of services running as root, in which case only your approach is fully effective. I want to say "But who does that nowadays...", but it happens.
tune2fs apparently also supports allowing members of a certain unix group or user to have access instead of solely root.
The core command for all this is:
tune2fs -m <reserved-percent> <device>
One other thing you might want to worry about: inode exhaustion. tune2fs has an inode reserve % as well - and trying to emulate this by creating a few hundred thousand files instead would be... inelegant.
The real question... Why does Linux or at least the common filesystems get stuck so easily running out of disk space? Surely normal commands like `rm` should still function.
As recently as 2016 I experienced major problems using `rm` with an intentionally-filled btrfs (and current Linux kernel at the time), and per my notes, it was even mounted as `-o nodatacow`:
# rm -f /mnt/data/zero.*.fill
rm: cannot remove '/mnt/data/zero.1.fill': No space left on device
They sometimes don't. The article even acknowledges this:
> Copy commands and even deletions can fail
I've had that happen too many times, so I don't know why would fill up my disk with a hacky spacer file, which surely can also fail to be deleted when the disk is already full.
Once upon a time, I wanted to cache large and expensive to pull files on many thousands of servers. Problem is the disk space on these servers was at premium and meant to be sold for customer use. The servers did have scratch space on small disks, but that was used by the OS.
So I wrote an on-disk cache system that would monitor disk usage, and start to evict and shrink its disk space usage. It would take up to N gigabytes of disk (configurable) for the purpose of caching, and maintain an M gigabytes free-disk-space buffer.
Say you had a 100 GiB total space on a partition, with 8 GiB used for cache with a 2 GiB headroom. As legitimate/regular (customer) space usage increased and reached 91 GiB, the cache would see 9 GiB available, and removing the 2 GiB buffer, would start to evict items to resize to 7 GiB, and so on until it had evicted everything.
When this system deployed, it started to trigger low-disk-space alerts earlier than before. At first that seemed like a problem, but the outcome is that we were now getting low-disk-space alerts with more advance warning, and the cache bought some time as it kept resizing down to free up space. It kind of, in a way, served the same purpose as described in this blog post.
Overall this cache was pretty neat and still is, I bet. There's probably ways to do similar things with fancy filesystems (or obscure features) but this was a quick thing to deploy across all servers without having to change any system setting or change the filesystem.
I sometime wish I had done this in open-source, because it would be convenient to use locally on my laptop, or on many servers.
Box Drive and OneDrive apps for desktop do this now, but with cached files from the "cloud".
It looks to you like everything is there, but in reality it downloads as you click on things, and empties cache if the free space begins to drop below a set amount.
hope you're not running -o compress=lz4 , because you are going to be in for a big surprise when you try to pull this emergency lever! you may be shocked to see you don't actually get much space back!
i do wonder how many FS would actually allocate the 8GB if you, for example, opened a file, seeked to 8GB mark, and wrote a character. many file systems support "sparse files"[1]. for example on btrfs, i can run 'dd if=/dev/zero of=example.sparse count=1 seek=2000000' to make a "1GB" file that has just one byte in it. btrfs will only allocate a very small amount in this case, some meta-data to record an "extent", and a page of data.
i was expecting this article to be about a rude-and-crude overprovisioning method[2], but couldn't guess how it was going to work. SSDs notably perform much much better when they have some empty space to make shuffling data around easier. leaving a couple GB for the drive to do whatever can be a colossal performance improvement, versus a full drive, where every operation has to scrounge around to find some free space. i wasn't sure how the author was going to make an empty file that could have this effect. but that's not what was going on here.
> hope you're not running -o compress=lz4 , because you are going to be in for a big surprise when you try to pull this emergency lever! you may be shocked to see you don't actually get much space back!
This is true. If you are replicating this, copy from /dev/urandom rather than using an empty file.
Before reading this, I had presumed that sparse files did not overcommit drive space, but apparently, they do. I don't use them regularly and certainly not to "reserve disk space" but I was surprised that you can make sparse files way larger than available free space on the drive. I had assumed they were simply not initialized, but the FS still required <x> amount of free space in case a block is accessed.
Oh man, reminds me of a Game Dev war story I read years ago. This purportedly happened in those console days with very limited memory capabilities.
In some game studio, as a project neared its release, the team was still struggling with memory issues. No matter what they did, they had a surplus of just about 2MB. The artists have reduced their polygon counts drastically, the programmers have checked every possible leak, have optimized algorithms and buffers the best they could but the 2MB surplus just kept haunting them.
That's when the VP of Engineering stepped-in. Calling the TL of the project into a closed-doors optimization code-review, they had the source code on a large screen and the TL talked the VP through everything the team has done so far to stay within the memory budget.
As the TL finished the walkthrough, the VP opens some mother-of-all files and deletes a cryptic variable declaration to the effect of:
int toLiveBuffer[2000000];
The VP then explains that he hid this declaration in their codebase after a project that had to optimize drastically late into the development cycle. But first he wanted to make sure that the team did their homework.
And poof. They emerge from the closed-doors meeting jubilant and victorious. The game is ready for prime time!
A fun problem on a Mac is that if you're using APFS for your filesystem, if it fills up, you can't delete any files. It's caught me out a handful of times, and each time, the only way to recover is to reboot, and thankfully I've had more free disk space after a reboot.
I'm not going to try to understand the logic as to why APFS requires free space in order to delete files (via any method, including dd)
Probably because it's a log-structured file system, and those _really_ don't like running low on free space.
They work by appending to the log then compacting sometime later, not modifying things in-place. As such, you always need a reasonable supply of free blocks so this can occur.
In theory, this is a good idea, but doesn't protect you in all cases. I have had instances on a few of my application servers where an event happened that dumped GB's worth of log data to the log files in a matter of a couple of minutes and filled up the drive (Thanks fast SSDs!). If I employed the strategy in the article, it would have only bought me a couple of more minutes worth of time, if that!
The filesystems where headroom matters are var, tmp and sometimes root. I like this strategy with logfiles because nethack.log.gz.30 was approximately as important as empty space.
Keeping another 8gb on root and tmp seems extreme.
> On Linux servers it can be incredibly difficult for any process to succeed if the disk is full. Copy commands and even deletions can fail or take forever as memory tries to swap to a full disk
I don't understand this. Swap is either a swap partition, or a specific swap file, all of which allocated in advance, so the fullness of the storage should have no bearing.
I keep my databases on a separate filesystem from root, var, or anything system critical for this reason. Even with the 8GB space waster, if you aren't on top of your disk usage you'd have down time when you fill up the filesystem containing the DB. I might be missing something here, but this does not seem like a good solution to this problem.
I have an empty leader on my hard drive so that I can recover if I accidentally nuke the front of it with dd while making a live usb. So it's not a bad idea, and it's super effective so far it hasn't been tested, and hopefully I never will need to.
>The disk filled up, and that's one thing you don't want on a Linux server—or a Mac for that matter. When the disk is full nothing good happens.
I found a bug with time machine where it wouldn't delete local copies properly and filled my hard drive until I couldn't do anything. The OS slowly stopped working. At first I couldn't copy or save anything, then deleting files made more files. It was so bad that the `rm` command eventually wouldn't work from recovery or the local OS. I could do nothing. I had to format.
It happened again and I learned to manually delete the time machine local snapshots, but it was crazy how hard it was to recover once it took all my storage. That bug is fixed now.
With virtual servers this should not be necessary, as it is easy enough to add some disk space. After all, this should not be a common issue in production environments, but more like a once in a decade problem.
With physical servers it might be a different story and might be a good idea. I tend to size filesystems to the requirements I have and enlarge them when required (it gives you a periodic reminder to think about what waste you have accumulated). That way, I can still add space even if the filesystem has been filled up. However, if you do it just to have some space when you need it, it probably is overkill and to have an empty buffer file is a lot easier to handle.
So far, my first stop to temporarily get more disk space was to reduce the size of the swapfile which on a lot of servers seems to be allotted >1x the requirement.
Will be switching to this hack! Perfect illustration of the KISS principle (Keep it simple, stupid).
This reminds me of a similar story in a classic Gamasutra article[1] (the section is "The Programming Antihero", and I'd recommend the other pages of the article for a few good chuckles). Apocryphal or not, it makes for a good story.
> I can see how sometimes, when you're up against the wall, having a bit of memory tucked away for a rainy day can really make a difference. Funny how time and experience changes everything.
As hacks go, it's a good one. I also like it because you don't have to be root to implement it and you don't have to reconfigure your file system params in ways that might or might not be great for other reasons.
This reminded me of embedded Java project that I worked 20 years ago. The VM had only 10MB of RAM and properly dealing with out-of-memory exceptions was a must. The most effective strategy was to preallocate like 200K array. Then on any memory exception the code released that array and set a global flag. The flag was queried through out the code to aggressively minimize memory usage until it drops to tolerable limit.
The preallocated buffer was essential. Without it typical result was recursive out-of-memory that eventually deadlocked/crashed the VM with no recovery.
I have a dual-boot laptop with windows and linux, and use the ntfs partition to share data between them
Recently, I extracted a large archive with Linux on the ntfs, and the partition was full
Then Windows did not start anymore
Linux would only mount the partition as read-only, because it was marked dirty after the failed start. Finally I found a tool to reset the mark, and delete the files.
Now Windows starts again, but my user account is broken. It always says "Your Start menu isn't working. We'll try to fix it the next time you sign in.", then I sign out, and it is still broken
This sounds like you should, instead, use the "Filespace Reserved for Root" functionality of your filesystem, which exists specifically for this contingency. The default for ext3 is 5%.
Just because HN likes to bashes Windows. I tell you that Windows runs pretty much normal if the disk is full. Had that happen many times and intentionally did this for tests as well.
Even disconnecting the disk technically doesn't break the OS. Because of the "Windows To Go" feature, the OS can detects this and pauses.
(Note: Windows To Go is officially removed from current versions but the code that freezes is still there. However, whether that works with your hardware is basically a gambling... so yeah dont try at home/work.)
Dumb idea. Read the man page for tunefs. The file system has some thing called min free which does the same thing. However this does not interfer with wear leveling. Dummy data does.
Not commenting on whether OPs idea is sound or not, however tunefs implies the now less and less used ext4 (many distros are switching to XFS or btrfs :-/). On another note, that limit applies to non-privileged processes only. Some crap running as root will just fill up the disk too.
Tomato potato. If you use LVM or anything like it to reserve space then in your failure situation you have to extend the lv, partition, and fs before the space becomes available. More work than just rm'ing the file.
I think the ideal ideal is tuning the reserved blocks in your filesystem.
> On Linux servers it can be incredibly difficult for any process to succeed if the disk is full.
You won't feel too clever if you come to grow your LVM volume into the free space and it won't work because there's no free space on the filesystem! :)
(I don't actually know if this would fail or not - but the point is "rm spacer.img" is pretty much guaranteed not to fail).
I've used LVM for this purpose plenty of times. lvm2 at least has not prevented me from extending a full disk. lvm + reserved blocks + a small spacer file are all decent options, even better when used together.
Good for you. That's not how my memory works. If I don't use a command regularly, I don't trust myself to remember it correctly. Even if I did though, that's a multi-step process, compared to the single command needed to remove a file.
Would be better to leave 8GB unpartitioned and then expand the partition. An 8GB file on an SSD is removing 8GB worth of blocks from being able to participate in wear leveling.
Reminds me of the cron task I set up once, long time ago, on a bare metal server. It would kill and relaunch a web service every 4 hours.
The service in question didn't require high availability (it was a mailing list processing/interface thing, if I remember correctly) but it had some memory leak which would eventually devour all the memory in the server, in about 2 days.
This hack served its purpose well, until the service was eventually replaced by something else.
What I don’t understand about this approach is why you think it actually does anything for you ? What you do instead of this is to setup an alert to monitor disk space at the right threshold for you, and then have a contingency plan for how to add more space to your environment.
It seems like you have sort of done that, but in this case you are actually allowing your system to get into a bad state before you react.
Perhaps it’s better to be proactive instead of reactive.
This is true, but alerts and monitoring software can fail in many ways, due to network issues, process crashes... Etc.
Due to the simplicity of this solution, there's not much that can go wrong.... unless you can't actually access the server anymore to delete the file...
If the monitoring fails to alert you, you don't get the alert that you're running out of disk space and you can't access the server to delete the spacer file. All you've done in that case is lower your disk space by 8gb and make your server fill up more quickly.
OK, but any human-written text file can serve as a ransomware canary because it is straightforward for code to distinguish between plaintext written by a human and encrypted text.
This reminded me of that joke about two guys who meet in the middle of the Savanna. One is carrying a phone booth and the other one an anvil. So, the one with the anvil asks:
- Why do you carry a phone booth around?
- Oh, you see, it's for the lions. If I see a lion, I drop the booth, step inside and I'm safe. What's with the anvil?
- It's for the lions too. If I see a lion, I drop the anvil and I can run way faster!
I've been doing this for years too. I also learned instead of one big file, halving several files is also useful so you can release space in "chunks". If you need everything at once you can wildcard the delete.
Such flexibility has been invaluable over the years. Thankfully with block storage and modern operating systems/file systems growing volumes can be significantly easier for most servers.
I maintain a small fleet of CI machines (mostly macs) and run into this issue as well from time to time. The free space idea is nice but I ram into the problem that under very critical disk space I can’t even shh or when delete a file because there is simply not enough space to execute the simple command. A reboot to get rid of some temp files helps me in these situations to get some control back.
On the subject of "inverted" thinking like this, I recently added a test to a test suite that is intended to fail some day. The test will eventually fail when a bug (for which we developed a workaround and added the aforementioned test to confirm the fix) is fixed in one of our open source dependencies. When the test fails, we'll know to remove the workaround (and the test)!
In the days of minicomputers, Data General's first 16-bit operating system, RDOS, required that the main file be "contiguous". Not only that, there was some model of disk they sold where the OS file had to be close to the edge for speed in loading. Prudent sysadms would create empty contiguous files in the favored space against the next upgrade.
This is an old technique. For example, some game developers back in the early days used to put dummy files in the game data space, and code the entire game with less space so that if later more space was needed, it was just a matter of deleting the dummy files. In that context, it kinda forces you to be smarter about your game assets and code.
Proof that the future is here, but just unevenly distributed -- we have technology for dynamic disk expansion, but implementation & integration just isn't present/slick enough to make it available to even tech-inclined hosting consumers just yet.
Guess this is another place of differentiation that some of these platforms could offer.
Would that work? The fs may actually allocate a new file before deleting the exisiting allocation so the risk of it not working is still there I would think?
It might vary by kernel, filesystem, or shell, but in my experience and confirmed with a quick test: shell redirection does not create a new file/inode.
Full disc problem in linux macine has been a problem in partialy solved in past many decades. We have had seperated partition /home, /tmp, /var, /usr in each its own partition. This is reduce problem if not completly removing. This is small desadvantage: there is reducion in fungability for a disc space.
I keep my databases on a separate file system from root, var, or anything system critical for this reason. Wouldn't you still have down time when you fill up the filesystem with the 8GB space waster in place? I might be missing something here, but this doesn't seem like a good solution.
This is not the right solution. It's like setting your clock 5 minutes ahead, to trick yourself into thinking it's 9:00 am, when it's really 8:55 am. It doesn't work.
The better solution is simple monitoring. Alert when limit is passed. Increase limit to 16gb disk space remaining if paranoid.
We write a data intensive desktop app, and when you are close to disk full, we reduce functionality so you can’t make the problem worse, or lose work because of the disk full situation. The thing is that we know that more than half of that user’s data is ours, so our data is often the cause.
Showing off that i’m not a sysadmin, but wouldn’t a monitoring daemon work? Once disk usage grows past a certain uncomfortable threshold you get an email/notification to see what’s up. I mean you obviously are monitoring other server vitals anyway right?
Cases mentioned below where space fills up quickly due to a bug, maybe yes. Outside that there's the problem that you can ignore the emails (or be sick, asleep, etc). Worse if they go to a team and everyone is busy and assumes someone else will deal. Bad if you aren't in charge and tell people in charge and they nod and don't decide anything - they prefer to run at 70/80/90/95% used indefinitely instead of signing a cheque.
When the drive fills and everything breaks and, you /have/ to respond, and it becomes socially ok to make it your highest priority and drop everything. An email "only a few weeks until maybe it runs out of space" is much harder to prioritise and get anyone to care. With this system, the time when it fills and breaks has some flex for you to not go down with the server, and save your own bacon. It's as much a fix for the organization as anything else.
I see this most in smaller company aging systems where they had ample storage and drives for a few years ago when they were new, now they're crammed full with the growth of software, libraries, updates, data, new services being deployed on them, increased demands, and nobody wants to commit to more storage for an older system towards the end of its warranties, but they definitely don't want to commit to the much larger cost of a replacement and all the project work, and running at 90% full costs nothing and involves no decisions. 91%. 92%.
Monitoring is a good idea, regardless. However, there are cases where a bug or some other issue can cause disk usage to ramp too quickly for someone to respond to an alert.
Even a swap file shouldn't matter, since it's still not sparse. The one exception is if you're on a system that dynamically adds and removes swap files - I believe darwin does that, and I think it might be possible to do on Linux(?) but I've not actually seen it done.
Not quite the same situation as described in the article, but it is still possible for the kernel to swap memory in and out of disk even without a swap file/partition. Memory used for storing executable binaries is allowed to be moved out of memory, as a copy of it lives on disk. This means you can still encounter memory thrashing (and thus system unresponsiveness) under low memory situations.
> Memory used for storing executable binaries is allowed to be moved out of memory, as a copy of it lives on disk.
On Linux, this is not necessarily the case, as you can change the file on disk while the executable is running. I don't know if Linux just keeps executable code in memory all the time, or if it is smart enough to detect whether a copy of executable pages still lives on disk.
I am reminded of a tweet that suggested adding a sleep() call to your application that makes some part of it needlessly slow, so that you can give users a reason to upgrade when there's a security fix (it's 1 second faster now)!
In most VMware clusters that use resource pools extensively I've always maintained a small emergency CPU reservation on a pool that would never use it, just in case I had to free up some compute without warning.
Really good idea. After looking at the linked article about dd, I guess this wouldn't work as well if one was using a file system with compression. In that case maybe /dev/urandom would be better?
I had used this technique in Dev and ist servers precisely 11 years back. Get storage would be a days task which would stall current activity. This helped. 1.5gb of 5 files.
mkfs has an option to reserve a %age or # of blocks/inodes for root of a file system. It's the file system equivalent of empty files.
Usually when free space is exhausted, it's for non-root users. You get that same "time to fix stuff by deleting the file" by using tunefs to change that root reserved space to zero.
Plus have /var/log on a separate file system and make sure that your log rotations are based on size as well as time.
So that if by the time you get the email the issue is at 97% you can immediately give yourself enough breathing room to figure things out with downtime or significantly degraded performance.
This is like carrying around a pound of beef because you refuse to look up the address of a McDonald's 7 minutes away.
Setup quotas or implement some damn monitoring -- if you're not monitoring something as simple and critical as disk usage, what else are you not monitoring?
Monitoring doesn't prevent random things from spiking, and something like this makes it easier to recover.
Quotas are tricky to set up when things are sharing disk space, and that could easily give you a false positive where a service unnecessarily runs out of space.
Not all environments require a stringent SLA. I have some servers that don't have a stringent SLA and aren't worth being woken up at night over if their disk is filling up fast.
It allows me to remove that big file, then I'm able to run sudo since I don't allow root ssh and sudo won't work with a full disk, then I can clear up space on the system, bring it up again, then update log rotate or do whatever to prevent that case from happening again.
That sounds a lot more complicated (and time consuming) than just having monitoring in place, realizing the disk is filling up and fixing it before it leads to downtime.
Monitoring is in place and usually it is caught in time. Downtime is acceptable in this environment, I don't think its worth being woken up in the middle of the night when it can just be resolved in the morning.
On the chat team at Twitch in the early days after the Twitch Plays Pokemon crisis [1], we started artificially doubling all chat traffic through our systems, then dropping the doubles just before they would be sent to users. [2]
Not only did it give us a "big red button" to press during emergencies like OP, but it revealed important logical scaling issues before they became real problems.
[1]: tldr; 1 million people playing a single game of Pokemon Red by using chat to send button presses
Please don't take HN on generic ideological tangents. They lead to threads that are repetitive and therefore uninteresting, and often turn nasty as well.
This is the opposite of private property ownership. Property ownership comes at a cost. That cost increases as supply dries up.
In this case there is no cost to the individual, only to the group.
Sort-of. Typically there is a cost associated with getting rights over a property, either by manufacturing, or in the case of land, by purchase or through the efforts of settling. However, once you have ownership of a property, it's usually relatively cheap to continue own (except, I suppose, if you consider the risk of a communist revolution or something).
This is esentially how the British monarchy earns their generous sums of money. They stole land from the Brits during the Norman conquest about 1000 years ago, and now they rent it back to them for a handsome profit (though the whole thing is rather complicated now and they only get a portion of the money).
I'm a brit who was only discouraged from the process of becoming a American (when much younger and on a terrific upwards trajectory which crazy events eventually skewered) and I would be hurt particularly by property tax now and actually the moment my income ceased growing.
I can't remember if Florida doesn't have property tax and that then may explain why so many of my family retired there, but everywhere I wanted to live did.
If that's all that's keeping you back, you should check if the city/state you wish to live in has what's usually called a homestead exemption. Many places will reduce property tax if you meet a few criteria. The criteria varies, but it usually revolves around being on a fixed income and living on the property.
Florida has property tax, but they don't have an estate tax, which probably influences people to retire there...
For information your property tax (council tax) for the most expensive house in Central London can be less than an average 4 bed house in a mid ranked US state.
During the 2-month lockdown a year ago, I would purchase 4 frozen pizzas at a time when I had a chance to buy them, because I was so upset that I could not buy one when I wanted a single one during the 2 previous weeks, because of hoarders who had been faster than me.
People think of toilet papers, but it is not just that. Pasta, rice, flour, yeast, plenty of useful things went missing due to hoarders during lockdown. Thankfully, I don't eat meat, because there was no meat at the supermarkets. Fish was also hard to find in the frozen shelves of the supermarkets. I am not a big fish eater either, so it was fine for me as well, but you get the idea: if you don't hoard a tiny bit, you get increasingly frustrated because hoarders will hoard, and you will have to wait for weeks before you get the chance to have what you want.
The meat situation was kind of interesting actually. At the beginning of the lockdown I remember going shopping for lots of shelf-stable foods. Very perishable stuff like meat or fresh veggies were out or hard to find (and yes freezing meat works fine I know). However, lots of stores had tons of shelf stable boxed and canned goods and for meat jerky and other dried meats which are simple to boil and turn into simple soups in case of emergencies -- and yet nobody was really buying them at that time.
I think I "prepped" for the worst by buying a 10 lb bag of flour, 40 lb of rice, 5 lbs of oatmeal, a sack of potatoes and a few bags of beef jerky and trail mix. Even if the worst didn't come to pass I figured we'd eventually eat it all anyways and it wouldn't really be hoarded, but in pinch we could ration it and it would last a few months and give us nutritious meals. It's basically what ships crews used to survive on during the age of sail as they spent months at sea. Not a ton of variety but it will keep you alive.
I bought a bunch of rice, and all that happened was little rice bugs started living in it, so I had to throw it all away. But if there were a rice shortage, I would probably have eaten it.
If you freeze it for 24 hours it will kill them off. Can be a good idea to do that when it comes into the house anyway as they might be in there already.
Easy enough to separate them out of the rice after freezing too.
For the same reason (kill bugs), the same thing is recommended when you buy flour. Put it in the freezer for 24+ hours, then take it out and store at room temperature.
I hear this derogatory term "hoarder" being thrown around a lot. When the pandemic first hit I bought a ton of food. I still ate and bought the same amount over time.. Nothing went to waste. I just wanted to make FEWER but BIGGER trips to the supermarket so I had reduced chance of getting infected.
In the same sentence you complain about "hoarders" being faster than you whilst justifying (reasonably well imho) why you yourself bought more.
People did what they thought they needed to and will always do so. The world was always still a bit of a jungle out there (any nation is three meals away from anarchy). I'm sure those with immune compromised or elderly family members were more excessive on "hoarding". We need to get out of this blame mentality and realize our system needs improving so it can respond to these challenges.
I know how to cook, but when you suddenly have to cook 2x as often because you can't eat out anymore your are going to cook a lot more of everything, including pasta. Suddenly the supply chain was out of whack - it was built for people that eat out a lot.
Out of curiosity: did people really experience empty grocery shelves or are phrases such as "rapidly disappeared" meant more as hyperbole?
I live in a Chicago suburb, and while variety did decrease (I still can't get Coke Zero Cherry) the basics were always in stock throughout the pandemic time.
Why? Because the channel mix changed, and distribution and packaging are channel-specific.
For example, toilet paper is shipped to commercial customers in cardboard boxes, while retail customers by it in plastic-wrapped, branded, and SKU’d blocks of 4, 8, 12, etc. When everyone suddenly stopped going to offices and restaurants, demand plummeted in the commercial channel and soared in the retail channel. It took time for factories and distributors to adjust to that. The same thing happened to a bunch of food staples too.
Since a lot of manufacturing is regional, different areas of the nation and world experienced different impacts.
Toilet paper is interesting because it takes up lots of space, can’t be “compressed”, so most stores doesn’t store too much of them. That’s why they are the first thing to disappear.
One day I was working with my colleague and when the fileserver was full he went to a project folder and removed a file called balloon.txt which immediately freed up a few percent of disk space.
Turned out that we had a number of people who, as soon as the disk had some free space, created large files in order to reserve that free space for themself. About half the capacity of the fileserver was taken up by balloon.txt files.