This is one of the books I recommend to my coworkers who are interested in operating systems - it teaches a surprising amount of things by telling you what an OS will do for you and therefore why you need it, instead of telling you how it works inside.
It also remains being very pleasant to read in spite of its very large size( I read the whole book cover to cover ). Obviously you can also read the classics ( minix book, tanenbaum, Bach , and probably modern references ) but this one somehow gives the operating system a purpose which I find absent in the others I’ve read .
If you don’t mind developing, what made you switch stance ? many people never change their minds even when faced with overwhelming evidence , and based on your prior level of support, I’m quite curious about the actual process .
Hard to tell in retrospect. I think the thick layer of distrust against palestinians (which was built by debunked lie after lie from Hamas etc over the years) was finally breached by the sheer asimmetry of power that Israeli forces have gained against Palestinian civilians.
Just forget that the two parties are Jews and Arabs and instead make them Suaheli and Kazakh, and then put one group in such an "agency-less" position as Palestinians are, and give the other group the leverage in power as the Israelis have, plus the grievances. Even if you can understand these grievances – there is just no way these things aren't going to happen.
Plus: The state of Gaza has reached a level of destruction that is just ... well basically as if they have nuked the place (Like I initally favoured). At some point the humane thing would be to call it a win and leave. An that point has probably passed a long time ago.
Plus, I have read about the background of some of Netanjahu's cabinet members and they essentially tick all the boxes of what I find problematic with the aforementioned power asimmetry:
Prior aggresive behaviour against Palestinian civlians in the settlement areas, with the victims having no proper way of legal recourse. Like ganging up on random Arabs there and beating them up. I know there is backlash for this from within Israeli society but man, things are bad if a literal street thug is getting a place in the cabinet, because he behaved that way.
‘In other words: AI is making it possible to detect severe security vulnerabilities at highly accelerated speeds.´
Isn’t it rather : we now have a new family of security flaws detector, which find other issues on top of the ones already found by conventional ( human or regular static analyzers ) methods ?
If they supersede all the existing ones , then it’s quite major, and quite a bunch of vendors will disappear …
I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong
The first time I had to deal with faulty ram ( more than 20y ago ), the bug would never trigger unless I used pretty much the whole dimm stick and put meaningful stuff in it etc in my case linking large executables , or untargzipping large source archives.
I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.
Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.
That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.
The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.
I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.
If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.
I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .
If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data
We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.
My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.
Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.
I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".
EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.
Ok, I am sure there is _some_ amount of unrepairable errors.
But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!
Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.
You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.
So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.
And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)
I think we diverge on ‘making it go away in my book’.
When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.
So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.
I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.
Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.
But this is sort of the march of nines.
My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!
Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.
No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.
In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.
You're thinking in terms of independent errors. I would think that this assumption is often not the case, so 3 errors right next to each other are comparatively likely to happen (far more than 3 individual errors). This would explain such 'strange' occurrences about ECC memory.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Modern AMD processors are basically a bunch of smaller processors (chiplets) glued together with an interconnect. So yes single chip nodes can have many numa zones.
Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).
Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.
The kernel tries to guess as well as it can though - many years ago I hit a fun bug in the kernel scheduler that was triggered by numa process migration ie the kernel would move the processes to the core closest to the ram. It happened that in some cases the migrated processes never got scheduled and got stuck forever.
Disabling numa migration removed the problem. I figured out the issue because of the excellent ‘a decade of wasted cores’ paper which essentially said that on ‘big’ machines like ours funky things could happen scheduling wise so started looking at scheduling settings .
The main numa-pinning performance issue I was describing was different though, and like you said came from us needing to change the way the code was written to account for the distance to ram stick. Modern servers will usually let you choose from fully managed ( hope and pray , single zone ) to many zones, and the depending on what you’ve chosen to expose, use it in your code. As always, benchmark benchmarks.
Guessing this is especially hard to automate with peripherals involved. I once had a workload slow severely because it was running on the NUMA node that didn't share memory with the NIC.
Isn't high grade SSD storage pretty much a memory layer as well these days as the difference is no longer several orders of magnitude in access time and thoughput but only one or two (compared to tha last layer of memory)?
Optane was supposed to fill the gap but Intel never found a market for this.
Flash is still extremely slow compared to ram, including modern flash, especially in a world where ram is already very slow and your cpu already keeps waiting for it.
That being said, you should consider ram/flash/spinning to be all part of a storage hierarchy with different constants and tradeoffs ( volatile or not, big or small , fast or slow etc ), and knowing these tradeoffs will help you design simpler and better systems.
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
My kids make fun of me because I know the shopkeepers around me by first name, along with the details of their businesses , and that shopping takes forever because I talk to everyone, customers included.
I just love it, it’s easy and I get a lot in return - from perks to incredible encounters. At work it’s been very helpful.
I developed that skill while traveling alone for a year , and it boils down to practicing and reading whether the person you’re talking to is ok with your talking or not.
And now because I know them I go there because I can buy my stuff but also spend five minutes chatting and that makes going grocery shopping a real joy. And because I go there and chat they do nice things like give me a couple of tomatoes or “you’ve got to try this cake” or the wine shop where I automatically get a 15% discount, or the butcher where they let me in when are already closed but they know I’ve come over specially.
And some of those people have become real friends, like go and have dinner together friends. We have very different lives but we get on because we get on. I think everyone benefits from reaching out of their bubble a bit.
If I’m feeling a bit glum I’ll go out to buy bread or something because I know just seeing the people I see regularly will lift me up.
It's interesting, because while having that skill is helpful I think part of the issue a lot of people have is an overturned sense for it - they will be worried they are getting judged for wasting their counterparts time.
It's good to have, but don't let not having it (yet) stop you!
Cold approaches worked better before social media and smartphones . now your awkward encounters can live forever online and cause humiliation for years to come , or some stranger looking for clout may step in. This is has become so common now , because everyone wants to be a hero.
It also remains being very pleasant to read in spite of its very large size( I read the whole book cover to cover ). Obviously you can also read the classics ( minix book, tanenbaum, Bach , and probably modern references ) but this one somehow gives the operating system a purpose which I find absent in the others I’ve read .
reply