Hacker News new | past | comments | ask | show | jobs | submit login
Windows 7 patch for Meltdown enabled arbitrary reads and writes in kernel memory (frizk.net)
545 points by romac on Mar 27, 2018 | hide | past | web | favorite | 215 comments

This is what happens when devs are presented with a very complicated problem, extremely short deadline and enormous amount of pressure.

>extremely short deadline

they had 6 months.

> they had 6 months.

You say that like 6 months is automatically a lot of time. "She had 6 months to give birth". Yeah, only it takes 9, so 6 is short.

Consider the scope and depth of the issue and the fact that they probably couldn't involve too many people on this effort.

Did she try having the pregnancy in parallel?

Doesn't sound like she was trying at all.

Twins - double the bandwidth but the latency stays the same.

0.22 vs 0.11 bpm is actually a big improvement despite the latency.

"Never underestimate the bandwidth of a station wagon full of babies hurtling down the highway"?

Truly a quote for ages

It isn't if you need the baby in 6 months.

just get one from any outsurcing firm then


She should have used Rust for a fearless pregnancy.

Other operating system maintainers had only days/weeks in Jan 2018.

We're all throwing darts in the dark here with regards to the resource they gave the problem and its difficulty. I think the real takeaway is just that it could be something other than just stupidity through and through.

Unless you can be sure their response solved their problem without introducing others, it's not evidence of sufficient time.

So I agree with the principle that a lot of time is a lot of time.

Inversely, though, I'd argue that Meltdown is a relatively small problem! It's strictly around memory usage, cache, and calling patterns. There's not a lot of systems at play, though there's the hard "figure out which order of instructions gets the state machine in a dangerous state" problem. There's a lot less coordination involved than, say, a system call bug that would subtly return the wrong answer half the time and you know that programs sometimes rely on this and others crash because of it.

Some things are hard, other things are hard but at least they're basically math, and math has a bit more determinism involved. Imagine if UX design or debugging strategies could always be broken down into state machines!

Great analogy. As an ex-manager always said: "3 women don't deliver in 3 months".

[E I see walrus01 already got that]

This is Scott Adam's (Dilbert) hilarious take on it: http://www.dilbert.com/strip/2007-09-03

corollary adage: nine women cannot gestate and give birth to a baby in one month.

See Brooks' little-known sequel: The Mythical Woman-Month

But 9 women can give birth to one baby per month on average

Not really for a period longer than 9 months. Of course, you can have even 9 women deliver 9 babies in one month, but none in the following 10-12 months.

Retort: That's where you're wrong! If we hooked up nine mothers to one single faetus, we could get the job done in 9 months.[0] The same way, if we hooked up our dev teams to a lead that could delegate the work properly, we could pump out a Meltdown patch in around a month and a half.


> Retort: That's where you're wrong! If we hooked up nine mothers to one single faetus, we could get the job done in 9 months.[0] The same way, if we hooked up our dev teams to a lead that could delegate the work properly, we could pump out a Meltdown patch in around a month and a half.

> http://www.pnas.org/content/early/2012/08/28/1205282109?sid=....

Except this whole train of thought falls apart once you consider the difficulty of "hooking up 9 mothers to a single fetus". In the same way you down play the difficulty of coordinating multiple teams for a solution around breaking research. Show me a working solution of the former and I'll accept the corollary.

you're looking at the problem all wrong, maltalex. Just hire a developer who is already 3 months pregnant.

The Linux kernel developers came up with a decent solution, how can it be that they can do this and the Microsoft developers cannot?

Is Windows written the exact same way as Linux? Never underestimate the amount of technical debt that can be holding a team down.

Ding-ding-ding, this is the non-bs answer.

Is that an excuse though?

It appears not, if such a bad bug can get through all the way to release.

"Linux" had a bug in which you could log into a system by pressing backspace 28 times a few years ago. And by Linux, I meant GRUB[1], and in turn, (many) Linux systems.

We're comparing Linux and Windows, an operating system that contains 3.5 million files[2] (of course, not just the kernel in this case). That isn't really fair. Code is as perfect as humans can make it, and it certainly does not help that there's so much to take into account.

[1] http://hmarco.org/bugs/CVE-2015-8370-Grub2-authentication-by...

[2] https://arstechnica.com/gadgets/2018/03/building-windows-4-m...

This GRUB bug you are talking about, is not a kernel problem though; on a side note, I'm going to read on the links you provided as I want to see if encrypted root partitions could also be compromised, I suspect no.

That's not quite on par with this Windows bug, but I take your point.

The Linux kernel developers hate their solution, and they only used it because they can't think of a better one. It causes enormous increases in complexity and kills performance in many cases.

They revived previous work on this as part of the KAISER work in November 2017, and still had major bugs with it in February 2018 (ie, 4 months later). That's pretty similar to the 6 month timeline mentioned here.



Linux kernel developer here. I don’t know know how MS’s Meltdown solution differs from Linux’s, let alone whether I should hate it.

MS (I think) uses IBRS to help with Spectre, and IBRS is not so great. Retpolines have a more fun name at the very least :)

I think he means Linux kernel developers hate (their own) solution for Meltdown.

The essential element of the solution for Meltdown is the same in every x86-64 OS: unmapping the kernel when in usermode. This is widely hated because it makes kernel entries and exits much slower, and blows away the TLB if your hardware doesn't have PCID support.

Yes, this.

It sucks, but what else can one do?

The Linux developers had a head start in the form of the KAISER (later KPTI) patch set, development of which had AFAIK started before Meltdown was discovered and reported in private to Intel.

>AFAIK started before Meltdown was discovered


According to https://googleprojectzero.blogspot.com.br/2018/01/reading-pr... Spectre was initially reported to Intel on 2017-06-01, and Meltdown a bit later.

After a quick web search, I found https://patchwork.kernel.org/patch/9712001/ which records the initial submission of the KAISER patch set at 2017-05-04. The repository at https://github.com/IAIK/KAISER has an older version of the patch set dated 2017-02-24, indicating that work on it had started even earlier.

Finally, the timeline at https://plus.google.com/+jwildeboer/posts/jj6a9JUaovP mentions a presentation from the authors of the patch set at the 33C3 on late 2016. Note that this page puts the submission of the KAISER patch set at 2017-06-24, but I believe that to be wrong; searching the web for "[RFC] x86_64: KAISER - do not map kernel in user mode" finds several mail archives with that message, and they all agree that the date was on May, not June.

That is, even if Microsoft had been immediately warned by Intel (or by Google), the Linux kernel developers would still have had a few extra months of head start, by basing their work on the KAISER patch set. Was it luck, or a side effect of the Linux kernel being used for academic research?

They said discovered AND reported, not just discovered. It is entirely possible someone discovered it much earlier and didn't report, but we won't ever know if evidence is never found.

KAISER is developed to mitigate another less severe vulnerability.

From the meltdown paper:

> We show that the KAISER defense mechanism for KASLR [8] has the important (but inadvertent) side effect of impeding Meltdown. We stress that KAISER must be deployed immediately to prevent large-scale exploitation of this severe information leakage.

Because Linux is a completely different OS with completely different code and a completely different set of problems.

That's entirely weak.

I think the WINE people are probably looking for your help, for some reason they still seem to think there's a few differences between the two. They'll be happy to know they've been wasting their time.



We ban accounts that attack other users like this. Please stop.


This is not appropriate discourse here.

Another relevant factor I just thought of: the Windows kernel has more constraints, due to binary-only drivers which have to keep working. The Linux kernel could fix any incompatible driver at the same time, since they're all in the same git tree (out-of-tree drivers are not expected to be compatible with newer kernels).

I agree. The philosophy is different. Linux is focused on having the right thing working, at the cost of compatibility (sometimes). Windows is (or at least was) focused on extreme compatibility and the actual features of operating system seem to be slapped onto the features of the previous version of the OS.

This seemed to work well for Windows audience in the past, also for Linux audience, due to the fact that they have different uses and audiences.

People seem to have segregated into those users that just want stuff working and those that want powerful operating system that allows them to do whatever they want.

At least that was until sometime the Windows 10 came...

> only it takes 9

What if we outsourced the QA to India?

that's a really bad analogy... Note: I think that those devs should be up for death row /s

A single person with a fixed 9 month biological timeframe is the analogy you chose to compare 6 months of a billion dollar company's software development time by potentially hundreds of developers (for better or worse) but importantly for an extremely critical class of bugs and therefore that's "just how it is"?

Come on, software is hard but when you fix a vulnerability and expose a far far worse one, and had months to plan, execute, and test it, then you are most certainly justified in being criticized.

It's not like we're saying the code is shoddy and needs work, which is entirely excusable in a short timeframe. It's that they've left users far worse off in the end then from where they started.

If MS had allocated 1000 devs to fix this issue quickly, the result would have been an utter disaster.

Of course 1000+ developers working on one single solution waterfall style in a short timeframe is a terrible idea. That's not how software works... and we all know that. You know that.

Jumping on the next worst thing does not excuse them either. Nor is taking another analogy to the other extreme helpful at all in this discussion.

A solid pool of talent with complete flexibility resource-wise and a strong critical-level mandate is nothing like a single person with a fixed biological timeframe, with relatively limited resources, no matter which way you'd like to spin it.

If they had allocated 1k devs into n teams to develop different approaches and review and test each other's code and approach. No, the result would've been a better patch and probably not that piece of hot garbage.

I suggest reading The Mythical Man Month sometime, you're not accounting for the complexity of running such an "n-teams" scheme.

I have read it and yes, I have. Assuming they had that many developers qualified to work on the problem, they'd also likely already have been employed on other projects. Therefore the management infrastructure would already be in place. The state would need to change, but yeah, the government would already be there and qualified.

My own personal dogma is that your CI/CD system hasn't achieved its goal until everyone on the team can spool up a given build of the code and try to reproduce an error for themselves without interrupting anyone else to do it.

The person who discovers the bug may not come up with the best repro case. The person best equipped at fixing the bug may not be best person to track it. Being able to spool up new people on a problem for cheap keeps the whole experience lower stress and generally improves your consistency with regards to success.

If the cost of someone trying a crazy theory is linear in man-hours and O(1) or even O(log n) for wall clock hours you're going to look like a bunch of professionals instead of a bunch children with pointy sticks.

From what I understand, Microsoft has never gotten there. They got too big to fail a long time ago. And certainly wouldn't have for Windows 7.

Not only that but the teams would be working independently by design. 9 people can't make a human in one month, but 9 people can make 9 children in 9 months. You can then choose amongst them. So, yeah, I have no idea why you're bringing in mythical man month stuff here.

Anything sufficiently complex can be broken down into simpler pieces. This includes most developer generalists.

Anything sufficiently complex can be broken down into simpler pieces plus the glue holding those pieces together.

In human organizations, that glue itself gets incredibly complex and expensive, as number of pieces grow.

I disagree that glue is expensive and complex. When you build a ply-wood tower in school to see who's holds up the most to compression, you don't douse your entire structure in glue. You get points off, because it adds so much to weight!

People are the plywood, fragile, finicky, and useless if left to their own devices. Management is the middle school kid who needs to take the wood he's been given and make something that will hold up to all the weight that'll be put on top it. In order to do this, he's been given a hot glue gun and enough glue to mummify the entire thing if he so chooses. Most of the kids will rush bullheadedly (or should I say uncaringly) into gluing the sticks together into something that "looks like it should work." They use too much glue, the structure isn't optimized for load handling, and when the day of truth comes, it crumbles down when the bucket that's supposed to hold the weight, destroys it!

What is glue? Whatever management wants it to be. It can be a team leader or a hastily configured IRC channel. In my experience (this includes organizing, delegating, and making sure that 40 devs-et-al get what's needed done), if you choose your sticks right, taking the time to make sure they're not hiding any structural faults, you can make the job 65% easier. If you lament that choosing sticks if difficult, I reply with "it's just practice."

The main issue I've seen, has been the all too common "there are no good managers." Especially in technology. The remedies for this? There's no bandaid. Each manager has to realize his personal shortcomings and fix them. But, to throw up his hands and say "the more people working on a project, the slower it'll get done," is a nice way to say "I can't handle all these people, but I'll excuse that away by saying it's inevitable. It's even industry 'common sense!'"

All but a very small number of those teams would have spent quite a while reading manuals, reading code, and learning how the kernel entry code and pagetable handling code worked. Then they'd come up with something, but there would be a severe shortage of reviewers.

Not to mention that the whole problem would most likely leak once that many people knew about it.

There's probably no 1000 good VM engineers out there, in the world.

Yet it didn't take 1000 of them to fix it on other OSes. Why are we even debating 1000 devs anyway? That is hardly the point and throwing more and more bodies at a programming problem is hardly ever a solution, nor one I proposed in my original comment.

It's ultimately a matter of talent, resources, and proper management. Which is hardly an insurmountable problem for a major tech company which decades of experience solving world-is-ending bugs.

I put it down to lack of openness to a wider review than just Microsoft engineers.


Nine month analogy is widely known in software development.

Software is all about capturing as many income streams with as fewest people as possible.

While also generating the most number of jobs possible.

What kind of jobs? Architects developing biotecture, or janitors cleaning up vomit and firefighters putting out dog shit that's burning?

Are you sure that that particular team inside Microsoft had full six months?

Intel had 6 months.

I'm pretty sure they had around six months give or take a day or so of oh shit in Intel. Of course, Intel may have actually simply broken the glass on a dusty old plan of action "In the event of ..."

Disaster plans are funny things.

I have had the misfortune of having to pull them out twice in my career - in both cases they offered little in the way of guidance for the particular situation that came up.

The set of unknown unknowns that are typically missed make most of them unless in all but the most narrow of cases, because many companies write them, and then forget them. Especialy if they are as large as intel.

Very true, although I'm glad to say I have not had to break out one of my own yet for real. My first experience of a full on DR test was pretty humbling - NetWare servers backed up by the Unix troops via Legato. It turned out that the backups were good but restored at a pathetically slow speed (no reflection on the Unix systems but I suspect the Novell TSAs were a bit shag at the time). We updated "time to restore" estimations and moved on, after adding one or two other results of lessons learned.

Do test your plans (this is not aimed at you personally zer00eyz - you probably know better than most).

There are a lot of unknowns but the basic model of a real DR plan is pretty sound these days, if you can afford it or wing it in some way. An example:

Another site, a suitable distance away. On that site there is enough infra to run the basics - wifi, a few ethernet ports, telephony etc. There should also be enough hypervisor and storage for that. Some backups are delivered there as well as on site. Hypervisor replicas are created from the backups (or directly) depending on RPO requirements and bandwidth available. The only thing that should be able to routinely access the backup files is the backup system (certainly not "Domain Admins" or other such nonsense". Ensure that what is written is verified.

Now test it 8)

.... regularly

Ok now I have to share a story...

The company in question had a rather large on site server room (raised floor, fire suppression) and a massive generator to deal with any power issues as well as redundant connectivity. This room was literally the backup incase their "real" data center went off line.

The problem is that the room was "convenient" so there were plenty of things that lived ONLY there (mistake one) -

When the substation for the office went, and the generator started everything looked fine. The problem was that no one had ever run the generator for that long... after a few hours it simply crapped out (over heated, problem two).

A quick trip to home depot got them generators and extension cords that let them get the few boxes that were critical back up - however one box decided to not only fault, but to take it's data with it.

This is when I got a rather frantic call "did I still have the code from the project I did?" - they offered to cut me a check for $2000 if I would go home right then and simply LOOK for it.

Lucky for them I had it - and the continuity portion of the DR plan got revisited.

In hind sight after I said I had the code, I probably could have asked them to put another zero on the end of the check and they would have done it just to be a functioning business come 6am.

I didn't even have to show some leg to get you to recant the dit.

Thank you - I'm happy to listen to (nearly) everything.

"I probably could have asked them to put another zero" - ahem that's not the IT Consultant's Way exactly. We have far more polite ways of extracting loot. We are not lawyers and should have morals.

Reminds me of that The Expanse quote:

"I have a file with 900 pages of analysis and contingency plans for war with Mars, including fourteen different scenarios about what to do if they develop an unexpected new technology. My file for what to do if an advanced alien species comes calling is three pages long, and it begins with 'Step 1: Find God'."

Microsoft had 2-3 months, tops.

Source, proof of assertion?

As far as I know he's right. The news was given first to Amazon and Microsoft sometime in August. Consider one month for testing and preparing for release, that gives three months to build a solution for all supported operating systems. Two months to do it for the most recent version and one month for backporting to the older ones sounds about right. Maybe a few weeks more, but that's it.

Intel's own press releases. I'm not gonna go digging into old links just because some random dude on the Internet can't use Google.

If the problem is complex enough, six months may be a short deadline.

Especially since it comes as a surprise, and there’s already an existing train on its way to the next station with its own timetable.

that surprise could include "management says we don't have to do anything :/" <5.99 months pass>, management: "we have to patch this and it needs to be done yesterday."

That would be extreme, but I can entirely imagine it taking many weeks for the true importance of this problem to correctly propagate across all management levels.

I wonder how long the actual devs fixing it had? From what I hear from friends who work there, Microsoft is a sprawling bureaucracy with many layers of management, where decisions are far from quick. I'd imagine that after Intel/whoever let Microsoft know about the exploits, it went through many levels of prioritization, negotiation about which team would work on it, not being brought into sprints because of other features already being worked on, etc. Most likely there were people with minimal knowledge of the relevant tech making all these prioritization decisions.

Wouldn't shock me at all if there was very little actual dev work done for the first few months, and then it was all super rushed at the end. Quite possibly the devs with the required knowledge didn't even know this was in the pipeline for months. That's par for the course at every decently large company I've worked at (i.e. 100+ devs), and at a beast like Microsoft I imagine it'd be way worse.

I remember that Microsoft was able to deliver critical fixes practically overnight. This assumed that once you see the problem the fix is pretty straightforward.

Unfortunately Spectre and Meltdown aren't straightforward and go to the very heart of how the OS works. It's not at all easy to fix this when you have enormous amount of software working on top of it depending on every little quirk your solution provides.

Yea, it is probably the biggest change to the Windows kernel in a security update.

This is what happens when you don't have a QA department.

This is something that you find through code review, not testing. Apart from regression testing, but that presupposes that you encountered the issue before.

Are you implying that Microsoft doesn't do QA?

If they do, whatever issues they're occupied with finding would call for an exorcism.

These kinds of things should be part of an automated test suite. Specifically, the kind of tests that were written years ago.

Honestly, Microsoft is really big into automated testing. I'm surprised this slipped through.

I don't think there are any OS kernels that practice test-driven development - most of them don't even have code coverage working. It's also very hard to test for a problem you haven't thought of yet.

A really simple test you can compile with cygwin - if it doesn't crash, the bug is present:

  #include <stdio.h>
  int main()
          volatile unsigned long *ptr = (volatile unsigned long *)0xFFFFF6FB7DBED000;
          printf("%lx\n", *ptr);
          return 0;

I seem to be unable to find a patch that will make it so that this doesn't run. Windows Update says that I have all required patches. I first tried KB4088875. That didn't cause this program to fail. Then I tried "2018-03 Preview of Monthly Quality Rollup for Windows 7 for x64-based Systems (KB4088881)", which was only a recommended update. That didn't help either.

Same for me. I tested a Windows 7 x64 system which has all security patches, but caf's "really simple test" above still runs, which seems to indicate that the bug still exists. Same as you, I applied KB4088881, which was the only pending update, but it made no difference.

Also, I tried the command from the orginal article:

pcileech.exe dump -out memorydump.raw -device totalmeltdown -v -force

This creates 5GB file which does look like a raw memory dump. I'm not sure how to interpret this; I don't know what the behavior should be with or without the bug.

In the off chance that anyone stumbles across this in the future, KB100480 fixes this.


CVSS 3.0 base score of 7.8.

I finally found the fix. It's KB4100480. It makes the little test crash as it should.

So there is no patch that fixes it?

Same here.

I'm worried.

I tested this on a Win7 x64 system with the 2018-01 (KB4056897) and 2018-02 (KB4074587) patches. It segfaulted. Hmmm.

Ahh, I was using a 32-bit gcc. 64-bit gcc shows it :)

  $ x86_64-w64-mingw32-gcc meltdown.c -o meltdown.exe
  $ ./meltdown.exe

But can you find a March patch that makes the correct 64 bit version segfault? I can't :-(


Grammar police warning: The comma in “if it doesn’t crash, the bug is present” actually makes the intention more difficult to understand.

The comma placement "if clause1, clause2" is extremely common. In the above sentence, there is no other place it can go, other than nowhere at all.

"if, it doesn't crash ..." nope

"if it, doesn't ..." nope

"if it doesn't, crash ... " nope

"if it doesn't crash, the " yep!

"if it doesn't crash the, bug ..." nope

"if it doesn't crash the bug, is ..." nope

"if it doesn't crash the bug is, present" nope.

When it is present, it does help to separate the if and then, particularly in the absence of the word "then".

Without the comma, the prefix "if it doesn't crash the bug" can be scanned as a viable clause, only to find that the suffix becomes a fragment.

You brute-forced comma placement. I tip, to you, my hat.

Thank you for this and everyone who has downvoted an incorrect and misleading statement

"If x, y" is a shorthand for "If x then y" in spoken language.

That's the grammatically correct place to put the comma. Fairly sure you're actually grammatically required to have a comma there.


Well, let's take a breath and be grateful that even MS can mess something like this up.

One dev or 1000, who cares, whatever they chose did not work particularly well. Are "they" to blame? Yes. Are the engineers to blame? Probably not. Is management the culprit? We don't know.

What's left? Next time your customers bug you about some random downtime caused by an overworked datacenter intern, don't feel stressed. Take the time to remember that even if you would've had billions of dollars, years of experience and thousands of employees, you could've messed up, just like MS did :)

"Take the time to remember that even if you would've [sic] had billions of dollars, years of experience and thousands of employees, you could've messed up, just like MS did :)"

When you are the direct, contracted, IT support for a company then you do have responsibilities. You might be considered responsible for timely delivery of patches - a fair argument in court I think. Mitigations might involve helpdesk logs as well as contracts.

well_done: Your tone comes across as BOFH. I'm possibly a simple PHB who owns an electric cattle prod that is wired up to the mains (three phase) but I prefer to get sign off for a project via work committed and not threat done.

>Your tone comes across as BOFH

I didn't read it like that, and I'm usually the first to read things negatively. I read it as motivational: don't feel bad about your own mistakes, even the big guys with tons of money and a lot of really smart people mess up sometimes. So cut yourself some slack and just do the best you can.

"Next time your customers bug you about some random downtime caused by an overworked datacenter intern, don't feel stressed. Take the time to remember that even if you would've had billions of dollars, years of experience and thousands of employees, you could've messed up, just like MS did :)"

I missed the :) which might sound a bit naff now but was probably intended to deflect comments like mine. Hit taken. However I did invoke BOFH which is (I hope) normally seen as an indication that a comment is not to be taken too seriously.

EDIT: BOFH => Negative - nope, not here.

You're correct, I read "BOFH" as a negative. I apologize since you did not mean it in that way. I never thought that BOFH would ever be considered the good guy in the story :)

> an electric cattle prod that is wired up to the mains (three phase)

"Have you seen the boss's new toy?" "Yeah. Coincidentally, I'll be working remotely moving forward. Good luck!"

Coming across as BOFH was not my intention at all, maybe I should refrain from using emojis to convey meaning ;)

I just thought that if you're working in a high pressure environment (and this applies to virtually every coding shop I've ever known) and get trouble from all sides all the time it feels reassuring to know that in fact, not you are the problem, neither is your employer, stuff like this just happens even to the best.

Yeesh. I didn’t know that Windows still uses the self-referential page table trick. This makes me very nervous, especially since they seem to keep it mapped in the user page tables. This seems likely to poke a big hole in ASLR if nothing else. It’s a huge target for write-what-where exploits.

They changed it in Windows 10 (RS1 IIRC)[1].

[1] http://www.alex-ionescu.com/?p=323

The tl;dr is that they're still using the self-referential page table trick, however the PTE_BASE is now randomised at runtime with dynamic fixups.

If an attacker can’t find it by probing the smallish number of choices using one of many MMU layout fixes, I’d be quite surprised.

Does this bug also imply they're not using SMAP, since otherwise any kernel access to the page tables should have tripped it?

Replying to myself, I guess not, because the kernel doesn't modify the user pagetables using the user pagetables' own self-referencing entry. Rather it has the kernel %cr3 loaded and modifies them using a mapping of the user pagetables in those kernel pagetables.

..which, on further reflection, raises the question: if those page tables aren't loaded when in kernel mode, why do they even need a self-referential entry at all?

> This seems likely to poke a big hole in ASLR if nothing else. It’s a huge target for write-what-where exploits.

How? Is it mapped to a static location?

It is in Windows 7.

Good to know, thanks.

What's wrong with using self-referential page table? I thought everyone uses that?

According to the article, the self-mapping PTE is randomized in the latest Windows 10.

Security is hard. I don't think I'd have the stomach to work on kernel or encryption code.

The worst I can do here in userland is crash or delete data. And that's pretty bad already

I dunno. I think a program that produces subtly wrong results is the worst. It reminds me of a project I once did. It involved producing reports from tens of thousands of records including summing some of the fields. I was constrained to work on Windows so I put the date into an SQL server database and used MS Access to produce very nice looking reports. I reviewed the reports and everything "looked OK" so I handed them to the users for approval. They users were accountants. They added up the partial sums and pointed out that the results were only approximately correct. It turns out that MS Access is not so good at arithmetic. I restructured the reports to perform the arithmetic in the SQL queries and just use MS Access to format it for a pretty page. I also checked the arithmetic before handing the next revision over.

Better testing (as in more than a superficial glance) would have caught this before review but there always exists the possibility that subtle bugs can sneak past even well thought tests.

Just my own experience and opinion.

Agreed. I was doing an derailment investigation a number of years ago which involved digging through event recorder logfiles. The event recorder is such that it writes an entry every second but only updates GPS 20 seconds, so it writes GPS coordinates against every 20th entry.

What I discovered was that the event recorders on certain locomotives updated GPS at the 20th second rather than the 0th second. This meant that the GPS entry next to each line was in fact offset by 20 seconds - i.e. the entry for 02:11:40 was in fact what was sampled at 02:11:20. I think they must've held the GPS coordinate in memory somewhere but updated it AFTER writing that second's entry, so they wrote 02:11:20 whilst holding 02:11:00's GPS, then updated it, but then written that update at 02:11:40, etc. This was a fault with the design of the event recorders, not just one loco, as it occurred on each of that type that I looked at.

This confused me so much because it looked right - it was in the right general location, it was updating, etc. - but for a solid few days I did a bunch of analysis thinking the train was in a different location to where it really was. I eventaully picked up on it when subtle things kept not adding up and verified it by watching another loco come to a stop but then see the GPS keep moving for a bit afterwards until it settled.

I agree with you, subtly wrong results are the worst.

I think there is a spectrum of styles of people who work on this stuff. At one end, you just write some code and see if it works. At the other end, you read specs very carefully, think about what you need to do, and do it. No matter which approach you take, you still get blindsided every now and then.

Yep. The most fun is when the spec has built-in security bugs. Especially when it only manifests in the interaction with a different spec.

you likely write a lot more code than security/kernel-hardening professionals though. they just likely spend more of their time reviewing and researching than coding

I guess upgrade to Windows 10 is the message Microsoft is trying to get out here?

Laying off all those QA and testing people will have some downside - MS seems to be letting older versions take the hit.

Perhaps. However, Microsoft has published very clear guidance on how long previous versions of Windows would receive support for, specifically including the period for security updates. Doing what you're describing is effectively reneging on that deal, and that sends a very different kind of message.

Sure, I have a hard time believing MS would purposefully screw their Win 7/8 enterprise customers - but the issue at hand suggests it's at the very least a byproduct of their new strategy of focus on Win 10 and the decision to do with less QA/Testers by involving more end users to participate in testing. Thus Win 7 users end up with slower, less tested patches and no hardware support backports. To be fair only the less tested patches sound terrible.

That or switch to unix.

Bad coding on one product isn't exactly the most convincing strategy to get people to try a different product.

What sort of QA would be doing memory access tests? Most I've worked with can't even do automation scripts.

That’s like saying most developers are WordPress install monkeys. It may be locally true but it’s not globally so. Good QA people – and I’m sure Microsoft and similar major players have plenty of them – are just as skilled as the developers but working at different goals. In addition to security, they’ll be working on scalability, detailed correctness tests, fuzz testing and other automation techniques, etc.

If they can be replaced with a script, your employer has a management failure and is wasting a lot of money on short-term savings.

Has anyone actually confirmed this bug? The author seems to be an expert in low-level DMA security, but it would be nice to see independent confirmation. Reading through the comments so far, it doesn't seem so. The closest anyone comes is this: https://news.ycombinator.com/item?id=16693599

I was hoping to see someone who said:

(1) I tested a Windows 7 X64 machine without the Meltdown patch (pre-December 2017) and couldn't read arbitrary memory.

(2) Next I tested with Microsoft's Meltdown patch KBnnnnnnn (Jan or Feb 2018) and could read arbitrary memory. The system is insecure.

(3) I then tested with Microsoft patch KBnnnnnnn (March 2018) and can no longer read arbitrary memory. They fixed it.

I finally found the fix. It's KB4100480. It makes the little test crash as it should. Phew!

I did use a modified version of my short test program to actually test modifying the page tables to read a chosen physical address, which worked just fine. The bug is real.

And KBnnnnnn that fixes it is which KB?

My take on this is a little bit dumb, but, once upon a time, many moons ago, I thought I understood the CPU I acted upon, I could peek and poke and look up what was where. I mostly wanted faster, but what i got was more complex.

Is there a way to just get faster without the complexity. What would a new cpu architecture and OS look like if we started again? is there room for open hardware to save us all?

Superscalar processing was unfortunately a major step forward in terms of performance.

The answer in https://stackoverflow.com/questions/8389648/how-do-i-achieve... is interesting; I'm particularly fascinated by the temperature warning (the answer author's CPU got to 76C in testing).

My CPU's sitting at 30C right now. It maybe climbs to 48C if Chrome's being stupid, and 50+C if I'm doing something mildly taxing. I've never made it go beyond 60C IIRC.

So, modern CPUs are so efficient that they're simply just never hitting their maximum throughput. I think that's pretty incredible.

The sad thing about CPUs that don't use modern (superscalar, multi-stage, microarched, etc) design is that they just can't keep up.

And people's OCD about speed and (more frequently) parallelization nowadays drives what they'll buy. Something had better have a killer feature if it isn't fast or highly parallel.

So it's possible, but a huge headache. Whatever you built would likely be highly purpose-specific.


The article says that this vulnerability was patched in March 2018, so at least there's that.

Is the March 2018 update even out still? I thought they pulled it because it reconfigured all the NICs in the system, losing static IP configuration in the process...?

I'm still working on a bloody huge list of customer updatathons for Meltdown and Speccy. Now I have to go back around a load of them that I have already patched and find the Win 7s and 2008r2s and update those before I continue.

Oh well, it gives me something to do of an evening 8)

Why are you manually patching workstations? WSUS allows central management (inc. zone deployment), but even in Windows 7's default state it should apply these updates without intervention.

2008R2 I can see doing it by hand, but Windows 7 clients is odd. Particularly as it seems to be taking you two months to apply urgent patches.

Some of my customer VMs are Windows 7 - Veeam proxies for example. I also take backups quite seriously. Yes this is all a bit manual in some cases.

(Nearly) All of them are on the end of an IPSEC VPN that I can get at from home via the office web proxy and another VPN or via magic. Some of them have 192.168.0/24 or - those are on the end of OpenVPN. I wrote this: https://doc.pfsense.org/index.php/OpenVPN_NAT_subnets_with_s... . You have no idea what networking is about until you've had to do that sort of nonsense a few times 8)

I don't have the luxury of one WSUS to manage, we have loads of the bloody things. Some customers have pretty skilled local IT depts, some have somewhat vocal users that accuse you of resetting their passwords after spending hours doing way more than they have paid for and would not understand what you are on about in the first place. I love them all equally as any parent would ...

When I get bored of watching Windows Update I run apt update && apt upgrade && reboot on a few machines and keep a weather eye on the monitoring system. When I get really bored, I run up yaourt on my laptop or my office PC. When I've got a newly installed Win system or two to patch, I fire up a few emerge -Uvh --deep --newuse --keep-going @world sessions (I'm not really joking here) or run up genkernel.

Yes, there is the default state designed by .... bbzzzzrrt .... soz, lost it, and then there is reality. Could I also remind you that there is rather more to patching than WSUS:

* Firmware - Dell, HPE and Co have had to do rather a lot of work here and had to start again in Jan when Intel dropped the ball * Hypervisors - I generally see VMware - that's a lot of patching and don't forget that some of them were buggered, so need excluding. * VM vHardware versions - yep, all those little lovelies have their own hardware types to worry about * "My fooking factory runs 24x7 - what are you going to do about it?" .... "Yes but you didn't go for the full cluster version sign I'll see what I can do" ...

You think I'm odd! No mate, my little company are well aware of automation and use it where we can but we are pragmatic and have to deal with a lot of reality.

We could of course bind our customers to our iron will and enforce our policy and stuff. They would not work on weekends or other odd hours. They would not insist on doing things their way and they absolutely would pay us on time - they generally do 8)

Wow, that's crazy. About as bad as it gets for local privesc!

So why does this only affect Windows 7?

Parts of the memory management code were rewritten for windows 10 to do fun things like randomize the location of the page tables.

This fairly significant change wasn't backported to Windows 7.

Then when they went to backport the meltdown fix to windows 10 they set a 'this page is user accessible' bit in the page tables by accident.

Probably because after Windows 7 they started a kernel rewrite , known as MinWin, to extricate the Win32 Userland tendrils that had crept into the Kernel since the NT days.


The article doesn't seem to indicate that MinWin started after Windows 7. If there has been a fundamental kernel change effort after Win7 (I'm not aware of one), maybe it has a different name?

You're right, for some reason my brain said Vista came after Windows 7. Given that 7 came after Vista my theory makes no sense.

Any indication of whether this was actually exploited? I really don't want to do a full key-rotation routine.

Also, does windows 7 map the entire address space into kernel memory? That is, would this have enabled direct memory access to other processors.

My understanding of the article is that the page table itself was writable. So you an attacking process could map in the entire memory of the computer and read everything, regardless of what was in the kernel's version page table.

The attacking process could also put whatever code it wanted into the kernel, and so give itself full access to everything on disk as well.

This sounds contradictory:

"Only Windows 7 x64 systems patched with the 2018-01 or 2018-02 patches are vulnerable. If your system isn't patched since December 2017 or if it's patched with the 2018-03 patches or later it will be secure."

"I discovered this vulnerability just after it had been patched in the 2018-03 Patch Tuesday. I have not been able to correlate the vulnerability to known CVEs or other known issues."

The fun thing is that even MS admitted it break non-PAE kernels and pre-SSE2 processors (in the most recent one). I have been fighting a similar bug in one of the Jet 4.0 patches for a while now.

Sounds like Microsoft found and patched it independently already. Afterwards someone else (blog author) found it too, maybe by reversing the patch from patch Tuesday.

Predictable. Fixing old bugs introduces new bugs.

I wonder for how long Windows as a software can continue to grow. I looked at the list of services, and its crazy. So much exotic functionality, and so many of what i don't ever need. Then the file system, there are even hidden folders managed by windows itself, that just grow and take up space. All that adds to complexity, and increases the probability for bugs. I wish there was a version of the OS that just shed all that unnecessary functionality and returned to basics. Something like a minimalist Linux distro, but able to run all games and office.

Even before that it's already unfair to compare a closed-source product to an open-source system. Bugs are much easier to find in an open-source system. It doesn't even by itself mean that there are more of them. If you look at the big picture, it's not like Windows is known for it's security.

>Even before that it's already unfair to compare a closed-source product to an open-source system. Bugs are much easier to find in an open-source system

Ironically, wouldn't that make it even more unfair for Windows? Shouldn't all the 'millions of eyeballs' looking at the linux code be making it more secure?

>If you look at the big picture, it's not like Windows is known for it's security.

True, but security bugs are easier to reason about, than feelings.

We can't measure the number of security bugs, we're measuring how many get fixed. Fewer eyeballs on Windows would imply fewer discoveries, and fewer bugfixes as a result.

>We can't measure the number of security bugs, we're measuring how many get fixed.

The number of bugs found should be trending towards zero since millions of people have the opportunity to improve the source code and prevent the bugs from being introduced in the first place. There are ofcourse other advantages to having the source be open, but if there is no security advantage to open source, that's going to put a dent in some of its marketing.

>Fewer eyeballs on Windows would imply fewer discoveries, and fewer bugfixes as a result.

Why would fewer people be looking at Windows compared to Linux? Security Researchers don't really discriminate. Or did you mean just the MS developers? Hmmm, I don't know how many windows bugs were found through external sources vs internal. Perhaps someone has already done that analysis..

> Why would fewer people be looking at Windows compared to Linux?

As I wrote in the original post, this is because Linux is open-source. There are few people looking at Windows, simply because there is no source to look at, and as a result there are 10 times less people in the world who potentially even can look at it and check for bugs. That's why. With Linux you need basic systems programming skill and ability to code simple exploits. With Windows you either need to be working there (and be assigned to this task) - or reverse-engineer, which is a much rarer and complicated skill.

Yes, in theory all that is correct.

> The number of bugs found should be trending towards zero

Only if no new features are ever introduced.

> Shouldn't all the 'millions of eyeballs' looking at the linux code be making it more secure?

Yes, this is exactly what happens, from my experience.

-> more people look at code

-> they find (and fix) more bugs

-> the system is more secure, because all bugs are found and fixed, instead of being kept inside the code and being sold on hacker forums and agency surveillance projects.

You also know that Linux is not just one codebase from 20 years ago, it constantly changes and adds new features? Of course there will be new bugs (like any other recent OS).

>-> they find (and fix) more bugs

Where is the evidence that this happens? Do you have data (Open vs closed) showing more security bugs were found through developers, versus external sources?

>-> the system is more secure, because all bugs are found and fixed, instead of being kept inside the code and being sold on hacker forums and agency surveillance projects.

Why would a hacker fix a linux bug for free, but chose to sell a windows bug? That doesn't make sense to me.

Might be worth reading this article :


It is really pointless using the count of CVEs as a measure of how vulnerable a product is.

AFAIK, every single form of aggregation that reduces variance, biases your data set.

>It is really pointless using the count of CVEs as a measure of how vulnerable a product is.

I read the article, and that is certainly the opinion of the author here.

Security is a large field. You can reduce it to number of bugs. You can reduce it to the development process used to create the product. You can reduce it to methods of defending against future vulnerabilities. You can reduce it to methods of tackling bugs. You can reduce it in along any axis. I don't think using CVEs as a measure is pointless. I find them to be useful.

"Hmm, but it appears that windows has fewer security bugs than Linux. Is there any data showing otherwise?"

Yes, I think you are being a bit of a noddy comparing a kernel with an entire OS. That said, all software has bugs. Blimey, how on earth can you compare the paltry 3000000000000 odd source files of Windows tucked up in GIT with the gazzilions of source files that comprises a modern Linux based system (let alone the BSDs etc).

I will simply mention here that when I update an LTS Ubuntu or Debian box I run "apt update && apt upgrade && reboot" (or use a GUI if I'm bored) and it takes a few seconds to minutes to update the entire system. Everything. That includes Java, Flash, Office suites, graphics drivers, USB drivers, printer drivers, CAD suites, database servers, web servers, PHP, Python, Perl, Rust, Go, ... need I go on. Everything. The same happens when I use pacman or yourt, or emerge, or yum, or rpm or whatever.

I'm personally CREST accredited, so have a fair idea about security and prefer to spend my time doing stuff and not waiting for updates to install (if I can even find them) - you?

FWIW WinXP is officially quoted as 45 million lines of code (https://www.facebook.com/windows/posts/155741344475532), everyone's decided Win10 is 5-10 (some say 15-20) million more.

I've been meaning to SLOCcount Linux sometime, actually!

Having said that, I don't think it'll be 45M LOC. The kernel is 20M LOC (https://www.linuxcounter.net/statistics/kernel). Chrome is 18M (excluding blank lines/comments) (https://www.openhub.net/p/chrome/analyses/latest/languages_s...). LibreOffice is 9M LOC (https://www.openhub.net/p/libreoffice).

And then I found out that KDE is 60M LOC!! (https://www.openhub.net/p/kde)

GNOME is 9M (https://www.openhub.net/p/gnome).

But I'm guessing those two stats are comparing just the base desktop environment in GNOME's case with all the productivity apps (including KWrite et al) and system libraries (including QtWebKit et al). This must be kept in mind.

TL;DR, an incredibly basic system with just a word processor and web browser, and maybe a minimal windowmanager on top, would be 47M. Adding KDE in makes it 107M - but you're almost never going to use all of it (whereas with Chrome and LibreOffice some large proportion of that 18M and 9M is loaded into RAM and potentially targetable).

Mate, the sheer amount of LoC in any modern system is nearly uncountable. I have been a serious Gentoo aficionado for many years. My lap has been burnt for hours simply compiling Firefox or LO. They are both massive and they are only two apps.

If you want to SLOC Linux then download it https://www.kernel.org/ and help yourself. Why not look here as well https://www.freebsd.org and others - those are my mates, and good ones.

Sorry, did you have a point? I honestly have no idea what you are saying.

Sorry, I did guild the lily somewhat, this was my essential point: "Yes, I think you are being a bit of a noddy comparing a kernel with an entire OS."

You have no idea of the complexity of things until you delve into Windows Side-by-Side...

99 little bugs in the code, 99 little bugs.

Take one down, patch it around...

127 little bugs in the code.

I guess we know now that from the 3000000 files or what it was they boasted about, not a lot of them are some sort of unit test..

How can you test against an unknown bug?

It wasn't an unknown bug. This was a fix for a known bug. The point is that there was no test in place for memory isolation between users or processes.

FYI, security testers call this type of testing negative testing, which is different from functional testing that is done to test an app works "properly". However, for an OS this test is not negative testing but functional testing if the OS is designed to enforce user and process isolation.

Testing to make sure you can't read from somewhere you are not supposed to read from seems like a pretty obvious test for an OS.

There are lots of places you're not supposed to read from. Does any operating system have 100% test coverage of addresses that aren't supposed to be readable?

IDK, there's a quite narrow whitelisted known range of addresses that should be readable by your process; you could (and should) certainly have an simple automated test that simply tries to read everything with the expectation that it should succeed only in known cases.

This one apparently meant that something intentionally mapped for kernel access only was now accessible from userland. That is exactly the tedious, boring issue someone manually testing a bunch of applications will never find, but unit tests are designed to.

> How can you test against an unknown bug?

The point of testing is to make the unknown bugs, known.

The point of testing is actually to ensure certain classes of known bugs aren't present. You can't test for bugs of which you have no knowledge.

In this case, however, an OS should have some sort of test in place that process and user isolation is being enforced. That is certainly a known situation that can be tested. It is also a selling point of the OS.

Indeed, this bug is a yowzer.

Yes, how cute they put everything into a thingie designed by Linus, for Linux and it scales way beyond its original use case. There's probably a lesson there somewhere.

Git is great, but no, actually it didn't scale and they had to develop GVFS for it to work. (https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...)

"Today, we’re introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened."

No, git works fine - that looks like an Engineer's bodge to account for an inadequacy elsewhere.

Huh? It's to avoid having to download the entire massive repo, which probably takes too much time to be a practical part of the developers' work flow.

I'm just about to add this to our Risk Register. I'm thinking of 0.01 x 100 (our scoring system is 1-3 x 1-3.) That means I think it is very unlikely but seriously (I will probably offend someone if I let loose here) nasty.

So you'd give this 1 on a scale from 1 to 3? Presumably 3 being the most serious.

That doesn't make any sense. If this bug is present, this is instant total control of a PC. That's as bad as it gets.

Sorry, I was being silly. Our RR (is not unusual) in having a scoring system. We normally score from 0-9. There are two parts: "How likely is this thing going to happen" and "what will it do". Both parts are given a score of 1-3 and then the scores are multiplied together to get the Risk score (we use slightly different terminology but the idea is there). An item gets zero when fixed or considered not a risk but is deemed worth documenting.

So, on our register you get a series of items with weights from 0-9 that is a self ordering list of things to fix. It is pretty simple and you could add more dimensions if you like.

I'm just a simple Board member of my company - MD in my case. We try to create to do lists that have a reasonable chance of being fixed with a reasonably simple ordering of importance.

Unless OP edited his post, he appears to be scoring it 100 (i.e. it's off the scale).

I misread grandparent post. The scale is 1-3 x 1-3, in other words, from 1 to 9.

0.01 * 100 is 1.

I was being silly and making quite a few assumptions about readers - which is also daft.

This is a really awful snag but it has all ready been patched if you apply them.

Ah, I see. I was reading it as two axes, not a product.

Yes a product although the term "two axes" works as well. It's a pretty common way of quantifying "risk" into something that you can tabulate and form a todo list. You list your risks and give each one a score from 0-9 that is made up of "chance of happening" x "business impact or importance or whatever". You could score either as zero as well which will obviously cause the total score to be zero.

I'll be changing our Risk Reg soon to become a Risks and Opportunities Register after a discussion in our last ISO 9001 audit. Not sure how the scoring scheme will work for that yet.

This may look like a bit of a silly pseudo formal exercise but it really does help with decision making. There's nothing wrong with bending the scores either, if you are open about it. It is simply a way of prioritising a list of things to do in the end.

I have to be a PHB sometimes as well as a sysadmin 8)

It's super likely. There's probably already malware in the wild.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact