Hacker News new | past | comments | ask | show | jobs | submit login
Why Aren't Operating Systems Getting Faster as Fast as Hardware? (1990) [pdf] (stanford.edu)
55 points by vezzy-fnord on July 19, 2015 | hide | past | favorite | 39 comments



It's clearly a combination of bad OS architecture, bad software architecture, and misalignment of hardware + software interests. We've seen improvements on the last two in different ways while OS improvements are hacks piled on hacks piled on original mess. There were attempts to improve performance with purpose-built designs: BeOS demo made desktop scream with performance; QNX smoked microkernels such as Mach; IX did this for dataplanes; mainframes' Channel I/O gives them insane throughput and utilization (90+%); Intel's i432, IBM's System/38, Burroughs B5500, Cavium's Octeon network SOC's, and recent Azul Systems Vega all accelerated critical functions with purpose-built hardware.

So, there's all kinds of ways to improve things at each layer. Many were done and are being done with vast improvements over the competition of the time period. Why do mainstream OS's not improve as mainstream hardware improves? Bad design, bad implementation, and strong desire for backward compatibility that makes fixes hard. Good news is there's a niche market and academic R&D that are always creating clean-slate stuff that does it better.

Admittedly, though, replacing a full OS isn't likely to happen as users reject anything without feature X, app Y, or benchmark Z. The thing is there's a lot of those things in an OS to the point that barrier of entry to building one is insanely high. Even Solaris 10, despite having a hell of a start, cost over $200 million to design, implement, and test. Best shortcut to that I've seen are the Rump kernels. That combined with the clean slate work means we might see a better thing happen incrementally over time.


Admittedly, though, replacing a full OS isn't likely to happen as users reject anything without feature X, app Y, or benchmark Z. The thing is there's a lot of those things in an OS to the point that barrier of entry to building one is insanely high.

Best shortcut to that I've seen are the Rump kernels.

Well, not really. Rump kernels are a godsend shortcut (similar to DDE, but much more advanced) for device drivers, but Blub-loving users wanting their apps means you still need to roll a POSIX userland of some sort (the way GNU Hurd does it over Mach is quite interesting, it supports things like processes with multiple uids with all POSIX calls being RPC that delegates to object servers) or since we're talking about a mainstream improvement here, Darwin/XNU and maybe even Windows NT. No escape.


The L4 kernels have just been porting Linux, etc to user-mode to run the untrusted apps with calls to underlying API. Might be the best that can be done outside some kind of source to source translation with heuristics.


What were some of the main reasons that made QNX and BeOS fast? For BeOS, was it that the OS took advantage of asynchronous execution?


animats explained QNX's design and results very well on a HN thread. Google isn't giving me crap even when I type his exact words in. Only does that on HN for some reason. I pastebin'd my copy here: http://pastebin.com/2K119Wtj

Far as BeOS, I found this word document where some people wrote up a summary of design decisions that boosted performance, reliability, and how easy it was to develop on vs eg POSIX.

https://users.cs.jmu.edu/abzugcx/Public/Student-Produced-Ter...

Hope those help.


Interesting, thanks, I'll definitely check them out.


Because programmers

Systems Software Research is Irrelevant (aka utah2000 or utah2k)

    By Rob Pike (Lucent Bell Laboratories, Murray Hill)
“This talk is a polemic that distills the pessimistic side of my feelings about systems research these days. I won’t talk much about the optimistic side, since lots of others can do that for me; everyone’s excited about the computer industry. I may therefore present a picture somewhat darker than reality. However, I think the situation is genuinely bad and requires action.”

http://doc.cat-v.org/bell_labs/utah2000/


I think the paper should be titled: Why isn't hardware as fast as advertised?

If memory bandwidth forms the bottleneck, then a faster processor is of little use.


Recently learned that reading this article about Cray computers.

http://www.techrepublic.com/blog/classics-rock/the-80s-super...

Old Cray sustained bandwidth is still higher than recent Core iX cpus, making them 'faster' for actual workload, even though their peak bandwidth is lower. The CPU GFLOPS will be a lot less useful if they don't have data to flop onto. System Design vs Marketing.


Not true. If you program like it's the 80's you don't get nearly the same speedups, but some structures like linked lists are basically obsolete or extremely exotic for performance. Even so Intel's out of order execution, branch prediction and caching is very strong.

If you do trivial operations across arrays of linear memory current Intel processors are orders of magnitude faster than naive serial programming. Lots of C code that looks fine can be sped up by 12x by rearranging memory access, and by 50x using SIMD.


Sorry I forgot to mention which comment to read : http://tek.io/1fwoxZH

I was referring to that part :

> the real world mflops of the C90, working on data sets too large for a typical pcs small cache, works out to roughly 8.6 gflops while the Intel Core i7 2600 will achieve only about 1gflops sustained on problems out of cache.


It isn't a fair comparison though, since a supercomputer with that throughput isn't working on incoherent serial workloads. An i7 Sandy Bridge can do at the very least 6 gflops per core at 3Ghz in my experience.


Sure, it's a specific case, it was to demonstrate the impact of bandwidth and not to forget it in todays very high xFLOPS.


It doesn't demonstrate anything, it is comparing two completely different things. Bandwidth is not a performance limiter in either of these cases.


> Lots of C code that looks fine can be sped up by 12x by rearranging memory access, and by 50x using SIMD.

Any recommended papers on this topic?


Computer Systems: A Programmer's Perspective (3rd ed)

http://www.amazon.com/Computer-Systems-Programmers-Perspecti...



The article lacks any data? I've found some numbers here: https://books.google.de/books?id=BY_1BwAAQBAJ&lpg=PA2&ots=Y8...

That's still difficult to archive with a desktop.


Yep the article lacks that data. The comments section is where it's at.


Thanks, I completely forgot to make that explicit.


The comment that i think you're referencing in that article is talking about a cray x1, which was only 2 years old when the comparison was done.


Because companies offering better hardware with lower MHz and GHz lost huge market share. Companies, especially hardware vendors, should always give the market what it wants if they want to stay in business. I faced an uphill battle over a decade ago trying to argue why Alpha's RISC flexibility and PALcode led to huge performance & safety increases when utilized properly. I demonstrated custom benchmarks. They went with the Intel stuff like most of the market and I suffer through maintaining such messes to this day.


Meh. For > 15 years I have been architecting I/O intensive server products (web caches, wan optimization, filesystems) and unless it is crashing, we mostly don't worry about the operating system. Our performance issues are of our own doing, and are dealt with in user space, as with nearly everything we do.

I thought this paper was kind of silly when it came out, and still think it is.

Also, while I'm piling on Ousterhout, Tcl has got to be the most overrated winner of the ACM system software award.


This is a Usenix paper from 1990. The past is a different country...

Tcl plus the Welch book is pretty good. You get purely event driven programs. I've replaced three or four Big Piles O' Java (that didn't work) with Tcl scripts and in short amounts of time. Nothing particularly wrong with Java, but these Piles still existed...

The Welch book is the critical component...

Its nearest neighbor is Python, which is also just fine except that package management then becomes a real chore. It's also easy to forget that Scotty ( the SNMP extensions to Tcl ) was the real winner back in the first dotcom era.

It's probably pretty lousy for web caches, wan optimization and filesystems. It's very good at constructing workstation-based programs to run against embedded systems.


Also, while I'm piling on Ousterhout, Tcl has got to be the most overrated winner of the ACM system software award.

Sprite OS and the log-structured file system were his more interesting achievements, anyway.


Running DOS 6.22 on a SSD on an i7 (can only use 1 proc, but 3-4GHz is pretty fast).

Runs pretty fast IMO. But that's not really the issue. OSes do 1000x more than they ever did in the past, and if you go boot up some mid 80s or 90s machine, you'll see how much faster OSes actually are.

I play around on my old Pentium 133 from time to time, I always forget how long it took to load programs (at the time it seemed blazing fast ~1996).


It's funny, I was in a recent discussion in a BBS group on Facebook, and the question came up regarding running older DOS programs.. imho emulation is plenty fast, but some consistently balked at the idea because it was too slow... I remember running some of those programs for only a single user on 386/486 class hardware, and its leaps and bounds faster today with a dozen users, and a bunch of other stuff running on a server.

Now, for the past 5-6 years computers really haven't gotten much faster (CPU or memory-wise)... they've gotten much lower power, and with SSDs becoming more common place that helps a lot. It will be interesting to see how the next few years shapes up.

It would be nice to see a ground up OS effort... I think ChromeOS is a decent attempt, but think it could be better with a slightly cleaner baseline.


Because Linus Torvalds actually believes that userspace processes have "infinite stack".

I swear I'm not making this up.


No, you're not making this up. You just didn't understand a word you read, or how it was meant:

The original Torvalds quote in context:

>"When I started the whole design, started doing programming in user space, which I had not done for 15 years, it was like, wow, this is so easy. I don't need to worry about all these things, I have infinite stack, malloc just works. But in the kernel space, you have to worry about locking, you have to worry about security, you have to worry about the hardware.


I often use "Jonathan Swift" as a psuedonym, more recently I have adopted "Durante degli Alighieri".



Classic Mac OS applications had 32 kb of stack. Desk Accessories had 8. Much of my effort to fixing Working Software's QuickLetter went towards reducing stack consumption. My predecessor had apparently not read the fine manual.

We spent a lot if money to purchase MacCLint but it overflowed the stack by megabytes and so corrupted the heap. The author wasnt receptive of my advice to remove all its recursion, he asserted rcursion is necessary for compiler-like programs.

Recursive algorithms di not require recursive implementations. The pricedure fir converting recursion to iteration is documented by Robert Sedgewick in "Algorithms".

Really you do have a stack but its not the runtime stack.


It comes 'a bit' late, but couldn't you increase stack size by calling SetApplLimit before calling MaxApplZone?

And of course, initially, classic Mac OS didn't have 32kB of stack on original hardware. It didn't even have 32kB available to applications.


I started at WSI about seven months before System 7 was introduced. At the time the documented limit was 32 kB. Yes applications could call SetApplLimit but Desk Accessories could not, QuickLetter's stack was on top of the application's stack. We had no control over that.

I expect the System always ensured that DAs could have 8 kB of stack on top of whatever the application had; that is, app+DA defaulted to 40 kB.

MacCLint was such a skanky program I don't think it would have done it a whole lot of good to increase the stack.


Aha! I didn't get your problems were with a desk accessory.


In a compiler recursion is typically used for walking Abstract Syntax Trees, and the recursion depth is usually similar to the scope nesting depth. I find it hard to believe that this could add up to megabytes for normal code.


A common problem in QuickLetter before I fixed it was that it had very large local (stack) variables - big arrays and structs. My first fix was to malloc() anything larger than a couple dozen bytes.

MacCLint was a piece of work; I would be unsurprised were it to have local variables that were megabytes in size.

That is I don't know the depth of recursion but I do know that it blew out the 32 kB limit by megabytes.

MacCLint is also the reason why I always use explicit returns in C-like languages. It would crash - which on Mac OS System 7 would drop my box into MacsBug - if I just fell off the end of a void function.

Dave Johnson, the owner of Working Software, had already advised me to use explicit returns because he once used a compiler that generated incorrect machine code if he didn't do so.

Even without buggy tools explicit returns are handy for setting breakpoints. Some debuggers enable you to break just before return even if you fall off the end but not all do.


> MacCLint is also the reason why I always use explicit returns in C-like languages. It would crash - which on Mac OS System 7 would drop my box into MacsBug - if I just fell off the end of a void function.

Isn't falling off a void function the same as returning from it?


Yes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: