Hacker News new | comments | show | ask | jobs | submit login
How much your computer can do in a second (computers-are-fast.github.io)
602 points by srirangr 30 days ago | hide | past | web | 234 comments | favorite



Alternatively, this could be titled "do you know how much your computer could do in a second but isn't because of bad design choices, overengineered bloated systems, and dogmatic adherence to the 'premature optimisation' myth?"

Computers are fast, but not if all that speed is wasted.

A recent related article: https://news.ycombinator.com/item?id=13940014


It's wasted only if it's not traded for something else. But it is.

A lot of system would simply not exist if we would have waited for people doing it properly because there is a limited pool of very skilled experts and the demand for IT far exceed our ability to supply. Plus writing good code takes a lot of time and resources, but our society changes now so fast that it very well maybe rewritten next year.

Hence, we are trading computer power to compensate our limited human resource.

E.G: wondering why you see electron apps everywhere now ? Because until now making a beautiful, powerful and modern app with a portable GUI was something only a few people would be able to do. Now any web dev can do it, and they are able to solve problems that weren't solved before because nobody would be available to do it. At the price of performance, memory usage and the use of the ugliest programming language we ever made popular.

It's the same deal as before. When C arrived, Wordperfect died because they stick to assembly. For them, C was wasting resources.

When Java arrived, expert systems were turned to it because it was more productive than to write it in C. We didn't need it to be fast, but the companies wanted their tools to be produced quickly.

Then when the Web arrived, people were laughing about the quality of PHP (initially a Perl hack !). So many bad code, so many security fails, just a terrible, terrible stack of programming debt. But it created the web we have today because it was easy to use, and suddenly a lot of people could just write their web forum. Hell, I learned my job with EasyPhp.exe way before becoming a Python expert.

Nothing here is wasted. It's just implicitly and involuntarily invested.


> Because until now making a beautiful, powerful and modern app with a portable GUI was something only a few people would be able to do.

Except it's none of these things: electron apps suffer the same presentation-before-content problems as most of the web, they inherit all of the state bugs of web apps, and they're painfully slow despite usually just being menus of nested lists and text boxes. It's passing a burden from developers to users, which considering the huge asymmetry between these two groups that's typical (at least 3 orders of magnitude, and often 6 or more) is an enormous inefficiency.

I'm all for selecting tools and platforms to better optimize developer productivity, but it's not clear to me the web platform has actually accomplished that for all but a few types of applications, and pushing it to the desktop is resulting in cumbersome, unresponsive, needlessly fragmented interfaces.

Put simply: Amazon Music Desktop has untold development investment and a complete visual redesign every 6 months, but WinAmp circa 2001 is somehow a thoroughly better way of searching and playing music.


Let's be clear, I dislike slow apps, I think current behemoth web pages size is a monstrosity and every time I start an electron app (minus the excellent vscode), I scream in my head.

Yet.

Most electron app I tried have a ratio result/effort far better than any other solutions for the dev.


This cannot be understated. It really can't. The programmer in me will always be drawn to small, lightweight code that gets the job done while using minimal resources. But that programmer is an algorithmic designer, which doesn't often need to deal with the realities of UI, graphics card drivers, or cross-platform compatability.

Electron is popular precisely because it enables developers of nearly any skill level to solve real world problems without investing the additional time to become a systems engineer. And the truth of the matter is that there are many more real world problems that need solving. Computers are cheap, humans are expensive.

When performance matters, by all means, hire the expensive programmers and code it from scratch. Do it "right." But doing it at all is often the only thing that needs doing, and for those problems, Electron is often, begrudgingly, the best choice.


> which doesn't often need to deal with the realities of UI, graphics card drivers, or cross-platform compatability.

What makes you think desktop programmers have to deal with graphics drivers? How is electron any better than desktop toolsets worse at dealing with the realities of UI work?


For the graphics question, it really depends on how you define a desktop programmer. Most video game developers need to work with graphics at some point, and a great deal of professional creative software uses GPU acceleration. Whether you deal with the drivers directly or not, you're going through them, and end up bitten by weird bugs and platform inconsistencies in testing. (Why are my drop shadows pink? Oh, this is a known issue with Intel HD 4000 on OSX... that sort of thing.) To be fair, Electron doesn't solve this problem, and neither does any toolkit really, it just shifts the responsibility to the framework, rather than the application developer.

As for electron being better than existing toolkits as far as UI development? I don't think that's a fair assessment in either direction. It's a right tool for the job thing.

Electron's real strength is that it's web-based, so it's familiar to existing web developers. In that respect, it's a bit easier to use if you already know web technologies. But, it's not terribly good at native integration (Electron apps tend to feel like web apps because they are), and resource consumption is abysmal because it's going through the browser layer, and we all know Chromium isn't light on RAM usage. Other toolkits are far better at handling native platform integration and can be much more lightweight.

Again here, "better" is a really tough thing to define, as every team and project is going to have different priorities.


Honestly I will be able to accept the resource consumption in the long run. But the use of JS everywhere is the real hard part for me.


WebAssembly and languages compiled to JS make it immaterial.


It has the potential to do so, but I'm waiting to see it for real before calling it. Right now it's just not the case.


I'm not sure how. WebAssembly is more limited than JavaScript (e.g. no access to DOM), and also more limited than native code (e.g. much more limited threading story).


I wonder how long after the WebAssembly DOM API becomes stable tbe first Rust-based "server and client written in one language" frameworks start popping up.


I thought they were working on adding DOM stuff too?


They are. And they are also working on having full multi-threading (thanks to SharedArrayBuffer).


> Most electron app I tried have a ratio result/effort far better than any other solutions for the dev.

This is more about the experience of the developer. Anyone familiar with GTK or Qt (possibly others) should be able to bang out a UI pretty quickly, in many cases it will be much simpler than HTML+js+CSS. With Visual Studio you could slap together a win32 app pretty quickly.

But for some reason we decided we had to coddle developers who only know javascript.


Cross-platform was mentioned upstream.

Microsoft finally seems to be learning now but I think it will still be a few years before people start creating cross-platform desktop sw in visual studio, if ever.


I think their intention is to make VS cross platform but not to provide a cross platform UI stack themselves. Even if they made any of their UI stacks cross platform, it wouldn't be particular compelling compared to something like qt (or GTK if you don't care about cross platform).


Why do we care so much about the dev? There are way more people who use a typical program than there are people who develop it, and these people often use the program more frequently than the developers make changes to it.

For example, if a feature is used every day for a year by 10,000 people, speeding it up by 200ms is worth over a month of developer time (200ms for 10,000 people over a year is 203 hours of time wasted, which is about 25 8-hour days, or five 40-hour weeks).


This equation starts with 0 users and grows pretty slowly at first. It only flips once you have a critical mass of users, and by then it's "too late" to preventatively fix the architecture to bias to performance.

There are also lots and lots of products that take a "npm+glue" approach - they ship an initial version that leverages everything possible, and then run aground trying to make it to version 2, because the architecture is so reliant on the dependencies that it can neither be maintained nor support any additional features. This is a little bit different from the "Electron problem" in that it identifies a showstopper limitation to using developer-side easings, rather than a long-term paper cut issue.


This equation starts with 0 users and grows pretty slowly at first. It only flips once you have a critical mass of users, and by then it's "too late" to preventatively fix the architecture to bias to performance.

Sure, but everyone's always planning for the future when they choose databases, backend architecture, etc. Why not plan for the future in user experience, too?


Keeping I'm mind the things I've developed are niche and not large; a dozen users at one time is about as far as I've got....

With databases, I use SQLite for quick prototyping and single user apps, and Postgres for the multi-user ones. If I needed to scale, I'd optimise my queries better and consult an expert.

With backend architecture, I'm most experienced with Python, Django & Twisted. I know these are used in web-scale apps, so if I needed to scale, I'd once again optimise best I could and consult an expert.

Now, UI is a different kind of scaling problem, because ideally it's going to look and interact the same for 10 users as it does for a million. You're scaling based on the features you add as opposed to the number of users. How do you scale that?

I'd say it's just as is happening now; you rewrite the UI as you scale, because if your revenue grows with users, you can afford to. On mobile devices: web page -> mobile app with mainly HTML views -> full native UI. On desktop: background server + webpage (think Plex) -> Electron/web-tech-based app -> full native app.

The problem ends up being when your revenue is independent from number of users and you don't have the resources to do the rewrites. Or as is likely the case with apps such as Atom; a big part of your proposition is the availability of plugins from third-parties.


> For example, if a feature is used every day for a year by 10,000 people, speeding it up by 200ms is worth over a month of developer time (200ms for 10,000 people over a year is 203 hours of time wasted, which is about 25 8-hour days, or five 40-hour weeks).

Is your company ready to pay $10-20k to its software supplier to save 200ms per person ? Rationally it may be a good idea, but I sincerely doubt anybody would pay for that.


You can't reduce it to those simple variables, though. What if the speedup is enough to remove multiple AWS servers, saving potentially thousands of dollars per year moving forward? What if those milliseconds are the ones that keep more users on your site, like in Amazon's famous search bar story?


Who is funding the developer? Why should they spend $5k (plus the opportunity cost of not using that dev's time on more fruitful pursuits) on 2 weeks of micro-optimization so that their 10,000 users will each experience a speedup so small they'll never even notice (and certainly never pay extra for)? How do you justify that expense?


200ms delay is easily noticeable. Even 50ms is noticeable. And latency is cumulative, and your user's hardware already adds some, so you never know exactly how much latency budget you have before the whole system feels unpleasant.

I think 25ms is a reasonable goal, and even 10ms wouldn't be a waste of effort. All interactive software should be written like a fast-paced action game.


I'm making a more abstract argument that doesn't have anything to do with money. The time people spend using a piece of software shouldn't be worth less than the time developers spend writing it.


Okay, but under those conditions your argument can only be applied to a more abstract reality that isn't the one we actually live our lives in.


The time developers spend writing a piece of software is worth what people are willing to pay for it. It has everything to do with money - if the person using the software isn't willing to spend more for 200ms less in startup latency, it probably won't happen.


> How do you justify that expense?

It makes it harder for your competitors to offer a better product.


Pretty obviously, not all time is equal in the eyes of the company that owns the software. That month of dev time costs on the order of $10k-20k all it where the users time is free from their perspective.


Only if those 10000 people do something else in that 200ms and also if they multitask it does not matter. Except in some HFT or other specific use cases, i dont think even 100000 will matter 200ms fix.


Just like GUI apps then, Mac OS and Windows has many bugs, are painfully slow compared to no GUI, and create longer to write.


> until now making a beautiful, powerful and modern app with a portable GUI was something only a few people would be able to do. Now any web dev can do it

Anybody could do it before, including web developers, using, say, Qt with C++ or Python.

Or are you suggesting that web developers are not capable of learning?


No I'm suggesting the ratio result/effort between a QT app and a electron app is strongly in favor for electron for web dev.

Actually I'm not suggesting. I'm witnessing.


I mostly do back end code. But when I look at electron and compare it to something like QT Quick, QT Quick looks much more pleasant and easy to use to me.


Electron targets front end dev and people with knowledge of JS + HTML + CSS.


So, your claim is that using electron has got nothing to do with it being easier or simpler. It's just running off inertia?


HTLM + CSS is right now the easiest combo to create a really custom UI. Try to do something very custo with QT or Wx that is portable and you'll notice it's way, way more work. Combine that with the face web dev already know HTML + CSS and the related tooling, and you got the answer.

Electron apps are slow, eats a lot of resources, weight a lot of Mo and don't have a native UI integrated with the OS. You really need a SUPER good reason to use it. And the reason is it's familiar, easier and more productive.


Is it? Looks pretty easy to me: http://doc.qt.io/qt-5/qtquick-canvas-example.html

You get a full scene graph to play with. What exactly do you think is missing? http://doc.qt.io/qt-5/qtquick-visualcanvas-scenegraph.html


the fact that you have to use a scene graph?


What would you use in HTML for a completely custom component? As far as I understand, you'd have to build your own scene graph on top of a canvas.

I mean, if you want standard or semi-standard components, QT has far more out of the box components than HTML does, but the initial complaint was about flexibility for doing something 'very custom'.


QT Quick is really adorable way to put in programming logic with improved UI.


> E.G: wondering why you see electron apps everywhere now ? Because until now making a beautiful, powerful and modern app with a portable GUI was something only a few people would be able to do.

You should really take a look at Python and Qt. Beautiful, easily maintained and well-performing cross-platform apps have been a possibility with this pair for the better part of a decade. Anki is my favorite example, but there are plenty of others.


I'm a python expert, I know. You are missing the point: electron apps target front end devs. They are more of them, a lot more, than any other type of dev and they don't want to learn a new tech. Hell they even ported js to the server, an awful language, to avoir using something else.

Plus saying qt is easy to use to them is like saying java is productive to a python dev. You are not on the same scale.


Front end web devs.

There are still some of us that do actual native frontends.


>A lot of system would simply not exist if we would have waited for people doing it properly

And we'd be better off. We'd be better off without Atom and VS Code.


So by making a complex task much much easier, we have traded efficiency for larger numbers of wasteful systems.

The Crux of industrialization?

The future of mankind? Make task easy. Get more people to do task.


> E.G: wondering why you see electron apps everywhere now ? Because until now making a beautiful, powerful and modern app with a portable GUI was something only a few people would be able to do.

Take portable out and it gets so much easier.


take out beautiful and it gets so much easier too.

I guess that gives us:

* portable

* beautiful

* fast

(choose 2)


That was so acid. ;-)

But besides my pun, "fast" code is a moving target relatively to the end result (see how performance considerations about Java running in a VM became moot because of hardware improvements notably on the CPU; see how today multithreading is making a comeback to rescue CPU-bound applications from stale improvements).

Fast is also subjective, it's cultural in that it's a trained perception. I'm annoyed if Excel takes more than a couple seconds to launch, but most people are used to 10s at least, because that's our daily experience. Likewise, I remember a time when having pixelated thumbnails until images loaded completely over dozens of seconds was the best browsing experience; nowadays on fiber/LTE we expect a page to appear rather instantly.

I guess what subconsciously annoys the hell out of "those who like as fast as possible applications" with Electron is that we know it will yet make fast a more distant concern overall, at least until hardware makes the Electron performance penalty moot, like Java VM.

I'm pretty convinced 'fast' is indeed a clear third priority for most consumers though, seeing how it doesn't seem to bother most people in UX research (less so than many other concerns), or how Apple rides ever more financial success while releasing ever slower hardware and software (comparatively to the whole market, and sometimes astonishingly even to their own previous models). Facebook is bloated as hell too these days, but it's as popular as ever. Spotify is sluggish too, even Twitter. People just don't seem to care, the looks and simplicity of use seem more determinant.


"Everything is for the best in the best of all possible worlds." --Candide, Voltaire


The market is wasteful. Lots of things are subpar because of the competition requiring adhoc solutions pushed to market, then becoming standard, so on and so forth. You can blame the game or not. At least if there was some acknowledgement of that process and a little cleanup time to spread good ideas and good fix ...

Personally I always feel weird booting up old boxes (say old = Pentium 2) and realizing how much the whole platform could do. It also felt quite sufficient in terms of UX and presentation.

That's odd because, everybody (I assume) feels how quickly we feel our newest gear gets slow, and how a new generation improves speed. But when I pop a very old system I don't feel slowed down much. It's just of a lesser resolution (hi res images, sound, videos); which for my needs doesn't really matter.


An eye-opening experience is running an IDE like Visual Studio 6 or a really old version of Photoshop on a modern machine. It starts instantly, compile times are a fraction of what we're used to. The interface is totally responsive. It's remarkable how much better software feels when everything is instant.


It's remarkable how much better software feels when the software was developed on a slow machine.

Unfortunately devs invest in the fastest equiplent, so they don't experience how their code runs on the average end user who doesn't upgrade every couple years.

In can be useful to test your code in a VM that is deliberately slowed down, so can get a feel for user experience on a slower machine. Then you'll know what parts need to be optimized.


> Unfortunately devs invest in the fastest equiplent, so they don't experience how their code runs on the average end user who doesn't upgrade every couple years.

As someone who doesn't spend much on computer upgrades, I notice this everywhere. Especially when it comes to web development, where everything is optimized for chrome on mac. Don't have a laptop made in the last year? Enjoy a janky-scrolling slow-loading hard-to-read mess.


Also, the web devs have no idea how slow the connections in even the western world can be. At least I think so, if they had to experience a 4 Mbit/s connection more often they wouldn't push for Angular and such.


4Mbit/s is slow nowadays?


Unfortunately devs aren't given any time to optimise it to run on slow machines. It is not a priority.


You don't necessarily need to specifically optimize for it: just do your development on a slower machine. If you're using a compiled language, you can offload compilation with something like distcc for productivity's sake, but all your testing should be on the slow machine.

It'll be pretty clear if some change you've made is going to impact people on older hardware, because you'll notice the moment you make the change. And if it's unbearably slow for you, it will be for a good chunk of your users as well.


Developer tools like Visual Studio are somewhat ironic in that the developers who create inefficient software are essentially seeing this reflected indirectly in their tools also becoming inefficient. A perverse version of "do unto others as you would have them do unto you"?


I used to think that of the java world. It wasn't above some critical mass / threshold of simplicity, and everybody that liked java lacked that simplicity, and every solutions they tried to add to java just added more cruft (xml, j2ee beans, IDEs)


Similarly, I run a very barebones Openbox Arch Linux install on my ancient 2Ghz Turion 64 laptop and I recently booted Windows XP on it and I was blown away by how fluid & responsive the entire system was compared to my much newer Linux-based OS.


Also, surprisingly, XP could handle 64MB decently (even for a nazi like me)


My memory is very blurred, but I thought 32MB (maybe less) was still pretty standard when XP came out.


>An eye-opening experience is running an IDE like Visual Studio 6 or a really old version of Photoshop on a modern machine. It starts instantly, compile times are a fraction of what we're used to. The interface is totally responsive.

If you gave that to a Product Manager candidate (as part of an interview) and pretended it's your current product (assume they didn't know it already), asked them to play around with it for ten minutes and as their task come up with some thoughts regarding what they would add or improve, their eyes would pop open and they would say, "WOWWW!! THIS LOOKS LIKE IT ISN'T DOING ANYTHING -- THERE IS A LOT OF ROOM FOR IMPROVEMENT HERE."

And in just fifteen minutes, they would gleefully tell you exactly how they would break it. (By adding crap to make it slow.)

(Note: this could be run as a real experiment, but the results above are my personal prediction - maybe it wouldn't happen as I predict.)

I really do think product managers are the people to blame here. To them, fast = broken or not doing anything. It's not done (to them) until it's (to us) broken.


A lot of places I've worked had government incentives to do this too. Adding features is "R&D" and attracts less payroll tax than optimizing bugs does. Other companies structure themselves this way, a feature is something you can charge to another department, an optimization isn't.


I've worked at one place in particular that fell in love with what I call FDD, Finance Driven Development. There's few things more demoralizing than watching systems you helped design fall apart because bug fixes aren't capitalizable.


Or start with the current version, then demo the ancient version, and claim you've "streamlined and tuned" the program?


What I was getting at is that the suggestions to add crap would come from the product manager candidate themselves - thereby "proving" (if the results match my prediction) that the requirement for slow unresponsive interfaces come directly from the PM's, who would add it if it weren't already slow.

Like, I want the experiment to test if a product isn't already slow, would a product manager actively make it slow? (I think, "yes".)


When software is too slow it hurts sales. When software lacks features that hurts sales too. Faster doesn't make for compelling marketing (anybody can claim their software is fast) but you can get people to upgrade for new features. So features are added with every release and performance improvements are made up so to the point where it's "good enough" again. You don't need a product manager conspiracy to explain why software is bloated and slow.


Okay, so let's update the experiment protocol.

So the hypothesis is that if it's too fast it's not considered done by a product manager. So as an updated test protocol, let us repeat with two groups of Product Managers, and this time they must evaluate the software, which they have never used before. Some candidates get the older version on the metal, and others get the older version slowed down 10x for example by running in a VM on the test machine, or simply being shown to them running on older hardware but otherwise an identical setup.

My hypothesis is that the exact same software running slowly would be objectively judged as better by Product Managers than the same software running instantly. That instant execution is considered an explicit negative point - completely independent of features. So that if the old software running slowly gets 7.5/10 rating by the Product Managers asked to evauate it, it would get only a 5/10 rating if it runs instantly.

This controls for the features which you mention. So I've suggested a test protocol, of course I can't verify it just at this moment but at least I am being scientific in my suggestion that Product Managers simply rate fast software worse. If you have an alternative protocol to test this, one that is easier to run, I'd like to hear it.


If slowness is the goal and if product managers insist on slowness, then you'd expect a lot of sleep() calls in commercial software (which has product managers) and no sleep() calls in open source software. This doesn't happen.

What we see is that commercial software has better documentation, gets more long term maintenance and support, and supports more cumbersome enterprise features. Exactly what you'd expect when you have salespeople and product managers who talk to customers.

Your hypothesis is falsifiable, which is a necessary but not a sufficient condition for it to be scientific. For it to be scientific you would also have to provide evidence that your hypothesis is more plausible than the alternatives, before you do any experiments. After all, it's easy to come up with thousands of silly hypothesis, and by happenstance some of them will get a positive correlation when tested. So every hypothesis has to undergo serious scrutiny.


Actually, some software like facial recognition or OCR, does have sleep calls inserted because people don't believe it can be done as fast as it can be done. Thus, whenever I deposit checks at the ATM, I have to stand there and watch as some entirely unnecessary animation pretends to scan my check for ten seconds before I can have my card back.


Yes, the same applies to save buttons. If saving takes less than 200ms people are not confident the save really happened. There are also UI animations that deliberately slow things down to prevent the user from getting disoriented. This doesn't prove that product managers want software to be slow though.


Agreed. I think PMs have their own motivations, but ruining things for others is not explicitly part of their grand plan.


I don't think it's fair of you to switch to the language "ruining things for others" - of course they don't want to explicitly "ruin things".

If you had said, "I don't think making sure that things don't happen instantly is explicitly part of their grand plan" then I would dispute that statement. (Yes, making sure things don't happen instantly is explicitly/implicitly part of their grand plan.)

They want users to have an "experience" not just click a button or press a key and instantly see the result. It is very explicitly part of their plan, and if the algorithms don't slow things down enough, they actually request animations or loading screens that do so.

Some of it may be implicit, rather than explicit. For example, a front-end javascript framework might actually produce pixel for pixel the exact same result as a simple HTML file generated more quickly on the back-end might do, but could still be a preferable "experience".

All this could be tested explicitly. For example two groups of PM's could evaluate web sites, where one included Working... screens where a javascript framework slowly renders something client-side, and the other could have pixel-for-pixel the same thing served from the server, using less bandwidth than it took to even transfer the javascript files over, so that the result is served much, much faster.

I hereby formally predict that some PM's will prefer the slower version for the exact and only reason that it is a slower user experience, and that this could be verified with a double-blind and carefully controlled experiment. I've put my claim out there, and it is easily falsifiable as well as has explanatory power in terms of the software we actually see.

Very simply: some PM's prefer a slower user experience. (I claim.)


I think that's the time to scan both sides of the check (which is not done with a photo scanner, but with a sheet fed scanner, so it's not instantaneous), upload the image (potentially over a very slow network link), add it to the OCR queue on the server, do the OCR, save the image, and return the results back to the ATM machine.

I highly doubt the ATM maker put in sleep calls to intentionally slow it down -- no one is going to be surprised if it takes 1 second instead of 10 seconds to scan a check.

The bank could optimize for time, but it's not their time, it's their customer's time and they are already saving so much money over a teller transaction, they likely don't care about saving customers a few seconds.


This strikes me as some kind of terrible vicious cycle: tradeoffs against performance (towards build speed) were made > users got used to slower software > users unconsciously prefer/select for slower software.

When you do get to use it, using performant software is a really great experience, so how to we train users to appreciate this?


Sometimes companies do decide to make speed a priority, and earn literally hundreds of billions of dollars off it. One of Google's key features over Yahoo+Altavista was that it loaded instantly; they cut out all the crap from the homepage so you could get yourself to a result in under a second.

Ditto BitTorrent clients like uTorrent and Transmission; the first generation of BitTorrent clients were written in Python or Java, and felt like it. When users started complaining about how they would lock up your machine, somebody actually bothered to write one in highly-optimized native Win32 C, with no frameworks.

You have to understand what the customer is optimizing for though. Someone standing at an ATM is optimizing for confidence, not speed; they already took the time to drive to the ATM, so depositing their check in 1 second vs. 10 seconds doesn't matter when the whole trip takes them 10 minutes. Someone who's responsible for depositing 1000s of checks for a major business, using an office scanner, cares a lot. But I really doubt that industrial-strength check cashing apps put a sleep() statement when scanning the checks.


>If slowness is the goal and if product managers insist on slowness, then you'd expect a lot of sleep() calls in commercial software (which has product managers) and no sleep() calls in open source software. This doesn't happen.

Why do you say this doesn't happen, when it does, in fact, happen? An example of which would be this:

http://osxdaily.com/2015/01/06/make-the-window-resizing-anim...

This is a very explicit "sleep". It is far from the only one. Are you saying someone didn't think the slower animation was "better" than the faster animation - that the default time wasn't picked by a product manager?

>What we see is that commercial software has better documentation, gets more long term maintenance and support, and supports more cumbersome enterprise features. Exactly what you'd expect when you have salespeople and product managers who talk to customers.

We also see fancy loading graphics and I consider it extremely plausible that product managers consider fast/instant operation as room to add something to slow it down. I am not sure why you're debating this.

Regarding your open-source claim, this comment - https://news.ycombinator.com/item?id=13940014#13940458 - shows that fancy progress bar animations are considered more important, even in a commandline utility, and even under Linux. In other words, the fact that these progress bar animations slowed core functionality by 50% was -- in point of fact -- not an issue. Someone spent the time to create them, test them, commit them, and these were accepted: someone considered it objectively (or subjectively) better.

>Your hypothesis is falsifiable, which is a necessary but not a sufficient condition for it to be scientific. For it to be scientific you would also have to provide evidence that your hypothesis is more plausible than the alternatives, before you do any experiments.

You don't think surveying the actual decisions made in cases (such as the two examples just quoted, the window animation speed and the progress bar speed) where there was no functional difference except slowness, to be a plausible reason for supposing that people in charge of these choices choose a slower version? I think it's quite plausible that for whatever reason, slow software is preferred explicitly and implicitly.

>After all, it's easy to come up with thousands of silly hypothesis, and by happenstance some of them will get a positive correlation when tested. So every hypothesis has to undergo serious scrutiny.

I didn't spend much time looking through it in the original leak (on Wikileaks) but Bruce Schneier quoted on his blog Do's and Don'ts by an intelligence agency:

https://www.schneier.com/blog/archives/2017/03/the_cias_deve...

The parts I'd like to draw attention to:

- DO NOT perform operations that will cause the target computer to be unresponsive to the user (e.g. CPU spikes, screen flashes, screen "freezing", etc).

- DO make all reasonable efforts to minimize binary file size for all binaries that will be uploaded to a remote target (without the use of packers or compression). Ideal binary file sizes should be under 150KB for a fully featured tool. Rationale: Shortens overall "time on air" not only to get the tool on target, but to time to execute functionality and clean-up.

- DO NOT perform Disk I/O operations that will cause the system to become unresponsive to the user or alerting to a System Administrator.

Now I would just like to note that the other requirements are actually SUPER advanced - the binaries are expected to include complete encryption and decryption, as well as generating completely standard and compliant HTTP traffic that it is wrapped in, under another layer of functionality. There's a ton these binaries do.

You will very rarely see requirements similar to what I just quoted in most open source or commercial software. While I realize that when binaries involve active involvement with the user, they will necessarily be a bit more bloated/slower/etc.

But most software does not seem to care about being unresponsive, having huge binary sizes, causing CPU spikes, and generally being slow. I don't want you to misunderstand. I am quoting the above requirements for a piece of spyware to show you requirements that are emphatically NOT in any Product Manager's vocabulary. No Product Manager ever seems to tell their programmers to make a tiny, streamlined product that does not cause any unexpected delays or runs like molasses instead of instantly.

So I would say that on the whole my hypothesis that slower software is considered objectively better by Product Managers has high plausibility. The fact is, you are already suggesting that if the slower software is evaluated as better than the instant software, it will be "by happenstance."

Maybe so - but the fact that product managers choose slowness over speed every day, everywhere, speaks otherwise. Unfortunately I am not in a position to show this scientifically, and also social science experiments are particularly difficult to produce or replicate, with close to half being unreproduceable:

http://www.nature.com/news/over-half-of-psychology-studies-f...

"Over half of psychology studies fail reproducibility test".

So if you are a product manager, I would just like you to compare the requirements you've ever given out, to the ones I've just quoted (which I consider great requirements.) Of course, I use some minimal programs that meet the description. But as a rule these are programs produced by a single programmer, with no PM in sight.

It was nice to talk to you but you seem to view the software landscape differently from how I do. I think it's easy to see that our perspectives are very different.


On further consideration, though it's common to talk about an upgrade treadmill or "what Andy giveth, Bill taketh away", software actually does get better over time, and we shouldn't look back so fondly on older versions. To pick just one example, software in the 90s didn't have inline spell checking as you type, a feature that came in handy for me just a few minutes ago while editing my other comment. Still, the question remains: why doesn't everything feel instant on our current systems, and what do we do about it?


> To pick just one example, software in the 90s didn't have inline spell checking

To pick on your example - it is interesting that you have to go back 20 years to find a feature beneficial for the everyman, and that you are still wrong: Just about every word processor starting from MS Word 95 came with inline spell checkers. (And indeed I prefer the snappiness and identical feature set of my copy of Office 2003 to the ribbon-laden gigabyte-guzzling mostrosities of Office 201x, just as I prefer a 15-year-old Winamp to the latest Itunes player.)


> To pick just one example, software in the 90s didn't have inline spell checking as you type

What? Which 90s exactly did you live through? Because I seem to remember MS Word 6.0 having not just spell checking as you type, but also the dreaded autocorrection feature. That was in 1993 according to Wikipedia.


So is there any way to stop this treadmill, short of overthrowing capitalism itself?

Edit: If that seems like a non-sequitur, I said it because the parent comment started by saying that the market is wasteful.


I don't think capitalism is to blame, as this is endemic even in free software. The way to stop this is to stop repeating dogmatic myths about optimisation, consider efficiency/optimisation as an integrated part of the design process and not something to be applied afterwards, and focus on "user time is expensive" instead of "developer time is expensive" --- and sometimes, those users may also be developers.


> user time is expensive

that's not always true, especially from the point of view of the dev.


Yep. With forced auto-updates, software companies have implemented "software gavage," and the devs are just getting paid to push code down users' throats. It's much harder to entice users with something they might actually want.


There is certainly a case to be made for overthrowing capitalism, but I don't see why an anti-capitalist revolution would automatically result in a society with higher quality software.

Open source software suffers from much the same problems as commercial software. There is much about writing good software we haven't figured out how to do yet as an industry, and in the meantime we have to work with tools that are totally inadequate.


Why do you think our tools are inadequate? What would adequate tools be able to do that our current ones can't?


Way too much to cover in depth. In short, it's all terrible. We're mostly using tools from the 70s, and we're still to this day plagued by use-after-free and out-of-bounds errors, nullpointer exceptions, awful debuggers, race conditions, slow compilation times, bad primitives for multithreaded work, just to name a few. We've known for decades that we can do better and ways to do better have been studied in depth by academia.

That's just the basics. Then there is futuristic things: Why can't we move running processes from one physical machine to another? Where are the debuggers that allow you to go back and forward in time? Why can't we visualize the data structures in our program and trivially change the data layout to see how that affects performance?

When I look at the demo of the Xerox Sparc interface (1982) and compare that to where we are today it's pretty clear to me we messed up big time somewhere along the road: https://www.youtube.com/watch?v=Cn4vC80Pv6Q


For people who have an exceptionally high tolerance for learning what people actually care about and why the shitty technological solutions won in the marketplace, this is probably an opportunity.

When I entered the workforce in 2005 after learning all about bygone languages (Common Lisp, Smalltalk, Dylan, Haskell, Erlang, Ocaml, etc.) in college, it was really, really painful going back to Java. Java had gotten so many fundamental language features wrong when better alternatives were well-known in research literature. Monitors instead of CSP, null types instead of Maybe, inner classes instead of closures, manifest typing instead of type inference, no first-class functions, mutability everywhere, conflation of modules & classes, not even a semblance of pattern-matching. But I could use it to write software that performed well, that could be deployed with tooling that everybody used, that had a wide base of available libraries, that had IDEs and other tooling available, and that other programmers could maintain.

It's 12 years later, and all of my recent projects have been in some combination of Swift, Rust, Kotlin, or ES6+React. And most of these languages have: promises & message-passing instead of explicit locks; Optional types or at least some form of null-safety; lexical closures; type inference; first-class functions; immutable-by-default; module systems that don't require that you shove everything into a class; and pattern-matching. But they also did the hard work of figuring out an interoperability story with some large existing body of code; of writing decent package managers and easy build systems; of making the language reasonably performant for most everyday tasks; and of (in Swift & Kotlin's case) providing full support within an IDE. All the language features that were common in research languages of 2005 but missing in industrial programming of the time are now very usable in the industrial programming of now.

I wouldn't be surprised to see a similar renaissance in tooling over the next decade, with exotic features like time-traveling debuggers, fully-inspectable systems, and migratable processes becoming standard in the next generation of IDEs.


> We're mostly using tools from the 70s

We're using the shittier tools from the 70s, because the better tools died off - which is sad, because we still didn't manage to recreate some of the features those tools offered. The success of UNIX and C seems to be caused by the same phenomenon that causes Electron to succeed and bloat to proliferate - what wins aren't good solutions, but those which get popular quickly.


> aren't good solutions, but those which get popular quickly.

but by the judgement of a majority, those tools are good! That's why it got popular. Any personal sense of aesthetics from you (or anyone) doesn't matter at all here, and most people who call tools that failed 'good' are only judging it by personal aesthetics.


I'm not talking aesthetics here; I'm talking about shitty engineering. Often legitimized by nonsense phrases like "worse is better" these days.


Within capitalism, products are made to maximise exchange value rather than use value. This is because it is more profitable to optimise for products to be exchanged rather than used. In fact, it can be detrimental to maximise use value, because a product that can be used for longer will mean that fewer people will have motivation to purchase another one any time soon.

I like the idea of overthrowing capitalism itself to resolve this.


This will sound pathetic, and it's partially because I don't have much of a life, and spend way too much time with electronics, but one of my biggest regrets is---

Upgrading.

I've always fell for the increased security, and blah, blah, blah, but in the end I have a device, I once used, become slow, programs become useless, or my plethora of USB devises don't have the right updated firmware.

I have a Mac, and a MacBook Pro I'm thinking about wiping clean, and installing the original operating systems. It's just a pain, and yes--I know security is important.

I did finally learn on the IPad, but I'm still not completely sure I'm doing the right thing. I must have hit "tomorrow for 9.3.5---three hundred times. I just don't want to make this devise slower. I paid way too much years ago, and I hate putting things in my electronic graveyard--my closet.


I can't upvote your most recent comment because you're attracted too many haters too quickly and the comment is in purgatory unjustly. In fact, I think you've attracted some regular haters, though I can't prove it. I suspect someone doesn't like your anecdotal style, but it is just as valid as an unsupported generalisation.


To be fair, that article is discussing a small bug, not over engineering or dogma. The size of the deal people made over it was more wasteful than the CPU time this (now fixed) bug cost.

And FWIW, of all the problems that matter to me and my teams, I find premature optimization to be far, far more wasteful of money and human energy than wasted CPU cycles. There are definitely times to worry about performance, and I fancy myself a deeply performance oriented person, and yet I am and will continue to be proud of my dogmatic adherence to avoiding premature optimization, because in my experience it's not a myth at all. I have watched literally years and millions of dollars wasted multiple times by teams that decided to prematurely engineer high performance solutions to problem they thought they had but didn't actually. So I'm way more scared of premature optimization than I am of wasting cycles.


By definition an optimization being premature means it's not necessary in the present. For these situations its like taking a loan for technical debt. In the future you may need 10x the resources to fix but in many environments that's acceptable due to company growth, or like you mentioned, spending money on compute rather then Dev time.

That said, I've been amazed what the top engineers in the field can design and implement getting things right the first time without additional effort. Simple, minimalist, high performance implementations that don't sacrifice much. There is a thin line between the beauty of those architectures and grab bag of pattern matching complex and unecessary optimizations.


> By definition an optimization being premature means it's not necessary in the present.

No, it is premature when it is not necessary for the deployment (at some future point in time).

When I was developing for mobile phones (long before the iPhone), the decision of what to optimize when required some educated guesses where the phone market would be in 6 months, when the application is market-ready. (And: In the end, we did optimize a lot of code that did need no optimization 6 months later, but those optimizations made development work much, much more bearable.)


> an optimization being premature means it's not necessary in the present.

Yeah, spot on. What's really important is answering the question of what's necessary, that's really the key. Sometimes you do know that optimization will be necessary in the future even when it's not needed right now. If that's the case, the earlier you attack it, the better. But all of us (me included) tend to spend time on things we believe are necessary that turn out not to be.

> For these situations its like taking a loan for technical debt. In the future you may need 10x the resources to fix but in many environments that's acceptable

Definitely. There is the gamble that if you don't know for sure and you wait it might cost a lot more later. But I would go a lot further than that. The problem is that if you take the other route and optimize early on the wrong thing, it makes optimizing the right thing 100x more expensive in the future, not just 10x. It can be more expensive to optimize the wrong thing than it is to optimize the right thing late. This is because optimizations often require a different way to slice the code, to feed the limiting system with a different batch size, or to invert your high-level looping structure and swap the inner loop and outer loop.

If you've ever done image processing in Python, I always think of numpy vs PIL as a good example of slicing your batches a different way -- if you do it the easy way in PIL, you can write cool stuff in an hour, but your code might take 60 seconds to run on an image. PIL loops over pixels, and you put complex ops inside the loop. Numpy is opposite, you put the operations on the outside, and the loops on the inside (to allow the loops to be run as native compiled code). Numpy's harder to learn, the code is inside-out from PIL, it takes longer to write & debug, but something that takes PIL 60 seconds will take less than 1 second using numpy.

Most of my serious optimization efforts in C++ have ended up feeling very similar to the PIL vs numpy issue, and what happens is that it takes a ton of effort to optimize the right way because I have to invert my whole program's structure. When I've done that only to discover I didn't optimize the right thing, it's a huge mess to un-invert and re-invert the right way.

> I've been amazed what the top engineers in the field can design and implement getting things right the first time without additional effort. Simple, minimalist, high performance implementations that don't sacrifice much.

Same here, some people seem to hit the right balance with less effort. Sometimes it's experience and good guesses, but often (from what I've seen) it's because if you watch the best engineers work, they spend more time communicating up front to make sure they're doing what's really needed, and they spend more time checking their assumptions as they go, writing tests, profiling, and talking with their peers and their management about what they're doing. They're able to recognize quicker when they're spinning their wheels on problems they want to solve, and not problems that the team or product really truly needs solved.


Its always about making the right judgement call between optimization and complexity, and sometimes I'm still surprised by what needs optimized and what doesn't.

An anecdote from a project I'm working on. I'm writing some LED control software for light shows and using GPIO pins on various embedded Linux boards (CHiP, Raspberry Pi, Orange Pi) to generate the SPIO-like serial signal I need.

My first draft, for Raspberry Pi 3, is a wonder of compact bit twiddling, doing all the necessary operations to write to GPIO memory in one compact function.

For the second platform, the Orange Pi Zero, I found a Python library for GPIO access that someone had already written, saw that the gpio operations were broken out into separate files and pulled those files into my project. My first thought: this will never be fast enough, and the O-Pi Zero is slower than the R-Pi 3. Luckily, I decided to try it first.

Now the the gory details of writing to memory are hidden in the library and the logic in my code to generate clock pulses and write bits is a lot more clear. I had to tune the code differently - the delays in my code to generate the correct clock pulse width are shorter now that there is function call overhead, but that was easy to sort out with a bit of trial-and-error with a logic analyzer.


> I'm still surprised by what needs optimized and what doesn't.

Yes! Exactly. I'm almost always surprised. And it's really important to remember what benefits you get from the unoptimized code too.

Once I wrote a controller input system in a game engine, the code that monitors and reacts to all the buttons on an xbox controller, for example. I spent a whole bunch of time making sure that when I hit X, the reaction code would trigger with almost no overhead at all. It did fancy stuff like watch for combos and sequences, and game designers could author the control schemes, and I spent a lot of time making sure that actions cost no CPU time.

Then one day I profiled it carefully and found out that the bottleneck was something I'd forgotten to check for -- the case when no buttons were being used. The case that is happening during 99.5% of frames.

The loop to check if something needed to be done had a pretty bad memory access pattern than was missing cache most of the time. I thought it would be close to a no-op, but it wasn't. I had to refactor the whole thing and flip it inside out, and I realized I'd wasted a bunch of dev time fixing the code that reacts to one button at a time. It doesn't matter how slow a single action is, because it'll never show up in the profile. It does matter if I waste a few hundred thousand cycles traipsing through memory missing cache with every instruction.


Something else I've learned the hard way - sometimes its better to cut and paste the same short code snippet in multiple places than create an abstraction to put the code in one place. I admit this is probably an indication there's a problem in the overall design, but sometimes those design choices have already been made and the goal is to make things work with what I have.


> Something else I've learned the hard way - sometimes its better to cut and paste the same short code snippet in multiple places than create an abstraction to put the code in one place.

It's a pretty short term optimization though.

Sure, it'll get you working, but working on a code base with no duplication means that once you find a bug and fix it, you don't need to hunt for duplicates to fix. It also means that if you're debugging something, you know exactly which code path was taken instead of "oh I thought it using X but it's using slightly different Y."

Whether that's a good trade off or not really depends on how long this code will be maintained. If it's front end glue code that will be tossed out when the CEO decides in nine months that the website should use X technology instead, then yeah, that's a fair trade off. Much less so if you know this code will be there 5-10 years down the line.


>I find premature optimization to be far, far more wasteful of money and human energy than wasted CPU cycles.

I would argue this always depends upon what you're doing. For example, in webdev, the culture of prototype fast and break things fast is very abundant, and makes sense when you're just trying to figure out if what you want done is even possible. However, there are many times where you /do/ have to consider the performance impact, because with it, what you're making isn't going to be nice to use.

By dogmatic, I think they meant individuals who will actually argue that '100ms' loop times are 'tiny' and shouldn't be a problem. They don't get the concept that its just one small part of your code, and the rest of it combined can make an entire process take up to 5-10 seconds to do something that should be in a blink or two. Learning about optimization is a part of learning programming, and knowing when to use it. The problem is when you don't know it, and fumble around trying to do so when you've got much bigger problems further ahead.

I ran into this problem a lot with image filtering and working with Canvas's to build an image editor. I talked with some communities about the Uint32Array implimentation in chrome and other browsers and how some of the ES6 goodies, like 'let', are actually bad practice when you start to get into image filtering, because just iterating over all the pixels in an image by using "for(let i=0;i< pixel.length;i++){}" is 3x slower than using var, causing just a single full loop to be 80-120ms on a good computer.

TypeScript compiles down to var, though, so its less of a problem there, but I met a lot of resistance in that statement in general, even though its true, because there's a lot of dogmatism in programming. At the end of the day I'd rather learn and teach good practices and not tell people they're pre-optimizing unless they really don't know much about what they're working with.


> ES6 goodies, like 'let', are actually bad practice...

BTW, this is a really good example of why people like to avoid premature optimization.

Performance oriented code frequently has to break all the rules, it often goes against all "best practices" for code that isn't performance critical. If you do it too early, you will wreck your codebase's ability to deal with 1- non performance critical code and 2- other performance issues than the one you solved.

If you make it a rule to avoid using 'let' everywhere, then you lose the safety mechanism of block scope variables, something that a lot of people agree is a good idea.

It's very easy when you find out that let is 3x slower to stop using let, but then if you do that you're prematurely optimizing everything. And if you don't do it, everything you write is a little slower than it could be, and that will nag at you.

Functional programming is another good example of what many many people consider a good idea for code safety, and yet it is frequently abysmal for performance. Just to use another JavaScript example... using map() has historically been about 10x slower than using a for loop.

The singularly smartest engineer I know, who's now managing a huge team at NVIDIA, said to me that engineering for features is a solved problem, but engineering for performance is an unsolved problem. We know how to do class hierarchies and architecture and pub-sub systems to make a 10M LOC application manageable. We don't know how to write high performance code on that scale.


>If you make it a rule to avoid using 'let' everywhere, then you lose the safety mechanism of block scope variables, something that a lot of people agree is a good idea.

Which is why I said its bad practice in specific cases, currently. 2.5 Million iterations is a use case, not a general rule.

Programming is complicated, and as such you should know not to make the assumption that a use case isn't going to apply everywhere. Rather, you can say that microseconds of performance difference aren't a good reason to not use let, unless, you're not using something that compiles down to var, and you're going into the millions of iterations.

But, the point I was trying to make is that even when you talk about performance problems, devs jump the gun to 'pre-optimization', partially because they themselves don't know how well code performs, and partially because of misapplied dogmatism.


>using map() has historically been about 10x slower than using a for loop.

Wow, really? Any idea why? Seems odd when map() imposes fewer guarantees and can be implemented, in the worst case, with a for() loop...


I don't really know why, and I have heard there are large discrepancies from browser to browser. It is a function call per iteration, or even a closure, so I'm guessing it may often be difficult to recognize and optimize out the function call, or maybe some browsers haven't gotten around to it yet, I don't know.

This also depends on what you're comparing. Map always allocates a new array. In my Chrome 57 right now, map() is 4x slower than new array + a for loop of push().

But it's 100x faster to re-use an existing array... (a for loop of array element assignment)


> some of the ES6 goodies, like 'let', are actually bad practice... 3x slower than using var

This is good to know! I suspected, but never verified. I assume this will get fixed over time, since 'let' is brand new.

But JavaScript in general isn't well suited to image filtering, right? Certainly worth using, if at all possible, either CSS filtering or WebGL. Either of those will be much faster. Depending on what you're doing, some complex filtering operations will even be faster to upload to a back-end, process, and download the result. :P Is WebGL something you avoided because it won't work for you, or something you avoided because it's more costly development wise?

Hey overall, I agree with everything you say, and I'm still comfortable saying, dogmatically, never optimize prematurely. What that means to me is don't do engineering for problems you don't know you have. If you're not sure, then you don't know. If you're doing filtering, and napkin math proves in advance that you can't beat 5fps and your requirements call for 15 fps, then you do know.

In my mind this even fits with startups, because a startup's mission is to find the product and market, to figure out what's possible and then execute on it.

So ultimately I agree 100% it always depends on what you're doing, my additional caveat is always know what you're doing. :)


>either CSS filtering or WebGL

CSS only has basic filters, and WebGL is only possible for per-pixel manipulation, such as brightness or contrast. This is because WebGL is just shaders, and half of shaders are per-pixel, along with some other fun things you can use with multiple buffers, but you can't actually get /back/ the processed data even if you got all the pixels into a single loop. With WebGL 2, this may be different due to some new buffer types, but I don't think its made for it. Any time you do an algorithm that is dependent upon surrounding pixels, such as dithering, you're stuck with JS.

https://tools.czaux.com/space-pixels/

Here's an example of in-browser dithering to a fixed palette set that works pretty well, using RGBQuant modified with alpha support. The overall canvas uses FabricJS for manipulation.


> but you can't actually get /back/ the processed data even if you got all the pixels into a single loop

Oh if you need the pixels back, you can certainly get them using glReadPixels(). It's a bit expensive in WebGL land, though it is much less expensive than looping over pixels using JS. And often there are ways to get around needing the pixels back, depending on what you're doing.

You can also do loads of non per-pixel tricks, including filters that need the surround, using multiple passes and clever shaders, you are not as limited with WebGL 1 as you think.

https://5013.es/toys/dithering/

https://www.shadertoy.com/view/MslGR8

This is what I'm talking about, you have to slice your program differently if you optimize it. Your high level structure and outermost loops will look different if you use WebGL instead of JS. And yes WebGL 2 is even better.

But, this is all much trickier than writing your straightforward loops in JS. That's the price you pay for performance, your architecture will be kinda crazy if you do it in WebGL, and it will not be easy to go back if you discover you optimized for the wrong thing.


Random-order dithering can be done in WebGL, but they are of poor quality. I am talking about error-diffusion dithering, where you are reading and writing pixels, and then re-reading those modified pixels that change the rest of the pixels. In WebGL, you can read the pixels of the image, but you cannot modify and read. You are returning the actual pixel you have modified, which is what glReadPixels() returns. You could use webgl, but you'd have to use glReadPixels() for every pixel modified so that the next pixel knows the new data.


Yes, ordered dither is easier in a shader than error diffusion, that's true, but error diffusion is definitely possible on a GPU. If you need it. Do you really need it? Why not still use the GPU to accelerate whatever parts you can? Even if you dither on the CPU, doing your color filters on the GPU instead of in JS could make the difference between interactive and not.

I don't know what you're doing exactly, but when you do need dithering, it's usually the last thing you need to do before output, and it's usually less important that the dither is fast than most other operations you need in an image editor.

BTW, random order is lower quality, and error diffusion is better, only for for very low res color palettes. If your result is more than 256 colors, it's irrelevant, and for high quality dithering, e.g., 16 bits per channel colors to 8 bits per channel for print, random ordered dithering is superior.

https://community.arm.com/graphics/b/blog/posts/when-paralle...

http://www.cs.cmu.edu/~imisra/projects/dithering/HybridDithe...

https://www.shadertoy.com/view/4dt3W7

That last one is ugly, but it's a great proof of concept here. You could get error diffusion in a shader using 2 passes. The first pass you render to texture, and the second pass, you can sample the texture however you want. Note how in this example, the error diffusion propagates backwards from what you would normally do in JS or C++, because it's a shader looping over destination pixels, not code that loops over source pixels.

> In WebGL, you can read the pixels of the image, but you cannot modify and read.

I don't know exactly what you mean here. You can modify & read pixels using render to texture, or using multiple passes.

Render to texture will be faster, if you can do it. Multiple round-trips from CPU to GPU and back will be slower, so you want to limit the number of trips, but it's easy to do.


>Even if you dither on the CPU, doing your color filters on the GPU instead of in JS could make the difference between interactive and not.

Dithering requires quantization. If you mean things like brightness and contrast, yes, webgl is better for that. But quantization with error diffusion dithering is still based upon previous modified pixels, so you cannot just send a bunch of pixel info to webgl, you have to do each pixel seperately. Meaning, pushing through 2.5Million seperate inputs sequentially, and having to use glReadPixels() 2.5Million times.

>BTW, random order is lower quality, and error diffusion is better, only for for very low res color palettes.

Not in any of the images I have used. Ordered dithering looks fake, because it appears like a texture to the image. Dithering with error diffusion kernels, like Floyd-Steinberg create a less obvious texturization that still preserves the underlying image.

>https://www.shadertoy.com/view/4dt3W7

From what I'm reading, for every pixel, it is processing a 250 iteration loop that gets the pixels in the row and builds errors based upon those. However, it is not modifying any pixels as it goes, this is a parallel operation independent of each other. Good error diffusion with a kernel requires previous pixels to be modified, and errors to be built upon them, which is why it is a sequential operation. I don't see any destination-dependant pixel manipulation.

>Multiple round-trips from CPU to GPU and back will be slower, so you want to limit the number of trips, but it's easy to do.

If its possible to get data back from Uniform Buffer Objects in WebGL2, it may be possible to send up to 1000 or so uniform pixel values, but that is the general cap. Some GPU's have less. Unless I am missing some magic buffer you can use to write out many pixels too, and in that case I would be very interested in testing that, but shaders are designed to output a single pixel per instance.

>https://community.arm.com/graphics/b/blog/posts/when-paralle...

From what I'm reading, this is using OpenCL, and parrallelizing in rows, but these rows have height. A single thread processes one row and sequentially goes through the pixels like normal error diffusion. Just that this breaks different parts of the image up. I may have to try this out in JS and see if perhaps it can thread better that way. ty.


> so you cannot just send a bunch of pixel info to webgl, you have to do each pixel seperately.

Why do you think that? You certainly can get previously modified pixels, you can send millions of pixels to WebGL with a single call (as a texture). Nobody calls glReadPixels millions of times, that's a bad idea. :) You might want to investigate multipass rendering techniques. Small kernel convolutions, for example, are standard and simple operations in WebGL. People use blur & edge filters all the time, and those depend on neighborhood computation.

Regarding the 1-d error diffusion shader on shadertoy, it is using a gather instead of a scatter. It's a pull instead of a push, he flipped the operation inside-out. It is still computing error diffusion correctly (but for only a single pixel row). And it's running at 60fps.

This is the whole point I tried to make above multiple times wrt performance: this code looks unrecognizable compared to the straightforward CPU serial way to implement error diffusion.

The reason this example is correct (in 1D) is because it recomputes the error propagation for every destination pixel; it's wasting almost all of the computation it's doing because in this case it's not sharing the error computation. But that doesn't mean it can't -- this is just a proof of concept on ShaderToy, not the limit of what you can do. You can't do multipass with ShaderToy, and multipass is how you share pixel results from one iteration to the next.

> Unless I am missing some magic buffer you can use to write out many pixels to

It sounds like you're missing render to texture and multipass techniques, the ways to use textures as compute I/O. To do multipass in WebGL and share the results of computation from one pass to the next, you create an offscreen framebuffer for your results, and you render directly to that framebuffer. You can then use the result as an input texture for the next pass (via glCopyTexImage2D) or you can read back the buffer (via glReadPixels) and then you repeat. Using glCopyTexImage2D is much faster because you never leave the GPU.

I think you could do error diffusion by rendering the error to a texture, and using a multipass technique that only needs as many iterations as the maximum distance any error could travel. In the worst case it'd probably be the greater of your image width or image height, e.g. 512 passes for a 512x512 image, but in practice I think you'd be done much sooner. That's assuming there isn't something hierarchical and more clever that could do it in log(width) passes, which I suspect there is.


>It sounds like you're missing render to texture and multipass techniques, the ways to use textures as compute I/O. To do multipass in WebGL and share the results of computation from one pass to the next, you create an offscreen framebuffer for your results, and you render directly to that framebuffer. You can then use the result as an input texture for the next pass (via glCopyTexImage2D) or you can read back the buffer (via glReadPixels) and then you repeat. Using glCopyTexImage2D is much faster because you never leave the GPU.

How are you writing many pixels to a framebuffer in a single instance? frag_color returns one pixel. Even if that OpenCL implimentation works in webgl, you'd still be at only 8 threads, so only 8 pixels that can be done at the same time, then you have to pass in the output to process again. 259200 glCopyTexImage2D calls for a 1920x1080 image since its 2073600 pixels.

I'll definetely look into this more, since that openCL implimentation might hold some answers on how the error distance is seperated to allow for 8 rows at a time.


> frag_color returns one pixel

Whoa, hang on. Hey I only mean this to be helpful not insulting, but it sounds to me like you may have some misconceptions about the way WebGL works. I know how easily that can be taken the wrong way, especially in text, so again I apologize in advance and I don't mean that to be rude at all. It would be best to back up and understand WebGL.

If you're doing image processing in WebGL, then to write many pixels to a framebuffer all at once, you draw a single polygon that covers the entire viewport. Your shader is applied in parallel to all pixels drawn. That is how ShaderToy works, it renders a single quad to the viewport and applies whatever shader you give it, the GPU runs that shader on all pixels rendered.

There are never hundreds of thousands of buffer read calls, you only need a handful. For a blur, you only have to do a buffer read once, and your shader samples the 3x3 neighbor pixels.

You don't need OpenCL, that's a level of complication you don't need. I may have given the wrong impression with that link.

Check out this image processing tutorial using basic WebGL 1, and pay attention to how it works:

https://webglfundamentals.org/webgl/lessons/webgl-image-proc...

Here is the demo from that article that uses the techniques I've been talking about. All of the filters in this demo are doing neighborhood computations. And note you can apply multiple filters. There is no texture copy here, this tutorial renders directly to a texture in one pass, and then samples that texture in the next pass and so on. The iterations or feedback that you're looking for happen by combining render-to-texture with drawing the viewport polygon multiple times.

https://webglfundamentals.org/webgl/webgl-2d-image-processin...


>If you're doing image processing in WebGL, then to write many pixels to a framebuffer all at once, you draw a single polygon that covers the entire viewport. Your shader is applied in parallel to all pixels drawn. That is how ShaderToy works, it renders a single quad to the viewport and applies whatever shader you give it, the GPU runs that shader on all pixels rendered.

>Whoa, hang on. Hey I only mean this to be helpful not insulting, but it sounds to me like you may have some misconceptions about the way WebGL works.

You draw a simple quad, which the shader then processes all the pixels in parallel, returning their frag_color as the output color. It can read textures and other source information, but it does not have access to what is being processed currently for other pixels, because its parallel. You have to wait until it has rendered that, and then pass it in again. I am unsure what I am misunderstanding.

>https://webglfundamentals.org/webgl/lessons/webgl-image-proc...

This was a cool insight into using framebuffers for multiple passes, thank you.

>Here is the demo from that article that uses the techniques I've been talking about. All of the filters in this demo are doing neighborhood computations. And note you can apply multiple filters. There is no texture copy here, this tutorial renders directly to a texture in one pass, and then samples that texture in the next pass and so on. The iterations or feedback that you're looking for happen by combining render-to-texture with drawing the viewport polygon multiple times.

From what I'm seeing, this is not just a couple passes, its a pass per-filter. None of the filters singularly rely upon multiple passes, their output is calculated purely on the image that was filtered before it, meaning the filter itself doesn't need to write pixels and then read them again. You can see that in the for loop that calls setFramebuffer() and drawKernel(). I understand that you can apply a shader, and put its output back in, but I still fail to see how you're avoiding doing that at the very least in the hundreds of thousands of times. error diffusion dithering classically is sequential from top to bottom, or bottom to top depending upon serpentine or not, I don't think you can just process a ton of pixels at the same time and still look anything like a sequentially done Floyd-Steinberg.


Okay, this is good, you're almost there. BTW, I'm doing a bad job of explaining, and I'm sorry. I realize I'm complicating a few things and conflating a few things, so the best advice I can give is to actually go through that tutorial on image processing and write the code and understand the whole pipeline.

You're right; this demo is 1 pass per filter type. None of them require multipass, but the entire demo is multipass. There could be a filter that needed more than 1 pass, but in this case there isn't.

1 pass means: render all pixels in the viewport, and run the shader on all pixels rendered. You've got that part. The trick is you get to access a texture, and the texture is the result of the previous pass. Furthermore, inside the shader, you can address and access any pixels from the previous pass, and you can access multiple pixels from the previous pass to render one destination pixel.

I think the millions of reads you're looking for are the texture() calls happening inside the shader. The [render / render-to-texture / copy pixels / copy texture image] calls process all pixels in the viewport in a single call. The shader applies to all pixels, but a shader only gets to touch any given destination pixel once per render. But the shader can read as many source pixels as it wants.

Because the shaders aren't limited on their reads, but they are limited on their writes, you have to re-organize your algorithm accordingly. You keep reiterating that error diffusion is spreading out from top to bottom, and I keep re-iterating that it has to work differently in WebGL, we've been talking past each other a little bit here.

You're right; you can't spread things out (scatter) using WebGL during a single pass, you can't do the classic Floyd Steinberg implementation the same way you do on the CPU. So it's important to understand that there is another way to do it, and it doesn't look like what you're used to. It doesn't spread things out by pushing error around inside the loop. It spreads things out by letting each destination pixel pull the error from it's neighbors before computing it's own error, rather than pushing it's own error to it's neighbors after computing. This is known as a gather, as opposed to scatter. It is mathematically the same thing, but the order of operations is turned around.

Here's a diffusion demo, it's reaction diffusion, not error diffusion, but ultimately exactly the same diffusion process. Each pass diffuses the previous pass by 1 step.

http://timhutton.github.io/webgl-reaction-diffusion/


>You're right; you can't spread things out (scatter) using WebGL during a single pass, you can't do the classic Floyd Steinberg implementation the same way you do on the CPU. So it's important to understand that there is another way to do it, and it doesn't look like what you're used to. It doesn't spread things out by pushing error around inside the loop. It spreads things out by letting each destination pixel pull the error from it's neighbors before computing it's own error, rather than pushing it's own error to it's neighbors after computing. This is known as a gather, as opposed to scatter. It is mathematically the same thing, but the order of operations is turned around.

That makes significantly more sense now. I thought you were saying I could do normal floyd-steinberg sequentially somehow.

I am unsure on the mathematical implimentation, but I will keep looking at this. I would think there would be a paper on this method somewhere.

I really appreciate the time you've taken to respond. Not many webgl people out there that can actually point out how it all works and whats possible, and I definitely learned something.


> because just iterating over all the pixels in an image by using "for(let i=0;i< pixel.length;i++){}" is 3x slower than using var, causing just a single full loop to be 80-120ms on a good computer.

That is an implementation detail. Of specific chrome versions.

That does not make let itself bad practice. In principle it should be significantly easier to optimize for a compiler because its more limited scope. So in the long run or possibly on other browsers it is likely best practice.

Putting transients before long-term behavior leads to urban programming myths and legacy baggage.

In fact in some jsperf benchmarks on chromium let happens to be faster for me.


>Putting transients before long-term behavior leads to urban programming myths and legacy baggage.

Not when the problems still exist. When they stop existing, sure, but the statement generally goes along with how some parts of ES6 are not well optimized yet. Over time, they should be optimized, but I cannot design something for an event that hasn't happened yet and cannot be tested.


You are doing it wrong.

let is better than var (in every way), so you should never ever use var. Of course you can design for the future. Designing for the future is using let and transpile to a dialect that is used by most browsers today.


>let is better than var (in every way), so you should never ever use var.

Not in the case I presented. let has benefits, but that does not mean its 'better in every way'. Again, dogmatism. Granted, the case I give is many, many iterations, it still is a problem if you're doing image filtering.


> Computers are fast, but not if all that speed is wasted.

While computers are fast, the performance improvements have been spread very unevenly. CPUs have improved the most, with memory lagging significantly behind, and persistent storage even more distant. On the other hand, networks have become much faster than before. I like this list of Latency Numbers Every Programmer Should Know: https://gist.github.com/jboner/2841832


Is that still true? I would think that the rise of SSD means persistent storage has improved much more CPU, at least in the last decade or so.


Not really, it's just another example of uneven performance improvements. Peak CPU performance has increased massively, but achieving it requires the use of vector instructions (and multiple cores). It is true that the speed of persistent storage has increased faster than sequential CPU performance (but it's still quite far off).


Or it can blink a cursor https://news.ycombinator.com/item?id=13940014

I remember an early Atom extension which faded the cursor in and out. It used 100% of one core to accomplish this.


You're nearly right: but premature optimization is no myth, but a massive source of both bugs and wasted effort.

The backlash (e.g. your statement?) against it's misapplication seems to be causing as many problems today as the misapplication, at least in my experience.

It is true that "premature optimization" is used as an (invalid) excuse for lack of or poor design, but if wasn't there another excuse would be found. That's not what Knuth was talking about.



That made me remember when I was learning to program (Clipper anyone?). How fast Clipper (compiled) was when comparing to dBase III PLus.

And made me wonder: for small applications, like the traditional one computer that controls my sales, nothing online... which language would I use? Seems that there are many options for big systems, buj WebServer + Java + (...) seems to be too complex (and slow) for small needs.


well most of the examples are in interpreted python which would be one of those bad design choices.


Do you know how much you could do in a month but can't because of bad design choices, overengineered type systems, and dogmatic adherence to the 'every application needs to be written in the language with the best benchmark numbers' myth?


False dichotomy. The real solution is a language with a strong, static type system (and thus plenty of room for ahead-of-time optimization) that's also concise and productive, such as Kotlin or Swift.


Pretty cool, but a number of the questions are totally unknowable.

For instance the question about web requests to google. Depending on your internet connection you've got more than a order of magnitude difference in the outcome.

In the question about SSD performance the only hint we have is that the computer has "an SSD", but a modern PCIe SSD like in the new Macbook pro is over 10 times faster than the SSDs we got just 5 years ago.

The question about JSON/Msgpack parsing is just about the implementation. Is the python msgpack library a pure python library or is the work of the entire unpackb() call done in C?

The bcrypt question depends entirely on the number of rounds. The default happens to be 12. Had the default been 4 the answer would have been 1000 hashes a second instead of 3. Is the python md5 library written in C? If so, the program is indistinguishable from piping data to md5sum from bash. Otherwise it's going to be at least an order of magnitude slower.

So I liked these exercises, but I liked the C questions best because there you can look at the code and figure out how much work the CPU/Disk is doing. Questions that can be reduced to "what language is this python library written in" aren't as insightful.


You very well captured the intention of this (or at least what I think is the intention). I mentioned above that in highly abstracted systems it is not at all unusual to spend more time with inefficiencies rather than doing real work:

You noticed here that it is also hard to tell in a high abstraction layer which operations might be fast and which may be slow.

Often this is even found in seemingly simple places, e.g. traversing a List<> in Java can have vastly different performance characteristics (LinkedList vs. ArrayList).


This is exactly my main complaint about this.

Still, it was an intriguing idea, even if a lot of the answers were based on things not described. I did learn a few things... I don't do much parsing so I didn't realize how slow JSON was compared to other packaged formats.


it's the required escaping of quotation marks!! if you have non-english users and didn't turn off escaping of non-ascii then you're in for a real surprise :) length-prefixed strings would solve this.

also, memory allocation penalties can hit you hard, depending on your environment, since it's hard to predict what you'll need to allocate when encoding / decoding json


Yes, modern computers are fast. How fast?

The speed of light is about 300,000 km/s. That translates to roughly 1 ns per foot (yeah, I mix up my units... I'm Canadian...)

THUS, a computer with a clock speed of 2 GHz will be able to execute, on a single core/thread, about 4 (four !) single-clock instructions between the moment photons leave your screen, and the moment they arrive into your eye 2 feet (roughly) later.

_That_ should give you an idea of how fast modern computers really are.

And I _still_ wait quite a bit when starting up Microsoft Word.


The obvious conclusion is that by doubling your distance to the screen, you double your computers speed! ;-)

(in truth, I really like this image, though. On a similar note, I think I recently saw a number for the length of wiring in a modern cpu (ryzen review perhaps) - and it was a rather staggering number from the fractal layout of modern chips - 200 meters perhaps? In a square little more than 2 cm to a side, if that?)


Another fun start I recently learned is that about half of the area on the die on a modern chip is taken up by wires for the clock alone, in large part to ensure the length of wire to each component is the same length.

http://m.cacm.acm.org/magazines/2017/1/211094-exponential-la...


Holy crap, you're right. 2 instructions per light-foot! That is a very cool comparison!

Makes me think maybe light is slower than I thought. And that Word must have a lot of code running. :P


Most modern CPU's can execute more than one instruction per clock cycle on a single core or hardware thread. They do a lot more than 4 instructions in the time light can travel 2 feet.


If you want to get technical, modern CPUs can dispatch (begin the execution) of 4 to 6 instructions every clock cycle. However, many instructions take more than one cycle to complete. For instance, integer multiply takes 3-4 cycles on an Intel Sandy Bridge chip. FMUL takes 5 cycles to complete.

That being said, there are several instructions, like mov, integer additions and compare, bitwise operations, which can absolutely complete in a single cycle.

See this document for more details: http://www.agner.org/optimize/instruction_tables.pdf


Also: While SIMD does not execute additional instructions, it executes more operations [that's the whole point; it reduces contention and bandwidth needs in the frontend / instruction decoding and scheduling]. A Haswell core peaks at 32 SP FLOP/cycle, for example.


This (appx 1ns / foot) was Grace Hopper's visualization many years ago. I remember being taught it in 1990 and intrigued at how it made something abstract so easily memorable.


The thing is that while CPUs are fast, RAM, never mind a HDD, are not.


This is the coolest fact, how incredibly fast they can react.


If, like me, you spend most of your time in high-level, garbage collected "scripting" languages, it's really worth spending a little time writing a few simple C applications from scratch. It is astonishing how fast a computer is without the overhead most modern languages bring in.

That overhead adds tons of value, certainly. I still use higher level languages most of the time. But it's useful to have a sense of how fast you could make some computation go if you really needed to.


>It is astonishing how fast a computer is without the overhead most modern languages bring in.

No kidding! I once implemented Knuth's Dancing Links/Algorithm X[1] in Python, as part of a Sudoku solver. At one step in the algorithm you need to choose from a group of "things to do next"; the C version I was using as a reference just chose "the next item" in the group (it was stored as a linked list). Changing it to the more efficient method of scanning all the items and choosing the one which would more fully constrain the search space made no obvious difference in the time it took to solve a 9x9 Sudoku -- practically instant.

In the Python version, the efficient strategy also solved 9x9's practically instantly; the simpler strategy ended up taking several minutes! I forget exactly how many unnecessary steps the simpler version ending up taking -- hundreds of thousands, maybe -- but the C code just got out of the way and let the processor crunch through them too quickly to care about.

[1] https://www.ocf.berkeley.edu/~jchu/publicportal/sudoku/00110...


Worth checking out Common Lisp. It's as high-level language as you can get, and yet good compilers (like SBCL, or like commercial ones from Franz and LispWorks) can compile it to tight assembly with performance very close to that of C++ (you need to disable some runtime checks for that though, but you can do that on a per-function level, so it's much less of a problem than one thinks).


Haskell is also really fast. On most Stack Overflow questions asking "why is this Haskell program slow", there's often a detailed answer with an implemention that's about as fast as -- or even faster than -- C:

http://stackoverflow.com/questions/42771348/why-is-haskell-s...

http://stackoverflow.com/questions/6964392/speed-comparison-...

http://stackoverflow.com/questions/29875886/haskell-why-does...


None of those are very good examples. They spend a lot of time optimizing already dead-slow algorithms, and they end up with an optimized dead-slow haskell version that is faster than the non-optimized dead-slow C version. Not very impressive.

I thought I would implement some naive versions:

For he second link, I just whipped up this naive in chez scheme, and it runs in 0.5s, and that should really be the baseline: https://pastebin.com/JXGLA4TR Incidentally this seems to be on par with the fastest haskell version posted that does not use precomputed primes for factorisation. It could be made a lot faster by using only primes, but I couldn't be bothered.

And I doubt you will find a haskell version that gets the longest collatz-sequence that is faster than this: https://pastebin.com/FAdiHA3X (0.01s using gcc -O2). Yet again, a simple algorithm, but this time with bolted-on memoization.

Edit: The first link contains the most naive and slow fibonacci function. It might be mathematically elegant, but it is dirt slow.

    (define (fib n)
      (let loop ([n n] [a 0] [b 1])
        (if (zero? n)
            b
            (loop (- n 1) b (+ a b)))))
That one takes 0.1s calculating the 100.000th fibonacci number, and there are even faster ways.


But you still don't get any control over memory layout of your data and how allocations are performed and this could easily cost you an order of magnitude in performance.


True, at least without using any vendor-specific extensions (that may or may not help here; I haven't investigated). My point was that in CL, you get close-to-C++ performance levels by default, you can get even closer by optimizing specific areas in your code, and all of that while still writing in a very high-level language. Of course you won't beat memory-optimized C++ code in performance, but you're already close enough that you rarely need to.


Yep, I've recently written a RFC2822 (and friends) message parsing library in C that can parse 4GB of email messages per second (incl. full bodies decoding and normalizing everything to utf-8) on my old Ivy Bridge i5. Part of the trick is never tying number of malloc() calls to the number of messages/size of input, not using crappy OOP like interfaces, and avoiding string comparisons, which is all possible. My library has no malloc(), no strcmp() and no OOP. :)

It's about 4 times faster than gmime (also written in C - but kind of unfair comparison, because gmime is not doing the body decoding and conversion to UTF-8 by default) and 20 times faster than Python (in a single thread). Now if only I could have a SSD that could read my mail archive that fast. :)


I would love to read through some of that. Would that be possible ?


I worked through project euler using both C and chez scheme. Once you spend a lot of time optimizing an algorithm in chez, you can sometimes squeeze out a 5x speed increase just by porting it to C.

And chez is _fast_.


  >> It is astonishing how fast a computer is without the overhead most modern languages bring in.
The other interesting thing with high level languages (using Python as an example) is that a small number of features that are not often used keep the language from running significantly faster (see PyPy for example).

Seems like there should be room for a language with slightly reduced set of Python features (Py--) which runs much faster.


Be careful what conclusions you attempt to draw from examples when you arent sure what exactly is happening. These examples are actually very wrong and misleading.

Take for example, the first code snippet about how many loops you can run in 1 second. The OP fails to realize that since the loop isnt producing anything which gets actually used, the compiler is free to optimize it out. You can see that thats exactly what it does here: https://godbolt.org/g/NWa5yZ All it does is call strtol and then exits. It isnt even running a loop.


We automatically generated all the results in this quiz from the programs on the site. None of the loops were optimized out, we ran basically a binary search to figure out the maximum number of iterations you could run in a second.

Results and compiler optimizations will of course vary across computers, but they were correct on my laptop on the day that we ran them (in Sept. 2015).

If you want to reproduce this on your computer and see how the results are different, you can clone https://github.com/kamalmarhubi/one-second and run `run_benchmarks.py`


The -O2 flag wasn't added to the benchmark script until Sept 20th in commit 4a31931, I suspect you ran at least sum.c without that flag since any modern gcc compiler should have optimized away that entire loop.

I downloaded and ran the benchmarks and indeed the search for the 1 second runtime for sum.c never completed -- it never found an answer because the sum binary runs in constant time regardless of the iteration count. In fact, atoi() in sum.c quickly overflowed once the iteration count exceeded 2^31 (since atoi() returns a signed int). I removed the -O2 flag and then I got the same 550429840 iter count as you on my laptop.

The benchmark script isn't really doing a binary search, it's doing a linear search since it's increasing the iteration count by 10% with each loop.


No, look again. it's constant time. Yes, "optimizations will vary" - but any C compiler with reasonable settings should just ignore the loop. Otherwise you are running so un-optimized that there is no point in talking about the runtmie of the program anyway. You should change it to return the sum.


> You can see that thats exactly what it does here: https://godbolt.org/g/NWa5yZ

You are wrong as well, that is what it is doing if you compile it with -O2 on that exact compiler. Try compiling with -O0 instead and it will include those instructions. Also the number is reasonable, modern cpu cores can execute around a billion loop iterations per second.


The page mentions that all C programs were compiled with gcc -O2.


The page claims the C program can do 550 million loop iterations in one second. That's exactly what you would expect from an average laptop. So the loop clearly isn't getting optimized out in the executable used by the authors. If you compile with -O0 you can confirm this for yourself.

You can also see that in later examples printf() statements are added to stymie the optimizer.


Seems clear the question assumes you're generating code that actually increments the integer (i.e. repeating a single add instruction).

An anecdote defending the question on a non-technical level: when I was a kid around 4 or 5 my dad had me race a computer counting to 1000. It was one of the more memorable parts of my young childhood!


Smart computer can do dumb things very very very fast.


More impressively, sum.c could go likely an order of magnitude or so faster, when optimized.

> Friends who do high performance networking say it's possible to get network roundtrips of 250ns (!!!),

Well stuff like Infiniband is less network, and more similar to a bus (e.g. RDMA, atomic ops like fetch-and-add or CAS).

> write_to_memory.py

Is also interesting because this is dominated by inefficiencies in the API and implementation and not actually limited by the memory subsystem.

> msgpack_parse.py

Again, a large chunk goes into inefficiencies, not so much the actual work. This is a common pattern in highly abstracted software. msgpack-c mostly works at >200 MB/s or so (obviously a lot faster if you have lots of RAWs or STRs and little structure). Funnily enough, if you link against it and traverse stuff, then a lot of time is spent doing traversals, and not the actual unpacking (in some analysis I've seen a ~1/3 - 2/3 split). So the cost of abstraction also bites here.

If you toy around with ZeroMQ you can see that you'll be able to send around 3 million msg/s between threads (PUSH/PULL) from C or C++, around 300k using pyzmq (this factor 10 is sometimes called "interpreter tax"), but only around 7000 or so if you try to send Python objects using send_pyobj (which uses Pickle). That's a factor 430.


How would you optimize sum.c to be faster?


Well the obvious answer would be, eliminate the loop, which is what any compiler optimizer will do ;)

But let's assume that the operation is not quite so trivial like here, then this structure would be a prime example where each loop operation is independent from each other, so you can sum a vector in parallel (=SIMD) and then sum the vector once in the end.

Also, since we're obviously not using any compiler optimizations, you can unroll the loop, which reduces per-iteration loop overhead (the conditional jump) and in such a simple case as here is bound to give a nice boost.


When you say "eliminate the loop", do you mean loop unrolling? They are compiling with -O2, I'm not sure if that does loop unrolling. If it does, how much unrolling does it do, exactly? I can't imagine it would construct a block of code with a billion add instructions.


Eliminating the loop means the compiler looks at the code and says "the value computed in this loop is never actually used, so might as well just get rid of that code altogether".


First, it's unused, so it'll just kill it. If it wasn't, most compilers can compute sum reductions, etc, pretty well.

Anything that is easily expressed as a set of simple recurrences that can be solved, it'll solve.


s += NUMBER

or

s = NUMBER


for a data parallel problem like that you could vectorize it using the SIMD unit. It never ceases to amaze me at any point how much of silicon real estate just sits idle because no one really bothered to look behind the curtain and tailor the code for a specific architecture. for reference, a recent Intel chip will have 256b wide vector unit (512 for server class chips) that can be treated as a vector of 8/16/32/64b units to do parallel math. at a typical 16b ops you get 16x speedup for practically nothing. plus its considerably more if you have predictable but non standard data access patterns.

now <rant> IMO what the parent commenter was trying to say is often this punchline about premature optimizations used as catch-all justification for all the convenience/expediency related decisions taken. & also it just works well to create future work. unfortunately that never gets scheduled as I have seen management time frame shrink too. almost everyone is in quick & dirty 'working' version to book a success and move on. with this culture if you try to push resolving these 'technical debt' type issues you are just perceived as a downer. </rant>


I've heard that modern processors are designed with the expectation (or even a requirement) that no more than 10% of the chip will be active at once. In addition, there are power density constraints (even those 10% of the chip that can be active at once must not be concentrated in the same spot).


Nope, but processors, especially larger ones, have pretty advanced power and clock management which allows them to deal with power envelope and thermal envelope constraints. E.g. the two clocking and core voltage ladders in Xeon E5/E7 for AVX vs. non-AVX code. (Though this particular example is a bit of a sledge-hammer approach).


You can't really do loop unrolling and constant folding as some suggest because the number of iterations is determined at run time.

But assuming his CPU takes one cycle for an addition and one cycle for a conditional jump, his CPU only needs to run at 1GHz to achieve his result.

Given that branch prediction should be nearly perfect for such a simple and short loop, modern x86 CPUs doing at least 4 integer additions per cycle, and typical CPU speeds in the range of 2-4 GHz, a properly optimized version should be nearly an order of magnitude faster.

So either more aggressive compiler flags and maybe SIMD intrinsics, or hand written assembly (easy here, not so easy in the real world)


> You can't really do loop unrolling and constant folding as some suggest because the number of iterations is determined at run time.

Huh?


That question is a bit unspecific :D

>NUMBER = atoi(argv[1]);

argv is populated at runtime based on what the OS passed you.

But you are right if you meant that my statement is too strong: you can't do arbitrary loop unrolling. You can jump to a loop that's unrolled once if the number is even, etc.

There are certainly also optimizers that will just output the input without any loop, but that's beyond anything I would expect of a current production compiler.


Well, you could do arbitrary loop unrolling - with Duff's Device:

https://en.wikipedia.org/wiki/Duffs_device


If you are going from 0 -> NUMBER and adding one to a variable every iteration, what will the variable's final result be? NUMBER. The compiler can see this and will just do the following:

    s = i = atoi(argv[1]);
Now the loop is gone.

Edit: This will still happen even if you're doing other stuff in there. GCC will attempt to inline and optimize like this. If it is impossible then you will see a loop with most of the stuff factored out of it.


GCC with -O3 might try to unroll this loop into a single constant. O(0) is pretty fast.


That's very likely. Clang 8.0 collapses the loop with any optimizer setting other than -O0.

But if you had to do it manually, loop unrolling and SIMD instructions (although not part of standard C) would be good bets and can probably get you an order of magnitude.


Clang 8.0?


Apple Clang version number is approximately the same as the Xcode version number, and not identical to the Clang release it's based on.


So what version of upstream Clang is Xcode 8.0 based on?


GCC as expected, O>2

        sub     rsp, 8
        mov     rdi, QWORD PTR [rsi+8]
        mov     edx, 10
        xor     esi, esi
        call    strtol              <-- atoi
        xor     eax, eax            <-- zero return value
        add     rsp, 8
        ret                         <-- return from main
Interestingly enough it does not elide the entire function body.


How do you get objdump -S to print out non AT&T syntax? My output looked like this:

    000000000400400 <main>:
      400400:	48 83 ec 08          	sub    $0x8,%rsp
      400404:	48 8b 7e 08          	mov    0x8(%rsi),%rdi
      400408:	ba 0a 00 00 00       	mov    $0xa,%edx
      40040d:	31 f6                	xor    %esi,%esi
      40040f:	e8 dc ff ff ff       	callq  4003f0 <strtol@plt>
      400414:	31 c0                	xor    %eax,%eax
      400416:	48 83 c4 08          	add    $0x8,%rsp
      40041a:	c3                   	retq   
      40041b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Which I'm not a fan of. I've always wanted "add (store), (get)" but could never get that.


Not on a development machine right now, so I just went to godbolt ;)

https://godbolt.org/


Use the '-M intel' option.


-M intel


ITYM "O(1)".

"O(0)" is pretty much meaningless.


I mean O(0) because no operation occured. The only operation that will occur is parsing the data which is not the problem specified. When looking at the loop, and just the loop, O(0) is the correct answer for this.


What an excellent teaching pattern - you're far more likely to remember what you learned if you first stop to think and record your own guess, and this is excellent UI and UX for doing that routinely and inline.


This page from the New York Times is another great example of teaching through interaction: https://www.nytimes.com/interactive/2017/01/15/us/politics/y...


This is awesome. The real lesson here is, when you make a thing, compare its performance to these kinds of expected numbers and if you're not within the same order of magnitude speedwise, you've probably screwed up somewhere.

My favorite writeups are the ones that gloat about achieving hundreds of pages served per second per server. That's terrible, and nobody today even understands that.


Don't some of these examples run in O(1) time because the value in the loop isn't used? E.g in the first example 0 is returned instead of the sum.

Obviously we are talking about real world c compilers with real world optimizations so presumably we'd have to also consider whether the loop is executed at all?


That's nothing. Here's code that does 77GFLOPS on a single Broadwell x86 core. Yes that 77 billion opertaions per second.

http://pastebin.com/hPayhGXP


Interesting that that code looks more like Assembly than C to me.


This reminds me of "Latency Numbers Every Programmer Should Know"

https://gist.github.com/jboner/2841832

Edit: Just realized halfway through that there's already a link to this from their page!


The `bcrypt` question seems out-of-place. It has a configurable cost parameter, so almost any of the answers is correct.


Hard to believe there are 124 comments here and nobody has brought up Grace Hopper's talk[0][1] yet. With good humour she gives a example of what various devices' latency are, and a simple tool to comprehend the cost and orders of magnitude.

  [0] short - https://www.youtube.com/watch?v=JEpsKnWZrJ8
  [1] long - https://www.youtube.com/watch?v=ZR0ujwlvbkQ


One second on what?

A Core i7? A raspberry Pi? A weird octo-core dual-speed ODROID? An old i915-based Celeron? My cell phone? An arduino?

"Your computer" has meant all the above to me, just in the last few weeks. The author's disinclination to describe the kind of hardware this code is running on -- other than "a new laptop" -- strikes me as kind of odd.


I'm curious to see the data collected on guesses. Some were quite difficult to guess, like hashes per second with bcrypt not knowing the cost factor, but I guess we can assume some sane default.

I would have really liked to see all these numbers in C, and other languages for that matter. Perhaps add a dropdown box to select the language from a handful of options?


This reminds me to this email from LuaJIT's list:

Computers are fast, or, a moment of appreciation for LuaJIT https://groups.google.com/forum/#!msg/snabb-devel/otVxZOj9dL...


Brilliant! I'd like to see those numbers summarized somewhere though, a bit like the latency numbers every programmer should know: https://gist.github.com/jboner/2841832 (visual: https://i.imgur.com/k0t1e.png)


Computers are fast unless your algorithm is quadratic or worse, then there's no computer to help you.


I came across this "article"? before in the past, I feel like I remember it under a different title like "language speed differences" or something. Or maybe that's another article by the same author/site/format.


Why isn't the first Python loop (that does nothing but pass) optimised away completely?


The first example (the sum in C) is certainly optimized away. I'd be careful drawing too many conclusions about these examples because the code is dodgy (doesn't use sums etc), the optimizations are probably incorrect (reasonable compiler flags were added after the results were generated).

You could think of them as thought experiments e.g. "how many ADD's can we make on a single thread on an average PC?" rather than "What would the runtime of this C program be on an average PC?". Since the results were generated without optimizations, at least for the C programs, there is not much point in talking about runtime.


CPython only does very few optimizations on the byte code. Most optimizations are in the interpreter itself.


Because the Python interpreter doesn't do that optimization.

(PyPy might, I don't know.)


Compilers often don't optimise loops away because they feel that the user put them there for a reason -- after all they are so obvious that the user could have removed them herself.

One use for such a loop is a delay -- not so common nowadays, but it used to be a mainstay of DOS based games etc.

I bet that if GCC started aggressively optimising out empty loops, it would interact with some subtlety of concurrency to break a spinlock or three in various kernels.


Compilers can only optimize away loops if they know that the loop has no side effects. Compilers like GCC have to have a list of standard functions that they know are pure.

This can trip you up sometimes. For example, if you do something like use memset to zero out sensitive data when you're done using it, the computer can say "the result of memset is never used, and the reference to that memory is lost, so there's no need to actually make the call" and you're left with secrets in memory.


Forgive me if I don't know what I'm talking about here, but could volatile be used here to force the compiler to perform the memset?


You're correct. (https://barrgroup.com/Embedded-Systems/How-To/C-Volatile-Key...)

But again, you're telling the compiler that you know what you're doing, and to not perform its normal optimization. I'm not well-versed in compiler design, but I think that's the tradeoff you have to make between trusting the user's design choices, or not.


The grep example should search for one character. Grep can skip bytes so that longer search strings are faster to search for. On my machine, I get from 22%-35% more time taken if I changed "grep blah" to "grep b".


Or "how fast can one of my 8 CPU cores run a for loop?" To put that in perspective: all 8 cores together give me about 40gflops. I have 2 GPUs that each give me more than 5000gflops.


Anyone care to rewrite these into c#? I am really surprised how fast these python scripts are and I would like to see comparison with equivalent tasks in c# where it stands..


Was disappointed to find that nearly all the examples were Python and shell script. I'm not interested in knowing random trivia about how slow various interpreters are.


The last question seems really misleading. Most modern CPUs have a cache size of 8MB (max), yet the answer is >300MB?


Or running Windows Vista you can right click and display a menu in one second plus about 29 other seconds.


Well, my computer won't display an image apparently inserted with JavaScript, although it could if I wanted to grant execute privileges on it to computers-are-fast.github.io

Does anyone have a link to the image(s)?


> GitHub Pages is temporarily down for maintenance.

Ironic?


NVMe SSD can be up to 10X faster than SATA.


or "computers are fast, so we might just slow things down by using python for numerical calculations"


This could make a pretty good hiring test. Not expecting perfect answers, but a rough correlation with the results, and some good explanations.


Edit: I'm an idiot


Where's the typo?

A millisecond is one thousandth of a second. If you can run something 68 million times in a second, you can run it 68 thousand times in a millisecond. And that's what the text says.


68M per second == 68K per millisecond




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: