Hacker News new | past | comments | ask | show | jobs | submit login

Apparently, every computer science problem has easy to understand solution that’s easy to implement but awfully slow in practice.

This is one of them.

Rotating images that way isn’t CPU bound problem. It’s RAM bandwidth and RAM latency bound problem. The optimal data structure for that is not recursive, it’s 2D array of small square blocks of pixels. If pixels are bits, I’d bet on either 8x8 or 16x16 blocks. If they are RGB, one good format for those images, optimized for rotations, is BC7 https://www.opengl.org/registry/specs/ARB/texture_compressio... the decompressor being implemented right in the graphics hardware.

> If we were writing an image manipulation application, we’d provide much snappier behaviour using coloured quadtrees to represent images on screen.

That statement was true 30 years ago, no longer a case. Branch misprediction is expensive, memory latency is huge, RAM is a block device now (the block size typically being 16 bytes for dual-channel DDR). In addition, some displays now show 8+ megapixels, and most show 16M colors.

For an image manipulation application, that recursive thing ain’t good at all.

>> we’d provide much snappier behaviour using coloured quadtrees

> That statement was true 30 years ago, no longer a case.

The colored quadtrees are an optimization over the plain quadtrees earlier in the article, reducing the number of comparisons for some images. That will be more efficient 300 years from now just as much as it was 30 years ago. In context it is perfectly obvious that "much snappier" does not refer to the totally different techniques used by actual graphics software.

It's a toy implementation used as an example, if you get hung up on the example, you miss the whole point.

>It's a toy implementation used as an example, if you get hung up on the example, you miss the whole point.

When to use such techniques, and whether they make sense in the real world, performance wise, should have been part of the "whole point" all along.

By that argument, any blog post that does not contain a comprehensive software development education can be criticized.

You should ask whether these techniques are useful in writing correct code and if they save developer effort, rather than picking apart irrelevant aspects of the toy example.

I don't think modern graphics code represents individual pixels as strings either, would you dismiss the post for that reason?

I don't think the point of the article was to present an efficient and novel solution for image rotation. It was just a nicely visual example of a recursive data structure.

> The optimal data structure for that is not recursive, it’s 2D array of small square blocks of pixels. If pixels are bits, I’d bet on either 8x8 or 16x16 blocks.

If the image is very large and contains large contiguous regions of solid colors, then a quadtree that doesn't subdivide homogeneous regions can be a much faster approach than a flat pile-o-pixels.

Either way, this article isn't about rotating images.

Well, hierarchical representation of those images is indeed a good idea for some data. Saves both RAM space and bandwidth, and CPU time.

Just not the quad-tree way described in the article.

There are ways to implement hierarchical structures that work well on modern hardware. One is hashmaps keeping dense 2 bits per pixel blocks (again, 16-64 bytes per block) with the values saying filled/empty/mixed. Another one is B-trees or B+ trees.

How would you improve on the quad-tree representation used to implement hashlife? http://www.drdobbs.com/jvm/an-algorithm-for-compressing-spac...

I wouldn’t.

I am not convinced hashlife is still the best of them. The invention is from early 80-s. Many old heuristics about caching versus calculation is no longer true today. Here’s my article on the related topic, lookup tables: https://github.com/Const-me/LookupTables

If I wanted to implement efficient life simulator, I’d probably used the fact that with AVX2, it’s possible to count those neighbors for 8x8 block of pixels in a single __m256i register (with 8 neighbors max, 4 bits / value is more than enough). Might be faster than L3 cache access, and will definitely be much faster than main memory access.

That’s why I’d put those pixels in 8x8 blocks, single uint64_t value per block.

Putting those blocks in a hash table takes care about the infinite field. std::unordered_map will work, but if you care about performance prefer CAtlMap from Microsoft, dense_hash_map from Google, or something similar.

I believe most modern implementations, including hybrid implementations of HashLife, take advantage of register and cache sizes. HashLife as I might implement it in JavaScript, is really a very naïve approach. I think Bill Gosper even had versions in the 1970s that did things with registers and assembly, not just doing everything in lists.

But the big issue with HashLife is not whether you can write a faster implementation, it's whether the particular pattern(s) you are exploring are amenable to HashLife. It wins for highly repetitive patterns, but is much slower for patterns with a very large amount of entropy in time and space, like methuselahs.

HashLife can be fabulous for exploring purpose-built machines that cycle though regular states.

Looking at golly-2.8-src. At the first sight, it does not use SIMD. Only uses 32-bit registers in 32-bit builds, 64-bit registers in 64-bit builds.

All x86 CPUs manufactured after 2003 have at least 16 128-bit SIMD registers. Using them for calculations is the only way to get anywhere near advertised CPU performance.

For languages other than C and C++, like JavaScript, where there’s no SIMD instructions, calculations are generally slower, but the hash tables are provided by the platform and hence implemented in highly-optimized C code, HashLife should be perfectly fine.

For HashLife, I think the chief problems are around an efficient cache, and especially efficient canonicalization. The algorithm is so fast for universes that suit its domain that I think most other considerations drop by the wayside.

However, besides the fact that not all problems have enough redundancy in time+space to suit it, it also suffers from the fact that it is most efficient when computing massive jumps in generations. So it’s as not as performant for creating animations as it is for doing things like finding the maximum size of a pattern, calculating its growth rate, and so forth.

For JavaScript you might do quite well with a 32-way bit-slice implementation, like http://fanf.livejournal.com/93032.html That is, if you can't do it on the GPU with WebGL :-)

>If the image is very large and contains large contiguous regions of solid colors

In other words, if the image is not very image-y? (but more vector-y).

It's an entirely plausible scenario. It will rarely be the case with photographs, but graphical images often feature large solid color blocks. It's large reason why the PNG file format exists.

You might still want the precision and detail at the boundaries of the shapes that a raster representation would give you. Think big blotchy cow spots or something along those lines.

   For every complex problem there is an answer that is clear, simple, and wrong.

A slow solution isn't wrong. Just inefficient.

We're talking about several orders of magnitude here.

Is a web page that loads in fifty seconds 'correct' or is it wrong? Is a video format that can only decode 2 frames per second correct, or wrong?

Often, timeliness is an unstated part of the requirements, in which case an academically pleasing solution that isn't practical doesn't fulfill the requirements.

I think you're missing the point. A quadtree is a compact way to store a partition of two-dimensional space, much like you'd store a binary heap in an array.

The implementation of a recursive data structure or algorithm can usually take advantage of tail recursion or trampolining. The practical implementation of a theoretical algorithm may look extremely different, much like you wouldn't literally translate abstract pseudocode to the language of your choice.

>Is a web page that loads in fifty seconds 'correct' or is it wrong? Is a video format that can only decode 2 frames per second correct, or wrong?

If your condition of satisfaction is "I am presenting information to someone who will read it", then "loads in fifty seconds" is wrong, yes.

> Is a web page that loads in fifty seconds 'correct' or is it wrong?

Neither: it's probably using JavaScript to perform an AJAX request to get some data in a bespoke markup format, then using JavaScript to parse that markup format and build DOM entries, all because the developer thinks HTML is 'so 90s, dude!'

> Is a web page that loads in fifty seconds 'correct' or is it wrong?

Of course it is correct - this is how the internet was/is with a 14k or 56k dialup modem, I don't blame you for not remembering/not experiencing it. I remember watching JPEGs render progressively from a blurry mess one at a time, or sometimes not at all. It took way more than 50 seconds to load and render all elements on a typical website.

Depends on the business constraints. Wrong is not just "returns invalid results".

There's also the notion of wrong or right solution for a particular problem (which the phrase above already assumes). E.g. for a game where things have to run in 60hz, a slow algorithm would be the wrong solution to use.

The reverse is also true: a solution can return wrong results in the absolute sense (e.g. false positives) and still be correct for our business case. This is the case with things like Bloom filters for example.

> Apparently, every computer science problem has easy to understand solution that’s easy to implement but awfully slow in practice.

I have been doing some data structures/algorithms review as I haven't really used much of that knowledge from my CS degree and, frankly, wasn't that great of a student when I was learning it. I've obviously focused mostly on understanding theoretical performance (Big-O) as I've been implementing and playing with things, but I'd like to know about good resources for things to consider about performance on real systems. I hear about things like cache misses, memory layouts and branch mispredictions, but don't really know how to apply that knowledge or, maybe more importantly, measure the effects of optimizing for some of those things over others. I'd appreciate any advice anyone has to offer.

That Big-O thing is mostly theoretical. Most useful application is job interviews.

It might stop you from doing some things, like bubble-sorting 1GB datasets, or inserting many elements to the start of a large array.

However, in other cases it’s just fine to bubble-sort (for very small collections), or insert many elements to the start of the array (when you’re reading much more often so the profit from memory locality while reading outweigh the losses from relocating those elements while inserting).

Modern hardware is extremely complex. No amount of knowledge will help you to 100% predict the performance effects of your changes. Two advices.

(1) Learn to profile your code. Both externally, using a sampling profiler that will tell you where’re you spending the CPU time, and internally using e.g. std::chrono::high_resolution_clock that will tell you the performance effect of your changes.

(2) Don’t forget about rollback feature of your version control, and don’t hesitate using that as necessary. I routinely discover some of my changes I viewed as performance optimizations actually decreased performance. The reasons include branch misprediction, compiler optimization issues, cores competition for L3 cache, RAM bandwidth, thermal issues…

For a dose of real-world big-O I recommend following http://accidentallyquadratic.tumblr.com

I don’t think engineers working at Ericsson and Apple (the companies listed in the copyright section of that source file) forgot about big-O. What I think happened, when they wrote that, their assumption was people won’t be using those to pass 16MB blobs. And for small messages, O(n^2) works just fine.

I can understand those people. HTTP is better than Web Sockets for large chunks of data. With HTTP, servers and proxies may cache, clients may resume downloads, clients may check for changes with if-modified-since header, and so on.

The problem here isn’t with big-O, it’s with changing world.

If in the future someone will send gigabytes in that event, the current version will break despite O(n), because RAM is finite. Relatively easy to fix, you just need to save data to a hard drive instead or RAM.

Does it mean it’s a good idea to do so now? Don’t think so, it would be over-engineering and premature optimization, and the current version is currently fine. Just like the original O(n^2) code was fine for small messages.

At face value, I don't actually see where you contradict the article. The recursive nature of this solution is akin to recursive matrix multiplication. Which can be a way to setup memory fetches to stay optimal. (Could... I have no data to say it is.)

Any chance you know of any benchmarks exploring this?

Recursive matrix multiplication is in general a bad idea (Strassen, for example, loses a little bit of stability and is not cache-friendly)

To be clear, I can see why coding it purely recursively would be a bad idea. Same for using linked structures, if that was the concern with the quad tree.

However, it seems that you could use the general recursive definition to find some optimizations that one could do. For example, define a recursive structure to break down a problem until it is small enough that a more traditional method works. (So, don't drop down to the size of a pixel, but do drop down till you have things at cache line size?)

Perusing the wikipedia for Strassen, I see that it scoffs at a ~10% benefit for large matrices. Curious how that holds in light of "deep learning" environments. Especially since 1k x 1k matrices are probably not that uncommon nowdays. Seems like every little bit would help.

Edit to add: THinking of the context of the original post. Yeah... probably not needed for anything in a "snappy" user application. I was thinking more for very large data sets.

The mention of "around 10% or less" is based on a 2005 paper, and even in the past ten years there's been a lot of change in computer architecture (the machines used in the paper were mostly ultrasparcs, and they write "Note that for the (pentium 3 system), we could find no problem size for which Strassen’s is faster than ATLAS’s").

See also : https://www.quora.com/What-algorithm-does-BLAS-use-for-matri... https://github.com/flame/how-to-optimize-gemm/wiki#notice-on... http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

That is an interesting note. I remember reading that the lower bound on multiplications was lower due to swapping a lot of them for additions. Knuth indicated that this did not matter in practice, since few people are multiplying large matrices.

My interest here is really just if that is still a good assumption. With the multiplications that happen in "deep" fields, I'm curious if there are now reasons to look to these algorithms.

We might be talking at cross-purposes, my angle is that of dense linear algebra, single or double precision, on NUMA architectures with vector operations, and all respect to Knuth but it requires a different model for the machine.

If that's changed, like for example if you're computing over a different ring where arithmetic ops are expense as compared to cache misses and there's no stability issues, then yes the issue is back open and Strassen is put back on the table.

Knuth seems to be in agreement with you, from my understanding. My comments are only of a general interest and curiosity, not of knowledge. (If that makes sense.)

That is, I certainly was not trying to second guess what you are saying. Just noting my curiosity on whether or not this has, or will, change. Knuth's view that I'm quoting is from Vol 2, which is pretty old at this point. I'd be a little surprised if it was outdated. But not shocked.

Very large matrices is >50kx50k (maybe even bigger). Strassen is then just a tiny bit better when implementing both cache and allocation aware.(Tested in Fortran)

At least it's at a non practical size.

It is always interesting to see just how valuable a "tiny bit" can be, though.

Granted, I confess most is almost always navel gazing. Increases in the "best" sorting/map algorithm have probably not lead to any noticeable improvements in modern codebases. Usually, those benefit more from being the low hanging fruit, not most computationally complex. (That is, I'm sure in these large matrix problems, there are other low hanging fruit that will yield better improvements.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact