This is one of them.
Rotating images that way isn’t CPU bound problem. It’s RAM bandwidth and RAM latency bound problem. The optimal data structure for that is not recursive, it’s 2D array of small square blocks of pixels. If pixels are bits, I’d bet on either 8x8 or 16x16 blocks. If they are RGB, one good format for those images, optimized for rotations, is BC7 https://www.opengl.org/registry/specs/ARB/texture_compressio... the decompressor being implemented right in the graphics hardware.
> If we were writing an image manipulation application, we’d provide much snappier behaviour using coloured quadtrees to represent images on screen.
That statement was true 30 years ago, no longer a case. Branch misprediction is expensive, memory latency is huge, RAM is a block device now (the block size typically being 16 bytes for dual-channel DDR). In addition, some displays now show 8+ megapixels, and most show 16M colors.
For an image manipulation application, that recursive thing ain’t good at all.
> That statement was true 30 years ago, no longer a case.
The colored quadtrees are an optimization over the plain quadtrees earlier in the article, reducing the number of comparisons for some images. That will be more efficient 300 years from now just as much as it was 30 years ago. In context it is perfectly obvious that "much snappier" does not refer to the totally different techniques used by actual graphics software.
It's a toy implementation used as an example, if you get hung up on the example, you miss the whole point.
When to use such techniques, and whether they make sense in the real world, performance wise, should have been part of the "whole point" all along.
You should ask whether these techniques are useful in writing correct code and if they save developer effort, rather than picking apart irrelevant aspects of the toy example.
I don't think modern graphics code represents individual pixels as strings either, would you dismiss the post for that reason?
If the image is very large and contains large contiguous regions of solid colors, then a quadtree that doesn't subdivide homogeneous regions can be a much faster approach than a flat pile-o-pixels.
Either way, this article isn't about rotating images.
Just not the quad-tree way described in the article.
There are ways to implement hierarchical structures that work well on modern hardware. One is hashmaps keeping dense 2 bits per pixel blocks (again, 16-64 bytes per block) with the values saying filled/empty/mixed. Another one is B-trees or B+ trees.
I am not convinced hashlife is still the best of them.
The invention is from early 80-s.
Many old heuristics about caching versus calculation is no longer true today.
Here’s my article on the related topic, lookup tables: https://github.com/Const-me/LookupTables
If I wanted to implement efficient life simulator, I’d probably used the fact that with AVX2, it’s possible to count those neighbors for 8x8 block of pixels in a single __m256i register (with 8 neighbors max, 4 bits / value is more than enough). Might be faster than L3 cache access, and will definitely be much faster than main memory access.
That’s why I’d put those pixels in 8x8 blocks, single uint64_t value per block.
Putting those blocks in a hash table takes care about the infinite field. std::unordered_map will work, but if you care about performance prefer CAtlMap from Microsoft, dense_hash_map from Google, or something similar.
But the big issue with HashLife is not whether you can write a faster implementation, it's whether the particular pattern(s) you are exploring are amenable to HashLife. It wins for highly repetitive patterns, but is much slower for patterns with a very large amount of entropy in time and space, like methuselahs.
HashLife can be fabulous for exploring purpose-built machines that cycle though regular states.
All x86 CPUs manufactured after 2003 have at least 16 128-bit SIMD registers. Using them for calculations is the only way to get anywhere near advertised CPU performance.
However, besides the fact that not all problems have enough redundancy in time+space to suit it, it also suffers from the fact that it is most efficient when computing massive jumps in generations. So it’s as not as performant for creating animations as it is for doing things like finding the maximum size of a pattern, calculating its growth rate, and so forth.
In other words, if the image is not very image-y? (but more vector-y).
For every complex problem there is an answer that is clear, simple, and wrong.
Is a web page that loads in fifty seconds 'correct' or is it wrong? Is a video format that can only decode 2 frames per second correct, or wrong?
Often, timeliness is an unstated part of the requirements, in which case an academically pleasing solution that isn't practical doesn't fulfill the requirements.
The implementation of a recursive data structure or algorithm can usually take advantage of tail recursion or trampolining. The practical implementation of a theoretical algorithm may look extremely different, much like you wouldn't literally translate abstract pseudocode to the language of your choice.
If your condition of satisfaction is "I am presenting information to someone who will read it", then "loads in fifty seconds" is wrong, yes.
Of course it is correct - this is how the internet was/is with a 14k or 56k dialup modem, I don't blame you for not remembering/not experiencing it. I remember watching JPEGs render progressively from a blurry mess one at a time, or sometimes not at all. It took way more than 50 seconds to load and render all elements on a typical website.
There's also the notion of wrong or right solution for a particular problem (which the phrase above already assumes). E.g. for a game where things have to run in 60hz, a slow algorithm would be the wrong solution to use.
The reverse is also true: a solution can return wrong results in the absolute sense (e.g. false positives) and still be correct for our business case. This is the case with things like Bloom filters for example.
I have been doing some data structures/algorithms review as I haven't really used much of that knowledge from my CS degree and, frankly, wasn't that great of a student when I was learning it. I've obviously focused mostly on understanding theoretical performance (Big-O) as I've been implementing and playing with things, but I'd like to know about good resources for things to consider about performance on real systems. I hear about things like cache misses, memory layouts and branch mispredictions, but don't really know how to apply that knowledge or, maybe more importantly, measure the effects of optimizing for some of those things over others. I'd appreciate any advice anyone has to offer.
It might stop you from doing some things, like bubble-sorting 1GB datasets, or inserting many elements to the start of a large array.
However, in other cases it’s just fine to bubble-sort (for very small collections), or insert many elements to the start of the array (when you’re reading much more often so the profit from memory locality while reading outweigh the losses from relocating those elements while inserting).
Modern hardware is extremely complex. No amount of knowledge will help you to 100% predict the performance effects of your changes. Two advices.
(1) Learn to profile your code. Both externally, using a sampling profiler that will tell you where’re you spending the CPU time, and internally using e.g. std::chrono::high_resolution_clock that will tell you the performance effect of your changes.
(2) Don’t forget about rollback feature of your version control, and don’t hesitate using that as necessary. I routinely discover some of my changes I viewed as performance optimizations actually decreased performance. The reasons include branch misprediction, compiler optimization issues, cores competition for L3 cache, RAM bandwidth, thermal issues…
I can understand those people. HTTP is better than Web Sockets for large chunks of data. With HTTP, servers and proxies may cache, clients may resume downloads, clients may check for changes with if-modified-since header, and so on.
The problem here isn’t with big-O, it’s with changing world.
If in the future someone will send gigabytes in that event, the current version will break despite O(n), because RAM is finite. Relatively easy to fix, you just need to save data to a hard drive instead or RAM.
Does it mean it’s a good idea to do so now? Don’t think so, it would be over-engineering and premature optimization, and the current version is currently fine. Just like the original O(n^2) code was fine for small messages.
Any chance you know of any benchmarks exploring this?
However, it seems that you could use the general recursive definition to find some optimizations that one could do. For example, define a recursive structure to break down a problem until it is small enough that a more traditional method works. (So, don't drop down to the size of a pixel, but do drop down till you have things at cache line size?)
Perusing the wikipedia for Strassen, I see that it scoffs at a ~10% benefit for large matrices. Curious how that holds in light of "deep learning" environments. Especially since 1k x 1k matrices are probably not that uncommon nowdays. Seems like every little bit would help.
Edit to add: THinking of the context of the original post. Yeah... probably not needed for anything in a "snappy" user application. I was thinking more for very large data sets.
See also : https://www.quora.com/What-algorithm-does-BLAS-use-for-matri... https://github.com/flame/how-to-optimize-gemm/wiki#notice-on... http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
My interest here is really just if that is still a good assumption. With the multiplications that happen in "deep" fields, I'm curious if there are now reasons to look to these algorithms.
If that's changed, like for example if you're computing over a different ring where arithmetic ops are expense as compared to cache misses and there's no stability issues, then yes the issue is back open and Strassen is put back on the table.
That is, I certainly was not trying to second guess what you are saying. Just noting my curiosity on whether or not this has, or will, change. Knuth's view that I'm quoting is from Vol 2, which is pretty old at this point. I'd be a little surprised if it was outdated. But not shocked.
At least it's at a non practical size.
Granted, I confess most is almost always navel gazing. Increases in the "best" sorting/map algorithm have probably not lead to any noticeable improvements in modern codebases. Usually, those benefit more from being the low hanging fruit, not most computationally complex. (That is, I'm sure in these large matrix problems, there are other low hanging fruit that will yield better improvements.)