14,000x Speedup (2015)

dang · on Oct 13, 2020

All: some comments are finding fault with the article and the author in ways that aren't in the intended spirit of HN. When someone made a mistake or didn't find the best answer, it's great to explain why and supply correct information, but please avoid mixing in putdowns, and if your comment is just a putdown, please don't post.

Otherwise we end up with basically this: Person 1: "I am very smart." Person 2:" Rubbish. It is I who am very smart." That is not interesting, regardless of who is smarter. Worse, it all-too-easily turns a community like HN into an unfriendly place.

When in doubt, simply remember that the intended spirit is curious conversation, and then your comments will naturally contain interesting details and insights without the other stuff.

https://news.ycombinator.com/newsguidelines.html

mynegation · on Oct 13, 2020

Algorithms are important but are especially powerful in combination of knowing computer architecture and programming language intricacies.

Many years ago I was asked to look at the program written in C++ that calculated Kendall-tau correlation matrix for a large amount of data. Basically Kendall Tau is a robust replacement for Pearson correlation and it had to be calculated for 0.5M^2 elements and calculation of each element needed calculation of Kendall Tau on two series of numerical data of length N. Both M and N were on an order of few thousand and naive implementation has the complexity O(N^2 x M^2). The program ran for 6+ hours.

After looking at C++ I realized that a lot of time is spent just copying data needlessly (a result of not knowing the difference between passing a vector and reference to it) and needless memory allocations and deallocations. I did a rewrite from scratch, keeping the tests, but not copying anything and allocating a minimum amount of memory only once, employing the cache locality, spreading the calculation with OpenMP and finally - realizing that Kendall Tau can be done in O(M^2 x Nlog(N)) I found that out on my own, just by squinting and realizing that calculation of Kendall tau for two series is essentially sorting, but later found out that e.g. scipy implementation uses the same approach.

The resulting program ran for 4 minutes on a 2-core and 4 threads.

8ytecoder · on Oct 13, 2020

This is at the very least a clever and involved optimization. Let me tell you the story of my similar 10h to 10min fix. We had this cronjob that was supposed to run every hour, detect all changes to the customers and related table and sync with the marketing saas tool. It was written in Rails and took 10 mins at first. As we grew the time taken by this job also grew linearly. To a point where it took 10 hours and we could only sync the data twice a day. As it turns out the job was loading each user and then scanning for changes. We switched it into a somewhat complex sql query and MySQL ran it in a few mins. Add the additional processing and the whole job now took 10 mins.

tetha · on Oct 13, 2020

This is when being the guy knowing a few database tricks is fun.

We had one reporting query that ran for 13 hours or so, in a normal web request. Not very successful, but it tried its best. Running this through explain while thinking of daloks revealed some dependent subqueries. Those are the devil in mysql because they are evaluated per row in the outer query. That takes a lot of time very quickly. Eventually we wrangjangled the dependency into a terrifying group by such that the dependent query could be precomputed and the outer query just joins with the result. Boom, runtime down to 2 minutes.

In another case everyone panicked ... until one of my dudes added an index - which took about a day - and slashed runtimes from 18 hour to 12 minutes. Learning to optimize queries on your RDBMS of choice is very powerful.

karmakaze · on Oct 13, 2020

Saw one of these recently in a query that was ported from PostgreSQL to MySQL. Same thing join-with table-expression to fix it from 6+ hours to seconds. MySQL will sometimes consider a subquery to be 'dependent' even when there are actually no dependent terms and could have been evaluated once.

hrishi · on Oct 14, 2020

+1 for wrangjangling.

fwip · on Oct 13, 2020

My first big win came as an intern, when I reduced the big O complexity of an algorithm written by a (very smart) statistician for our paper to O(n^2) instead of O(n^4). This was just driven by realizing that a couple data lookups in an inner loop were themselves O(n^2).

If I recall correctly, that raised the "feasible" value of K from about 4 to about 14 on the hardware at the time, which was more than enough for our purposes. I wanted to keep optimizing, but my boss correctly reminded me that shipping the paper was more important.

mesarvagya · on Oct 13, 2020

Sounds like classic n+1 problem

stouset · on Oct 13, 2020

Or more likely, the classic “fetch an entire table from the database and work on it locally” problem that seems endemic in code written by people who don’t understand the capabilities of SQL.

Works great in dev with a local sqlite instance and ten rows. Not so great in prod.

hinkley · on Oct 13, 2020

I was asked once in an interview what the hardest part of programming was, and I said it wasn't Order of Complexities, it was the C constant within it. Interviewer was not subtle in thinking that was stupid.

My rationale was - and is - that C is not sexy. It's hard to motivate people to look into it, and it's often hard work because a sort algorithm or a lookup table tend to be compartmentalized, while C issues are diffuse - they're literally everywhere, and they don't always show up on a flame graph. On one project I got a 2x improvement in overall performance by fixing issues that weren't visible in the perf output, and would have been invisible on a flame graph (this was before flame graphs).

C is particularly a bigger issue once you consider Amdahl's law. A 20% slowdown in a linear part of the code can show up as a second-for-second slowdown in response times, while a search algorithm that's only a small part of an overall process, or parallelized, you can make it 10 times faster and might only see a 2% improvement in response times. If you avoid allocating more servers, for a new feature, nobody celebrates your achievement. You have to do it wrong the first time for anyone to notice you.

kadoban · on Oct 13, 2020

I'd argue that none of that is the hardest part of programming, or at least what I'd focus on in terms of what comes up the most and what effects users and developers and suits the most.

Judging by those, the most important hard parts are around planning and estimating work, and correctly understanding when those make no sense, and in validating code for correctness, especially over time.

If we ignore importance and just go with what's the hardest single task: naming things.

hinkley · on Oct 13, 2020

First, there is no one answer here. When we specialize we take on some 'hard parts' and delegate other hard parts to other people. Civilization is predicated on you and I having different definitions of 'difficult'.

But I think they're of a piece. All of these things are "meatspace" problems. We are constantly fighting the temptation to try to intellectualize the whole process of making software because the center of the problem is trying to interact with a machine. Some don't fight that temptation at all.

You see this problem when people fight automation because they expect everyone to execute a manual process perfectly every time, even when the consequences of failure are very high. They insist that everybody white-knuckle everything and think less of you if you don't agree. They victim-blame. When they're the founding or oldest surviving members of a project, you're gonna have a bad time. People who could fix the problem move on to other jobs.

Naming things is really the same class of problem. A bad name calls to mind assumptions that are not relevant, and you are in constant danger of making category errors every time you think about that part of the code. A bad name becomes a sort of attractive nuisance.

The C problem is a matter of motivating people to do the right thing. Because it's difficult to spot, it's an invisible sin. It rarely comes up in code reviews, so nobody will call you out on it, nobody will make you fix it, finding it after the fact is daunting, and the only people who will care (eg sales) are so often the source of other problems that pleasing them is almost a character flaw. If performance is a feature, then constant time regressions are one of our worst games of whack-a-mole (behind security and internationalization, maybe ahead of sane logging).

[ETA:] I'm just a weirdo who likes to tilt at some kinds of problems I find hard, instead of dumping them on someone else.

karmakaze · on Oct 13, 2020

The real problem with naming things is that the wrong 'things' have been isolated, if there was a clear understanding of what was being isolated and why the naming part would be easy(er).

SAI_Peregrinus · on Oct 16, 2020

There are two hard problems in computer science: naming things, cache invalidation, and off-by one errors.

amenod · on Oct 13, 2020

> On one project I got a 2x improvement in overall performance by fixing issues that weren't visible in the perf output, and would have been invisible on a flame graph (this was before flame graphs).

I assume you mean that these improvements would not be visible on a flame chart because they were not in perf output? (assuming one was built from the other?)

Flame graph by definition shows where the time is spent and is just a (great, imho) visualization of this data. Some people seem to have a beef with it, but their issues are invariably tied to the way the original data is gathered.

hinkley · on Oct 14, 2020

This got longer than I intended, but that just illustrates that the rabbit hole is deeper than many people are willing to go.

The data can be wrong in a number of dimensions. In general, code that gets called the most sometimes get underreported. Look at invocation counts. Ask questions.

One, they can clear the CPU cache, causing sibling functions to be overreported (one of many sources of how changing a slow function may bring less of an improvement than expected). This shows up strongest when I‘ve gotten a 10% runtime improvement from fixing a function reported as 5% of total time, or a 10x reduction in time from removing half the calls to a remote service.

Two, duplicate or idiomatic code will be scattered across different parts of the code stack, reducing the apparent magnitude of a pattern of logic below the threshold of attention of most people. Code that is in four places may represent 5% of total time and not get looked at, even if it’s an easy fix. The nickels and dimes add up quickly.

Three, functions that operate near the limits of the clock resolution will be counted properly but the cumulative timing rounded down, again putting it below the noise floor for most people. I’ve made a lot of hay re-sorting the results by invocation count and benchmarking changes to functions that never hit the top 20 list.

And last but not least, functions that create but don’t destroy their own memory allocations end up shedding timing that gets picked up by the recipients, homogenizing the timing graph. In particular in GCed languages, a function that is called in a loop that exhausts most but not all of the free heap will invariably end up stalling out in the next phase of a large calculation.

With all four of these, there are several failures of imagination I commonly saw. A flat timing graph is taken as evidence that it is time to stop optimizing, even if the target improvement has not been met. When the “tent poles” are even, getting people to care how tall they are is challenging. Few people will make six changes to the code to achieve a 10% speed up. The Rust compiler may be the first time I’ve witnessed anyone else brag about a 1% performance gain, other than me. They either don’t do it, or they apologize for achieving so little. 10% is 10%, and I don’t care where you found it, if your code quality is high enough. And perhaps most importantly, speed improvements are multiplicative, not cumulative. Use a time budget the way game devs do. If an interaction takes 10x as long as your target, every method that takes 0.5% of current run time represents 5% of your target. This is, in particular, how you spot duplicated slow code. You ignore them at our peril.

Particularly in the era of QA cycles, using a “zone defense” (optimizing one module at a time, instead of going after tall tent poles) netted a higher rate of return per hour of labor and a much larger cumulative improvement, on its own merits and by increasing the budget allotted to perf. It seems concentrating the changes in one area at a time decreases optimization fatigue. On two projects I kept that initiative alive for more than two years, and I was the one that gave up because I had cycled around the entire system and was finding few things to correct (you only learn new tricks so fast, and they go asymptotic eventually). Everyone came to expect every release to be a bit faster than the previous, and I got feedback that this narrative increased sales, and manager buy-in. People will take a chance on you when they like where you’re going, even if you haven’t gotten there yet. Narrative matters, if they trust you, and usually you have to earn it.

amenod · on Oct 15, 2020

Yes, I agree with you - I just don't see that as a failure of flame charts.

You still need to understand how the data was gathered and what it really means. It is also true that flame charts are more useful on projects that are not yet heavily optimized; they clearly show the parts that are very slow compared to others. And when you don't see anything else that can be optimized, it doesn't mean that everything is optimal - the limit is imagination (and knowledge), not the problem space.

arethuza · on Oct 13, 2020

Out of interest, did they say what they thought the hardest part of programming is?

mynegation · on Oct 13, 2020

Just one hardest part? Could be naming things and off by one errors.

joshuak · on Oct 13, 2020

... and cache invalidation

nullsense · on Oct 13, 2020

...and off by one errors

z3t4 · on Oct 13, 2020

What is the C constant?

vonmoltke · on Oct 13, 2020

If I understand hinkley's point correctly, it's the constant multiplier (and, in some cases, lower-order terms) that conventional complexity analysis handwaves away as irrelevant because it only looks at the limit as input size goes to infinity.

naruvimama · on Oct 13, 2020

I guess the op meant the C as in

Total time = nX + C

No matter how much you optimize X (n here is for instance the number of items), if C is significant then it matters. It is usually harder to optimize the constant C in the OP's experience.

brilee · on Oct 13, 2020

I realized the same thing when trying to compute F1-optimized thresholds; scanning through all possible threshold scores is O(n^2) but computing f1 score inline while scanning through the sorted array is O(n).

https://www.moderndescartes.com/essays/auc_intuition/ is my writeup on why the sorting approach is the most economical way to approach this sort of problem

Phemist · on Oct 13, 2020

Interesting! Thanks for the write-up. In the field of Biometrics we often use the library BOB to compute scores like AUC's and EER's, or other measures over stuff like DET curves.

For some practical work I was doing, BOB (and SK learn) proved too slow for my liking. I am currently unable to provide much more detail, but I used a similar insight to yours (mine was the amortization of sorting costs to have a better average time complexity) in a library I built for calculating empirical, bootstrapped, confidence intervals for EER's.

https://github.com/feerci/feerci

For massive amounts of decision scores, like millions, this method can give you confidence intervals on EER's where no other library currently in use can. Unfortunately, I have not yet found a theoretical basis for this method, and suspect it might actually break down under some trivial smaller cases. Also, DET curves and their derived measures remain a tough cookie to crack.

Considering AUC's and EER's are somewhat tightly related, I'm thinking this method could also be applied to AUC's.

lmm · on Oct 13, 2020

So did you measure how much of the speed improvement came from your careful memory copying tuning and how much came from the algorithmic change?

mynegation · on Oct 13, 2020

At the time, yes, but I do not have those numbers anymore. Trying my best to remember - I would estimate that the speedup of 360 / 4 = 90x is approximately: 4x multithreading (it was pretty linear speedup), 7x the algorithm and 3x the memory inefficiencies.

EDIT: fixed my math, thank you andruby

andruby · on Oct 13, 2020

Sorry to correct your math here, but going from 6 hours (6*60=360 minutes) to 4 minutes is a 90x improvement, not 900x.

ath92 · on Oct 13, 2020

Interestingly, a speedup of 2 orders of magnitude seems to be about what you'd expect if you only apply the algorithmic optimization (going from N^2 to Nlog(N) with N being in the range of a few thousand, like OP said)

mynegation · on Oct 13, 2020

Like I mentioned - my memory is pretty hazy about the specific speedups.

Yes, N^2 to NlogN feels like it should be a bigger factor, but remember that constants in front of this matter. You are replacing 2-level nested but very straightforward loops that rip through cached memory at blazing speed without any disruption to CPU pipelines with a single loop that involves quite a bit of condition checking and array element swaps (cache is busted, pipelines are busted - conditions are very hard to predict). You gain some, you lose some.

TYMorningCoffee · on Oct 13, 2020

Yeah. The commenter only gave the numbers when both the constant factor and the improved algorithm were applied.

It would be interesting to see them applied separately. The commenters claim that both are important has not been shown by his example yet.

However the results are impressive. Great job with the solution.

mumblemumble · on Oct 13, 2020

makes me think of https://ericlippert.com/2020/03/27/new-grad-vs-senior-dev/

monksy · on Oct 13, 2020

That was a great write up explaining about not seeing the forest for the trees. However, it is still infuriating that we're getting these leetcode questions in interviews to see how much you can hack and optimize a toy problem. (As if you had lots of time to think about it and work through it)

mynegation · on Oct 13, 2020

Ha! Great story, thank you! Clearly shows that computational complexity by itself is not very useful without knowing the assumptions. It is one thing to see substring search and immediately pattern match it to something from CS course like Boyer-Moore or Knuth-Morris-Pratt, but these algorithms require pre-processing that makes sense only if you know you are going to search for the same pattern repeatedly. Pure library function cannot make that assumption, and REPNE instruction is just an icing on a cake.

ghusbands · on Oct 13, 2020

> pre-processing that makes sense only if you know you are going to search for the same pattern repeatedly

The pre-processing is typically worth it if you're searching through a large string, even just once.

antisthenes · on Oct 13, 2020

That was a great read for someone who's used VBA to solve business cases, including InStr.

Anecdata, but most of my strings were indeed short (searching for manufacturer SKUs inside arrays of other SKUs).

neves · on Oct 13, 2020

This is a usual problem with C++ and why I hate it. There's a lot going on under the hoods, and you must be really knowledgeable of the language to prevent stupid things. Following some idioms you can really avoid it, but it is useless since your coworkers will fall into the language traps.

vidarh · on Oct 13, 2020

Knowing if you're copying or passing references is essential to understanding most languages that supports both.

It's scary how many people are unaware of it, though.

Not nearly as dramatic as the example above, but I once cut the time spent on page generation for a commercial CMS by 30% my first day in a new job by realising they did excessive new allocation of strings instead of concatenation (which will amortise the allocations by allocating more space than needed in most C++ implementations).

I find it quite disturbing, because, while I prefer to use Ruby these days, C++ documentation is really explicit about guarantees provided by both the language and the standard library with respect to things like memory behaviour and algorithmic complexity.

There's if anything less going on under the hood with C++ that you're not explicitly told about than in most modern languages.

godelski · on Oct 13, 2020

One thing I tell people is that beginning a good C/C++ programmer made me a much better python programmer and makes it substantially easier to pick up other languages. It's not so specific to the language being C as much as it is learning about memory and the what you learn by constantly shooting yourself in the foot because you make dumb mistakes (I still think mistakes are the best way to learn).

neves · on Oct 13, 2020

I agree, but a simple automatically defined copy constructor can spread havoc in a code base.

qayxc · on Oct 13, 2020

I'd argue that not knowing the language you're working with is a problem that's not exclusive to C++.

There's so much more going on under the hood of more high level languages, that writing faster code is straight up black magic [1].

[1] https://stackoverflow.com/questions/28723658/python-why-is-i...

throwaway894345 · on Oct 13, 2020

Yes, you must always know your language, but mastering (for example) Go is a lot easier than getting a cursory understanding of C++ (i.e., enough to write correct code for non-toy applications). You can get a cursory understanding of Go in a few hours and you can master it in 1-2 months. It would be months and years in C++, respectively. C++ is just a much larger, more complex language, so there is a lot more to know. It's not that C++ is inherently bad--some of that complexity allows C++ to be faster and more powerful than Go; however, we can't pretend that the two are equivalent with respect to their ease of learning or number of footguns.

qayxc · on Oct 13, 2020

I agree in general, but knowing the difference between pass-by reference and pass-by value is C++ 101.

Mastering C++ isn't even required to be productive using it, since there's no need to use all the language features simply because they exist. This also avoids the majority of footguns.

throwaway894345 · on Oct 13, 2020

> I agree in general, but knowing the difference between pass-by reference and pass-by value is C++ 101.

I agree. But the scientific computing community includes a lot of people who have no exposure to computing at all, so that vectors are copied when passed by value (as opposed to a fat pointer a la Go slices) is a performance foot gun for these new users. For that matter, a lot of people coming to C++ from any higher level language will likely be tripped on on this for a while.

> Mastering C++ isn't even required to be productive using it, since there's no need to use all the language features simply because they exist.

Right, I was distinguishing between "mastering" and "knowing enough to be productive". I'm asserting that you can master Go in far less time than is required to be merely productive in C++ (granted, I'm assuming "productive" means "you're not regularly introducing memory leaks, segfaults, etc" i.e., "you understand memory management best practices in C++"). Specifically, I think it takes 1-2 months for a newbie programmer to become a master of Go and ~6 months to become merely "productive" in C++ (bear in mind that even figuring out how to build non-trivial C++ programs is itself a herculean feat).

gumby · on Oct 13, 2020

For a systems programming language the part of C++ that most people need to know isn’t huge.

There’s a lot in c++ that only exists for back compatibility and you don’t need to use it. There’s also a lot of infrastructure so library writers can write very efficient generic code. Most people are not writing libraries and don’t need that stuff.

sdenton4 · on Oct 13, 2020

You got all the backwards compatibility stuff, though, if you're code base is older than today. Or if you've got co workers who learned on c++11 instead of c++19...

Also, templates, and the indecipherable errors they throw. And cpp memory handling errors are one of the largest sources of security holes.

I'm convinced that apologising for cpp is basically Stockholm syndrome. These are things we absolutely would not accept in a new language, but we must, because it's what we've got. Rust can't win fast enough, I say...

ncmncm · on Oct 14, 2020

It would be great if a modern language could match C++, but none is on the horizon.

Rust is powerful enough to do a lot of what C++ is used for, but not the hardest things. There are libraries you write routinely in C++ that you can't code in, or call from, Rust, and never will. The gap will widen with time. Rust is focused on low-level safety at the expense of library power, but you get safety in C++ by coding using more powerful and safe libraries, so C++ comes out ahead in the end.

Rust is a way better language than Go or Java. It would be excellent if Rust could displace them, or even just one of them.

If its adoption rate were to increase by maybe two or three orders of magnitude, it could survive. But the Rust core team seem not interested in doing what would be needed to get that much adoption, and hype alone won't get it there.

Rust's place will be secure when more complaints are written about it than paeans.

throwaway894345 · on Oct 13, 2020

I would even say that simply knowing which things are essential and which are "legacy things for backwards compatibility that most people don't need to know about" is nontrivial, and I'm not even sure that there is substantial consensus among the C++ community about exactly what is in the "legacy/compatibility things" bucket (no doubt there's wide spread agreement that some features are in this bucket, but I imagine there are a lot of features that are controversial).

User23 · on Oct 13, 2020

C++ has a famously complex standard though, to the point that, last I checked, there are no fully standards conforming C++ compilers. I know one member of the standards body and he is an excellent software engineer, a subject matter expert, and just wicked smart too, and he still confesses he doesn't really understand C++. There's a reason why Google doesn't actually have C++ as a supported language, but rather a curated subset thereof that's governed by their house style guide.

qayxc · on Oct 13, 2020

I concur, but using and knowing a reasonable subset is enough to be productive.

Rust has its own problems and pitfalls as well, and I doubt that people who don't understand pass-by value vs pass-by reference and its implications will be happy with fighting the compiler over ownership issues.

Implementing a doubly linked list comes to mind - trivial in C++ - a brain twister in Rust and apologists will hide behind statements like "they're almost always the wrong choice anyway" or they're "just used pedagogically" [1].

A cynic could interpret that as Stockholm Syndrome as well.

[1] https://codereview.stackexchange.com/questions/204183/doubly...

mynegation · on Oct 13, 2020

C++, while complicated, is actually pretty transparent about it. The difference between pass-by-value and pass-by-reference is one sigil, but at least you see it there and you have at least some idea that it is there for a reason.

Compare it with lazy evaluation languages like Haskell or declarative languages like SQL where oftentimes you have to run query planner explanation to troubleshoot the performance problems.

milesvp · on Oct 13, 2020

+1 for sql. I once was looking into a slow query and had to know that ordering text field in mysql requires a filesort regardless of how much data was involved. which means touching a disk. Was easy enough to make a schema change to a varchar small enough not to also trigger a filesort. But learning that required several hours of scouring documentation and making sure the schema change wasn’t modifying the requirements significantly.

Also what’s fun is that the data in the tables can change the explain plan. I got to see this myself when dealing with dev vs prod data.

necovek · on Oct 13, 2020

Knowing the basics of CS would surely help in set the expectation for such "SQL" (or rather, RDBMS) behaviour: the problem is the problem of sorting a very large array of strings, which is unsurprisingly a hard one.

The very basic experience with RDBMS would point you at a better solution: indexes. Not sure about MySQL or standard SQL, but with Postgres you can keep an ordered index that will be used in an ORDER BY query. It's a common trick employed in programming in general: trade the update/insert cost for query cost (indexes take time to update, but querying is then fast).

Tomis02 · on Oct 14, 2020

> Also what’s fun is that the data in the tables can change the explain plan.

1. This change tends to be in your favour, not against you.

2. Data is important, not the code. Of course the plan would change if your data changes. That is the point of the query planner, to find the most efficient way of retrieving the data, based of course on the contents of the database.

I agree that you need a good knowledge of your database in order to be effective. Just like with C++, for example. Unfortunately we're not at the point where we can achieve optimal results without understanding the fundamentals, regardless of the programming paradigm.

naruvimama · on Oct 13, 2020

The vast majority of SQL is written or managed by business users. It is a good idea to let the engine optimize the code.

And it is a dsl the planner is expected to be very good. Perhaps the next version will always fix it.

nullsense · on Oct 13, 2020

It can't automagically not do stupid things if you feed it a query that will make it do stupid things. You still have to know certain things about how queries will try to fetch data.

dahfizz · on Oct 14, 2020

I really have to disagree.

There is a lot going on under the hood regardless. That's how computers work. Your Python script has to deal with all the same memory allocation and cache locality issues.

C/C++ just gives you the opportunity to see what is happening under the hood and tune it.

bufferoverflow · on Oct 13, 2020

What he described has nothing to do with C++, but with the inefficient algorithm that was chosen.

formerly_proven · on Oct 13, 2020

There's very little going on under the hood with C++, if it even has a hood.

andi999 · on Oct 13, 2020

Vtable and vtable pointer is under the hood.

not2b · on Oct 13, 2020

Yes, but any C++ programmer who expects to be hired will have a good understanding of vtables and vtable pointers. They are a lot less tricky to reason about than, say, thunks in Haskell.

tome · on Oct 13, 2020

What makes you so sure about that?

ncmncm · on Oct 14, 2020

Those are trivial. And have little effect on performance. And are hardly used in good programs.

In practice they mainly displace function pointers.

andi999 · on Oct 14, 2020

They might destroy your cache lines though.

sdenton4 · on Oct 13, 2020

Meanwhile, there's a perception that machine learning is hardware-bound for speed increases, which just isn't true.

I work on a pretty heavy ML system, which a year ago took about a month to 'fully' train. Today we're down to 25 hours training time using the same hardware, based mostly on improvements to the model. The improvements are coming from a combination of signal processing tricks and tricksy model changes. Inference is similarly much faster, too...

EmbarrassedFuel · on Oct 13, 2020

Sounds interesting, would you be able to share anything about the signal processing tricks?

roter · on Oct 13, 2020

> In climate science we do a lot of downscaling. We take temperature and precipitation readings from a coarse scale Global Climate Model grid and map them to a fine scale local grid. Let’s say the global grid is 50x25 and the local grid is 1000x500. For each grid cell in the local grid, we want to know to which grid cell in the global grid it corresponds.

Climate scientist here. Colour me a little confused. We have a GCM which is (typically) on a structured grid, i.e. with latitudes & longitudes on a pixel-like grid. Finding the GCM (coarse) grid cell for a given latitude/longitude of the downscaled (fine) grid is an integer operation, i.e. invert lat = lat0 + i x delta_lat and lon = lon0 + j x delta_lon. Must be missing something here. Even if there is a strange projection, you can map it such that the deltas are uniform.

Ultimatt · on Oct 13, 2020

You're not wrong. This entire article is a classic example of Computer Science failure. It's someone who knows program optimisation in the abstract and nothing about the actual target problem from a domain knowledge perspective. They went about optimising bad code as in quality of thought level, rather than actually getting to grips with the true requirement first. Though perhaps a computer science win, but a software engineer fail.

dang · on Oct 13, 2020

Please don't post supercilious dismissals of other people or their work, even if they made a mistake and/or were supercilious in their own right. It just contributes to making this a nasty place.

The differences between this comment and the GP comment are significant. The GP included specific information, where this comment is just a putdown, and the GP allowed for the possibility that there is missing information, i.e. that the "author is an idiot" interpretation is not the only possible one.

https://news.ycombinator.com/newsguidelines.html

andi999 · on Oct 13, 2020

I usually agree, but in this case maybe the author needs to hear exactly that. At least the author might the want to stop boasting about these things (and save his future career etc).

dang · on Oct 13, 2020

I appreciate the sentiment, but that is probably better reserved for personal interactions than internet pile-ons. There are lots of factors we don't know. Also, if we're to be honest with ourselves, internet pile-ons don't arise out of compassionate concern for the other. That's a justification that gets tacked on.

It's moot anyhow, because the main reason not to have threads go that way is that it does bad things to the community. Each time it happens, we deepen the pathways towards making HN nastier and more toxic, and since those are the default paths already, we need to consciously cultivate the opposite.

kolbe · on Oct 13, 2020

This was written in 2015. What's worse logic: his code or HN thinking that shit talking him five years later in a forum we don't even know if he sees is constructive?

dang · on Oct 13, 2020

Obviously I agree, but please don't add to the problem by dumping more nastiness into the thread.

Note this guideline also: "Please don't sneer, including at the rest of the community."

It's worth remembering that HN is a statistical cloud of posts, not a person, and so can't think anything.

https://news.ycombinator.com/newsguidelines.html

aaronblohowiak · on Oct 13, 2020

you are a singular individual, dang. thank you for what you do for HN

mlyle · on Oct 13, 2020

In a lot of ways, the original implementation makes a lot more sense than this "optimized" solution. Something like the original brute force, compare all the points' distances variant is how I'd be inclined to write one of my unit tests for the grid alignment routines. Trying a few dozen points on a few dozen grid variants is a way to make sure I've not completely screwed up math-- whether off by one, rounding problems, etc, while applying sanity checks on the overall grid code base.

This binary search "optimized" abomination is... more difficult to read and reason about than either than the integer math or "brute-force" solutions, and slower to boot. All it has going for it is that it's not the absolute slowest choice.

It may not be integer math, because we've not been told if the grid is regular... but it certainly is still easier than this.

projectdelphai · on Oct 14, 2020

I mean maybe but if you told me you could take my code that runs in less than a second and make it more readable but it would take 30 minutes to run instead . . . hell no I'm not taking you up on your offer. I'll just comment and document the existing solution better and call it a day.

mlyle · on Oct 15, 2020

You're missing the point.

The solution here is faster than the absolute naive solution of finding all distances, but it is both slower and less readable than integer math grid conversion..

I could see why you would make the choice of the naive, slow solution. I could see why you would choose the fast, integer math grid conversion one. But I don't see why you would choose something slower than the optimum solution that is also much, much less readable.

That is, the naive solution and the integer math solution are Pareto-optimal choices... and this abomination ain't.

necovek · on Oct 14, 2020

I am not a climate specialist, but I was confused about the same thing: I would fully expect for there to be a bijective mapping between a global grid and a local grid. I mean, lat/lon is a 2d space.

My other comment was going to be that using binary search is not that much computer science. I mean, it's elementary programming knowledge, you don't need a degree to know that.

Marazan · on Oct 13, 2020

Yeah, I genuinely don't get the problem statement. As you say as stated the solution to the problem as stated is a single line piece of basic math.

I can o lump resume the problem is badly described to miss some subtly.

I hope.

danielvf · on Oct 13, 2020

I've upvoted this. Even the "optimized" solution here is so horrible that I almost thought the article was satire when I first read it. Maybe it is.

Even if we ignore the fact that it's grid to grid conversion, (which should just let you math the answer, without any searching) the fact that all the data is sorted means you can do the equivalent of a merge sort and only have to look at a two point for each step.

And how on earth can you take .117 seconds to line up a half a megapixel worth of grid points, even with their "fast" algorithm. They are using the R language which should be fast, right? Are they doing something expensive with it? Or perhaps there is an insane amount of data elsewhere making these lookups super expensive?

dang · on Oct 13, 2020

Please post your improved and correct perspectives without name-calling. Doing otherwise makes the community worse, even if you're right.

https://news.ycombinator.com/newsguidelines.html

danielvf · on Oct 13, 2020

My apologies. While I don't see the name-calling towards an individual, my previous post did mention possible bad faith, and I would not been glad to have my post written about my code. I also slightly misunderstood what the author's code was doing. I can see how it is not conducive to helpful discussion.

The author optimized their code in a way that was fast enough, and fast enough to massively change the entire business process around their code. It's very commendable. Further optimizations may have not been worth it.

dang · on Oct 13, 2020

Appreciated!

forgotpwd16 · on Oct 13, 2020

R in general and especially loops are slow; slower than even Python. Making a function and applying it on a vector is a simple way to boost performance. Also recursion seems[1] to be faster than iteration.

Seeing the code in article, it can be that a part in that speedup is because R is used and not only of a better algorithm.

[1]: https://predictivehacks.com/a-comparison-between-iteration-r...

sillysaurusx · on Oct 13, 2020

Gamedev! One of the best parts of being in gamedev was that the problem space forces you to work this stuff out. People won’t wait 30 minutes for a frame to render.

A few ways to solve this pop to mind. As Aeolun mentions in a parallel comment, you can just divide coordinates if your grid is regular. But it often isn’t.

In that case, you can attack it in a few ways. One time Feynman was giving a lecture and said “This triangle looks a little cockeyed. Obtuse! That’s the word! I know about the triangles but not their names.” And I’m not Feynman, but I did forget the name of this: you build a tree of squares. First one is your whole grid; split it into four smaller squares, those are the children; repeat till your resolution is sufficient. Then you can attach whatever you want to the child nodes (in this case, grid squares). Since it bisects space into halves, you can do a search in O(log n).

But what if you want to go faster? What if it’s absolutely crucial for speed, like a physics engine, to be able to ask “which objects occupy this region of space?”

My favorite algorithm is absolutely delightful: a radix sort on floats. http://www.codercorner.com/RadixSortRevisited.htm

You end up getting O(k*N) time, just like radix sort normally gives you. But the real magic is that it works with bounding boxes. There’s an article from the same author which I can no longer find, detailing it. But the idea is, you sort all your objects in one spatial dimension first, say along the X axis, and then you don’t need to do anything to search quickly. No trees. Suppose you want to ask “does this region of the X-axis have any objects?” You just... do something, ha. I forgot the next step, so I don’t want to say a mistaken answer and tarnish poor Pierre’s idea, which is quite nice. But he’s been working on this problem for basically decades, so here’s one of his N papers on the topic (of spatial partitioning, not applying radix sort to solve it): http://www.codercorner.com/ZeroByteBVH.pdf

I could work it out if I sat down with it for a bit, but, I leave it to one of you younger devs can take up the torch. It’s a fun puzzle.

dharmon · on Oct 13, 2020

The second algorithm you are thinking of is probably what is known as "sort-and-sweep". Basically each object takes up an interval along each axis, so a necessary condition for collisions between two objects is that the intersection of the intervals along each axis are non-empty. (which means if there is no overlap of intervals along any axis, you can guarantee they are not colliding).

You compute the bounds along each axis in linear time. The sorting of these bounds can be done in linear time, then you take a pass over each axis looking at the marked intervals, which is also linear.

jansan · on Oct 13, 2020

I agree, that sounds a lot like "sort-and-sweep" or "sweep-and-prune" (I prefer the first name). This is an awesome algorithm and really simple. I once reimplemented an algorithm with sort and sweep to improve performance of a certain test case that took 45 seconds to compute. With sort-and-sweep it only needed about 100ms.

sp332 · on Oct 13, 2020

That's a quadtree, or an octree in 3D.

sdbrown · on Oct 13, 2020

A k-D tree?

contravariant · on Oct 13, 2020

That's a binary tree where each node divides the space along a single axis. A quad-, oct-, 2^n-tree divides along all axes for every node.

geophile · on Oct 13, 2020

No. Quad and octtrees divide the space. K-d trees divide the data.

snovv_crash · on Oct 13, 2020

kD-tree doesn't necessarily split at the midpoint like a quad or oct-tree do.

Cthulhu_ · on Oct 13, 2020

I've learned a few simple tricks about things like random numbers, rounding, simple physics (acceleration, inertia, gravity), etc just from following some tutorials in the pico-8 fantasy console.

And some people have taken that to extremes, building 3d rendering engines and tech demos in what is a very constrained environment.

djmips · on Oct 13, 2020

And I would point out that some of the best Game Devs who make frighteningly fast code don't have CS degrees and are sometimes fixing up after people that do.

wang_li · on Oct 13, 2020

Binary Space Partitioning tree.

CyberDildonics · on Oct 13, 2020

If you sort bounding boxes in only one dimension by their min values you still have to brute force test their max values and their other dimensions unless you make a secondary space partition or acceleration structure.

jansan · on Oct 13, 2020

No, it works because you keep an array of "active" elements during the sweep. Let's say you sweep from left to right over the elements sorted by their left bounds. For each element you encounter you check if it collides with any of the active elements. While doing so, you check if the active element's right bounds is left of the current sweep line. If so, it cannot collide with any of the remaining elements, so you can remove it from the active elements. This way the number of active elements usually stays small and you have to check each element only against few other elements.

CyberDildonics · on Oct 13, 2020

What you described is what I said. A spatial acceleration structure is meant to skip as many comparisons as possible. What you are talking about is trading less sorting and more comparisons during the queries.

jansan · on Oct 13, 2020

Sorry, I misunderstood the "brute force" part of your comment. You are right, for the active elements the standard implementation uses a brute force on the active elements. I had to implement sort-and-sweep to handle one specific performance problem that took 45 seconds to compute. With the simple sort-and-sweep this went down to 100ms. I also tried sorting the active elements along the secondary axis, but at least in our case this did not yield in any further performance improvements. This probably depends a lot on how many elements there are in the active list.

blisterpeanuts · on Oct 13, 2020

Years ago, I was a contract programmer at a large university, working on Perl reporting scripts for the library system.

Once a month, they reloaded the patron database; it took about 24 hours. If there were a network interruption or power glitch, it would have to be rolled back and run again the next day. Pretty bad.

It wasn't my task, but I was curious why this program took so long to run, so I started picking it apart. It was written by a PL/I programmer -- his first Perl project. No comments. He was no longer there. No one understood the script.

I started changing things, among other things I moved the Oracle-specific DBD calls out into a separate module, because someone on the internet said Perl ran more efficiently that way. And poof! just like that, the program then ran in 8 minutes, down from 24 hours.

I looked like a hero, but I very openly said I wasn't quite sure what I did. Just got curious and moved things around, and apparently discovered some bug in the DBD layer by pure accident.

Sometimes curiosity and willingness to experiment is all it takes.

mlang23 · on Oct 13, 2020

Most code is bad code. And if it is not algorithmically bad, there are many other errors lurking in dark corners.

I am a HPC cluster admin. Many years ago, we had a (for us back then) rather large project. Several million hours of CPU time. During a support case, I happened to stumble across the source code for the project. And I was pretty surprised. It was a few hundred lines of Pascal, compiled with fpc. I knew about the language being used and was told other compilers dont make a lot of difference "because the code only uses 32-bit values". Hmm, suspicious, but you can't dictate how people solve their problems. At least not easily. But the small LOC count, and a upcoming weekend without a lot to do got me thinking. So I sat down and more or less mechanically translated the code to C++. Making use of templates to move a value from runtime to compile time. And bang, I got a speedup of 5x. So a few hours of time spent by someone NOT involved in the project at all helped to save around 5 mio. hours CPU time (around 1.5x my yearly salary).

Of course, this is a pretty extreme example and not really representative for the scientific computing community. Still, with my experience watching people run big jobs I am estimating that we waste around 50% of our computing time (world wide) on insufficiently thought-out implementations and lack of knowledge about the actual architecture the code is running on. Sometimes, I am happy if a user knows the difference between OpenMP and OpenMPI :-)

gridlockd · on Oct 13, 2020

> ...not really representative for the scientific computing community.

I disagree.

Source: dude, trust me.

Tomminn · on Oct 14, 2020

Agree, it definitely is representative.

There is the odd project that is so computationally intense that it needs to be written near optimally, but most scientific computing is a quick-and-dirty calculation run inefficiently on a giant cluster so you can get that figure for your paper and move on.

chrismorgan · on Oct 13, 2020

Point of document implementation: please don’t use <script type="math/tex">…</script> for equations, or else the equations are simply missing in environments where MathJax doesn’t run (e.g. JS disabled, text mode browser, or JS fails to load—all up, it’s more common than you realise). Use a form of markup that will fail visible rather than fail invisible. TeX mathematics notation is far better than a void.

You could make this abuse of <script> a bit more tolerable with something like the CSS `script[type^="math/tex"]:not([id]) { display: inline }`, but that will still be hidden to text-mode browsers or browsers where the CSS fails to load. It’s much better to just not use <script> in this way.

thangalin · on Oct 13, 2020

> please don’t use <script type="math/tex">…</script> for equations

KeenWrite[0], my desktop text editor, has an Export HTML using TeX/SVG feature that encodes math glyphs as vector paths, embedding the SVG into the resulting HTML file.

On the subject of optimizations, KeenWrite can render 1,000 simple TeX formulas to SVG format in about 500 milliseconds (on my desktop machine), sufficient for real-time previews. Two libraries needed changes to accomplish this feat.

First, JFreeSVG and Apache Batik both suffer from the same performance problem: they use Java's NumberFormat to convert floating point numbers to strings. It's slow because of internationalization rules, which aren't necessary for a machine-readable format. The fix entailed replacing NumberFormat with the Ryū algorithm[1].

Second, JMathTeX[2] was parsing TeX by throwing exceptions for flow control. In Java, throwing an exception forces the virtual machine to fill out a stack trace. Normally this wouldn't result in abysmal performance, but the flow control was being executed for each character in every macro name encountered. Pre-parsing macro names without any flow control resolved the issue.

Further optimizations were made, but those two were the lion's share. Combined, the total time went from 1 TeX formula to SVG format in ~5 seconds to 1,000 in ~.5 seconds.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/JMathTeX

[2]: http://jmathtex.sourceforge.net

thewebcount · on Oct 13, 2020

I would go farther and say don't use special math mark-up for writing simple things like variable names and the 4 basic math operations. The only thing in the entire post that might need it is the log2 written twice near the end of the article.

CogentHedgehog · on Oct 13, 2020

That's a good speedup but why are they not using locality-sensitive hashing? Or doing some basic math to convert X and Y to binned indices for a lookup table -- you can easily precompute values for a few million cells and hold them in memory. If the lookup table is small enough it would even fit in a single CPU cache, and lookup would be nearly instant. Both of these approaches provide O(1) lookup for grid cells rather than O(log n).

I used a similar optimization when I was younger to enable interactively plotting very large (at the time) 2D point datasets. We stripped out points that were so close together that the difference would not be visible. The X,Y coordinates were each binned to an integer pixel location and both were packed into a single long integer. Then we stored a hashset of longs so we knew what pixels had points occupied. When adding a point to render, we did O(1) hashtable lookups to see if the location already had a point in it. The result was an extreme speedup because we could avoid rendering all points which were not visible, and it scaled linearly with the number of points.

Aeolun · on Oct 13, 2020

I’m not sure why we can’t map a position in a big grid to a position in a small grid by doing:

X = Math.floor(bigX / bigXMax * smallXMax)

Is there some magic here I’m not seeing?

mattmanser · on Oct 13, 2020

Or, if the grids never change, just store the data of the parent grid with the child grid.

We've not got the whole picture though, but you don't need CS training to know that a loop within a loop within a loop within a loop is a bad idea.

rocqua · on Oct 13, 2020

Grid doesn't need to be regular

orlp · on Oct 13, 2020

What the person you're responding to did was a very simple form of locality-sensitive hashing.

And even for non-regular grids, as long as they're somewhat regular (like timezones), you can narrow it down to just a handful candidates that a linear scan will then solve instantly.

etimberg · on Oct 13, 2020

As an example, see this image of the UTM zones and look at the zones labelled in red. https://upload.wikimedia.org/wikipedia/commons/b/b7/Universa...

alanbernstein · on Oct 14, 2020

That drives me nuts. I don't see an explanation in the article for why there are just six irregular cells in the whole grid. It does mention that it's "in part to accommodate the southern half of the Kingdom of Norway", despite the system having been developed by the US or maybe Germany. Why Norway? Why only Norway?

thamer · on Oct 14, 2020

Recently I got a speedup that was literally 6-7 orders of magnitude by changing a few characters in a regex, which made it match its input in linear time instead of O(k^N) due to a crazy backtrack.

The speedup I measured could have been even greater if longer strings were being matched, but in practice the longest string I tested it with took over 14 hours to match before the change, and ~2 milliseconds after the change. This alone is a ~25 million times speedup.

It was such a fun problem to solve, and by far the largest speedup I had ever seen. It was also the first time I encountered this issue outside of compilation class in college, but the way this regex was constructed made the issue pretty obvious.

vecter · on Oct 14, 2020

What was the actual change?

thamer · on Oct 14, 2020

It was a long regex, made of multiple parts with two OR'd together that looked very similar except for the last few characters. These two parts were trying to match a structured format, think of something like a domain name for example; it's not just [a-z0-9.-]+ since you can't have the - or . at the start or at the end, and maybe the end part has a limited length (say you want to match only short TLDs). I can't really say what these strings were but they had these sorts of restrictions and a range that was a bit larger than what domains can use (like capital letters).

So they had something like this for this structured part:

    [a-zA-Z]+[a-zA-Z0-9-.]*[a-zA-Z0-9]?(\.[a-zA-Z]{2,3})?

Actually kind of like that, but with multiple ()? and (){N} wrapping the different layers.

Let's call the regex above STRUCTURED. Now how do you match either one of these structured strings followed by either `.foo` or `.bar`? They had it as:

    STRUCTURED\.foo|STRUCTURED\.bar

(again, where STRUCTURED is replaced with the whole long regex from above). Meaning that the regex engine, as it consumes the first characters that potentially match the structured string, can't tell whether it's in the left or right branch of this OR until it reaches either the `.foo` or `.bar` suffix.

The change I made was simply to match it with:

    STRUCTURED(\.foo|\.bar)

In this case the engine can consume each character of the input and make its way through STRUCTURED without having to maintain k branches per character (given as k=2 in the example above but it was much more than that).

I hope the general idea still comes through despite the simplification and required obfuscation.

cestith · on Oct 13, 2020

Hacking through this and thinking himself so superior because he understands CS just makes the article feel pompous and self-aggrandizing. It's highlighting a personal win with notes about the shorter runtime being a team win.

CS is a wonderfully useful, deep and broad field. There's nothing so special about some simple Big O complexity that he couldn't have taught every developer in the department this rather than bragging and upstaging the person who wrote a correct but slow implementation. Patting all of CS on the proverbial back for this smells of gatekeeping.

lanecwagner · on Oct 13, 2020

Trying to get Bootcamp grads to learn CS is like pulling teeth. The worst thing I hear as an employer is, "I dropped out of my CS degree because the last couple years of classes weren't applicable to daily programming".

I've started Qvault to try to address the problem... We'll see where it goes.

https://app.qvault.io

moron4hire · on Oct 13, 2020

My senior year of CS classes may not have been applicable to "daily programming", but are certainly applicable to "monthly programming".

There are just jobs I never would have gotten or kept if I wasn't able to solve complex problems, on my own, the few times they come up.

This is something that is so hard to get across to boot camp grads. They complain about how hard it is to get a job, how hard it is to progress at the jobs they do get, how hard it is to write basic programs on their own, how hard it is to even figure out what is important to study. But then, if I suggest that constantly focusing on shallow, broad study of JavaScript-framework-du-jure might not be sustainable and that Computer Science might actually be relevant to software development, suddenly I'm the asshole for calling into question the teaching methods of the boot camp system.

For me, it's also a quality of life issue. I just can't imagine being satisfied with a career where I'm just a glue coder. I want to make new things. If I were working in the automotive industry, I wouldn't want to be on the assembly line, I'd want to be in the engineering department. And most of the people I talk to want to make new things, too. They just don't seem to want to hear that "building something from the ground up" requires "understanding fundamentals".

noisy_boy · on Oct 13, 2020

I remember many years ago I came upon a giant shell script doing some data processing (data not even that large, maybe few GBs) that took close to 25 mins (mainly due to unnecessary I/O due to tons of temp file creation) - written by "developers". I re-wrote it in Perl and it completed in less than a minute (and it wasn't even optimized because I was a non-CS newbie who was just learning Perl). I even setup a cronjob to compare the output from both and email to the concerned folks. It was never adopted because I was new and Perl was considered "too-risky".

ignoramous · on Oct 13, 2020

> It was never adopted because I was new and Perl was considered "too-risky".

What's not to love about bureaucracy?

dumbfounder · on Oct 13, 2020

Computer Science for the... loss?

I did not study computer science in any substantial way, and I hardly consider myself a computer scientist, but I do have a nose for algorithms despite not really thinking about it in a structured way. I think lot of people get wrapped up in the notation without realizing they can approach the problem in a completely different way. Case in point, a professor at a University in the US created a parts-of-speech tagger and released the source in the early 2000's. I wanted to leverage this tagger to create a topic oriented search engine, but it was WAY too slow. It took about 10 seconds to tag a document (back in 2003 on decent hardware), and since I needed to tag hundreds of millions of documents, and despite having about 15 servers at my disposal, this was not going to work. So I dug into that algorithm and realized it was trying to test all of it's conditions all of the time to figure out which rule would apply. I redesigned it to index the rules and only apply them if it was possible for that rule to have some effect. Or something like that, the details are now fuzzy. The speedup was roughly 1000x, and made the tagger usable at scale.

Plot twist: I took a computer science class (as a Mech E major) taught by that same professor years earlier and failed it. That class was Data Structures.

Al-Khwarizmi · on Oct 13, 2020

The tagger's speed was probably enough for what the professor intended at the time. They probably knew how to optimize it, but didn't because they had no reason to.

I say this as a professor myself, because we do that all the time in my team: we create some software system (also in NLP, by the way), we do experiments to show that it improves whatever metric we are interested in or that it teaches us something interesting, we publish a paper with it, make the code available and then move on to the next paper. That's what we are typically incentivized to do. The time spent into optimizing a system that we are not going to take to production anyway is time not spent doing more research. Which doesn't mean that we don't know how to optimize.

dumbfounder · on Oct 13, 2020

I hear what you are saying and this story is more tongue-in-cheek than trying to put down computer science or professors.

But, fyi, this tagger was not just a professor's demonstration, it was kind of ground-breaking and served as a foundation for other taggers. The professor went on to have a pretty awesome career at a few different big tech companies, far surpassing my own success. And yes, I agree, I am sure he could have made it faster himself had he dedicated the time to it.

hombre_fatal · on Oct 13, 2020

It's surprising this would need to be said on, presumably, a forum of engineers. Or perhaps the temptation to laugh at someone is greater than the boring admission that OP is using the program on inputs much larger than ever intended.

jayd16 · on Oct 13, 2020

Not to mention the possibility that much of that code was written by an volunteer underclassman.

cestith · on Oct 13, 2020

It sounds like you do know how to optimize. Your important metric is just different. You're optimizing for your time rather than the computer's because that's by far the more valuable resource in your set of constraints.

time0ut · on Oct 13, 2020

Another consideration is that the unoptimized version of the algorithm may be easier to explain and study. So he might also be optimizing for clarity.

hansvm · on Oct 13, 2020

Mostly unrelated: When I write heavily optimized code I prefer to write the stupidest, simplest thing that could possibly work first, even if I know it's too slow for the intended purpose. I'll leave the unoptimized version in the code base.

- It serves as a form of documentation for what the optimized stuff is supposed to do. I find this most beneficial when the primitives being used in the optimized code don't map well to the overarching flow of ideas (like how _mm256_maddubs_epi16 is just a vectorized 8-bit unsigned multiply if some preconditions on the inputs hold). The unoptimized code will follow the broader brush strokes of the fast implementation.

- More importantly, you can drop it into your test suite as an oracle to check that the optimized code actually behaves like it's supposed to on a wide variety of test cases. The ability to test any (small enough) input opens a door in terms of robustness.

cestith · on Oct 13, 2020

Absolutely. Clarity of the concept is important. That's sometimes at odds with the best performance.

jerf · on Oct 13, 2020

There is nothing a computer science course teaches, or indeed, any university course really, that can't be learned by someone who didn't go through that course. Undergrad computer science is indeed one of the easiest things to self-teach because of the ready availability of the "lab materials", i.e., computers are cheap and abundant now.

But a structured program will take you through these things, ensure that you understand them, be available to answer questions in a manner other than "bang your head on the problem for weeks" or "just give up", and provide a structure progression through to more advanced topics.

I think it's important to keep both sides of this in mind; the universities do not have a monopoly on the knowledge and arguably it's easier than ever to pick up yourself, but that doesn't mean the structured experience is useless either. I've worked with some very good self-taught people, but they all have also tended to have gaping holes in their education relative to a good computer science education. (I agree with those people that they are much better than some people who end up with bad computer science educations. Unfortunately, if one optimizes one's computer science education for "the easiest 4.0s", one can end up with very similar gaping holes in one's formal education!)

commandlinefan · on Oct 13, 2020

> that doesn't mean the structured experience is useless either

It also often seems missed that somebody who went through a formal degree in lieu of gaining real-world experience can also gain real-world experience at least as easily as somebody who got the equivalent of a formal degree through real-world experience.

lmilcin · on Oct 13, 2020

So one of the applications I have worked with was an embedded credit card terminal app which needed a transactional database. Since I could not find a product that would fit all requirements I decided to write one.

Now, you can imagine smart algorithms, trees, hashes...

Nothing of that sort. The database was written as append only transactional log. To retrieve data, entire file was scanned for initial record and all its updates. No indexes. Some algorithms were plainly O(n^3).

Yeah, I got into many heated discussions with "computer scientists" in the team. Yet for all their might they were not able to improve upon the program because they forget that algorithmic complexity is only one factor of performance.

Because I used extremely simple algorithms (like linear search) the operations were very simple and progressed quite fast on a CPU that was good at prefetching from the storage.

The total amount of memory available was just 0,5MB meaning that the "super inefficient" scans still completed within perception threshold of the user.

While O(n^3) was used for some operations, the application state machine which limited number of actual number of steps. Most transactions follow same basic pattern of execution and so I don't care what happens when somebody does 500 updates to the transaction when I can figure out there will ever be 5 state changes at most.

There were other consideration for the program. For example, it had to use very little instructions (we were very short on space) and it had to work in constant memory (to be able to statically calculate stack space necessary, for example).

wkz · on Oct 13, 2020

A clear case of Rob Pike’s 3rd rule of programming:

https://users.ece.utexas.edu/~adnan/pike.html

WoodenChair · on Oct 13, 2020

It's a great anecdote that shows why we shouldn't ONLY worry about the CS side, but it doesn't mean the system wouldn't have been even more robust/faster/more scalable using better data structures/algorithms. Just because doing the simplest darn thing gets the job done, doesn't mean it's the best way to do it.

Obviously you chose this solution because it made the most sense given your constraints (time/money/knowledge/microcontroller limitations). But that doesn't mean it was an optimal solution nor that with slightly more time/money/knowledge a significantly better solution couldn't have been found.

zimpenfish · on Oct 13, 2020

> I got into many heated discussions with "computer scientists" in the team.

Had a week long argument once about an (relatively) inefficient MySQL query because "it's not going to work for 1000 things" when a) we had barely 10 things currently running and b) the customer was >this far< from cancelling because stuff wasn't working (which this query would help with.) It was a frustrating time. I think there's a lot of developers who "know the price of everything and the value of nothing".

leeter · on Oct 13, 2020

People forget that Big O is only part of the story, they also need to consider Little o and average runtime. Just because something is n^2 or worse asymptotically doesn't mean the average runtime will be that bad. There are many cases where the average runtime is closer to Little o almost all the time.

zimpenfish · on Oct 13, 2020

> they also need to consider Little o and average runtime.

Frequency of run is also important - a batch job that takes 5 hours that only runs once a month? Is it worth optimising that?

ric2b · on Oct 14, 2020

Average is useful but is Little o relevant outside of very low latency things like games? If the input is small it will run fast for any solution anyway.

lmilcin · on Oct 14, 2020

Why would efficiency be relevant only for low latency things?

pc86 · on Oct 13, 2020

I'm biased because I studied Poli Sci, I'd like to think I'm a pretty successful software architect and senior developer, and for a long time I had a big chip on my shoulder because I didn't study Comp Sci (and barely graduated college at all if we're being honest).

But your story shows the gulf that exists between Computer Science and software development. You needed to actually use a piece of software constrained by business requirements (not spinning up 250 servers), and when it didn't do what you needed it to, you changed it. The fact that you failed a data structures class didn't prevent you from refactoring and improving the code. The fact that that professor wrote code that performs poorly on good hardware doesn't negate the fact that there was probably some novel CS implementation in there somewhere.

I know in the past I've gotten caught up on CS profs having terrible websites or programmers not understanding the intricacies of bubble sorting but I think it helps to keep in mind that they are two different disciplines that inform each other, and the gap is seeming to get wider over time.

amw-zero · on Oct 13, 2020

I don't understand - how is this a loss for CS? You used algorithm analysis to identify an inefficiency an improved that inefficiency. That is what Computer Science is. CS is not its notation - it is the act of doing what you just described. The notation is just supposed to help.

I don't understand the reduction of all of math and CS to complaining about people who overindulge their notations. I've been reading this recently, and it just affirms how important and ubiquitous mathematical thinking is everywhere in life: https://www.gatesnotes.com/Books/How-Not-to-be-Wrong.

domnomnom · on Oct 13, 2020

It's a dynamic where each party can feel the need to cover one's own insecurities resulting in (unintentionally) invaliding the other party. One side aims to justify a sunk cost and the other to prove themselves and it becomes a chain reaction whereby everyone becomes engaged in ego defense/vuln attack until someone goes "wow you are really intelligent" and breaks the cycle.

m12k · on Oct 13, 2020

Computer Science is just academic/theoretical programming. While having a relevant education usually beats not having one, real-world experience usually beats theoretical knowledge when it comes to achieving real-world gains. E.g. if I want something computed as quickly as possible I'm probably better off asking an average game engine programmer than an average CS professor, since they'll optimize for cache efficiency and branch prediction rather than just big O notation.

qsort · on Oct 13, 2020

As someone with CS background but no experience in academia, this is true but only up to a point. Optimizations like cache efficiency and branch predictions are important and often trump big O considerations, but are mostly engine-level issues. If you are normalizing for the average programmer, the #1 reason why their program is slow is that they're using the wrong algorithm (often implicitly with e.g. library functions or database queries). If anything I'd like more people to be aware of how to reason in big O terms rather than less.

umvi · on Oct 13, 2020

People obsessed with big O are super annoying. To them O(1) trumps O(n) even if it's a tiny little set of data where clearly the latter is actually faster (stopwatch time).

ls65536 · on Oct 13, 2020

The way in which I like to often think about these things in practice is with "hidden" constant factors. For example, you can think of the O(1) algorithm as really taking O(1 * K1) time, and the O(n) algorithm taking O(n * K2) time to complete. For different algorithms, K1 and K2 will almost certainly be distinct, and may even differ by significant amounts. Of course, if K1 is less than K2, then the value of "n" is irrelevant, and the O(1) algorithm always completes faster. But if K1 is greater than K2, as is the case quite often, then this really depends on the nature of what "n" is in practice, and whether or not it is larger than the value of K1/K2. This of course still ignores any other, often important, considerations beyond just run time such as memory consumption (which may also affect run time indirectly), but I find it's a good starting point when trying to reason about when O(n) can run faster than O(1) in various real-world scenarios.

It's a good reminder to always try to understand your likely workloads as well as possible (know your "n"), and to get good measurements (what are K1 and K2 values) before prematurely optimizing these kinds of things.

arethuza · on Oct 13, 2020

Isn't that already built into the definition of O?

"f(x) ∈ O(g(x)) as there exists c > 0 and x0 such that f(x) ≤ cg(x) whenever x ≥ x0"

Saying f(x) is O(g(x)) doesn't say anything about how those two functions compare below x0?

https://en.wikipedia.org/wiki/Big_O_notation

Edit: Not trying to be snarky - just trying to check whether my 30+ year old memory of education on such matters is even vaguely correct... :-)

Edit2: Added "memory of" - I'm pretty sure what I was taught was correct, just my memory of it that is all too fallible.

cyphar · on Oct 13, 2020

You are correct, the point being made was that a lot of people aren't aware of that and just take it as gospel that O(1) algorithms must be faster than O(n) ones -- but, as you said, big-O notation only says that is true after n goes above some threshold n0 which may be (and usually is) very large.

After all, the most efficient (in terms of big-O) known algorithm for multiplying two numbers uses a 1729-dimensional Fourier transform and only becomes more efficient once the numbers involved are significantly larger than the number of atoms in the universe[1].

[1]: https://en.wikipedia.org/wiki/Galactic_algorithm

pmiller2 · on Oct 13, 2020

Interesting (coincidental?) appearance of 1729 here in a numerical algorithm.

arethuza · on Oct 13, 2020

That "significantly" is quite the understatement!

colejohnson66 · on Oct 13, 2020

It’s good to understand what the “Big O” for your algorithm is, but, yes, people who obsess over it are annoying. If I know I’m processing 100 items very rarely,[a] does it matter if my quick and dirty sorting (no pun intended) algorithm is bubble or quick sort? They both complete in a fraction of a second, and the user (generally) isn’t going to notice a difference between a single frame update delta or two.

[a] Key words are 100 items very rarely: if you’re sorting many times a second (for whatever reason) or (relatively) large datasets, using a quicker sorting algorithm than bubble or insertion would make sense.

kortex · on Oct 13, 2020

I had basically this exact discussion with a coworker. I said something offhand that I rarely consider big O, beyond avoiding obviously inefficient patterns; they replied they basically are always considering big O. Personally, that strikes me as premature optimization. My priorities are generally:

1. Solve the problem.

2. Ensure correctness of that solution

3. The code is easy to reason about

4. The app is robust

5. "good enough" performance

6. Performance tuning

Which is of course the "correct answer", and of course in practice these all blend together I will emphasize any number of those at a given time (who doesn't love spending an occasional afternoon just chasing pointless benchmarks?). But I never come at it from a "big O mindset" even when that's what I'm doing in practice, I just call it "common sense" and heuristics: don't put slow things in tight loops, memoize when you can, and leverage concurrency.

In my line of work (CV), moving a slow blocking operation from serial to async will get you easy, bigger wins 90% of the time, vs seeking out something in polytime and refactoring to nlogn.

jjuhl · on Oct 13, 2020

I once worked on an application where we (in one common situation) had to sort an array that was always <= 5 elements and in 90+% of cases was already completely (or mostly) sorted. I got a lot of heat from co-workers when I used bubble sort for that task, since "bubble sort is terribly inefficient" (and it is for large random datasets). But, given the actual data in question, no-one could actually make any other sort perform faster.

Know your data.

jdmichal · on Oct 13, 2020

My rule is that the only sort I will ever write by hand is a bubble sort. It's basically impossible to write incorrectly. If and when that breaks performance, then I will bring in an external sorting library and figure out what works the best for the data.

It's the equivalent philosophy to always buying the cheapest tool you can the first time around. When you break that tool, then you go out and buy the expensive one.

_2xvp · on Oct 13, 2020

The cheapest tool you can find is surely the sort in your programming language's standard library. Writing your own sort seems crazy to me.

jdmichal · on Oct 13, 2020

What can I say. Sometimes you're not working with the blessed standard data structures for which those algorithms are typically written.

mekoka · on Oct 13, 2020

They're not just annoying, they possibly don't get the point of big O. The n in O(n) means that at the time of algorithm design the size of the data is unknown and considered unbound. If you already know its upper limit then it becomes O(c) (where c is a known constant). If c is low enough (a discretionary decision) it can possibly be considered O(1). Think of how a theoretical hash map traverses one of its buckets in linear time to find the key's value (if the bucket is implemented as a linked lists), we still consider the overall lookup to be O(1) simply because the upper bound of the bucket is known.

newen · on Oct 13, 2020

I guess that's why they call this site Hacker News and not Computer Science News.

PaulHoule · on Oct 13, 2020

One time I had a famous computer science professor mail me a C program (maybe 200 lines of code) that would compile OK but would crash when I would try to run it.

I set a breakpoint at the beginning of the main() method and it crashed before it even got to main().

That had me scratching my head for a minute, then I realized that the program was static initializing a 2^32 byte array. The prof had a 64-bit computer (probably DEC Alpha) which you could do that on, but it didn't work on my 32-bit x86 machine.

It turned out that the program never used the array that it allocated so deleting one line from the code got it to run.

Most CS profs never have to really finish the systems they work on so they lack many of the skills and attitudes of the professional programmer. Some of them are artisans with code, but mostly they making a living writing papers and teaching classes and the code is tertiary.

cabaalis · on Oct 13, 2020

There's a happy medium between getting a job done and applying theorums/bigO/fancy data structures all the time. Probably what you came across and subsequently optimized was an implementation of the former.

Fun fact: Just getting the job done works out well most of the time, technical debt is introduced but as in your example it served its purpose.

arcticbull · on Oct 13, 2020

Data structures is a weird one, it's basically a memorization class. It's the kind of thing you just have to do, a bunch of times, then it clicks. That was my experience, anyways. It's largely only relevant when interviewing, and I always cram for that like I did in college, by implementing a whole ton of data structures the week before.

qayxc · on Oct 13, 2020

> It's largely only relevant when interviewing

I find the opposite to be true, but it depends entirely on what problems you work with on a daily basis.

Data structures and algorithms are a toolbox and the more you know, the more options you have at your disposal to tackle a specific problem.

If you work in an area that doesn't benefit much from this specific set of tools, then yes, you only it during interviews; interviews done by people who don't have the faintest clue about what they're hiring you for (which is a big problem itself).

arcticbull · on Oct 13, 2020

> I find the opposite to be true, but it depends entirely on what problems you work with on a daily basis.

I suppose that's true, my perspective is as a client engineer where for the most part, someone's built the data structures for you. Yes you need to know their performance characteristics and when to use them, but otherwise if you're building them yourself you're doing something wrong. I'm sure that's not true in all roles.

> If you work in an area that doesn't benefit much from this specific set of tools, then yes, you only it during interviews; interviews done by people who don't have the faintest clue about what they're hiring you for (which is a big problem itself).

I'm actually really passionate about recruiting and improving the recruiting process anywhere I work, I'd love you to elaborate on this.

qayxc · on Oct 13, 2020

> I'm actually really passionate about recruiting and improving the recruiting process anywhere I work, I'd love you to elaborate on this.

Unfortunately this is a really complex topic and I'm afraid I'm not nearly knowledgeable or experienced enough to give any meaningful advice in a concise way.

Say a candidate is to be hired for developing and maintaining a line of business app. In this case it's important for them to know the programming environment, frameworks and tools, as well as working with large programs or legacy code if applicable.

Start by asking a few "no brainers" sprinkled into a general conversation about their previous work experience and increase the "difficulty level" from there.

There will come a point at which the only honest answer will be "I don't know" or "I'd have to look that up", which is the perfect opportunity to just hand them a tablet (or laptop) and let them do so. This is an excellent way to learn how they look-up technical information (which sites do they use, how long does it take them, do they use meaningful keywords).

Another effective way of judging someone's skill is asking them to point out problems with a small piece of code or have them discuss required modifications to meet some constraint (e.g. sort a dataset by one of a user-selectable set of criteria, but you can only read each item once and the dataset doesn't fit into memory - a common problem with archival data stored on tape).

These are just some pointers, but they all have two things in common: they require the interviewer to know what the candidate is going to work on and they need more preparation work than just a checklist that reads "knows how to implement an AVL-tree in Python" or "can tell the arguments of all 6 overloads of std::transform_reduce() in C++17".

ricardo81 · on Oct 13, 2020

Brill Tagger, per chance? I had tinkered with its code at some point, don't think I realised a 1000x speed up, though.

AshamedCaptain · on Oct 13, 2020

I used to know a professor who switched from the hardware department to the software department. His reason? With hardware I can reduce runtime by 10 or 100. With software I can reduce runtime by log(x).

michaelmior · on Oct 13, 2020

It's especially powerful when you have someone with a deep knowledge of both hardware and software who can maximally exploit the performance of the hardware taking into account CPU caches, vectorization, etc. that most application programmers don't think about on a daily basis. Daniel Lemire[0] is a great example of someone who does this well.

https://lemire.me/blog/

orlp · on Oct 13, 2020

If your grid is well-behaved enough you can go a lot faster still using locality sensitive hashing.

innagadadavida · on Oct 13, 2020

> global grid is 50x25 and the local grid is 1000x500. For each grid cell in the local grid, we want to know to which grid cell in the global grid it corresponds

Isn’t this just a simple quantization? (Lat/20,Lon/20) should do it?

salmon · on Oct 13, 2020

I find optimization porn like this to be so satisfying.

Frost1x · on Oct 13, 2020

The only issue I have with optimization porn is that it often handwaves away the time and thought that went into finding clever optimizations. In some cases it's clear how a problem can be better optimized. In many cases, it's not so clear. It takes someone experienced in development, an understanding of software and computational complexity, and the insight to see optimization opportunities or have used them before.

I also live in scientific computing world where the author lives and many other scientists don't appreciate the complexity of these optimizations the author hand waves away. It's easy to after the fact say, "well obviously look at this.." and show after the fact runtime performance improvements, it's another story to explain how you arrived at this, the time thinking about the problem sitting at dinner and before bed, while exercising or having coffee etc.

My biggest gripe with developers in this regard is how clever people like to act. They have a challenging problem and often spend significant amounts of time, then pretend like it happened over lunch. This does happen but from my experience, it doesn't happen regularly, not as many like to portray. You often iterate through several other optimization approaches aor ideas that fail or are discarded.

It's similar to when you see obnoxious compressed mathematical notation in a presentation where the author jumps through something they spent years on like it's trivial. If it was trivial, it wouldn't have taken you years and although a solution presented and checked to a problem is much clearer now than before, it's by no means obvious. This gives other professionals irrational expectations of resource allotment (in terms of your paid time) to solve these problems and makes your life more difficult than it needs to be.

Imagine if after Einstein spent time developing special relativity he skipped it all, hand waved, and said, "well clearly e=mc^2..." no--absolutely not clearly, but that's amazing insight.

CyberDildonics · on Oct 13, 2020

Step 1: start with something grossly, ridiculously inefficient.

Step 2: Brag about your speed up

hnaccy · on Oct 13, 2020

I was lucky enough to come upon a similar speedup in some code that was used a fair bit at first job; multiple order of magnitude speed up by applying some data structure knowledge.

I feel like base level algorithm/ds knowledge is getting higher so these will disappear, even if you don't know the exact solution your intuition will make you think "there's no way this needs to be so slow" and then you do some research/thinking.

rladd · on Oct 14, 2020

We can easily reduce this to effectively time=0 which is an infinite speedup.

Since the membership of any of the local cells in the global cells is static, we only have to compute this once. Then we add a globalCell parameter to each local cell to specify which global cell it's contained in.

If this uses up too much memory, we could go to a slower but still much, much faster than the blog post's method of something like:

- Create 2d array where the 1st dimension is a sorted list of local row values, one per global row value, that represent the maximum X value within any global cell. Then the the 2nd dimension are local column values, one per global column value that similarly correspond to the maximum Y value of any local cell per global cell. (mapping the coordinates to an array)

- Find the entry that the current local row value fits into (for example the entry with a value higher than the local row value, which is higher than the next list entry).

- This selects a second dimension array for that global row that has similar properties: it's only necessary to then find which value the local row fits into to determine the global cell (which would be the array's value at that index).

thenewwazoo · on Oct 13, 2020

I read this and chewed on it for a while, and I think it points to a deeper problem with how we teach computer programming. While, yes, computer scientists need to know about binary search vs linear search and how to iterate through collections, I propose that most programmers do not. Lots of people do programming that aren't, and shouldn't aspire to be, computer scientists (because they have other, more important, shit to do).

Languages and libraries should support broad deemphasis on counting and iterating and so on. If a user wants to find a thing, they should be taught to use a method on the collection, haystack.find(needle). If they want to iterate, they should be taught to use iterators, haystack.map(measure). And so on. We, as instructors, teach people for needle in haystack and that solves the problem and they move on so they can accomplish their real goals. All of the code in this article could have been written in one iterator chain in a language that didn't promote for l in locals as a first option, and with instruction that saved such implementation details for later.

m4r35n357 · on Oct 13, 2020

You could get arbitrarily large improvements by starting with even worse code ;)

andi999 · on Oct 13, 2020

Exactly. I like the enthusiasm of the author; but the self-congratulatory attitude doesnt go down very well if you should actually be ashamed of the first version. I mean it is only a few inches from:"you know in the old days we had to flip through the telephone directory from front to end to find a name. But you know what: it is actually a sorted list, so we applied a binary search algorithm and are 14000x faster. Tl;dr CS for the win"

theamk · on Oct 13, 2020

You'd be surprised how often "flip through the telephone directory" thing comes out in the real projects.

Sometimes it is one of those "we did a quick hack and then forgot about it as data sizes grew", sometimes one just forgot about the complexity, and occasionally there are people who don't understand the problem at all.

andi999 · on Oct 13, 2020

I agree, and I also think this should be the first solution since it is obviously correct so you can test your faster algo later against this (at least in the post the invariants of the dataset are not spelled out exactly, and we know premature optimization..). But I do not think you should expect a trophy when later fixing it (when profiling shows it is a problem).

jp0d · on Oct 13, 2020

Nice one. I'm one of those non-CS background person in IT. Although, I'm not a dev, I do write a lot of code in SQL and Python for data analysis. People like me can benefit a lot from understanding algorithms. Especially because we often deal with huge amounts of data. Cheers.

PaulHoule · on Oct 13, 2020

Then there is the time that I wrote a program that would have taken 100 years to run and found a (perfectly useful) approximation I could calculate in 20 minutes, for a 2.6 million (x) speedup. Even counting 16 hours of programmer time that is a 55,000 (x) speedup.

Jtsummers · on Oct 13, 2020

My most significant optimization was showing that the thing they wanted computed was not possible. Not just "will keep running and finish soon after the universe expires", but actually impossible to compute (in the general case, which was their case, if they restricted the inputs it was feasible). Apparently they had no mathematicians evaluating the requirements. The guy who brought the problem to me had been working on it off and on for months. So I saved him from having to think about it for the rest of his time maintaining that system.

PaulHoule · on Oct 13, 2020

Yep, that's the heart of Computer Science.

fzimmermann · on Oct 13, 2020

I see some issues with the problem description and the solution: The author assumes the problem can be split in x and y independently. Firstly, this has not to be true for all projections and grids. Secondly, the solution minimizes Manhattan distance in lat/long., not actual distance (you can construct points where the minima are different). This will not matter if you can approximate both grids as regular (equidistant and rectangular grids). But in that case, you could easily calculate for each point in one grid the corrosponding point in the other grid by scaling and rounding...

lkbm · on Oct 13, 2020

I thought this was going to be "How to Speed up a Python Program 114,000 times"[0]. It's an hour-long talk, but quite enjoyable.

In the example in the talk, it was less about knowing cs algorithms and more about knowing the details of how modern computer architecture works, but the lesson is the same: sometimes the background knowledge of a cs degree is useful.

He also has the same point about optimizing our code rather than just adding more and more machines.

[0] https://www.youtube.com/watch?v=e08kOj2kISU

acaloiar · on Oct 13, 2020

The optimizations here are uncannily similar to the ones I performed for an algorithm we developed for application in bioinformatics. It began with a statistics post-doc who knew some R. It was very brute-force.

https://arxiv.org/pdf/1509.00549.pdf (Results on page 28)

I suspect we would have seen similar speedups with larger vectors.

If you look at the tweaks to the algorithm, they're relatively minor, but with very significant gains. Sorry about the code syntax; the audience for the paper were statisticians.

ncmncm · on Oct 14, 2020

I don't know whose law says something like, "The amount of optimization left to do is proportional to the improvement you got on the last round".

I.e., 14000x means there's probably another factor of 10 or 100 in there. But 0.1s might be fast enough.

I get 2x on std::sort pretty easily. I am hoping to get a general speedup like that for half the algorithms in the C++ Standard Library into C++23. Maybe the other half a little later.

Caches mean nobody knows how fast anything should be, so lots of things seem fast until you see one faster, then the old one suddenly looks slow.

ekzy · on Oct 13, 2020

> But when it comes to developing software, I get the distinct impression that people think, “Hey, how hard could this be?! We just write down a few instructions about what we want the computer to do, hit the execute button and the, ‘Blamo!’, we get our answer!”

It depends on who you hang with :) I am 99% sure that 99.9% of people don't think like that. From my experience, people see computer programmers as nerdy mystical wizards and tend to say that they haven't got the slightest idea on how someone would even begin to make software.

sgillen · on Oct 14, 2020

In my experience with scientists/mathematicians a lot of them look down on software as an easy and unimportant detail of their work.

Certainly not all of course.

gbrits · on Oct 13, 2020

A Geohash (https://en.wikipedia.org/wiki/Geohash), a (decoding of a) recursive spacing-filling curve, would be much faster still: you literally slash a couple of characters of the tail of the hash to end up with your bigger cell, etc. It does mandate the way in which the grid/globe is subdivided though, which may not be suitable for the stated case.

netmonk · on Oct 13, 2020

This reminds me an old shell script used to housecleaning, which was performing first "for" loop based on an sql select to perform another "for" loop based on another sql select and itering on returned items from the first request.

Obviously the author of this script didnt know about sql joint. Reduced the work into one "for" loop and an custom sql join query as input, script running in less than 5 minutes compared to average 5hours before.

foota · on Oct 13, 2020

It sound like this performed well enough, but it seems like a more efficient solution would be to crawl the grid from left to right and at each point only search the neighbors of the previous closest grid from the larger grid, assuming that a local grid cell will never be have a height or width greater than a global cell. This would be linear in the number of local grid cells.

tumblewit · on Oct 14, 2020

In machine learning for trading class, a python for loop code was turned into numpy array where the code went from something like over 4 minutes to a few milliseconds. Its astounding how much you can improve on code just by a vastly better algorithm.

Note: I know numpy is C and faster than python but that was not the point. The point was using vectors vs bruteforce.

shihab · on Oct 13, 2020

I fantasize about coming across optimization opportunities like this in real life.

topkai22 · on Oct 13, 2020

As a team lead, my project sees them about once a quarter. This is on a project that fundamentally does form based data collection on a limited set of users.

The most recent one was where a dev used an array scan to find a piece of data in a record array. We changed that to a hash table look up and saw a 100-1000x increase in performance on our larger datasets.

Not quite as dramatic, but still pretty good.

If you aren’t producing single person solutions and your product doesn’t have any performance issues, you can either count yourself lucky to be working with great engineers.

bovermyer · on Oct 13, 2020

Related: if I had a bit of open source code that I wrote that I was convinced was poorly optimized, how would I go about getting someone to take a look at it and help me improve it?

ultra_nick · on Oct 13, 2020

That sounds a lot like the "Find the point nearest to this one on a 2s plane" interview question. Neat.

read_if_gay_ · on Oct 13, 2020

> Furthermore, there is literally no way to tell whether your program will ever actually terminate without actually executing it.

This is technically not correct, is it? Or at least phrased a bit poorly. Maybe replace "your" with "any given"?