
A supercomputer often won’t make your code run faster - signa11
https://lemire.me/blog/2017/12/11/no-a-supercomputer-wont-make-your-code-run-faster/
======
meesterdude
This echos some recent experience I had recently when working with running
complex simulations.

I'm building out backtesting for an automated cryptocurrency trading platform
(in rails) and have never had to work with so much data/math.

In general, I generate various calculations for each interval/pair, then
"signals" on top of that, and then compare if a given interval is a match - 3
iterations all in all to execute a backtesting strategy.

First, I did it all in ruby. It took days to generate all the calculations for
just one pair/interval, let alone other intervals or pairs. Even with the work
split up over multiple VM's on digital ocean.

So, I moved as much of the calculations into nested CTE's in postgres, and got
the calculations and signal generation down to 10s - a huge speed boost!

But then I wanted to analyze multiple intervals/pairs at the same time against
a given strategy, which took a long while to run. Not days, but maybe half an
hour. The improvements i made there was to make a tradeoff - limit the amount
of calculations to a range instead of all of it (if i wanted it to run fast)
and to cache interval/pair/strategies that had already ran - so adding new
ones (such as another pair, or set of intervals) would only re-run the "new"
options.

So that worked - until I wanted to analyze multiple strategies against
multiple intervals/pairs. This is where things start getting into the
potential supercomputer scale, because there are something like 2 billion+
possible strategy combinations that could be tried - probably more once you
factor in tweaking underlying indicator settings. And then that times as many
interval/pair combinations one selected. But since I don't have a super
computer, I had to get creative/selective.

I did make some improvement here - cached all the calculations and signals
separately from the results so a strategy variance would not need to re-run
calculations needlessly, as well as by limiting the simulations to run to
answer simpler combinations - instead of which indicator combination works
best overall to identify which indicator, when paired with a pre-specified
one, works best. And then once you have that, you can re-run with those two
indicators to further test an idea. Not the same thing and potentially an
approach that disregards strategies where a combination of poor indicators win
over when used together - but that's the sort of sacrifice one must make when
you don't have a supercomputer lying around.

TLDR: SQL is good at crunching numbers so move what you can there, reading a
cache is usually faster than regenerating it so use those where possible, and
be selective with your modeling parameters to reduce runtimes and iterations
needed for a result.

