I find the efforts of modern academics to do ML research on relatively underpowered hardware by being more clever about it to be reminiscent of soviet researchers who, lacking anything like the access to computation of their American counterparts, were forced to be much more thorough and clever in their analysis of problems in the hope of making them tractable.
Optimizing for cache management and branch prediction is very difficult. And most programming is done a level of abstraction that isn't amenable to staying portable (i.e. staying nimble) after optimization.
Plug: Staying algorithmically nimble after optimization is a problem we've "solved for" at our startup (monument.ai)
At least if you're taking inspiration from biological systems, it clearly is part of the equation, a really important one even.
12c/24t, 64G with a high end video card.
tldr: upgrade GPU, downgrade CPU and ram to keep similar pricing.
The mobo cut is also a pretty useful savings, though it will be an obstacle to multiple-GPU setups.
Also, the parent comment misleadingly suggests that a 3900X costs less than $300. That seems like an error in pcpartpicker, since clicking through reveals a true price of $400+.
That bad boy costs $0.96/hr on the current spot prices (https://aws.amazon.com/ec2/spot/pricing/)
The closest in performance to that GPU would be the V100 found in P3-instances.
I'm embarrassed that I feel so compelled to post this--on a Friday night at that--I apologize.
Don't be. Finding gems like this is why some of us read the comments.
There are certainly some workloads where it makes sense to own your own storage and rent computation, but you can't assume that by default for a "powerful AI" workload.
The closest you can get on AWS (more like System #3 in the paper with 4x GPUs) would be something like a p3.8xlarge instance  that'll cost you $12.24 (on demand) or $3.65 to $5 (spot price, region dependent) .
A single GPU instance (p3.2xlarge) only 16 vCores, though) will cost you $3.06 on-demand or $0.90 to $1.20 (spot).
My assumption would that either the GPU or the CPU is the bottleneck, most likely the GPU. Why not spend money for more GPU and fewer cores?
Some experiment results:
Well, they presumably tested the same CPU with 4 GPUs (2080 Ti I think) - maybe they wanted to compare.
"Using a single machine equipped with a 36-core CPU and one GPU, the researchers were able to process roughly 140,000 frames per second while training on Atari videogames and Doom, or double the next best approach. On the 3D training environment DeepMind Lab, they clocked 40,000 frames per second—about 15 percent better than second place."
So, not massive this-is-now-doable speedup.
Couple things that stood out from github page:
Currently we only support homogenous multi-agent envs (same observation/action space for all agents, same episode duration).
For simplicity we actually treat all environments as multi-agent environments with 1 agent.
My speculation is that this why they gained such dramatic performance improvement. (but I might be very wrong)
But for most state of the art models (think gpt with billions of parameters) that is far from being the case.
Not saying it wouldn't be a worthwhile goal to improve the algorithms so that it becomes possible. At least on a 8x V100 machine, for Christ's sake. Because that's all I got.
Well that's still one powerful supercomputer and allows you to pretrain BERT from scratch in just 33 hours .
I mean that's $100,000 in hardware you have at your disposal right there, which is still orders of magnitude beyond 8k-level workstation hardware...
It speaks to the sad affair that is SOTA in ML/AI - only well funded private institutions (like OpenAI) or multinational tech giants can really afford to achieve it .
It's monopolising a technology and papers like this help democratise it again.
Since most contemporary methods only make sense if lots of training data is available in the first place, many companies interested in trying ML do have plenty of manually labelled data available to them.
Their issue often is that they don't want to (or can't for regulatory reasons) send their data into the public cloud for processing. Any major speed-up is welcome in these scenarios.
If you can get all your data into RAM on a single computer, you can have a huge speedup, even over a cluster that has in aggregate more resources.
Frank McSherry has some more about this, though not directly about ML training.
One correction: it's not Doom, but ViZDoom, a simplified version designed for DRL.
For example, say t_inf = 10 microseconds, t_env = 1 microsecond and we are training on a 30 core machine. When k=600, we batch 300 inference steps in 10 mics, and complete 300 simulation steps on the other half of the rollout workers in 10 mics. Both cohorts of rollout workers finish at the same time, achieving optimal performance.
Am I thinking about this correctly? Also, this equation assumes that we can batch up k/2 inference jobs on the GPU without increasing t_inf correct?
The article makes the point that your technique will give an advantage to academic teams that don't have the resourcers of big corporations. To me it seems that your technique optimises the use of available resources, but the amount of available resources remains the deciding advantage. That is to say, both large corporate teams and smaller academic teams can improve their use of resources using your proposed approach, but large corporate teams have more of those than the smaller academic teams. So the large corporate teams will still come up ahead and the smaller academic teams will still be left "in the dust" as the article puts it. What do you think?
For the record, I am a solo developer in progress of developing a to-be-online browser game. I must make intelligent bots and keep players busy until the time it has a lot of online players.
I had a look at Reinforcement Learning but I am not sure people are really using for this use case.
Running an already trained reinforcement learning agent is relatively cheap (unless your model is massive).
I suspect the reason people aren't using it yet is because it's a) really difficult to get right in training, even basic convergence is not guaranteed without careful tuning b) really difficult to guarantee reasonable behavior outside of the scenarios you're able to reach in QA.
edit: Link to lecture series https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTra...
The answer is dont touch ML, start with FSM.
But still 140,000 images per second! That's nuts. Even 70,000 is nuts.
I can only play 1000 2048 games in one second. Damn! I am slow
If you want to see major improvements in the academic and home side of AI then NVIDIA and AMD need to bring more memory to consumer hardware. But there isn't much incentive because gamers don't need nearly the memory that researchers do.
Brain plans to use a "growing ray" (originally a "shrinking ray") to grow
Pinky into super-size while dressed up as Gollyzilla, while Brain would turn
himself gigantic and stop him, using the name Brainodo, in exchange for
world domination. However, the real Gollyzilla emerges from the ocean and
starts to rampage through the city, making Brain think that the dinosaur is
Pinky. The episode ends with the ray going out of control and making
everything on Earth grow, including the Earth itself, to the point that
Pinky, the Brain, and even Gollyzilla are mouse-sized by comparison again.
Being able to train deep RL models on commodity hardware is only an advantage
if there isn't anyone that can train on more powerful hardware (or if somehow
training on more powerful hardware fails to improve performance with respect
to your model). Otherwise, you're still just a little mouse and they have all
like, if you want to show relative improvement of some new variation of an RL algorithm, this could be a good way to do it. or if you have a new environment that you want to solve for yourself. right now if you try to train anything in a moderately interesting environment on a PC, it takes just a little too long to get results -- makes the whole research process pretty painful.
GH: One big challenge the community faces is that if you want to get a paper published in machine learning now it's got to have a table in it, with all these different data sets across the top, and all these different methods along the side, and your method has to look like the best one. If it doesn’t look like that, it’s hard to get published. I don't think that's encouraging people to think about radically new ideas.
Now if you send in a paper that has a radically new idea, there's no chance in hell it will get accepted, because it's going to get some junior reviewer who doesn't understand it. Or it’s going to get a senior reviewer who's trying to review too many papers and doesn't understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that's really bad.
What we should be going for, particularly in the basic science conferences, is radically new ideas. Because we know a radically new idea in the long run is going to be much more influential than a tiny improvement. That's I think the main downside of the fact that we've got this inversion now, where you've got a few senior guys and a gazillion young guys.
In other words, yes, unfortunately, everything is the SOTA rat race. At least anything that is meant for publication, which is the majority of research output.
at the same time, if you go to this year's ICML papers and ctrl-F "policy", there are several RL papers that come up with a new variant on policy gradient and validate it using only relatively small computing resources on simpler environments without any claim of being state of the art. probably many would directly benefit from this well-optimized policy gradient code.
It's funny, but older machine learning papers (most of what was published throughout the '70s, '80s and '90s) was a lot less focused on beating the leaderboard and much more on the discovery and understanding of general machine learning principles. As an example that I just happened to be reading recently, Pedro Domingos and others wrote a series of papers discussing Occam's Razor and why it is basically inappropriate in the form where it is often used in machine learning (or rather, data mining and knowledge discovery, since that was back in the '90s). It seems there was a lively discussion about that, back then.
Ah, the paper:
Not innovation, exactly- but not the SOTA rat race, either.
Then again there are the general problems that are relevant to everyone, like natural language understanding, question answering, image recognition and so on, high-level tasks for which solutions have broad applicability. Rather tautologically, such tasks are always relevant to anyone who can perform them well.
In any case, if this was not the case there wouldn't be any motivation for academic teams to find ways to train with smaller computing resources, as the article reports.
In various benchmarks the geometric mean is usually used to compare total score across different tasks to account for severe issues with specific tasks.
If your network is the same as IMPALA, why do you show its results with different hyperparameters? Were some of them necessary for the optimization (e.g. reduced batch size)?
I suspect the cloud has a decade, maybe less, of hype to grift on.
Huge data sets on a personal computer and opt-in data sharing with business and healthcare, etc will be the new norm.
Further out, software as we know it will cease to exist as entirely custom chips per application are the norm. IN TIME.
New hardware wars to capture consumer attention incoming.
Personally I prefer to own my own compute, my own storage and my own hardware. Cloud isn't any cheaper in the aggregate long-term, but it does spread risk and costs across a longer term.
Considering everything I'd rather own the risk and up-front costs just for my own privacy and self-determination.
But I would rather not have proprietary GPU drivers.