
Is training DL models in the cloud too expensive? - biosopher
Does anyone here have experience training models with Google&#x27;s Cloud ML?  We&#x27;re currently training a model based off Food-2000 that takes about 5 days using a single K80 on a local machine.  I&#x27;d like to estimate doing this faster using Google Cloud ML.<p>My estimates use the pricing located here: https:&#x2F;&#x2F;cloud.google.com&#x2F;ml-engine&#x2F;pricing#machine_types_for_custom_cluster_configurations<p>Cost = (ML training units * cost per unit &#x2F; 60) * job duration in minutes<p>The &quot;ML training units&quot; for a standard_gpu is 3 and for a complex_model_m_gpu is 12.  I&#x27;m assuming a standard_gpu is equivalent to a single GPU on the K80 (which has two GPUs).  So my assumption&#x27;s that a complex_model_m_gpu is 4x more expensive because it&#x27;s equivalent to 2 x K80s.<p>The &quot;cost per unit&quot; in the US is $0.49 per hour.  And since I&#x27;m training with 2 x K80s in the cloud now, my training should be closer to 2.5 days which is 60 hours.<p>Cost = 12 * $0.59 * 60 = $425.  Given that a K80 costs $4,000 on Amazon, it would take 18.8 training runs to match the price of 2 x K80s.  But we ran multiple experiments to fine tune our model to this point so likely went way past 18.8 training runs total.  Maybe running in the cloud is too expensive?
======
brudgers
Here the value proposition is time for money and it depends on which is
constraining the business. Reducing the training from 5 to 2.5 days might make
a difference to the business. Or the time to market may be determined by the
speed at which the marketing department can contract for a new campaign and
that might be on the scale of months rather than a few days. It might also be
the that going from 5 days to 2.5 days is not enough of a speedup to go from
one iteration a week to two iterations a week due to setup and overhead and so
business velocity remains the same.

------
eggie5
my only comment would be that your TF model will not automatically employ the
second GPU. You need to architect your model, using TF routines, to use more
than 1 GPU.

