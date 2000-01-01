|
|Is training DL models in the cloud too expensive?
|Does anyone here have experience training models with Google's Cloud ML? We're currently training a model based off Food-2000 that takes about 5 days using a single K80 on a local machine. I'd like to estimate doing this faster using Google Cloud ML.
My estimates use the pricing located here: https://cloud.google.com/ml-engine/pricing#machine_types_for_custom_cluster_configurations
Cost = (ML training units * cost per unit / 60) * job duration in minutes
The "ML training units" for a standard_gpu is 3 and for a complex_model_m_gpu is 12. I'm assuming a standard_gpu is equivalent to a single GPU on the K80 (which has two GPUs). So my assumption's that a complex_model_m_gpu is 4x more expensive because it's equivalent to 2 x K80s.
The "cost per unit" in the US is $0.49 per hour. And since I'm training with 2 x K80s in the cloud now, my training should be closer to 2.5 days which is 60 hours.
Cost = 12 * $0.59 * 60 = $425. Given that a K80 costs $4,000 on Amazon, it would take 18.8 training runs to match the price of 2 x K80s. But we ran multiple experiments to fine tune our model to this point so likely went way past 18.8 training runs total. Maybe running in the cloud is too expensive?
