Thanks for the write-up, Max! I want to clarify something though: how do you handle and account for preemption? As we document online we've oscillated between 5 and 15% preemption rates (on average, varying from zone to zone and day to day) but those are also going to be higher for the largest instances (like highcpu-64). But if you need training longer than our 24-hour limit, or you're getting preempted too much, that's a real drawback (Note: I'm all for using preemptible for development and/or all batch-ey things but only if you're ready for the trade-off).
While we don't support preemptible with GPUs yet, it's mostly because the team wanted to see some usage history. We didn't launch Preemptible until about 18 months after GCE itself went GA, and even then it involved a lot of handwringing over cannibalization and economics. We've looked at it on and off, but the first priority for the team is to get K80s to General Availability.
Again, Disclosure: I work on Google Cloud (and love when people love preemptible).
I do most of my experiments with Jupyter Notebooks and Keras on top of TensorFlow. Keras has a ModelCheckpoint callback (https://keras.io/callbacks/#modelcheckpoint) which saves a model to disk after each epoch and is super easy to implement (1 LOC), and a good idea even if I wasn't training on a preemptable instance. In the event of an unexpected preemption, I can just retransform the data (easy with a Jupyter-organized workflow), load the last-saved model (1 LOC) and resume training.
The drawback there is if the epochs are long, which could risk in losing more-than-wanted progress due to a preemption.
One reason to use epoch checkpointing is because that ensures that all samples of the training data have been seen the same number of times. If your data is large and diverse, with heavy enough augmentation it might not matter very much
Disclosure: I work on Google Cloud.
Meanwhile the 10x price performance difference is the main point. Really eager to see the TPUs rolled out broadly, please do price them to take market share from NVIDIA
My lord. HPC resources are incredibly affordable. Hetnzer and some of the other dedicated server companies in Europe/Canada have some amazing deals (we've used OVH in the past with great success, and right now we use Paperspace for CPU intensive stuff we want to share expensive licensing on, like Visual3D).
We've been looking at setting up a small cluster of servers at work (budget of about $500), and I was still going to go with Versaweb. After seeing Hetzner, I'm going to reassess, and likely move everything there.
I'm paying €150 for what it seems I could pay €100 for. There was something that made me decide against Hetzner a few years ago, but I'll research and see if their TOS are now different.
EDIT: My numbers are wrong, I'm going to pay less for 4x the RAM (256GB)
Their pricing structures are slightly different, Versaweb gives me a bit more flexibility when configuring, a wider IP subnet bundled (instead of 1 usable IP), and a few other things which I'm investigating.
I also have to consider laws and network latency as these are in different regions.
In the end, I am paying $180 for a Haswell Xeon with lots of disk space and IO. I could pay the same amount for more RAM on the same CPU, albeit with slightly less space.
If I keep the same setup at a fraction of the cost, I could end up getting the 1080 GPU on the same datacenter. It somehow feels like the same or similar market to my needs ...
They're on different continents, which is a pretty fundamental difference.
It could go a lot lower. Hetzner's profit margin on renting a server like this for 99€/month is formidable.
Relative to the market, that price is very, very good.
If I were to buy such as system it would be over €2000, that's not including cooling or a case for it either. Granted I live in Sweden so taxes are a bit on the high side.
Regardless, that will be at least 20 months before they make a dime (assuming they can rent it 100% of the time). And in that time it will collect rackspace along with electricity, bandwitdh (2 gbits and 50 TB per month) and a dedicated IP.
And after all that time that computer is not that hot anymore, but still draws just as much electricity regardless.
For my model/data, Hetzner runs 1 training epoch in 1 hr vs 1.75 hr for Google. I'm moving the rest of my work over tomorrow. When Google has TPUs available, I'll look at it again.
thanks!! for the tip.
Should do some marketing on the disastrous effects it can have on the training.
I've talked with several second-tier cloud providers, and the GTX 1080TI is what their large-deployment customers use. At the NVIDIA conference they were all promoting the P100 (NVIDIA insisted), but all admitted that nobody asked them to deploy P100s at scale.
The Hetzner box is about 0.15 an hour. That means more GPUs per developer.
do you have any evidence for this?
While you're here: the other reason we switched to Hetzner is reliability. Sure we can continue training from the last checkpoint but we still lost half a day on average for the many surprise reboots. We suspect that you've overbooked the GPUs and someone has to lose when too many connect.
Although I agree it is somewhat confusing in terms of performance.
A lot of times I couldn't find what a comment/story I know I saved because I've upvoted pages upon pages more stuff since.
I was trying to calculate the total cost as their list price excludes VAT. It turns out they just booked the server for me and started sending invoice. Of course they allow to cancel within 14 days but I was handling a personal issue so didn't check my emails for almost a month. It turned messy.
If Hetzner support are listening, please improve the process and if possible take to credit card/payment details upfront so that person is aware that you are spinning the server for them.
10x more seems a lot but it really depends on how you use it, it's no secret that cloud is more expensive.
Disclaimer: Paperspace team.
 - https://cloud.google.com/tpu/
My question specifically being: if I configure a k8s cluster to have all my slaves as preemptible nodes...would GCP automatically add new nodes as my old nodes are deleted (from what I understand preemptible nodes are assigned to you for a max of 24 hrs)?
Considering the pricing of preemptible nodes + the discounts that GCP assigns to you for sustained use, it makes cloud insanely cheap for an early stage startup.
Go for it as long as you understand the downside. It's possible that all instances get preempted at once (especially at the 24hr mark), that there isn't capacity to spin up new preemptible nodes in the selected zone once the old instance is deleted, etc. New VMs also take time to boot and join the cluster.
If you are just doing dev/test stuff, I'd recommend using a namespace in your production cluster or spinning up and down test clusters on demand (which can be preemptible).
If you have long running tasks (like a database) or are serving production traffic, using 100% preemptible nodes is not a good idea.
Preemptible can be great for burst traffic and batch jobs, or you can do a mix of preemptible and standard to get the right mix of stability and cost.
Not me or OP, but same team :)
Would those types of mitigations work similarly with Google's premetable VM's?
There are a few interesting projects out there that do the kind of automation you are speaking of like these:
Spreading multiple smaller machines over a multi-zone k8s deployment might help mitigate, but it will never solve all the issues.
Then you scale up to the cloud to do hyperparameter search.
There is a notable CPU-specific TensorFlow behavior; if you install from pip (as the official instructions and tutorials recommend) and begin training a model in TensorFlow, you’ll see these warnings in the console:
FWIW I get the console warnings with the Tensorflow-GPU installation from pip, and I verified that it was actually using the GPU.
What range of GPU performance do you see? As in, if the card does 10 TFLOPS peak, does TensorFlow manage to reach that peak, or is it at 5% or 20% or some other percent of peak typically?
And are there expectations for Googles new generation TPU? What range of peak performance do people expect to get?
Our benchmarks for processing 1000000 images ResNet-50:
- 8x Tesla K80: 43m 3 sec.
- 8x Nvidia 1080: 17m 32 sec ( 0.09 euro / minute ).
We can provide you resources for free for research.
Disclosure: I'm founder of LeaderGPU.
Would be interesting to see these benchmarks on Haswell/Broadwell vs Skylake.
This is with a small(ish) network of perhaps a few hundred nodes... should I see a speedup for this case, or are GPUs only relevant for large CNNs, etc.?
In practice, there's a multitude of reasons why CPUs are more efficient (or at least faster) for smaller networks.
In my experience (not tf related, I mainly work on my own library now: https://github.com/chewxy/gorgonia) even with a cgo penalty, deep networks do improve with GPU training. Never dabbled much in CNNs (convolutions tend to do my head in) so can't say much.
The library doesn't handle NUMA hardware?
You'd be surprised the difference it makes. It was one of the reasons I liked Gentoo, emerge would always build from source for your target CPU flags, instead of using the package managed "one size fits all" build. Those 5-10%s really compound when you add them up along all dependencies.
Kudos to tutorials and guides that instruct how to build from source.
The same is every bit as true today for your containers, assuming you have a homogeneous target to run them (yes I know, containers are supposed to be supremely portable, but private ones can be purpose built)
Can you tell me more about this? I wanted to switch to Ryzen architecture with my video transcoding project that handles large volume, but because we lean heavily on x264/ffmpeg, it didn't seem like a good idea given the AVX issues, keeping me on i7-based architecture. (Previous comments of mine will show the history of this particular thread.)
Would love to hear it here or via my throwaway: firstname.lastname@example.org. Thank you so much.
Obviously depends on your workload, but on my highly parallel "standard" workloads, my experience is that you can get at most 15% more with hyperthreading on (e.g. 4 cores/8 threads) compared to off (4 cores/4 threads), whereas on the cache intensive loads, I get 20-30% LESS with hyperthreading on.
Depends on the provider. Azure, for instance, has hyperthreading disabled on most of their configurations. They're starting to offer new configurations with hyperthreading though.