Interesting article, thanks for sharing. We actually just went the exact opposite because of the larger scale issues we were having with Softlayer. Do you feel like you lost any resiliency by making the switch to physical servers (more virtual instances on one physical server, servers in the same rack, etc)?

No, I really do not think going to EC2 could be beneficial in any way in terms of improving resiliency compared to Softlayer. SL allows you to control which VLANs your box will end up on. VLANs could be treated as racks (since they do not allocate more than one VLAN per rack). Then you have multiple DCs in one region (e.g. DAL01, DAL05, DAL07, etc) and you have many different regions (DAL, SEA, WAS, AMS, etc).

I'd be very interested what problems you were having with them and at what scale. If this is a private topic, we could do it over email or some other medium if you like. You can contact me by any of the means listed here: http://kovyrin.net/contact/

We were about 75% virtual with SL and 25% bare metal. One of our issues with the virtual stuff is when we started dedicating them to a set VLAN, multiple times we ran into an issue where some type of resource for that pod the VLAN was in would be maxed out (usually storage) so we couldn't create a new instance. The solution we were given was to let the system pick a VLAN but by doing that we had lost control of the placement and added some complexity to our architecture.

Aside from that it was mainly nit-picky type stuff, but still things that were annoying (networking issues between DCs, networking issues between pods, internal mirrored apt-get repos going out of sync, API is kind of blah, etc).

We use docker so having a few bare metal machines with tons of containers on them wasn't a great HA setup (for us at least), even running in two data centers. The fairly quick setup time though was a nice selling point.

When we went to AWS things just kind of worked. The API was easier to use and the GUI portal was way nicer/stable. So far we have not had any odd issues with our instances, but we also typically run them at about 50% capacity so that might be why. It is also still early so maybe things will come up in 6+ months that send us back to SL :)

