I really enjoyed that write up. The focus on a good UI/UX in platform teams is definitely something that resonates with me. I also really liked the article placing an emphasis on Platform teams providing value by having a better "glue" I thought was interesting. I hadn't really thought that glue could be a component where value could be added. I really like the pattern of tools like "Devhose" and in developing a tool specifically to be the glue that integrates well with other tools.
I wonder if AMD will try to change that in the next couple years. AFAIK the reason for this is some clever licensing clauses by NVIDIA that effect market segmentation. If there were more competition on the high end, there would be downward pricing pressure.
It's also annoying because SR-IOV would be a wonderful thing for consumer GPUs, but it would make it too easy to use consumer GPUs for cloud providers.
Right now, you can run a VM with qemu and pass through the GPU to the guest OS, getting pretty close to native performance. With SR-IOV, every VM could have the same GPU attached, and you could manage performance with the hypervisor. This would let you toggle between VMs instantly, getting full performance on each one (assuming the others are idle).
AMD and nVidia do make SR-IOV cards, but they're extremely expensive, intended for data centers, and don't have display output. If it ever hits consumer cards, Linux will be the hypervisor of choice for pretty much everyone, because there will be minimal performance penalty for using VMs.
AWS takes care for most of them. All teams generally use cfn / terraform / to define that kind of infra (e.g. I need a DB with these properties) and they get applied as part of the standard deployment pipelines. We constributed support for cfn in Spinnaker to enable this.
A team in a different area to mine does offer Kafka as a service, but this pattern is an exception, and the org is actively moving away from it in most other cases. For example, a while back a team took care of offering "Cassandra's as a service", managed and operated for N product teams. They don't anymore, for reasons I explained in the article:
- AWS catches up (e.g. they recently announced Cass as a service)
- $commercial company has a Good Enough alternative (e.g. several teams are dropping Cass in favour of Dynamo)
- It's operationally very expensive to maintain a team that does $storage as a service.
- The cost/benefit ratio for doing our own managed storage only adds up when the managed solution is too immature, or lacks too many features that you need. The team offering managed Cassandras actually did this and moved to offering a managed Kafka clusters ~1y earlier than AWS released the first version.
So Terraform spins up a new RDS instance; what about all the management after that? Are teams left to their own devices regarding data/schema migrations, or does your team also provide a golden path there?
They are on their own. The team doing managed Kafka does offer some tools for data / schema migrations, but there is nothing for other storages.
Generally they actually don't raise this as a pain point, teams tend to be quite self sufficient in this regard and rely on things like https://flywaydb.org/ to deal with it.
From our PoV, at this point this type of feature would be in the high cost / low impact quadrant. Not because it doesn't make sense, on the contrary. It's just that it falls at a later stage of maturity than we're at organisationally. As I mentioned, Adevinta is an extreme case of sprawl. To invest productively in "a fully managed relational db with data/schema migrations" we'd need a certain degree of consensus on which db should we support. We don't have that luxury. Even more: we'd also need some form of consensus on "this is how you deploy and release software" which there isn't either (do you do proper CD? is there a manual judgement? are there deployments to a PRE/Staging env? ..). This heterogeneity greatly limits our impact (sprawl reduces the potential surface of impact for any decision we make) so we focus on areas where we can avoid this problem (e.g. everyone buys cloud, containers...). But also, as I said above, data/schema migrations is actually not a pain point teams complain about frequently.
We provision database access through Hashicorp Vault. It's excellent, short lived credentials provided by Kubernetes service accounts (lots of glue here!).
After the RDS instance is created we need to manually create credentials so that Vault gains access to control it though, this is our mission to automate soon.
With credentials in place teams need to maintain schema creation and migrations themselves. We provide wrapper scripts go gain access with Vault credentials mysql shell or Perconas pt-inline-schema-change. Some teams create pre-deploy jobs or init-containers so that their service can run migrations automatically.
Resonates very well with me, working at a scale up in a "platform team".
Business wise we're set to out engineer competition. Biggest challenges are definitely to get engineers on board on training and transitioning into new "cool" technology. Should we help highly skilled advanced teams run fast or focus on getting everyone on board the cloud native train?
Nice write-up Galo! As an ex-Deliveriarian I really enjoyed working this way, and the timing is perfect as I’m trying to set up a similar but way smaller PaaS. I’ll be borrowing this text :D
as a one person platform team (bare metal cluster + hpc resources); this article while super impressive gave me a kind of anxiety attack. I can barely keep the fires out and geez, look how much better it could be. There's hope for some at least.
Being a one person team is tough, I wouldn't beat yourself up over that. Usually impossible to get heads down in a project while firefighting 90% of the time.
> put the software defined networking, software defined storage, kubernetes, docker, KVM, libvirt, Windows AD, FreeIPA, thousands of Python conda environments, annoying users, boss users, NFS shares, NFS servers which can't do v4.1, CIFS, iSCSI LUNs, ancient iDRAC, circuit breaker replacements for air conditioners, data migrations due to lack of quotas, quoats and ACLs over NFS on mixed domains, DNS A records, shot motherboard in high-density nodes, ........... & magically there is time in the day
Every company and team operate under different circumstances that might not be in their control. It's a company effort, not only engineering, to get in to a position where using public cloud or buy it of the shelf products is possible.
Having both skilled legal and sales + business colleagues is key but a luxury in these situations.
that's a good question. I've tried to pick up what I can from online communities like Spice, r/sysadmin, Linxu & CentOS forums, but the most important thing I've done is actually work on some of the Tier-0 systems (in Europe) where everything tends to be done really well, and then try to imitate that (with shoestring budget of course).