
Talk write-up: How to build a PaaS for 1500 engineers - srvaroa
https://srvaroa.github.io/paas/infrastructure/platform/kubernetes/cloud/2020/01/02/talk-how-to-build-a-paas-for-1500-engineers.html
======
seslattery
I really enjoyed that write up. The focus on a good UI/UX in platform teams is
definitely something that resonates with me. I also really liked the article
placing an emphasis on Platform teams providing value by having a better
"glue" I thought was interesting. I hadn't really thought that glue could be a
component where value could be added. I really like the pattern of tools like
"Devhose" and in developing a tool specifically to be the glue that integrates
well with other tools.

------
streetcat1
I just want to note that in house platform can also be a big cost savers vs
public clouds.

For example, you should be able to be 10x cheaper on GPU compute if you use
consumer GPUs in-house.

From a conversation that I had with an academic IT person, he managed to save
2M$ yearly.

~~~
mushufasa
I wonder if AMD will try to change that in the next couple years. AFAIK the
reason for this is some clever licensing clauses by NVIDIA that effect market
segmentation. If there were more competition on the high end, there would be
downward pricing pressure.

~~~
anon9001
It's also annoying because SR-IOV would be a wonderful thing for consumer
GPUs, but it would make it too easy to use consumer GPUs for cloud providers.

Right now, you can run a VM with qemu and pass through the GPU to the guest
OS, getting pretty close to native performance. With SR-IOV, every VM could
have the same GPU attached, and you could manage performance with the
hypervisor. This would let you toggle between VMs instantly, getting full
performance on each one (assuming the others are idle).

AMD and nVidia do make SR-IOV cards, but they're extremely expensive, intended
for data centers, and don't have display output. If it ever hits consumer
cards, Linux will be the hypervisor of choice for pretty much everyone,
because there will be minimal performance penalty for using VMs.

~~~
latchkey
For reference:

[https://community.amd.com/community/radeon-instinct-
accelera...](https://community.amd.com/community/radeon-instinct-
accelerators/blog/2019/11/01/what-is-sr-iov-why-it-s-the-gold-standard-for-
gpu-sharing)

[https://www.amd.com/en/graphics/workstation-virtual-
graphics](https://www.amd.com/en/graphics/workstation-virtual-graphics)

[https://www.reddit.com/r/Amd/comments/aemr9x/sriov_now_is_th...](https://www.reddit.com/r/Amd/comments/aemr9x/sriov_now_is_the_time_to_push_for_this_feature/)

------
aeyes
Stateless is the easy part, where are the databases?

~~~
srvaroa
AWS takes care for most of them. All teams generally use cfn / terraform / to
define that kind of infra (e.g. I need a DB with these properties) and they
get applied as part of the standard deployment pipelines. We constributed
support for cfn in Spinnaker to enable this.

A team in a different area to mine does offer Kafka as a service, but this
pattern is an exception, and the org is actively moving away from it in most
other cases. For example, a while back a team took care of offering
"Cassandra's as a service", managed and operated for N product teams. They
don't anymore, for reasons I explained in the article: \- AWS catches up (e.g.
they recently announced Cass as a service) \- $commercial company has a Good
Enough alternative (e.g. several teams are dropping Cass in favour of Dynamo)
\- It's operationally _very_ expensive to maintain a team that does $storage
as a service. \- The cost/benefit ratio for doing our own managed storage only
adds up when the managed solution is too immature, or lacks too many features
that you need. The team offering managed Cassandras actually did this and
moved to offering a managed Kafka clusters ~1y earlier than AWS released the
first version.

Does that make sense?

~~~
claytonjy
So Terraform spins up a new RDS instance; what about all the management after
that? Are teams left to their own devices regarding data/schema migrations, or
does your team also provide a golden path there?

~~~
srvaroa
They are on their own. The team doing managed Kafka does offer some tools for
data / schema migrations, but there is nothing for other storages.

Generally they actually don't raise this as a pain point, teams tend to be
quite self sufficient in this regard and rely on things like
[https://flywaydb.org/](https://flywaydb.org/) to deal with it.

From our PoV, at this point this type of feature would be in the high cost /
low impact quadrant. Not because it doesn't make sense, on the contrary. It's
just that it falls at a later stage of maturity than we're at
organisationally. As I mentioned, Adevinta is an extreme case of sprawl. To
invest productively in "a fully managed relational db with data/schema
migrations" we'd need a certain degree of consensus on which db should we
support. We don't have that luxury. Even more: we'd also need some form of
consensus on "this is how you deploy and release software" which there isn't
either (do you do proper CD? is there a manual judgement? are there
deployments to a PRE/Staging env? ..). This heterogeneity greatly limits our
impact (sprawl reduces the potential surface of impact for any decision we
make) so we focus on areas where we can avoid this problem (e.g. everyone buys
cloud, containers...). But also, as I said above, data/schema migrations is
actually not a pain point teams complain about frequently.

------
Pirate-of-SV
Spot on!

Resonates very well with me, working at a scale up in a "platform team".

Business wise we're set to out engineer competition. Biggest challenges are
definitely to get engineers on board on training and transitioning into new
"cool" technology. Should we help highly skilled advanced teams run fast or
focus on getting everyone on board the cloud native train?

------
brujoand
Nice write-up Galo! As an ex-Deliveriarian I really enjoyed working this way,
and the timing is perfect as I’m trying to set up a similar but way smaller
PaaS. I’ll be borrowing this text :D

------
marmaduke
as a one person platform team (bare metal cluster + hpc resources); this
article while super impressive gave me a kind of anxiety attack. I can barely
keep the fires out and geez, look how much better it could be. There's hope
for some at least.

~~~
Aaronstotle
Being a one person team is tough, I wouldn't beat yourself up over that.
Usually impossible to get heads down in a project while firefighting 90% of
the time.

~~~
ckdarby
> Usually impossible to get heads down in a project while firefighting 90% of
> the time.

Put the phone away & magically there is time in the day. People have a lot
more time than they're aware of but fill it by normally numbing out.

~~~
marmaduke
> put the software defined networking, software defined storage, kubernetes,
> docker, KVM, libvirt, Windows AD, FreeIPA, thousands of Python conda
> environments, annoying users, boss users, NFS shares, NFS servers which
> can't do v4.1, CIFS, iSCSI LUNs, ancient iDRAC, circuit breaker replacements
> for air conditioners, data migrations due to lack of quotas, quoats and ACLs
> over NFS on mixed domains, DNS A records, shot motherboard in high-density
> nodes, ........... & magically there is time in the day

fixed that for you

------
jeffnappi
Built in tracking for the Accelerate metrics is awesome! Kudos

