Hacker News new | past | comments | ask | show | jobs | submit login
Talk write-up: How to build a PaaS for 1500 engineers (srvaroa.github.io)
203 points by srvaroa on Jan 9, 2020 | hide | past | favorite | 23 comments

I really enjoyed that write up. The focus on a good UI/UX in platform teams is definitely something that resonates with me. I also really liked the article placing an emphasis on Platform teams providing value by having a better "glue" I thought was interesting. I hadn't really thought that glue could be a component where value could be added. I really like the pattern of tools like "Devhose" and in developing a tool specifically to be the glue that integrates well with other tools.

I just want to note that in house platform can also be a big cost savers vs public clouds.

For example, you should be able to be 10x cheaper on GPU compute if you use consumer GPUs in-house.

From a conversation that I had with an academic IT person, he managed to save 2M$ yearly.

I wonder if AMD will try to change that in the next couple years. AFAIK the reason for this is some clever licensing clauses by NVIDIA that effect market segmentation. If there were more competition on the high end, there would be downward pricing pressure.

It's also annoying because SR-IOV would be a wonderful thing for consumer GPUs, but it would make it too easy to use consumer GPUs for cloud providers.

Right now, you can run a VM with qemu and pass through the GPU to the guest OS, getting pretty close to native performance. With SR-IOV, every VM could have the same GPU attached, and you could manage performance with the hypervisor. This would let you toggle between VMs instantly, getting full performance on each one (assuming the others are idle).

AMD and nVidia do make SR-IOV cards, but they're extremely expensive, intended for data centers, and don't have display output. If it ever hits consumer cards, Linux will be the hypervisor of choice for pretty much everyone, because there will be minimal performance penalty for using VMs.

I agree.

Another option would be custom chips for inference or training. IF we can get something like a TPU in house.

Stateless is the easy part, where are the databases?

AWS takes care for most of them. All teams generally use cfn / terraform / to define that kind of infra (e.g. I need a DB with these properties) and they get applied as part of the standard deployment pipelines. We constributed support for cfn in Spinnaker to enable this.

A team in a different area to mine does offer Kafka as a service, but this pattern is an exception, and the org is actively moving away from it in most other cases. For example, a while back a team took care of offering "Cassandra's as a service", managed and operated for N product teams. They don't anymore, for reasons I explained in the article: - AWS catches up (e.g. they recently announced Cass as a service) - $commercial company has a Good Enough alternative (e.g. several teams are dropping Cass in favour of Dynamo) - It's operationally very expensive to maintain a team that does $storage as a service. - The cost/benefit ratio for doing our own managed storage only adds up when the managed solution is too immature, or lacks too many features that you need. The team offering managed Cassandras actually did this and moved to offering a managed Kafka clusters ~1y earlier than AWS released the first version.

Does that make sense?

So Terraform spins up a new RDS instance; what about all the management after that? Are teams left to their own devices regarding data/schema migrations, or does your team also provide a golden path there?

They are on their own. The team doing managed Kafka does offer some tools for data / schema migrations, but there is nothing for other storages.

Generally they actually don't raise this as a pain point, teams tend to be quite self sufficient in this regard and rely on things like https://flywaydb.org/ to deal with it.

From our PoV, at this point this type of feature would be in the high cost / low impact quadrant. Not because it doesn't make sense, on the contrary. It's just that it falls at a later stage of maturity than we're at organisationally. As I mentioned, Adevinta is an extreme case of sprawl. To invest productively in "a fully managed relational db with data/schema migrations" we'd need a certain degree of consensus on which db should we support. We don't have that luxury. Even more: we'd also need some form of consensus on "this is how you deploy and release software" which there isn't either (do you do proper CD? is there a manual judgement? are there deployments to a PRE/Staging env? ..). This heterogeneity greatly limits our impact (sprawl reduces the potential surface of impact for any decision we make) so we focus on areas where we can avoid this problem (e.g. everyone buys cloud, containers...). But also, as I said above, data/schema migrations is actually not a pain point teams complain about frequently.

We provision database access through Hashicorp Vault. It's excellent, short lived credentials provided by Kubernetes service accounts (lots of glue here!).

After the RDS instance is created we need to manually create credentials so that Vault gains access to control it though, this is our mission to automate soon.

With credentials in place teams need to maintain schema creation and migrations themselves. We provide wrapper scripts go gain access with Vault credentials mysql shell or Perconas pt-inline-schema-change. Some teams create pre-deploy jobs or init-containers so that their service can run migrations automatically.

Spot on!

Resonates very well with me, working at a scale up in a "platform team".

Business wise we're set to out engineer competition. Biggest challenges are definitely to get engineers on board on training and transitioning into new "cool" technology. Should we help highly skilled advanced teams run fast or focus on getting everyone on board the cloud native train?

Nice write-up Galo! As an ex-Deliveriarian I really enjoyed working this way, and the timing is perfect as I’m trying to set up a similar but way smaller PaaS. I’ll be borrowing this text :D

as a one person platform team (bare metal cluster + hpc resources); this article while super impressive gave me a kind of anxiety attack. I can barely keep the fires out and geez, look how much better it could be. There's hope for some at least.

Being a one person team is tough, I wouldn't beat yourself up over that. Usually impossible to get heads down in a project while firefighting 90% of the time.

> Usually impossible to get heads down in a project while firefighting 90% of the time.

Put the phone away & magically there is time in the day. People have a lot more time than they're aware of but fill it by normally numbing out.

> put the software defined networking, software defined storage, kubernetes, docker, KVM, libvirt, Windows AD, FreeIPA, thousands of Python conda environments, annoying users, boss users, NFS shares, NFS servers which can't do v4.1, CIFS, iSCSI LUNs, ancient iDRAC, circuit breaker replacements for air conditioners, data migrations due to lack of quotas, quoats and ACLs over NFS on mixed domains, DNS A records, shot motherboard in high-density nodes, ........... & magically there is time in the day

fixed that for you

Don't be anxious!

Every company and team operate under different circumstances that might not be in their control. It's a company effort, not only engineering, to get in to a position where using public cloud or buy it of the shelf products is possible.

Having both skilled legal and sales + business colleagues is key but a luxury in these situations.

Thanks for the kind words! would love to use AWS or similar but we're runnign a clinical trial with human data so it's all on-premise stuff :)

I will keep an eye out for opportunities though..

I kinda work in hpc too!

Is there some place where we can exchange info and best practices?

that's a good question. I've tried to pick up what I can from online communities like Spice, r/sysadmin, Linxu & CentOS forums, but the most important thing I've done is actually work on some of the Tier-0 systems (in Europe) where everything tends to be done really well, and then try to imitate that (with shoestring budget of course).

Built in tracking for the Accelerate metrics is awesome! Kudos

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact