Hacker News new | past | comments | ask | show | jobs | submit login
You Don’t Need All That Complex/Expensive/Distracting Infrastructure (usejournal.com)
186 points by ingve 14 days ago | hide | past | web | favorite | 121 comments



> ‘Engineers get sidetracked by things that make engineers excited, not that solve real problems for users’ — we’ve heard it all before. There’s no shocking revelation here (you are on Medium.com after all …). But there seems to be a pervasive sense that you can’t launch a product without a K8s cluster or two, load balanced across a couple of regions, oh and if you have to deploy anything manually how can you possibly expect to turn a profit?

Not everyone needs K8s, not everyone needs multi-region. But as far as manual deployment goes... it's all fun and games until someone loses an eye. Often what you find if you have a process that can't be automated as-is is that you have a process that's got problems. Maybe those problems are bugs. Maybe those problems are "only one person knows how to run this, and if he wants to take a vacation or more onto something else, we're hosed." Automation is a good thing at any scale.


> Automation is a good thing at any scale.

Once dealt with a company that spend close to 2 years to develop 2 different generations of a robotics system to glue together two plastic pieces. One of the pieces changed slightly close to the end making both robots useless. The pieces ended up being glued together manually for next to no labor cost because a person could actually do the gluing quite fast.

Premature automation wrecks budgets and production lines and can kill entire companies.


Those robots were heroes. Not only did they not take our jobs, they made more of them.

There's a lot of difference between a gluing robot and the tasks that Devops advocates to automate.

Spending a few hours on setting up an automated build on a SaaS CI server will start bringing benefits immediately and pay off very quickly. Not only quantitative (next deployments will take less human time) but also qualitative (less chance of errors/bugs/mistakes, easier to do them more frequently etc).

Obligatory XKCD: https://xkcd.com/1205/


The other obligatory XKCD: https://xkcd.com/1319/

Where I work they used to have an ad hoc and manual deployment strategy that could take months and sometimes resulted in an errant dev taking down production. Switching to cloud infrastructure with CI/CD pipeline seems to have made things much better, at least for the products that have switched over.

Or you can have manual deployment strategy and gradually automatize it.

Why? That seems like re-inventing the wheel. What makes your deployment process so much different than average that you can’t just write a check to heroku/aws and call it good? What value exists in inventing yet another bespoke deployment system?

The deployment process might be vanilla but depending on the complexity of the software, automated validation gates before/after deployment stages might need to be quite custom to achieve required test coverage.

Sure but any good off the shelf pipeline will have hooks for that...

Yes, that's what they sort of had before. It was pretty hacky, brittle, and slow.

Automation is just leverage. Sometimes you need the leverage of a trowel, other times you need a six-ton digger. And if you use the trowel, you can employ fewer people and skip approvals and certifications.

It does make sense to automate - after you have the design. But the type of automation also changes with scale. Ideally, at a small scale, your automation can be something like: A script to pull the database for its regular backup. A script to automate uploading new PHP scripts and site assets. A script to reinstall the dependencies on a fresh image. Some version control and another backup layer over all of this. And then a huge test suite running the gamut: user workflows, failure conditions, site attacks, data integrity, backups.

It's the followthrough into automating the QA processes that makes the product great, not the underlying stack(which in most cases is going to have to be treated like a placeholder in the event of real scale). And it mostly isn't substantial engineering challenges at any level: they're "tick off the boxes" exercises, they just punch above their weight in terms of delivered value.

It's the tendency to add more things to configure and more systemic non-linearity("in this mode it does X, in that mode it does Y, but when Z is enabled both modes do Q") that creates a serious IT headache, and that in turn kills your progress as an independent. Oftentimes you have to give up on using the new stuff, the theoretically good stuff, because the additional layers of automation mean that you end up doing original R&D for an apparently ordinary problem, and you can't actually get a build off the ground. Or you fight through numerous issues and get it working but with no established "best practice" to follow, you only half-understand how to configure it properly, leading to technical debt and future unknowns in the risk profile.


Your last paragraph sums it up well, in my opinion. Too often the answer chosen is to "add a new configuration setting", which means one more thing for ops to keep track of and one more potential cause of failure for devs to investigate.

Cleaning up dead/deprecated code and working to consolidate code paths provides opportunity to clean up obsolete configuration settings, remove special cases, and improve the reasonability and maintainability of the software into the future.


> Automation is a good thing at any scale.

Well, financially if the project is generating revenue worth of one persons salary, then it doesn't make sense to make things that would require more personnel to improve mainteinability, since you can't afford it. You just have to live with the risk.


I rarely see automation as increasing the amount of labor required so much as reducing the overall amount of labot but timeshifting it from "later" to "now." And sometimes you can't make that tradeoff, admittedly.

I largely agree... but to offer more nuance you need to know what the right amount of automation versus toil is right for your project. It's worth reading the Google SRE [0] books, especially the workbook, if you're thinking about running and operating a service.

Complexity is the enemy in software engineering as much as it is in operations engineering. Sustainable reliability is about doing as little as you can get away with to achieve the reliability results that make your customers happy. That does mean following the advice of TFA and avoiding going straight to k8s for a small indie site that could be deployed with a simple Ansible/Puppet script. However it also means defining what your targets are and having indicators so that you're not relying on your gut instincts either: aim for 60% toil, 40% automation at first. Whatever the right balance is review it regularly as you scale your project up.

You can do a lot with a couple of beefy VPSs and a solid database these days.

[0] https://landing.google.com/sre/books/


"Simplicity is a necessary precondition for reliability." - Edsger W. Dijkstra (handwritten annotation to EWD498)

If one is looking to triangulate here's that same tenet in systems theory known as Gall's Law [0].

"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system."

[0] Systemantics by John Gall https://books.google.com/books/about/Systemantics.html?id=ql...



Ok, maybe you can help settle this for me. I know that phrasing appears in EWD1175, but years ago I came across this[1] which claims it was originally a handwritten annotation to EWD498. So that's how I've always cited it (I prefer Dikstra's phrasing to Hoare's.)

Some searching today found a wikiquote page[2] which led me here[3] also claiming an EWD498 handwritten annotation as the source.

Do you know definitively whether the annotation appears in EWD498?

(I thought at one time I actually found an image with the handwritten annotation, but I can't find it at the moment... Maybe I dreamed that...)

1. http://www.cs.virginia.edu/~evans/cs655/readings/ewd498.html 2. https://en.wikiquote.org/wiki/Talk:Edsger_W._Dijkstra 3. http://web.archive.org/web/200011201643/http://www.cbi.umn.e...


My rule has always been: "write down the entire manual process". Follow the steps and do a deployment and time it. Do you have that much time? Are you likely to make an error? If not, just do it manually. If so, break out the easiest parts and automate that. Then reassess.

Automation should follow the same rule as other things in a business. Only automate when it becomes painful not to. Or if it is trivially simple to.


Reminds me of an article I saw sometime back on HN: "Manual work is a bug" (https://queue.acm.org/detail.cfm?id=3197520). Just like you, the author's advice is to go iteratively, with the first step being documentation. As documentation improves you act more as a cpu running instructions, at which points automation becomes clearer, and so on. Not everything can or should be automated, but we definitely could do more.

> Complexity is the enemy in software engineering

For decades, I've been saying this a bit differently: "Software engineering is primarily an exercise in complexity management".


"Controlling complexity is the essence of computer programming." --- Brian Kernighan, 1976

Funny and true story:

2 years ago we we needed to move in from the cloud to a internal network. Because of time pressure we got a PC as our first server, where we installed a set of tools for our embeded developers, from Gitlab, over Jenkins, LDAP backend, Nagios, Rocket.chat, Crowd, file sharing, nginx, Volumerize backups, nfs sharing, build artifact storage, a private Docker registry, we even run a build slave instance on it.

This was supposed to be a intermediate solution until we move to the real infra. In the new infra they have all of the things like several stages of load ballancers, tons of firewalls between the servers, the slaves are physically in a different network, everything is insanely complex and takes at least 20 times more time to set up than you'd expect. This is why we still run on that old PC (we added 7 more as build slaves) and ca. 200 people use it now for 2 years on a daily basis, which seems pretty weird but it just works.

Lately we saw that the network card started hanging and we needed to do a hard reboot which is not nice. The guys who are close to that PC (we use it remote) had a USB network card lying around and we asked them to connect it because the old one in the tower most probably had it's end of life because it has been used so much during the last 2 years.

We're still on boarding more and more people and it's not clear when we will be able to move to the new very complex infra.


How do you run all the apps on a single server? Docker?

Also, who is responsible for updating all those apps? Is that a full time job for somebody?


Yes we run every app with their official docker image so updating it is quite easy, basically pulling the newest image and restarting the docker container.

When it's just everything on one server there is surprisingly little to do, so for that we have a rotating sysadmin where one of the team members is responsible for the servers for one sprint and then the next person takes over. For some time we had a dedicated sysadmin but it was never enough work for them so the rotating sysadmin does it on the side right now.

In the new infra this is changing significantly and there it already is much more work and we will grow with at least two more teams to handle it full time.


These sweeping generalizations in either direction are wrong and counterproductive. Some people need the complex infrastructure, some people don't.

What you should focus on is finding out what your use cases are, and then building the simplest thing that meets them. For some folks, high availability and zonal resiliency is an absolute must. For other folks, like Peter, it might not be. These context-less platitudes are pretty useless outside of the context in which they're made.


I thiink the idea behind this post was more to avoid this kind of thing from happening https://xkcd.com/1319/

Well, your users are gonna have a fun time when linode has to take down your single vpc for maintenance, or an AZ goes down (does linode even have AZs? There's some evidence that even GCP doesn't have truly redundant AZs, so I doubt Linode could.)

Point being: its a balance. I tell any startup who will listen: App Engine or Heroku. That'll get you REALLY far and strike a good balance between autoscaling, simplicity, and redundancy.

If you're still using a single server, you're doing something horribly wrong. This random guy's strategy isn't something to be proud of. He just doesn't know that there are better options out there. Simplicity isn't a VM you have to maintain, update, and secure. Its a fully managed PaaS.


I'm not sure when exactly this shift happened, but people (myself included) fetishize high availability so much nowadays. What would really happen if your startup-of-the-year with less than 1k active users was down for five minutes? Or an hour? Do you really think people visiting a site that's down think "I'm dropping my account"? Or do they think "Oh. I'll check back later."?

NomadList and RemoteOK (from the article) have both had downtime before. They're both much more profitable than whatever startup of the day has decided they desperately need complicated infrastructure to run their CRUD app.

I've fallen for this trap also. I've written an article about deploying a Next.js app to Elastic Beanstalk, when I probably could've just stood up a server and SSH'd in to deploy. I've used Firebase when I could've just stood up a simple REST API and PostgreSQL.

There's nothing horribly wrong with using a single server, he knows there are other (not better) options out there.


Availability is a spectrum. There's a lot of room between "a single VM on a third-tier infrastructure provider" and "google.com".

Hell, even "two VMs on a third-tier infrastructure provider" is better than 2x better availability, because there are discrete events which impact a single VM and don't impact two.

Startups should be taking the easiest route possible to Happy Customers. That's the goal. There are two parts to that: "Happy" and "Customers". You need Customers. That's product-market fit. You also need them to be Happy. Availability is a part of that.

No, a startup does not need a multi-regional strategy behind redundant global ELBs with edge caching. I never said that. You know what's pretty damn easy though? HEROKU. You spin up multiple dynos and you get multi-AZ redundancy, NO EXTRA WORK. App Engine is the same way.

Startups, everywhere, for the love of god, stop spinning up servers. If you can SSH into it, that's a smell. If you HAVE to ssh into it, you're wasting resources. There are so many different "serverless" options out there. Focus on the product. The infrastructure can wait. Its not going anywhere.


Yeah, my users really don't care. I have an explicit service availability policy that says any support at all between the local hours of 11pm-7am is on a best effort basis, and support outside of normal business hours is limited. Our services are normally available 24x7, but more than half of them even have pager notifications turned off between midnight and 6am, so that my staff can sleep. As long as things are back and functional by 9am or so, everyone is happy.

It's just a different world outside of the startup, "My total possible user-base is all 7 billion people in the world!" world. This is HN, so startups are the correct, default assumption, but I do see a lot of this worldview leaking into the non-startup worlds as well.


The shift started back in the late 2000s when the explosion of NoSQL and AWS made it easier than before for people to pretend they have problems they don't.

Before NoSQL you had to do expensive/difficult stuff like manual database sharding and buying expensive dedicated hardware. Now you can instantly spin up a few instances.

Considering how much easier it is to put the infrastructure in place IF you ever need it, its odd to me how focused on it people seem to be.


That makes a lot of sense, and if that's the case, I'd chalk it up to advertising for convincing people they _need_ all of this type of infrastructure.

> If you're still using a single server, you're doing something horribly wrong

This is so far from the truth. If you built your (fairly new) app, especially a new app, such that it requires multiple servers, then you're overthinking infrastructure at a time when you don't need it. Having multiple servers is great, but having a lean app that can run on a single server requires a different skillset. You don't need HA, multi-AZ AWS deployments behind a load balancer with route 53 DNS load balancing powered by auto scaling fleets with ci/cd jenkins and slack chat ops when you only have a small number of users.


You can pick one sentence out of my comment and tear it apart, or actually read and understand the whole thing. Your choice.

Managing servers is something no startup should be doing. Period. If you can SSH into it, you're wasting time on infrastructure that could be spent finding product-market fit. The number of servers isn't the relevant points; its the fact that he's using ANY servers.


I disagree. I regularly do a sanity check for my business, and it always turns out that two physical $40/month servers make a lot more sense for me than any PaaS offering, or even just AWS EC2.

Guess what, managing a small amount of servers (two) isn't that hard and doesn't take a lot of time, and $40/month gets you really fast CPUs and lots of RAM these days.

As always, generalized absolute statements make no sense: everything should be analyzed in context, and based on specific metrics (in my case: money).


If it doesn't make sense for you, then you're not a startup.

That sounds ipso facto, like I'm saying "PaaS is the best option for startups, because if it didn't make sense you're not a startup." but I mean it legitimately.

Startups are VERY DIFFERENT from normal businesses, in almost every conceivable way. Great startup engineers do not necessarily make great "normal business" engineers. Great startup infrastructure looks nothing like great normal infrastructure. The same solutions do not work for both.

Startups, generally, have "large" amounts of money and "small" amounts of humans. So when you say "based on my specific metrics (money) PaaS doesn't make sense"... uh, yeah, duh. You're not a startup. Being a small business does not make you a startup. Writing code does not make you a startup.

Startups burn money like hell in the pursuit of a 100x product-market fit. "Burn Rate" is thrown around a lot as a metric. Do you think the term "burn" was chosen by accident? Startups don't "invest" money. They don't spend it. They BURN it, with the hope that the kindling will explode and they can worry about the damage they caused later.


Do you know what your business would cost in PaaS monthly, in comparison to the $80 + labor it currently costs?

Yes, significantly more for comparable performance. And then PaaS is not a silver bullet, either: every service has SLOs and downtime.

I get that part. If only all startups could afford managed PaaS services. I might be biased given I've spun up and managed my own infrastructure without much difficulty, and I can see how some startups would prefer spending the dough on PaaS instead of say hiring or putting it towards marketing.

I just had an issue with that blanket statement that it's not kosher to be using a server still. It's totally fine for a startup to be managing their own infrastructure if they've done the cost benefit analysis on it.


Heroku costs $20/month/dyno. You can probably compare the performance of that to a t3.nano, which is ~$4/month.

An engineer costs, lets say $7000/month.

Now, all you have to ask is: How much time will the engineer be investing in the infra every month. Lets say you SSH in and update the system packages. 10 minutes? That's $6. Hope nothing goes wrong during the upgrade; remember, you're on the clock. Every minute you spend with that shell open is wasting money. Need to add SSH access for the new guy? Oof, we're gonna have to exchange some SSH keys, that might take 20 minutes. $12. Just got an email from AWS about a new Ubuntu AMI that fixes a security vulnerability... lets plan an upgrade. 1 hour. $30. We want to be more resilient against AWS maintenance downtime on instances, so lets bake an AMI and create an ASG. That's not too hard. Maybe a couple hours.

Oh... we just paid the difference for Heroku for an entire year. AND our engineers were able to focus on the product.


Startups can’t afford not to use PaaS. Every second of time they piss away managing their home brew platform is a second they aren’t spending time making their product something people actually want.

Managing infrastructure is a solved problem for any startup. Simply spin something up on a heroku-like platform and be done. Anything else is wasting precious time and money.


That obviously depends on how much effort/how expensive "managing servers" via ssh vs. managing cloud infrastructure is. Developing, installing, configuring, and monitoring/troubleshooting apps isn't for free on cloud either (like, not at all).

I'm 100% on board with "start with Heroku or App Engine" and then you can just layer in other services as you need. Where I see teams get caught up is that they compare the price of a single VPS with a handful of Heroku Dynos and think it's much too expensive.

As soon as you add in any amount of sysadmin time, security work, etc. it becomes absurdly better.


Yup. The cost is so irrelevant, its a little laughable how many other people keep bringing it up. The relevant metrics are People and Risk. If you host your own VM; ok, maybe it works great for you and you don't have to touch it. But the future risk of something bad happening is High. And when that does happen, if an engineer has to take 1 day to fix it, you'll have literally paid the difference for a PaaS for the rest of the year.

But, again, cost doesn't matter. What really matters is that engineer had to WASTE a day working on it. That's a day that could have been spent advancing the product toward market fit. That's a day when that engineer wasn't coordinating with the rest of the team. That sets back her work by a day, which might block another engineer by a day, and it ripples down and down. That's not something startups want to happen. Because Engineers are expensive, yes, but more importantly, they're hard to find. You can't just throw money at hiring talent like you can at Heroku (or App Engine, or Lambda, or whatever works best for your business).


These hands-off solutions might be expensive in operational/dev time as well. You might end spend days coding against the arbitrary limitation of managed services or debugging cryptic errors in the distributed application. My experience with lambda as an application platform is pretty bad. EC2 instance gives you plenty of technical flexibility that startup needs as much as velocity to find a market fit.

> Simplicity isn't a VM you have to maintain, update, and secure. Its a fully managed PaaS.

I disagree with this if you actually intend it in the blanket form you present it as. A fully managed PaaS can be the simplest solution overall, but it can also be overly complex.

It all depends on what you're trying to do and what your needs actually are.


I know Heroku is pretty painless except when trying to do weird things (use a wildcard certificate from let's encrypt comes to mind, though it might be OK now) where it can become weirdly complicated as we have to think about stuff that was hidden before (how are the request routed ? how is the certificated handled ? how to renew it ? etc)

How does App Engine compare ? From a bird's eye, it seems a lot of configuration is still needed to communicate within Google's ecosystem and as Google also pushes k8s, I wondered how much resources they were putting in App Engine to have it run smoothly vs k8s where they are the main driver force and have a natural advantage over the concurrents.


The implications of downtime must be considered in a business context — otherwise this kind of discussion makes little sense. You need to set service level objectives (SLOs) for your business.

In some businesses, your SLOs will force you to implement high-availability solutions. In other businesses, your users are not paying for high availability, and you are doing nothing wrong by using a single server, assuming you can recover from a failure in a predictable amount of time, meet your SLOs and not lose data in the process.

I agree with the article author: engineers tend to go way overboard, and often nobody asks the really important question: who will pay for those tight SLOs?


An old Unix admin mentor of mine always said to "do the simplest thing that works properly". I've always tried to do this and it's great advice.

That advice is gold. I follow it too. Currently in beta with a C++ application that has its own web server, video encode/decode, and NSA quality FR - all in a 10MB executable with a 200MB run-time foot print. This single application removes the requirement for separate video encode/decode, video serving, REST API serving, and all DB needs. And it runs exponentially faster than what it replaces, while being happy on a single box, anything from an Intel Compute Stick to you name it heavy iron server. I like to say "the infrastructure is inside, nothing else is needed."

FWIW, I run Minikube to manage a single ML churning machine in my garage, because I only need to learn 'docker build Dockerfile', 'docker push <image>', 'docker run <image>', 'kubectl create -f <foo.json>' and 'kubectl delete -f <foo.json>' to manage the workloads on a regular basis. For this minimal brain-space investment I get a bunch of features like resource management, a workload queue, dashboards and being able to test/debug an image on my dev machine for free. Then I scale this knowledge up to managed clusters in the cloud with as little machines I can get away with, using minimal brain-space to learn additional systems.

While I agree with the general sentiment of using the bare minimum to get the job done, the gratuitous complexity problem is usually caused by the people using the tools than by the tools themselves.


What are you using in production? Still minikube?

GKE.

[flagged]


We've banned this account, and if you keep doing this, will ban your main account as well.

If you'd please review https://news.ycombinator.com/newsguidelines.html and take the spirit of this site to heart, we'd be grateful. These other links might be helpful for that too:

https://news.ycombinator.com/newswelcome.html

https://news.ycombinator.com/hackernews.html

http://www.paulgraham.com/trolls.html

http://www.paulgraham.com/hackernews.html


[flagged]


> And your experience is irrelevant here...

That seems unnecessarily harsh.

But I do agree with your other points. Kubernetes introduces a lot of moving parts, even if the minikube interface is relatively straightforward. Over the years, experience has taught me that black boxes don't stay opaque for long - something breaks and you ultimately have to learn the internals, usually in an emergency break/fix scenario. At FAANG, the gave teams who's specialism is running those sorts of orchestration systems.

I trust systems I understand, and magic scares me.


You do run Linux I suppose. Try to compile the Linux kernel from scratch and see for yourself the bewildering complexity hidden in there. Hey, there is a hypervirtualization module hidden in your kernel, with tens and tens of knobs to tweak if you are so inclined. And yet, 95% of that complexity is most likely unused by your workloads and whatever is left very very rarely surfaces during an emergency break scenario. When it does, it is indeed painful. I still remember pulling hair because of https://en.wikipedia.org/wiki/Nagle%27s_algorithm more than a decade ago.

Conceptually Kubernetes is a trivial system, a Plan9 of orchestration systems. Everything is an executor watching a key/value store and triggering on various state configurations. What trips people is the network / DNS layer, which is also implemented as a bunch of executors watching the key/value store. When it breaks, you're helpless as a novice, plus debugging networks is painful for anyone. If Kubernetes had a '--network-driver=none' that just uses the host network as is, akin to Minikube & '--vm-driver=none', we'd never had this 'complexity' argument thrown around.

Fortunately, once you understand the architecture, debugging the network is straightforward: poke around the system network/dns executor logs and the culprit will soon reveal itself.


> Try to compile the Linux kernel from scratch and see for yourself

When I started using Linux around 20 years ago, with Red Hat 6, manually compiling the kernel was the only to way to enable all sorts of hardware support that's enabled by default today. And it was a great learning experience.

Though I do get your point about complexity, and I agree that the kernel is (arguably necessarily) very complex. Unfortunately there are few mainstream, well-supported and broadly compatible alternatives. I miss the days when computers were simple and a single person could understand everything that goes on inside, all the way down to the hardware.

With kubernetes and other container orchestration platforms, I feel that there are viable, less complex alternatives for many use cases. Simpler platforms mean fewer points of failure, and more chance that a small team of generalists would easily be able to fix any issues and get back to producing value for the business.

There are definitely cases where kubernetes et al are warranted, but in order to provide any semblance of robustness would require a dedicated and well versed resource within the business to look after the system. And most businesses (I'd guess > 99%) don't have the scalability or reproducibility requirements to necessitate it. The extra layer of abstraction just isn't worth it for most teams.


> I’ve seen the idea that every minute spent on infrastructure is a minute less spent shipping features

This is why working in infra is so awful, people follow this principle for years and then invest a million a year in an infra team whose hands they tie.

Resilience is a feature, though I agree at early stage folks often need less than they think, a single auto-scaling group and load balancer in AWS or something similar isn't much heavier than a single linode VPS, except that it has substantially improved resilience.


> This is why working in infra is so awful, people follow this principle for years and then invest a million a year in an infra team whose hands they tie.

Yep, having to wait for development resource to build a scalable system because the legacy is working just fine.... and then the legacy falls over and burns because it exceeded capacity like you warned it would a year before it did, and the business guys suddenly want you to fix it yesterday because now every account manager is out their end of month reports...

A significant downside of Scrum as a methodology is that it assumes product owners listen to the engineering team as well as the sales people yelling at them 24/7 for latest feature X before making prioritisation decisions.


Make it work, make it fast, make it scale.

Sticking to that order is critical.


The other thing these posts confuse me about is that a single auto-scaling group and load balancer isn't much more difficult to set up or maintain, and if you're already in the low traffic world, it's barely more expensive, all for the vastly improved ability to respond to things breaking that you didn't expect.

Wasn't this on HN last week?

Remember Soylent boasting about their elaborate compute infrastructure, for a business that made a few sales per minute? I once pointed out that they could handle their sales volume on a HostGator Hatchling account with an off the shelf shopping cart program, for a few dollars a month. But then they wouldn't be a "tech" company.

Soylent is apparently still around, competing with SlimFast and Ensure.


In one of my past lives, I managed an IT group where we processed ~50 point of sale transactions per second and did it on a two smallish application servers and a single "large small" Oracle database server. Our entire infrastructure only had about 10 servers... (including things like email, file services, redundancy, etc. and this was almost 15 years ago.) A few years later I was brought into another retail company that only did 500 orders a day... with damn near 150 servers. My jaw dropped on that one...

Yup was on HN last week.

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great

As the CTO of a small startup 0.1% downtime translates into vast quantities of lost trust and irreversible damage to our brand. Netflix goes offline for a bit they have some angry customers, and maybe lose some cash, but they'll still be chugging along. However, if we're down for a while our customers, who are already taking a risk trusting a new company, may disappear forever, and as a smaller/newer company the overall brand damage is far worse.

While I largely agree with the premise that companies over do it on infrastructure. I strongly disagree that my up time is less important than that of FAANG companies.


Most big companies don't really care. And that's because they don't have a single person with the power to say 'stop'.

In a long chain of managers the manager above will only hear "there was an issue and it's fixed". The upper manager doesn't really have the time and the will to dig into the details.


No, it doesn't.

Whether you're talking about a new Twitter or Reddit, which used to go down all the time (like, constantly), or some business startup, a downtime of 30 mins or so no-one really cares about, in my experience. At worst you'll get a phone call or two, a few emails, but if you handle them compassionately you'll be fine.

You can run a startup serving thousands or tens of thousands of customers on a single server, with no micro services, a simple server-side MVC setup and never have a $600 a year dedicated server even break 15% CPU.

I once took down a server by flooding the email server with error emails, which generated more error emails, which ran out of disk space, etc., etc. Took an hour to get the site up again.

Barely a blip in revenue. Client finally coughed up for proper email hosting rather than running their own server on the same box as their site, as I'd been advising them for 4 years.


> You can run a startup serving thousands or tens of thousands of customers on a single server, with no micro services, a simple server-side MVC setup and never have a $600 a year dedicated server even break 15% CPU.

This is a gross over-generalization. There are many kinds of problems in computing that can be approached by a startup AND which require large amounts of compute. The soup du jour is machine learning problems.

As for downtime, OP has a perfectly valid point. If you're building a B2B application, the customer has already taken a risk on you. If you go down in the middle of a busy workday, even for 30 minutes, you can be damned sure that someone at your client is getting some heat for taking that risk rather than going with $BIG_CO.


Not really, most people hate $BIG_CO, they're invested, they're invested in your trendy brand, your lovely UI vs the lotus notes style $BIG_CO.

The early adopters are going to give you slack.

As for the tiny number of startups solving ML business problems, compared to the thousands of web apps launched daily on product hunt, if your USP is computing power and special tech, then obviously this advice does not apply.

Edit: You should really disclose you work on AWS when discussing this sort of stuff


> Edit: You should really disclose you work on AWS when discussing this sort of stuff

Yikes. Is this irrelevant attempt to disarm them why we have these obnoxious "disclaimers" all over HN?

It's an absolutely meaningless gesture.


I don't work at AWS any longer, not that it's relevant.

> Whether you're talking about a new Twitter or Reddit

You're imagining the inconsequential downtime of Twitter or Reddit instead of a small B2B company trying to hook its first customers.


> Your goal, when launching a product, is to build a product that solves a problem for your users. Not build the fanciest deployment pipelines, or multi-zone, multi-region, multi-cloud Nuclear Winter proof high availability setup.

I agree, but context is key: if you're bootstrapping a start up, you don't need these things. You need to prove your product, then you scale.

But automation != scale. Having a process that streamlines your delivery process, regardless of scale, can be helpful. I've screwed up enough single-box deployments to learn that less.

Stepping back: our industry is pretty horrible about creating tools that can start small and scale up. I like where CockroachDB is going for that reason (just as an example). It would be great to start with a single database, and have a clear path to scale it horizontally across multiple nodes and data centers.

Kubes might get there... I'm not sure how focused they are on making small things work well, though... any examples of that?


I do not want to be overly harsh, but this is just lazy clickbait. Nothing new is said here, just a hipster developer making sweeping generalizations about companies he does not work for, and does not know their requirements.

Honestly, I do not even agree with his premise that you should start at bare metal with manual deployments. Getting some basic automation set up is STUPID EASY between Travis CI, Jenkins, Google App Engine, etc. I feel that toiling to deploy your services is a massive waste of time.

Obviously, the reality of this lands somewhere in the middle. I feel like Kubernetes is the whipping boy for "over-complicated infrastructure" undeservedly. Hosting it yourself I am sure is a bear, but there are a LOT of great hosted solutions available.

Google Hosted Kubernetes makes my job easier.

I write a few Deployments, Services, and Ingress controllers, set up keel.sh to update my deployments based on docker image uploads, and BOOM: Awesome, absurdly automated infrastructure.

Log aggregation? Comes out of the box with Stackdriver logging.

Monitoring and alerts? Comes out of the box with Stackdriver monitoring.

My developers can edit the Kubernetes resources through the Google Cloud GUI.

Deploying to an environment is as simple as pushing a docker image with the correct tag and letting Keel take care of the rest.

We have looked at alternatives, including bare metal, MULTIPLE TIMES, but in the end, we keep on deciding that Kubernetes is doing a lot for us and we do not want to stop using it.


... you just need heroku.

I remember chatting with someone intimately familiar with k8s and docker. We were talking about an app I was working on which deployed to heroku. He asked how many dynos and I told him (it was under 10) and he said: "yup, you'll not need k8s".

10 dynos and a big database can serve an awful lot of users.


it really can. as a contract sys admin, the amount of money i am paid when they could have just used heroku is obscene. it's funny how badly folks can even muck that up though, and then they feel they need me. ::shrugss: what am i going to do, give the money back? LOL. But yeah, heroku.

I'm in a situation similar to the one you describe. It's an interesting world us contract IT folks live in. I'm honest with people even if it could cost me the job, but they like having someone around in case something goes south. The beauty of this job is the sheer amount of undisturbed coding time I have to continue to automate things. One of the better jobs I've had.

10 of the large instances costs $5000 a month. Not insignificant. Smaller ones in my experience don’t run larger Rails apps very well.

$5000 a month could be a single employee's wage. Just pretend you have a new hire named "Heroku" who is generating immense value for you. If you are a one or two man shop, sure...but if you find yourself spending inordinate amounts of man hours toying with infrastructure, I think you'd need to ask what you are really saving.

Yeah, I've been frustrated so many times over my career when people get fixated over the sticker price of some SaaS/PaaS and instead insist on wasting rediculous hours building something or gaffa-taping some open-source solution together that then has to be maintained.

I've been in meetings where the combinded cost of the time taken to discuss if we should use a thing was more than the cost of the thing. Utterly infuriating.

Even when I worked for an agency there seemed to be an automatic discounting of the cost of people's time vs spending actual cash (and cashflow wasn't an issue).

I appreciate that there is often good reasons to DIY, but when there are not I will always favour something off-the shelf unless it is significantly more expensive.


I once worked at a company where the CEO would change the CRM every eight months or so, based on what kind of deals he could negotiate. Can you imagine how much time was wasted migrating, remapping and relearning the new systems??!

That's definitely less than half of what you'd pay to a devops/SRE type who has experience with K8s and all the other cloud hotness.

If the alternative is one engineer spending 50% of their time dealing with infrastructure, $5000 is still a good deal.

A good rule of thumb is: "plan for scale, but don't implement until needed". True, you don't want to engineer yourself into a dead end by designing a system that can't be parallelized, but I've also seen so many startup founders (including myself) that design for massive scale that won't arrive for years, but introduces complexities that will slow down development in the exact phase one needs to be the most agile.

This is good advice. It's all about risk appetite. We should be identifying and communicating both the potential for over-optimization and risk of not being scalable. You can only hope to hit the sweet spot if you have clarity on both being potential outcomes.

Well. It's all about a golden mean. As usual.

The mindset shouldn't be about building an ideal infrastructure, but it should be about having a reliable infrastructure instead. It surely doesn't required to have every brand cool thing which was advertised on HN within last 2 weeks. But fully automated pipeline for code delivery and configuration as code is essential. It doesn't require that much time (especially if you reuse one of thousands example from github), but it will save you later. Even for a single node in linode.

Even though it won't help you to build new features and attract new users, just think about as a necessary action to keep your existing users. Nobody wants to use the thing which isn't available because a developer messed up with deployment command, didn't notice it and left home.


8 days ago also, 81 comments https://news.ycombinator.com/item?id=19299393

After the last time I watched a video of the Nomadlist guy, Pieter Levels, talking about it and he said

>I've woken up so many times at 4:00 a.m. to just check if my website down and I have to do all this stuff and then I'm awake for three hours because the server crashed. https://www.youtube.com/watch?v=6reLWfFNer0&feature=youtu.be...

So maybe just PHP on Linode has some drawbacks


Enterprise systems must employ 100's of people therefore we must have complex stacks with multiple vendors providing services.

Clearly you don't understand the rationales here.

They employee 100s of people because during budget allocation they often get a lot of money to spend in a short period of time. Multi year projects are hard to keep people interested in.

The reason for multiple vendors is because if you have just one they screw you. You actually need the internal competition.

And most of the complexity in tech stacks is actually coming as a result of the move to agile. You don't have a large upfront requirements gathering and architecture process across all of the teams anymore. Instead each team is given just enough for that sprint, told to design it yourself and then people are surprised when there is 20 different implementation styles.


/s ??

my current situation :)

I wish it was sarcastic


TFA makes a more-or-less valid point, albeit one that's been made quite a few times before. My problem with this is that the headline elides a lot of the nuance involved in these discussions. For example, once you read TFA, you realize that the author is talking specifically about very early stage projects, where traffic and expectations are minimal. In that case, yes, it's probably correct that you don't need a lot of complex infrastructure.

But as even the author allows, at some point you do need this stuff. The real truth is closer to "You don’t need all that complex/expensive/distracting infrastructure... until you do."

The other thing I might posit, although I haven't sat down and worked out a complete argument for, is that sometimes even in a very early stage a bit of more complex infrastructure (automation in particular) can be very helpful... specifically when it serves to allow you to run more experiments per unit of time / effort.


For the tweet in TFA

>That single @linode VPS takes 50,000,000+ requests per month for http://nomadlist.com , http://remoteok.io

So it's not that early stage. I think remoteok.io is currently the worlds #1 remote working site. ("Remote OK is the #1 remote jobs board in the world trusted by millions of remote workers" it says)


I like the simplicity advocated by Pieter Levels.

The fact that he uses PHP with PHP-FPM solves the problem of deploying a new version (starting the new version, switching new connections to the new version, draining existing connections, stopping old version).

But when using a single host machine, you still have the issue of updating the kernel and the OS, and this is sometimes better done in an "immutable" way, by setting up a new VM and switching traffic to it. This is where things become a bit more complex. You have to use things like AWS EBS or Google Cloud Persistent Disk, detach them from the old VM and reattach them to the new VM. You also need to use floating IP or a load balancer.

In other words, baby-sitting the machine and its OS, either done manually or automatically, is a real pain.

Maybe the real simplicity lies in using something like Google App Engine, Heroku, Clever Cloud or Scalingo.


I used to think that resume driven development was nothing but a curse but I've softened my stance on it a little.

Realistically developers are compensated by a range of different things - free gourmet meals might be one - another even more important one being career development.

If you let developers who want to use kubernetes use kubernetes even if it's not strictly necessary it might be a net positive for the company.

Hell, even if kubernetes is a slight negative compared to the simpler equivalent it could be a net positive because it gave somebody who wanted it a career boost from experience with a "hot" technology which made them happy.

Now, only if RDS is going to cause a massive headache would I be seriously against it - provided we walk into this situation with open eyes.


There's also something to be said about not having to scramble to configure things once you outgrow that simple VM setup.

Obviously, you shouldn't prematurely optimize, but you also shouldn't wait until you are getting multiple hours of downtime every day while you are scrambling to get a better solution in place.

And I think many times we underestimate the server power we will need as we grow.

After a few hundred users I used to extrapolate that I would never really need more than a large VM and database instance. But then we added more and more functions and features and more and more computer power is needed per user now.

Then we started running into issues where certain procedures would take too long so they need to be queued, well then you have the overhead of a queuing system and on and on.

It's kind of the same thing that happens with companies as you hire employees. You think, oh, we have 5 developers now. We will never need more than 15 devs and some customer service and sales. But then you need someone to handle HR, and maybe a bookkeeper, and then people to manage the non developers, and then someone to manage the project and it just kind of grows exponentially.


>resume driven development

I haven't heard this term before. I love it


The elephant in the room here is this is talking about projects made by a single person. Part of the infrastructure mentioned is only in part about delivering reliable service, it's also about enabling deployments and concurrent work. If you go from one to two people, you're going to be thankful you set up a continuous delivery pipeline and the usual things needed to ship code consistently like a working CI build and test suite. Even with one person, making it easy and reliable to deploy changes without some crazy manual steps is a good idea.

(I agree with the whole "don't run k8 et al" for your side project, though obv.)


For hobby web apps that fit within the limitations of the free tier, I found deploying to App Engine to be simplest. It's hard to beat free, and this lets you keep the app running for many years without maintenance.

It may be an impression of mine, but it seems that such articles proposing a return to the simplicity are appearing more often here. I've just commented on another one questioning the nowadays richer front-ends.

This article here resonated to me as well.

We are currently running a fairly busy web app on a single Linux cloud-based VM instance, occasionally rising new instances during higher loads, and a deployment pipeline based on small python and shell scripts. Maybe rudimentary for some current standards, but it's been working sufficiently well.


While I agree that some people have much more complicated infrastructure than they need, and much more complicated infrastructure management tooling than they need... refraining from automating it and doing it all manually sounds to me like saying "Why are you writing tests? Your users don't care about tests, they don't care about how the code gets written, they just care about the product, you're wasting your time writing tests." Yeah, but, um.

Good luck getting concrete interfaces locked down to the point where you can write meaningful tests - all the while managing the rest of your workload - when the business requirements change every week or so.

You also don't need to write test early on. Especially if you are still validating your product. For the first year or so of our service, we just manually tested the important features after every deploy.

That sounds miserable to me, but it may be that we are working on different platforms where the cost/benefit of tests differ.

Exactly. This is analogous to "the dose makes the poison" and "everything in moderation".

If its just a message you want share, such as ‘delete all docker containers and images’ Why hire a team of specialists, and pay for only the best in complete server infrastructure, when i can just get it for free from GitHub pages. https://seanwasere.com/delete-all-docker-containers-and-imag...

Could not agree more !!! Most of my side projects....some big some small. Has a vps, self managed mysql db and a bash script called deploy.sh Ive seen whole sprints devotes to I fancy infrastructure... with the payoff usually not much... expect for one guy ending up owning all the magical pieces and know-how...

Or in my case, I feel the need to learn these technologies so I implement them to some extent, maybe that way I'm more employable. Then again, I love tinkering with new tools and my side projects don't make me any money yet.

> The answer? Simple … a single Linode VPS.

Fully second that. I started some of my best projects on the small Intel Atom Linux server at my residence. And we all know the garage story of Facebook (and similar)


"I didn't need it so you don't either. Look at how easy it is to claim everyone else is over complicating things because I didn't have a need for it!"

Is essentially the TL;DR of that article. Being dismissive while not really substantially expounding on the faults of other processes is just bland contrarianism. Bald assertions need only be met with bald assertions.

Allow me to write a retort:

You do need all of that infrastructure and you should spend even more time making sure your pipelines are a well oiled machine to reduce deployment fears and establish confidence in your infrastructure. Make it nice to use. That new tool that came out? It was written for a reason, and is probably tackling a problem you weren't even aware you'd be facing. So stick it in and call it a day with the knowledge that you've avoided spending time on a problem someone else has already solved.

Your turn, cynical blog person.


This ^^

I've seen / worked with a number of startups that used the author's advice and just had a simple setup "that worked". Until it didn't.

Here's how it typically plays out: Our "dev" setup "the server" for us at <AWS/GCP/DO/Linode/etc> and everything worked fine until the provider restarted the server, or an OS upgrade happened, or we fired the dev we found on Upwork and he shut down the server. Now X doesn't work. We don't know how to reproduce what he did.

Now you are left with trying to go through someone else's bash history and decipher what steps they used to build the server. Did they forget to tell a service to autorun? Who knows.

I agree with OP that it's possible to over-engineer a fancy CI / CD pipeline and matching infrastructure for a founder that just has an idea and zero users when you should be getting product/market fit, esp. when you just have one developer working on the system. However, the opposite is also true. It's possible to under-engineer the infrastructure where developer productivity is dramatically slowed when you're spending a significant amount of your dev's time doing deploys and blocking other work from happening at the same time. This can happen fast when you hire dev #2 and #3. This isn't even getting into the perils around security and scaling when you play fast and loose with the infrastructure side of the house.


I've seen three articles just like this in the past week. I don't know why people dignify them by posting them here. It seems like being deliberately contrarian as a substitute for actual nuanced reasoning based on experience is a good way to get exposure.


Maybe I'm weird, but I like servers that I can touch. Maybe if you're only a couple person company, it makes sense to use the cloud. But just paying for some space at a data-center and having your own servers seems ideal for any company medium and above sized start-up.

My last job was at a financial start-up worth about 200M at the time. And our server setup was dead simple: 8 servers and a load balancer. No containers, no bullshit, just servers running JVMs.


Coding Horror just have a discussion on using colocation for Discourse. Seems kind of ok really https://blog.codinghorror.com/the-cloud-is-just-someone-else...

I don't see any real disadvantage as long as you have IaC.

It takes a lot of experience and discipline to apply the right amount of leverage to the problem

Sometimes you do. Sometimes you don't.



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: