On my ops team, we've gotten some flak for not building robust enough of a request queue for some tasks. But it's been down several hours in the past year. The server almost never needs maintenance. None of the workload is real-time. App restarts are acceptable if memory leaks occur.
If we did everything by the enterprise book, we'd still be 70% of the way to a deployed product, instead of 15 months into its completion.
Sure. Except if this Ops team is like most Ops teams, they'd prefer to do it the (cue dark, brooding orchestral dum-dum-dum musical interlude) 'enterprise' way rather than be woken up by PagerDuty in the middle of the night because a new release push lead to the CPU spiking.
Sorry, but I've almost never seen situations where people have skimped on HA and someone else didn't get excoriated during an outage.
Maybe other people's experiences are different, but as a DevOps guy, even in a relatively new environment, my priority is stability. Not faith-based-computing based on someone's positive seat-of-their-pants past experience.
Reliable infrastructure at the cost of velocity isn't just good practice, it's self-preservation if you're ops.
Our little request queue, though prod-necessary, only needs 98-99% uptime. Same with much of our automation. Once you ask the simple question "Does this cause an interruption to our core business, or to security?" and you can say "no", HA requirements go down, down, down.
I fully agree that so long as SLA's are negotiated proportionally to infrastructure investment in money and time, then it's all good.
I've personally seen Ops teams get tricked by being told things weren't important, when in fact they became production emergencies with yelling when they invariably went down.
£70K of new hardware!
I stuck it on an existing server and nobody noticed though I might have got into trouble with the Change Prevention Board for not using the right form or something...
Edit: He did have a carefully worked out explanation of where the money had to be spent.... which I ignored after hearing "£70K".
Then dump the static file on S3, and let the interest from the £40K in your bank account pay for it.
- Generate static files using Pelican
- Dump the files to S3
- Set up Route53 to point a domain at the bucket
- Set up CloudFront (Amazon's CDN) to sit in front of the S3 bucket, since I found pure S3 to be too slow
Works incredibly well.
I put my photo/video blog on S3 reasoning that it'll handle spikes better. At one point it reached the front page of HN and cost me $450. Afterward I looked at the logs and found that my single $20 linode would have handled it fine...
But if you're talking about small static websites, AWS gives you an easy interface, Cloudfront, Route53, etc. It's a very easy way to do some very complex things.
You might also look into billing alarms. I've got some simple background jobs that I run on lambda, pulling some files to S3 every hour. It costs me a buck or two a month. But if it ever looks like it might charge me $20, it will alarm and email me.
The "or something" and "which I ignored" aren't making you look like the reasonable actor in these interactions
I doubt it was a mission-critical single static html page.
edit: £70K plus the cost of installation/hardware support/etc/etc.
- "ignored ... a carefully worked out explanation"
- doesn't (or doesn't care enough to) understand the company's established processes ("...or something")
Just as we can all point to examples of process absurdity, we can all point to examples of
- failures because process didn't exist
- companies with good process
£70K to host one static webpage is not a good process or "carefully worked out explanation" any way you cut it.
Haha, ours is called "Change Management", but they would more correctly be called "Change Prevention", or "Sanity Obstruction".
Even when I worked in electronic trading and we were dealing with 10 million events/second on combined feeds, we kept our architecture simple. Why? Experienced engineers realized that the more moving parts, there's more things that can go wrong (and it makes it that much harder to pinpoint).
Just time your queries, get a P95, and stream data to an analytics database if you don't want your prod to be exposed to added latency. No need to create some fancy distributed consistent system with caching layers and enormous test harnesses if the analytics workload might change next week.
I'm a huge fan of just creating a separate analytics database you stream prod data to and letting those with SQL knowledge play around there. Surprisingly, they rarely break anything. And if they do, it's not going to take down everything else with it
"Good enough, done in a week" is a perfectly viable option that is rarely offered by timesheet-padding developers.
I see it all the time from timesheet-padding developers who also are in a position that they'll benefit from the higher long-term support costs that that approach generates.
It's just a matter of whether they want to pad timesheets in the short term or the long term.
Or are you referring to some sort of government bureaucracy dealing with loosely coupled contractors bidding on jobs?
Back on earth I can tell you full time devs at big 4 tech companies padding estimates would be a very difficult thing to pull off, for a variety of realities.
Same situation for many other developers.
I was responding to (as should be clear, since I quoted it) an upthread generalization about what “timesheet-padding developers” would or would not do. It's expressly not a statement about developers generally.
> Back on earth I can tell you full time devs at big 4 tech companies padding estimates would be a very difficult thing
And pointless, since they aren't paid hourly. Timesheet-padding is only incentivized when developers are paid (or work for a company that is paid) based on timesheets, that is, hourly.
But, anyway, I wasn't talking about padding estimates, that was the poster I was responding to. My observations was that developers with incentive to pad timesheets often will do the opposite of padding estimates, going for quick and dirty development solutions which create more maintenance and support hours (of course, that only occurs as a result of structural incentives where the contractors building the system or feature also expect to be supporting it, but that's often the case.)
manager: Why did you allow this thing to happen.
me: It was an edge case that I pointed out 7 months ago. I've put it on the backlog 3 times, and you've removed it each time citing that it's an edge case that would never happen, and other newer features needed to be implemented.
manager: That's an engineer's response. Just fix it!
me: welp, shrug
In certain cases I've also overridden the "ignore that for now" in favor of at least some minimal fix because part of my job is to give you what you don't know you need to ask for, not just exactly what you ask for. And there's the value add because compared to people who don't consider that part of their job, my stuff / my team's stuff works better out the door.
Another similar question that I've noticed others not always asking are things like "how often would this be used again, is it just a temporary one-off" which, of course, has an answer that's subject to change, but at least guides the initial design. And when it changes it gets revisited.
This is a two-edged sword, though. I agree it's best practice from a technical point of view, but enabling wilful ignorance on the part of someone above you in the chain will eventually lead to a showdown when you really do need a week to fix it and it really is critical, and they tell you not to, and they'll stick to their guns because (as far as they know) they've been right every time.
Yeah, pushing that too much requires either (a) sufficiently-padded early estimates so it's not a slippage and a nasty surprise (a good practice anyway, but hard), or (b) a sufficient level of don't-give-a-fuck, or (c) really good judgment around "when this is necessary to give them what they need vs what they want" and when not to e.g. truly critical and immediately messagable stuff.
A healthy dose of don't-give-a-fuck/willing-to-say-no is often a good thing anyway, though, if you're able to do it politically/smoothly enough :)
Translation: I didn't actually require any answer other than "I am sorry, sir/ma'am, it will never happen again." Facts are irrelevant. I may not always be right, but I am never wrong, code monkey.
I worked at a hedge fund for 9 years in back office IT developing firm-critical software. We traded pretty much anything you can trade, so we had a lot of trading desks requesting changes all the time. So, while making a change for one desk, at least one other is bitching about why their changes is taking so long. Yeah, the change might only take 5 minutes, but it might be 4 weeks before I get to it.
We traded for all but about 2 hours of the week. Means you had a narrow deployment window. Even once the change has been made and tested, you've got to wait for the deployment window. One weekend of each month was off-limits to system changes to options expiration. So, even the smallest, simplest change might take at least a week or two to get into production. Quite longer big or breaking changes. Toss in regulatory and compliance issues and you've got a lot of paperwork and sign-offs to do a deployment. You've got to track those people down, explain the changes to the managing director and the risk with making the change.
Emergencies were fun. Either getting called or having to call a manager at 2am in the morning to get approval... I once had an emergency at 11pm. I got a call from my director about 15 minutes after I'd popped sleeping pills (I was having insomnia at the time). I went through every source of caffeine I had in my apartment to get through that; I dropped off the conference call at 4am. I recall hearing the call went on for several hours afterwards.
TL;DR It's not always over-engineering, it's more often misunderstood or invisible (to the business side) pressures on the engineer's time.
Funny this story reminded me of a recent event where my team was working on some reg-ex expressions for a language processor. Not overly complex stuff but not simple by any definition (they had been working on the set for 2 days), they had a side line manager from another department, that knew enough to be dangerous, so he decides he is going to run in his office and whip up some reg-ex, after reading the docs. So we threw it in the test cases, after tons of failures he got the picture that edge cases count, as he got a good visualization of how multiple edge cases increase the the odds of failure by orders or magnitude.
Which itself could be a result of management often viewing engineering as a cost center with no perceivable ROI. This can be reflected in pay not being commensurate with company performance and not stacking up to other employees (sales, management, etc.) and gives the engineers no reason to pay attention to ROI in their decisions. I would bet that in companies where engineering is very much involved with cost, ROI, and paid based on how those numbers turn out, the engineers would focus less on edge cases and know when to draw the line.
What's hard is finding the right balance. The proverbial "SQL in 5 minutes" moment from the blog post may not fully appreciate the developer's job, but it can serve as a reality check to self-question if you're over-baking it.
If you're willing to come in on the weekend to fix your query, be my guest. But I like being able to spend my weekend relaxing.
They often blame others while finding very sneaky and specific ways to, despite all odds, rise to a relatively insulated level without any modicum of technical or managerial prowess
Damned if you do damned if you don't.
If you don't focus on edge cases then later it's your fault for being sloppy.
If you focus on edge cases you are "over engineering".
Most of the time "over engineering" just means "I wish this was cheaper".
As for personal experience, well over two decades in the business, if we're keeping score.
So, you would have readers believe that developers are egoless automatons who do everything by the book? If true, they'd be the only faultless profession on the planet.
15 years with software dev experience and every single dev team at midsize and larger companies has always asked for far more time than necessary. Same with almost all enterprise consultants.
Or did you not give them the time they asked for and declare yourself correct after squeezing in a deadline? And if so how do you know what corners were cut or what technical debt was accrued to make a deadline the team did not sign off on?
Whether every single person is noble is irrelevant. People are making broad generalizations here against the reputation of developers with what I see to be weak and anecdotal evidence.
If someone has made serious inquiry into the subject and controlled for complexities like I suggested above then by all means let the chips where they may.
This is human nature, people will look out for themselves. Many, if not most, are looking to do the least amount of work in return for the largest reward, and it's very easy at bigger companies to pad hours, compounded by bureaucracy and poor management. This is seen in many other professions amongst all types of people, nothing specific to software.
Could I do the work myself in a week? Yes. Can I assign it to a team of diverse junior to mid-level developers and expect them to do it in a week as well? Realistically, no. They have to investigate, acquire knowledge, work together, be guided on the correct design (or get to it themselves), and then finally do the work and test it to sufficient quality.
Even more so the case if it's tech we have not worked with before.
This is my concern. It absolutely doesn't give you that knowledge. Accurately estimating development of non-trivial software is still an open problem. The best solution we have to date is to accept this fact and use it in our process and planning to our advantage when possible.
If the work you're talking about is rote and mostly a matter of repeating a pattern rather than innovating and engineering, sure I can see that being more predictable. Unfortunately quite a bit of software development does not fall into that bucket.
I don't blame him for that part - it doesn't take much complexity before any manager is incapable of understanding every detail.
The scary part is he doesn't seem to understand the fact that this implies the most successful teams will utilize some degree of trust and efficient communication to deal with this reality.
Ah! Classic duct tape programming!
Something like this is fine as long as the requirements don't change.
But we all know that at a point the requirements will change. And then you have to tell your customer that the "little new requirement" can't be accomplished with a cheap little change but requires instead a complete rewrite that will cost even more then the previous system did.
The only question is whether you are a honest company that told the customer before that that will most likely happen in the future or your planed for that inevitable outcome in the first place...
But as you said, knowing how and and when to negotiate between said extremes is exactly where the "art" lies.
So has AWS, so you can kind of feel good.
PS> Joking, but not really.
Biggest lie ever. That's pretty much a production-quality code here.
Asked to get some kind of analytics query to someone, and they need it fast and want it in some kind of visualization tool.
1. You open Zeppelin, take a bunch of database tables and whip out a query that is basically what they want and export it to an Elasticsearch + kibana instance.
Now comes the edge cases:
Oh I forgot, it needs to be on the internet
2. Need to set up a public IP, DNS, Nginx server and a series of rules to make it read-only (and it's still dangerous mind you)
Why isn't this password protected?
3. Add a nginx basic-auth with a single password
It needs to be available to admins and sales managers only.
4. Set up ngx_http_auth_request_module to hit the our authentication server (the cookies should be present - OHH the DNS name doesn't match the cookies. Set up a /subpath on the existing application.
It needs to work as an embedded view in the company's mobile app.
5. That uses tokens not cookies, so the auth-request module no longer works for this, need to come up with a SSO solution with a cookie and a place to store the cookie in a database. That requires a REST service on the existing app server, which will require a redeploy.
I just added a product to our system and it's not in Kibana
6. Need to modify the spark code to use spark streaming
The Spark server restarted and my new products aren't showing up!
7. Need to set up a service on the system to auto-start the spark job.
It is feasibly an actual product feature at this point, but (1) was asked for, but they really wanted 1 - 7. I would argue that (1) would only take 25% the time 2-7 takes to do. Not every product change is like that, obviously, but sometimes people think all changes are just so simple. Often it's the history and unstated features that make a huge difference.
Also this for fun:
One of my complaints about working in an environment where I am the only person who can program is that nobody else understands this. I've taken the time to explain this point to people, and they seem to grasp it fairly well, but once they see something that somewhat resembles the final product they get really impatient.
The last 25% of functionality takes far more time than the first 75%, especially once you consider handling edge cases, etc.
Learned that one in my first ever project. We mocked up the entire application in Visual Basic (just windows and buttons, no actual functionality implemented, we just wanted to know if it'd work for them) and then the client got really upset when we couldn't come round and install it the next day. Classic case of failing to manage expectations.
I haven't found a way around this other than careful communication, which works with most but not all clients.
Like this: https://media.balsamiq.com/img/examples/wiki-sketch.png
My wife does this kind of interface mockup presentation as part of her job, whenever she does a client presentation she starts with "so this is a screenshot of the inside of my head..."
The first 80% takes 80% of the time. The last 20% takes the remaining 80%.
(Especially for internal tool) Is it possible that what they've seen is, actually, good enough?
Or at least, good enough to solve 90% of the problem while you fill in the missing piece that covers the rest.
Then we have a few people who completely ignore any limitations I put of tools and use them in ways in which they are explicitly not working.
Case-in-point: I created a process that automated a particular task. Said task bears a resemblance to another task, but there are a lot of edge cases in the other task. The manager who runs the second task got upset that I had automated one task but not the other, and complained in a senior staff meeting about this.
I mentioned that the second process was much more complex, and a lot could go wrong if she tried using the new process for it.
Anyways, I took a sick day yesterday, and she had sent an email to me CCing my boss, complaining about how the new process was broken. She had tried changing the settings to work with her process and broken both processes.
She is fully convinced I am just keeping tools from her because I don't like her. The reason she doesn't have any tools is because I have to build the most robust tools possible or she'll break everything, and I can automate 10 other processes with that sort of time.
I'd say that my boss generally makes the right decision when it comes to what is best for the organization, and I'm not in the habit of rewarding people for being abusive.
Your engineer knows that if she writes your "5 minute" query without careful analysis, peer review and documentation and the query ever produces a questionable result --- whether it was anticipated by your requirements or not --- it's your engineers ass; you'll throw your engineer under the bus _instantly_.
Your engineer knows that if she writes your "5 minute" query and it produces any actual value you'll be back the next day with a "5 minute" enhancement. Anything you ask for that might matter the next day has to be built to be maintained by others because if she happens to take the day off when you show up and demand a revision to your "5 minute" wonder query and there is nothing for the other engineers to go on (revision controlled work, documentation, etc.) then that's her ass; she knows you won't stand up for her.
Your engineer didn't just fall out of the boat and is in no hurry to obligate herself to take responsibility for your adhoc miracle queries and the questions that will emerge when you go waving the output under everyone's nose, and she knows that's exactly what you'll do with it. Your little query is your view of the world and that view is highly unlikely to survive the first bit of scrutiny that's applied by anyone other than yourself, much less the second.
Asked a programmer our side (.net) who said it'd take a couple of hours to write a simple webserver, package it up into an .msi and give it us. I got annoyed, did it in Golang in about 10 lines of code.
I then realised i'd compiled a Linux executable on Linux, remembered it did cross-compiling, 10 seconds later I had a Windows .exe. All for a simple webserver that printed "cock", not useful but it proved the tunnels worked.
Sometimes we overcomplicate the simplest shit.
EDIT: As it wasn't clear (my fault) - we were trying to get to port 443 at their end - they were on Windows, we mainly Linux but the guys I asked were .net programmers..
Indeed. Why not use nc and telnet on port 443 to test? Linux already has nc and Windows already has telnet.
You could of course use nc as the client too, but it would be an additional install on Windows. 
And nc wouldn't help if the firewall was stateful and actually expected to see HTTPS traffic on port 443, so I do see their logic of putting an actual webserver there.
But without sufficient detail, it does seem at first glance to be overkill ;-)
Bear in mind the server was at the Windows end, so no good with telnet over that side...
I wouldn't know how to do it in Golang because I don't use it. I would default to my most comfortable language, which is almost guaranteed to not be the most efficient method to do anything.
python -m SimpleHTTPServer 443
Remember - the server end is Windows, and the guys at that end didn't know what to do if it wasn't an .msi or .exe.
I have cleaned up too many messes because of negligence from people who do not understand how hard programming actually is. That's what most people, especially us developers at times, fail to recognize. That 5 minutes someone took to write an "easy" query against the CI and deployment server almost brought it down (true story, luckily I was monitoring it look at another issue).
The ability to write code is taken for granted, because anyone can learn it. Some programming is easy, and some is extremely difficult and the real trick is knowing which. What scares me most about the code being written are the one off queries, etc. The ones that will "only be used once" or "only for low transaction instances". That's never true, someone will always have it laying around for that time when "we just really needed to make that update".
An old boss of mine used to say "the perfect is the enemy of the good." This is true, there are a number of times you need get something up and running and worry about fixing it along the way. There are other times when that "good little app" got used in the wrong way and cost us hours of downtime because of a mistake due to rushing. Now the perfect solution doesn't look like it was such a bad choice after all. I can wait a day or so for a query that I could write in 5 minutes. In the long term, waiting a few extra hours isn't going to impact anything that much. I'd rather the developer be through then explain why no one went home that night because we had to clean up a mess.
>Why does my engineer say it will take a month?
if you don't know the answer to that, you're a bad manager. Either you hired bad engineers or you have no idea how your dev process works.
This post is about engineering a future-proof solution rather than fixing bugs.