This is not a new problem, organizations will always choose guaranteed profits over possible loss of business unless the loss of business is catastrophic, I just wish that in this case instead of trying to make it seem like a big deal by writing an entire multipage excuse, a company for once would be honest and say 'The risk percentage did not fall in our favor this time, but we're not going to do anything about it because it didn't really impact our profits.'
I suspect this becomes a problem in the context hiring devops people, because whereas you can make the argument that writing game engines and working on game logic is more fun and justifies working for less, it's hard to make the argument that a devops job at Epic running game servers and websites is any more exciting than running servers and websites anywhere else.
This puts epic in the situation of having to pay market rate to attract devops people, but below market rate for attracting developers, which fucks up their pay scaling completely. What ends up happening is they just don't adjust their pay scale at all, which means they're hiring cheap devops people.
I was just curious since they approached me and I had fun with the experience and saying I didn't play their games/had no idea.
The recruiter had no idea of local wages.
Now seeing this I'm sure I dodged a bullet.
I'm not sure that love lasts forever though. I'm childhood friends with a lot of people that went into games and left by their 30s because they couldn't justify the pay difference. That said, maybe the games industry doesn't need these experienced people.
When people say video game doesn't pay well, it does not apply for the like of EA, Activision, Epic, Unity etc ...
The main point of my comment is that there's a natural downward pressure on the salary of their software engineers due to being in the games industry, which makes it difficult to find devops people (which, for the purpose of this conversation, are distinct from software engineers) at a similar rate, and makes it difficult to justify paying devops people at the market rate.
I do agree with OP that (some) game developers undervalue IT. Oculus had a similar and pay rate was equal to FAANG (because it is FAANG!), so it came from culture, not pay.
Yeah I know FAANGs and investment banks can be impressive on the bonus front too
But the prevalence of this just seems disconnected from what is considered normal or bragworthy in the rest of the private sector and world
And it continues because as much as people complain, they are holding these game publishers and developers to the lowest possible standard.
That's an interesting anecdote, but its quite easy to find examples of companies with well respected, well paid engineering teams that still have an occasional certificate expire. Microsoft, Spotify, Facebook, Apple have all had embarrassing outages due to certificates expiring.
That´s why things like AWS certificate manager + ELB kind of things are useful, so that they are mostly auto-renewed.
It is a chore that had bit most of the places where I have worked.
I've seen the opposite: Organizations who spent so much on the department that everyone was getting promoted to manager and hiring someone underneath themselves to manager things. Responsibilities being shuffled around as the department is constantly reorganized, until no one really understands who's responsible for what any more, but there are enough low-level employees to blame when things go wrong.
I've seen enough variations of organizational dysfunction that I no longer pretend to be able to guess what's going on behind the scenes.
Be reasonable: you know nothing about how Epic Games treats their IT staff or whether or not the team is adequately resourced. I wouldn't say certificate expiry is something that happens particularly often, but I have seen it happen, and it's been simply an oversight rather than an indication of some serious systemic issue.
So yes, I think this is one of many signs that they're not paying enough attention to extensions, not a totally isolated "accidents happen" event. Were I an extension author, I'd see that event as reason to be more concerned.
You're misinformed. Many extensions work, they are progressively being re-enabled over time, and on the nightly version they are all available, although whether they actually work or not depends on the state of the underlying APIs. The reason for the whitelist model, is that when they swapped to the new mobile browser engine, the underpinnings of many of the extension APIs had to be reimplemented, and they are not all online or bug-free yet.
And no, I don't consider a small custom list to be "support". It's a high-value list and a solid sign that they're not wholly abandoned, and I do expect it to come eventually, but it's very much not the same as general availability. General availability did exist before.
Edit: I broadly agree with their breaking of NPAPI stuff, WebExtensions (as a concept, not necessarily the specifics we have now) has a LOT of very real benefits, and does not inherently prevent equal or better capabilities. But it too is still a loss in control, as it stands today.
Since I left, I understand that they've fully automated, and mandated, all certificate generation and rotation, but there have still been cert expiration events, albeit rare.
Cert expiration events happen. It's zero indication of the intelligence or capability of the engineering skills or maturity of a company. It's a thing that just works until it doesn't, with zero warning.
You can't verify anything internal unless you're internal or it has already failed publicly, so you of course have to draw on patterns seen elsewhere. Critical-process failures in one area correlate heavily with failures in others.
Plus, Epic has not exactly shown themselves to be producing consistent quality in anything related to their store, or many internet-connected properties. If they were, this might be more attributable to "accidents happen, it's impossible to prevent them all". It could still be an abnormality, but they're edging further towards "... maybe not though" territory.
Edit: lets add a concrete "kinda example, kinda counter-example". Google is a tech company that is pretty good at consistently renewing its many certificates. They recently failed to do so for Google Voice: https://www.bleepingcomputer.com/news/google/recent-google-v...
I think there's a reasonable argument to be made that this reinforces claims that Google Voice is low priority / at higher risk of future issues due to lack of care, i.e. systemic issues, compared to other Google properties. I have no proof, but that doesn't mean it's automatically unreasonable.
Don't get me wrong: I'm not saying there aren't problems at Epic Games (most companies have them). What I'm saying is, we're just speculating: how is that helpful? Either to them or to this discussion?
We're either casting vague and hand-wavy aspersions or citing more specific examples where we actually have no idea whether they have any relevance to Epic Games.
It's just noise because, as you've pointed out, we're not internal.
It was illustration of though process - that seems to make sense to me.
> It's just noise because, as you've pointed out, we're not internal.
Yes, it is noisier than direct info from inside but you may learn something.
“We made a bad bet on certs not being that important, it backfired” doesn’t sound good but it’s the truth.
The same thing happened when Delta got wiped out by a power outage. “We made a bad bad bet on geo redundancy not being important, it backfired” wasn’t good enough for them either so they pontificated just like Epic did here.
It’s obvious that Epic doesn’t take certificates very seriously here. This is cert management 101. No need to read into it much further.
That said; Glassdoor is a terrible metric and has been widely criticised as a source of information due to the fact that bad reviews can be removed for payment; though “officially” they don’t accept payment to delete reviews; it’s part of one of their packages to clean up a companies image.
It has also been gamed by employers- but that is obviously a problem for all review sites of this kind.
Kinda like “I’ll get my nephew to make my website”
At first, it wasn't clear whose responsibility it was since back in the operations day, emails would go to someone's specific address or even a mailing group, where most of the employees who were on it had left while new employees weren't added to the list since they didn't know about it.
After it happened once or twice, metrics were set up to track expiring certificates (they were mostly all migrated to AWS Cert Manager I believe) while a few key ones couldn't be.
As a bit of background, we also follow the Google-esque model of not having a phone number for customer support and requiring customers to submit a ticket. We do have outgoing calls but no incoming phone number.
I say that because those key certificates would generate an email that said something like "Press this button and we'll call you to confirm you want to renew" so as you can imagine, my first thought was "Well, how the fuck is shit gonna work?"
I think in the end we just ended up calling the certificate provider to say we don't have a phone number and then we managed to get them migrated to DNS-based validation after some time.
This too wasn't a case of being underpaid but rather having a lack of knowledge. It's the sort of task that some particular person did for a long time but then left so none of us newer folks even knew where these things were provisioned from. Additionally, you don't feel like you have the authority to ie; call up some multi-national provider and be like "Hi, we own this thing but umm, I have no idea how to go about renewing it". It feels like being a teenager calling up about a first job haha.
It's just one of the casualities of "high growth" businesses mixed with humans being bad at seeing cause and effect when the gap between the two is super wide. Cause being people leaving and effect being "I forgot to ask how to do X or Y"
I guess I would clarify that we were following a devops model but had transitioned from a classic dev/ops split so it's quite literally a generational thing where you conceptually don't know how to go about eg; renewing a certificate on the phone because you've entered the industry in the time of dns validation via lets encrypt (and because there literally are no phones anymore in the businessa)
Start from the idea that you're going to issue certificates valid for 24 hours, and think how different your environment would need to look.
I renew once a month, and if things should break, I have a two month window to fix the issues.
Before that I would receive a Comodo SSL Certificate once a year via email and until then I always had forgotten what I had to do with it. What an unnecessary pain.
(Ideally, you'd remember and never set the alert off, but still great to have that extra layer.
Anyway, the best is to shorten the certificates validity. The way Letsencrypt recommends is perfect, run it often and require several failures before anything breaks.
I'm not advocating against it, just exposing the whole story.
Pretending it's "totally static" is exactly the problem. There are only two kinds of things in the software world - things that can stay the same until your next release, and things that need automation. "Almost completely static" is how your post mortem ends up on the front page of HN.
A consideration of the full story also needs to include the risks associated with long-lived certificates. If you lose control of the private key associated with one, what do you do? Are you actually operating a CRL? Are any of your HTTPS clients actually checking the CRL? What would you do if a severe compromise were discovered that affected the signature algorithm you're using?
This grants you 30 days to fix any problems and get the system back up.
One possible solution might be having the client introduce an artificial delay of 10 seconds or some other time when it encounters an expired cert, or adds an additional second of delay for every day it is expired. This degrades the connection but does not immediately break anything.
Plus you'd need to be way in the guts of the TLS implementation to achieve this; if you're already there, start generating noise a week ahead of the expiration instead.
Or better, none of the above and automate.
We wanted a database server to fail hard. Running slowly just caused cascading failures.
Of course, in this case you're effectively talking about the entire cluster crashing hard, but that's still easier to cope with than every system responding at a snail's pace.
The goal of a business is not to have perfect engineering practices. It is to fulfill customer requests. When there is an outage in the middle of the night, I'd argue that a degraded system buys time to address the issue.
Regardless of the mechanism, having a sudden, complete breakage is not ideal for a business.
If you plan to implement something like this, then do it right and have the service catch the exception and notify an administrator.
For this, our customers need a EV certificate. Most of our customers are small, and don't have their own IT. It's a mess, most don't understand what it is, don't understand the difference between the two or three certificate files they get, a lot can't even figure out how to extract the files (inside a password protected pdf of all things), password? What password? ...
And then of course the certificates expires. Just like that. Poof. And the person who ordered them last time has moved on to a new job, and so we're back to scratch.
We spend so... much... time... on hand holding this for our customers. Didn't take us long to figure out we need to remind them about certificate expiry, but the rest is just such a PITA.
Technically it's a pretty nice solution, but boy it is not made for normal people.
> EV certificate
> or three certificate files
So an EV certificate for machine to machine communication where self managed PKI would be better due to having a single CA that could “know the customer” and possibly sending the private key in a password protected PDF?
Did I misread that? Technically is sounds terrible.
There's one other they (the gov't auth provider) accepts as CA for this, though they claim others will follow.
I can imagine exceptions, such as when code requires a publicly-signed cert, but I suspect I'm missing something obvious here.
I always bristle at this use of ‘learnings,’ especially in cases where ‘lessons’ would suffice. However, it turns out this usage goes back to Middle English and is also in Shakespeare’s Cymbeline:
Puts to him all the learnings that his time
Could make him the receiver of, which he took
As we do air, fast as ’twas ministered,
Looking at what Epic is doing, I would encrypt customer data and everything that involves money. IMO only communication between data centers, with external payment providers, and with users must be encrypted and require valid certificates.
I mean, why not?
(Granted my certs actually failed earlier this week since my automation had broken)
You do need a domain though.