So I've been mulling this stupid thought for a while (and disclaimer that it's extremely useful for these outage stories to make it to the front-page to help everyone who is getting paged with p1s out).
But, does it really matter?
I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.
However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.
My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.
Yes and no, some things are actually time sensitive.
For example, I'm building a note-taking / knowledge base platform, and we were having some reliability issues last year when our platform and devops process was still a bit nascent. We had a user that was (predictably) using our platform to take notes / study for an exam, which was open book. On the day of her exam our servers went down and she was justifiably anxious that things wouldn't be back before it was time for her exam to start. Luckily I was able to stabilize everything before then and her exam went great in the end, but it might not have happened that way.
Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally, but I of course took this as a personal mission to improve our reliability so that our users never had to deal with this again. And since then I'm proud to say we've maintained 99.99% uptime[1]. So yes, there are definitely many situations where we can and should take a more laid back approach, but sometimes there are deadlines outside of your control and having a critical piece of software go offline exactly when you need it can be a terrible experience.
> Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally
And they would be right. Having your notes pushed up to the cloud is great and I use a feature like that all the time (specifically with iCloud and either the Notes app or beorg), but the most recent version of these documents should always be available offline.
Is your application unavailable without a network connection? What if you go somewhere without reception?
Yep, for the moment it is unavailable without a connection. Luckily most people are connected all the time these days, so it hasn't actually been a sticking point for any of our users so far. But yes, we agree that having offline is also super important, so we're building that out as well.
We wanted to build a platform that had collaboration in mind from the beginning though, which is why we opted to go for online-only initially – kicked the tough engineering problem of eventual consistency (when collaborating) down the road a bit so that we could work on features that were actually unique to our system (it's just two of us at the moment).
This is a great line of thought, I'd encourage everyone to take it. There's a huge amount of crap people get up to that is mostly about performative debt balancing - people feel that they're owed something just because <fill in the blank>, when it really didn't matter. Just another gross aspect of a culture overly reliant on litigation for conflict management.
But. the question is meaningless without qualifying, for whom?
Because I can absolutely imagine situations where an Auth0 outage could be extremely damaging, expensive, or both. Same for a lot of other services.
> Life doesn't have an SLA
Nope. Which is a part of the reason why people spend money on them for certain specific things. It is just another form of insurance against risk.
For a lot of stuff I agree but the problem is that (some of) these platforms advertise themselves as being built so that this should not happened. Less cynical engineers will then build some critical solutions that depend on these platforms and assume that they can and have successfully mitigated the risk of downtime. Sometimes the tools to manage/communicate/fix the service downtime are even dependent on the service being up.
The lesson is more that everything fails all of the time and the more interconnected and dependent we make things the more they fail. That is not something that can be solved with another SaaS as multiple downtimes, hacks, leaks and shutdowns have shown time and time again.
The point that often these services advertise on basis of resiliency is a fair one and I agree with what I think I'm reading into your conclusion which, if correctly understood, is that by increasing the number of dependencies in our systems we're exposing ourself to a compounding amount of downtime. And I'd assume we'd agree that generally we should architect towards fewer points of failure?
My reaction was more against the performative "haha, foolish n00b developers didn't build their system to use both Lambdas and Google Cloud and then failover to a data center on the North Pole like me, the superior genius that I am" that oftentimes appears in threads about downtime.
We could all do with a bit more "there but for the grace of god" attitude during these incidents while still learning lessons from them.
> And I'd assume we'd agree that generally we should architect towards fewer points of failure?
Yes, and to me that generally means having less points in total. We can make stuff pretty resilient but it's very hard and requires huge resources, so it's usually easier and simpler to just not have as many points at all instead of trying to add "more resilient" points in the form of SaaS.
In this case, a lot of apps are useless if the auth is down, and the auth is useless if the app is down so moving auth to something more resilient (if we assume this was an isolated incident and auth0 is generally good) only adds a point of failure and does not gain anything in terms of uptime. Especially since in more traditional setups the auth is usually hosted on the same server, on the same database and within the same framework as the app itself.
The problem is that the "small guy" is held to a high standard that the "big guy" isn't held to. If AWS shits itself for a day nothing will happen, if your small SaaS goes down for an hour you'll lose customers and people will yell at you.
And more importantly, if YOU try to use something "not big" and it goes down, it's on YOU - but if you're using Azure and it goes down, it's "what happens".
I think you're underestimating the scope of the impact and just how vital software is in the modern world. It's not just that people can't login to a system, it's that they simply can't get their work done, and some of that work is really very time sensitive and important. Auth0 is depended on by hundreds of thousands of companies. Tens of millions of people will have been impacted by this outage today.
I think it's actually because I'm beginning to realise how much I used to believe the importance of software and how maybe I no longer do.
For context I used to live in the UK which is probably, outside of South East Asia one of the most "online" societies (and miles ahead of the US in terms of things like online payments processing). I never carried cash, online orders for everything, etc.
I moved to Barbados towards the end of last year and let's just say there's a lot of low hanging fruit for software systems here. It takes about 4 months to get post from the UK, you can't really get anything from Amazon. There's a single cash machine that takes my card and sometimes it's out of money or broken and you can't open a bank account without getting a letter from your bank in the UK, with the aforementioned 4 month delay. Online banking doesn't exist. There was maybe 1 Deliveroo type service that was actually a front for credit card scamming and maybe 1 other food delivery app.
In a sense it has been so much more pleasant than life in the UK and not just because of the cheap beer and sunshine. If I have a problem I know my neighbours to speak to. I know the people in the bar, I know who can help me out if I ran out of money or needed food to tide me over.
This is all a bit 'trope of the noble savage', as if life was better off before all that technology or something. I don't believe that's the case however I also believe over-reliance/the belief in always-up systems reduces societal resilience. Certain things have to work, you have to be able to phone the ambulance and it comes (or alternatively know someone who could drive you to the hospital in a pinch), food has to get shipped in at some point, since a diet of cane sugar alone won't be sufficient. And for that supply chain technology, etc. is important. But there are many other types of software regarded as "vital" that I don't think are and the criteria for what is vital is actually a lot stricter than it can feel. And there's a lot more room for delay than we'd maybe feel when caught up in the tech bubble.
I appreciate this view, but I'm in academia, and with covid19 we are teaching remotely, doing exams remotely, etc. If the systems are down that can have a real disrupting effect on students not being able to submit homeworks/exams, us delivering lectures. And that potentially applies for the whole university (thousands of people).
To extend the OP's line of thinking does it really matter. Exams can be rescheduled, extenuating circumstances taken into account. As someone that has fallen ill quite suddenly through examination periods due to chronic illness I never appreciated the dogmatic approach taken when administering tests. I'm a human being, things happen, systems go down...
What I meant is when whole systems go down (i.e. canvas, blackboard, office365 or similar) as opposed to the internet for one person, the problem is the amount of stress and extra-work inflicted on thousands of people is (I think) can be quite large. Sure, nobody died, it's nothing like that, it's just people get upset about it because it is something outside your control and affects many people.
> Exams can be rescheduled, extenuating circumstances taken into account. As someone that has fallen ill quite suddenly through examination periods due to chronic illness I never appreciated the dogmatic approach taken when administering tests. I'm a human being, things happen, systems go down...
The problem is that any "leeway" will be taken up by cheaters. And the cheaters far outnumber the people like you who genuinely need some slack.
When I taught, I tried not to be dogmatic. But people have to understand that when a prof gives leeway, he's putting his ass on the line ... he doesn't have authority to do that and he could get burned if someone gets riled up about it.
So, if your prof cuts you slack that you needed, keep it to yourself and STFU.
Quote from student: "Thanks for having the best class."
Reponse from me: "Best?! You're getting clobbered in my class."
Quote from student: "Yeah, I'm not doing that well, but the bullshitters who always manage to butter up the Professor and skate through are actually failing for the first time ever. Everybody knows where they stand in your class. And, they know that if they put in the work they get the grade and if they don't, well, they get hammered."
Response from me: "Thanks, I guess?"
I considered it a compliment only because my father who taught high school for almost 4 decades said: "You're teaching a class. The students have to think you know your material, and they have to think you are fair. Nothing more. If they like you and/or respect you, so be it ... but those are non-goals. Your goal is to teach them the material, not be their friend."
I understand your point. But, and forgive the vagueness and wooliness of my thoughts around this subject, does this not highlight the "software has made everything shit"-ness of academia? Wouldn't a little less software or a little more downtime be good here?
Instead of being able to make a judgement call or respond appropriately to changing circumstances; instead of being relied upon for your ability to judge the needs of your students accurately, you risk being flagged up for not sticking to protocol in matters of ~student~ consumer interaction.
If a cheater slips through, does it matter, that much, if the cheating is getting an extra few days of time to complete an assignment?
Aren't universities meant to be about expanding knowledge, places of learning? Aren't we making a mockery of the whole idea of tertiary education getting so caught up in catching 'consumers' gaming the system and the risk debasing 'consumer currency points' or exam scores in order to justify the busywork of admin departments? Software and software enabled culture is incredibly powerful but it also removes human factors and discretion and has made many things worse.
Not regarding this specific incident, but to reply to this:
> However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
I've been mulling this for a while too, and I think I might have some responses that address your thought somewhat:
- Amazon/Google/Microsoft/etc. services have huge blast radii. If you build your own system independently, then of course you probably wouldn't achieve as high of an SLA, but from the standpoint of users, they (usually) still have alternative/independent services they can still use simultaneously. That decoupling can drastically reduce the negative impact on users, even if the individual uptimes are far worse than the global one.
- Sometimes it turns out problems were preventable, and only occurred because someone deliberately decided to bypass some procedures. These are always irritating regardless of the fact that nobody can reach 100% uptime. And I think sometimes people get annoyed because they feel there's a non-negligible chance this was the cause, rather than (say) a volcano.
- People really hate it when the big guys go down, too.
I think, though I might be way off base, your comment surfaces something that drives a lot of the, to me, over-the-top reaction to these outages. And that's the way (for AWS/Azure/GCP/Cloudflare) they reveal how the big 3/4 actually have eaten the "old internet" and how obvious downtime makes it.
Like this isn't a space for hobbyists or people just doing things in a decentralized manner anymore. The joke from British TV Sitcom 'The IT Crowd' where (the bigwigs are sold the lie that) the internet is a blinking black box in the company offices is actually true. Like something goes wrong with some obscure autoscaling code and actually, the little black box did break the entire internet.
I'm the kind of person who hates AWS and wants to live in the woods eating squirrels, but I can't really begrudge them downtime.
It doesn't matter though. In the end. What happens if that person doesn't do that very unspecific thing every 5 minutes on their PDA? Can they not complete their job still? Does the parcel not get delivered unless it is logged in the system the second it is delivered? Maybe so, maybe the driver steals it, taking advantage of the chaos of the system. Do they not go higher up the chain? Does the delivery company not have insurance? It can go endlessly but in the end. It doesn't even matter.
I happened to work in designing critical infrastructure for emergency services. We always had a failure in the plan, which is why part of our deliverable was a protocol for paper logging of the calls (ambulance, police, military...) and the subsequent following of the case. It worked amazingly when the system did go down. In part because it was roleplayed, in part because the system went down in a rather convenient time. The data was then added to the digital logs, and all was well in the world, including the people saved by the, and I kid you not, pen, and, paper... and other humans gasp
Yes it matters. Since we can't do it later ( only when a 3rd party is down, we can do it later)
They can't complete their job and no, it can't be done later since the opportunity to execute it is time-sensitive. It's one of the things we optimize for.
In a country like France, there's a discussion specification for it and it would get a lot of hassle.
We aren't delivering packages....
Eg. One of the reasons it matters, is that it would lose clients business and be taken into account within 4 years ( city tender... )
It's not because "people don't die", that it doesn't matter. A lot of jobs, cities and companies are dependant on what we do.
And jobs matter. So I think your statement is fundamentally flawed.
Out of interest, obviously you can't give too much away, what would happen if the users didn't/couldn't do that? The only situation that comes to mind is delivery drivers needing to get next destinations/mark deliveries completed but I'm maybe missing others.
I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.
> I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.
> I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.
Hi, just happened to see your reply after I posted mine, and wanted to maybe give just a little bit of insight. Now, this might not be the case where you are from, but in my experience, ultimately, if all systems go down, there are protocols put in place for radio communication.
We always built tools taking into account existing protocols, so they can map 1:1 (you can imagine, you can't exclude any mission protocol because the product owner thinks the screen looks better without it) but also allow for the change of protocols. For all these services, it was the military structures that truly had the functionality core, which was mapped to what they could do without any technology in case of an emergency. Which is a damn lot.
So, I feel like I'm going a bit far here, but rest assured, the people building the ambulance dispatch networks probably build them on top of systems that work with powers off. So Azure going down, or not, it doesn't really matter.
Even if that were true for a single system in isolation, it breaks apart quickly the number of services you’re ‘dependent’ increases. Then that relatively rare downtime of 1% starts to grow until every day, ‘something’ is broken.
> Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
The answer is surprisingly simple.
Most outages are the unintended result of someone doing something. When you are doing things yourself, you schedule the “doing something” for times when an outage would matter least.
If you are the kind of place where there is no such time, you mitigate. Backup systems, designing for resiliency, hiring someone else, etc.
I agree with you. Sometimes things break, such is life. What I don't fully understand is that when people choose to outsource a critical part of their infrastructure and then complain when it happens to be down for a bit. It was a trade-off that was made.
I think an important consideration here is that a huge amount of time, money, and resources is spent on making sure the computers stay powered and cooled in all manner of situations. We contract redundant diesel delivery for generators, we buy and install gigantic diesel generator systems which are used for just minutes per year, huge automatic grid transfer switches, redundant fiber optic loops, dynamic routing protocols, N+1 this and double-redundant that. It's tremendously expensive in terms of money, human time, and physical/natural resources.
The point is that we are always striving to plan for failures, and engineering them out. When there is a real life actual outage, it means, necessarily, based on the huge amount of time and money and resources invested in planning around disaster/failure resilience, that the plan has a bug or an error.
Somebody had a responsibility (be it planning, engineering, or otherwise) that was not appropriately fulfilled.
Sure, they'll find it, and update their plan, and be able to respond better in the future - but the fundamental idea is that millions (billions?) have been spent in advance to prevent this from happening. That's not nothing.
I can definitely get on-board with this. When AWS or Azure has some outage they pull me into calls and ask me what to do. These vendors are so large it's like asking me for my advice on the weather. Everything is screwed, man. Just hunker down and go read a book or something.
This was a fantastic post, it covers a lot of the things I've been thinking about but in a comprehensible and readable way. I see it has been submitted here before but not gained much traction, do you mind if I submit it again?
I agree with this sentiment. Though there is of course a bit of a problem when you're dealing with people who don't.
I'd also highlight that when the big players go down people 'know' it's not your fault, when a small 3rd party provider goes down taking part of your service with it it's 'because you didn't do due diligence' or were trying to save a buck. Similar in a way to the anachronism 'no one got fired for buying IBM'
> why does a few hours of downtime ultimately matter
I think people know this implicitly, but it's good to think about it explicitly. Does downtime matter, and how much is acceptable should be a question every system has decided on. Because ultimately uptime cost money, and many who are complaining about this outage are likely not paying anywhere near what it would cost to truly deliver 5+x9s or Space Shuttle level code quality.
That's a lovely viewpoint to be able to take about one's own priorities, but one that's hard to sell to the person at the entity, ultimately paying all your bills.
Yes, people should relax a bit, but those incidents you cite did cost those companies customers. That's okay for Amazon. But a small B2B service provider can't as easily absorb the loss.
We build these massively distributed, micro-concerned, mega-scaled systems, and at every step we recognize everything and anything can go wrong at any given moment, mulling over these problems on a daily basis.
And then it /does/ and all of us lose our shit haha.
All the sharding and YAML dark-arts in the world won't save us when the SSL cert renewal fails because the card has expired and the renewal reminder went into someone's spam.
This is a really interesting point that I hadn't considered before.
It's similar to ubiquitous next day delivery conditioning people to find anything longer unacceptable, when cheap next day is quite new and not even the norm yet.
No point in the post. I get horribly anxious when my food delivery takes just a little bit more than the estimated time, which is already in the 40 minute range, so pretty low. Then after I eat, I think about how spoiled I am by society, and how crazy it is that from the moment the impulse leaves my brain, it takes less than an hour for me to get whatever food I want...
Ah, a comment where I can put on my SRE (Site Reliability Engineering) hat :)
You're completely right that a 100% availability is unreasonable and often times, never required despite what a customer or site operator may believe.
Just a quick aside, availability (can an end user reach your thing) is often confused with uptime (is your thing up). If I operate a load balancer that your service sits behind and my load balancer dies, your service is up, but not availabile for those on the other side of said load balancer.
With that in mind, Hacker News could be theoretically up 100% of the time but if I go through a tunnel while scrolling Hacker News on my mobile phone, from my perspective, it is no longer 100% available, it is 100% - (period I was without signal) available, from my personal perspective as a user.
The point here is that a whole host of unreliable things happen in every day life from your router playing up to sharks biting the undersea cables.
With that in mind, you then want to go and figure out a reasonable level of service to provide to your end users (ask for their input!) that reflects reality.
It's worth noting too that Google (I don't love 'em but they pioneered the field) will actually intentionally disrupt services if they're "too available" so as to keep those downstream on their toes. It's not actually good for anyone if you have 100% availability in that they make too many assumptions and also, it's just good practice I suppose.
In short, an SLO is just an SLA without the legal part so a guarantee of a certain level of service, often internally from one team to another.
Ideally these objectives reflect the level of service your customers (internal or external) expect from your service
> Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region.
> Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
> The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.
> In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
Last time was due to several factors, but initially because of silently losing some indexes during a migration. I'm very curious what happened this time -- we'll definitely do a followup episode if they publish a postmortem.
Really interesting project, I couldn't find your podcast using pocketcasts.com, so I added it through their form here: https://www.pocketcasts.com/submit/
They mention a few errors in your feed:
Problem 1: Your podcast doesn't seem to have an author
Solution: Get some credit for your work by adding the following tag to your feed: <itunes:author>Author’s Name Goes Here</itunes:author>
Problem 2: Your podcast doesn't seem to have a description
Solution: Add a podcast description using one of the following tags:
Ooh, this is a neat podcast niche that I'll probably enjoy! If you're taking suggestion of public postmortems to talk about I recommend Github's 2018 outage [1,2] caused by a network partition.
I mean, it has the same benefit of other SaaS, you get to avoid building something and can spend that dev time on building something that solves a unique problem AND you have the benefit of knowing that you get to focus 100% on your app or site's problems and features, and that you have the entirety of Auth0 focusing on keeping your authentication working. I can promise Auth0 is better at building scalable, secure, and resilient authentication solutions than most dev teams, and I've been on a team that's built out a 1000s of logins/hour and 100k requests/hour enterprise grand IDAM solution.
If it's data security or something else that's your concern, you can host the data in your own database with their enterprise package.
General disclaimer: I'm a paying Auth0 customer but just use it for authentication, and it saved me a hundred hours of work for a pretty reasonable price.
I've never really worked with a language that didn't have myriad options for open source, configurable, plug and play authentication. I can't imagine spending 100 hours doing authentication.
I guess it depends on your use case; I do not really find it reasonably priced but then again, I need neither the scalability nor all the features it offers. Gotrue or supertokens are fine for what we do.
Auth is both simple and hard to get right. It's virtually the same everywhere. One group of people getting it right is better than every company trying to figure it out for themselves. It's exactly the right thing to farm out to a 3rd party.
Only on HN will you be told "you're an idiot if you outsource your auth" and "you're an idiot if you roll your own auth" by the same group of people.
Most people are using something like laravel or rails that has auth scaffolding built in for you, so both outsourcing and rolling your own are obscure dirt road paths that are rarely suggested.
There's a difference though between farming out to a SaaS product like Auth0 and rolling your own. You should absolutely not try and write your own Oauth2 server unless you really, really know what you're doing. But there are a lot of options for self-hosted auth services that are rock solid and battle tested.
Depends on what you mean by maintain. If you use one of the well-supported open source solutions like Keycloak then it is very actively maintained with regular releases, bug fixes, new features (U2F support etc). But of course you need to run your own infrastructure (database, application servers, load balancer, maybe separate infinispan cluster if you want to go wild). If you don't have the operational capacity to do that then maybe a SaaS solution is right for you.
It depends on your needs. What if you provide a SSO solution in your product, your customer is using Okta (or any other IdP) and that IdP goes down? There's nothing really you can do then unless you have other means of authentication.
To me the exact opposite - it’s seems like a prime candidate to be a third party service.
It’s something easy to get wrong, and has a long tail of work which is extremely generic (supporting all the different social logins, two factor authentication, password reset emails, email verification, sms phone number verification, rate limiting, etc...)
Really? Because of all the things I don’t want [myself or my colleagues] to write, a secure authentication management system that connects to multiple with providers is up there.
Really the only case where it makes sense to farm something like this out is to Google (if Google and the US military aren't in your threat model) because Google's G Suite login system (which can be used as an IdP) is, as far as I can tell, the exact same one they use for @google.com.
Incentives are perfectly aligned there, and if anyone can keep a system running and secure (to everyone except the US military which can compel them), it's them.
Out of curiosity could you take a look at an alternative to Auth0 like acmelogin? I worked on the design of dashboard of it. Are there any features that are missing in it and that we should add?
Auth0's pricing has always seemed really strange - 7000 active users for free but only 1000 on the lowest paid tier ($23/month). This means if you don't care about the extra features, once you exceed 7k you need to jump up to the $228/month plan.
My first Auth0 experience was a couple weeks ago when I had a quick crack at testing it out to see if it would be a suitable candidate to migrate a bunch of WordPress sites (currently all with their own separate, individual user accounts) onto.
I didn't spend a lot of time on it but initially figured it would be easy because they had what seemed to be a well-written and comprehensive blog post[1] on the topic, as well as a native plugin.
But I found a few small discrepancies with the blog post and the current state of the plugin (perhaps not too surprising; the blog post is 2 years old now and no doubt the plugin has gone through several updates).
I found the auth0 control panel overwhelming at a glance and didn't want to spend the time to figure it all out - basically laziness won here, but I feel like they missed an opportunity to get a customer if they'd managed to make this much more low effort.
I moved on to something else (had much better luck with OneLogin out of the box!), but then got six separate emails over the next couple weeks from a sales rep asking if I had any questions.
I'm sure it's a neat piece of kit in the right hands or with a little more elbow grease but I was a bit disappointed with how much effort it was to get up and running for [what I thought was] a pretty basic use case.
Is it worthwhile to do authentication via SaaS instead of a local library?
For password use case, it seems nice that you don't have to store client secrets (eg encrypted salted passwords) on your own infra. However now instead of authentication happening between your own servers and the users browser, there is an additional hop to the SaaS and now you need to learn about JWT etc. At my previous company, moving a Django monolith to do authentication via auth0 was a multi month project and a multi thousand line increase in code/complexity. And we weren't storing passwords to begin with because we were using onetime login emails links.
Maybe SaaS platforms are worth it for social login? I haven't tried that, but I am not convinced that auth0 or some one else can help me connect with facebook/twitter/google better than a library can.
100% - for OnlineOrNot (https://onlineornot.com) I only use passwordless auth (enter your email, get a magic link emailed) and Google via OAuth for this reason.
Screw losing sleep over whether you're storing credentials correctly.
What happens when the emails fail (like spam folder)?
I remeber a thread here on HN on a number of projects where they dumped email link sending as a login method for various reasons and complications. Have you face any challenges as well? If not what's your secret sauce? A better email provider? Would love to know.
Use a properly maintained library to salt and hash your passwords and the credentials will be the absolute least of your worries if your database is breached.
Generally it’s not the auth itself that is the problem but RBAC, multi-factor auth, integrations, etc.
We’ve looked at Auth0 and Okta because we wanted to see if we can save some dev time devising RBAC and supporting a lot of different auth integrations. Ended up doing it in house since the quote was unacceptable (essentially a mid-level dev salary per year)
Out of interest, what are peoples experience like with self hosted identity management options? I've been evaluating keycloak recently, and it seems pretty good.
Hey! Corrrect me if I'm wrong But It seems using Azure's(or any third party) client credential flow is better (or say easier) option as it can be used for managing multiple microservices.
However, I came across this specific need of implementing both Authorization and resource server on the same application and for that I'm planning to implement Authorization Server using Spring but I came to know that Spring have stopped active oauth project development and so I'm planning to use Keycloak for my application also I'm planning to store client id & client secret in mysql database.
In authorization server I have to generate access token and then send it back to the client and verify when the api call is made with the same token.
If you don't mind do you have any link or specific resources for the development which you did? I would love to see your project as well. Thanks.
I looked at azure a while back, funny thing is, like this incident, azure had an outage, I found keycloak pretty simple, you run it, you get a web front end, configure the bits, connect your app. I don't really have any resources at the moment, but I am going to do a github repo of example projects for connecting it to .NET stuff
Keycloak is pretty good in the average case, but when you get to esoteric use-cases like multi-thousand group/role setups it breaks down, performance-wise. Stuff like that isn’t common practice though.
The Auth0 team is probably distracted by their Okta onboarding. When I was onboarding at Okta after they bought the startup I was working at, I had to support both systems to bring myself up to speed fast -- and that caused some outages from double on call.
But, does it really matter?
I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.
However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?
I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.
My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.