Understand the services you depend on. Track the number of requests you're making to them, how long they're taking, and how many are failing. Reason through your system and look at the data when you have issues, rather than grasping at straws.
Obviously they were under a lot of pressure and it's easy to stand here and criticise, but...
...if my site is slowing down with load or usage, I'm not sure how you make the jump to "I should update my UI libraries!". Angular 4 isn't getting any slower, so best case is you've got some unknown performance bottleneck in your UI that is somehow causing 30s page load times, and which just happens to be fixed in Angular 6, and you don't accidentally add any new issues when you upgrade.
Conversely, it feels like if you're struggling with "slow load times" on a SPA, the first thing you'd do is open the network tab and see what requests are being made, to what, how often, and how long they're taking.
Grasping at straws does seem to be the right metaphor. (Or maybe the old chestnut about the drunk dropping his car keys in a dark parking lot, then looking for them under a streetlight, since it's too dark to find them in the parking lot?)
I'm happy for the team and it sounds like things are going great for them, but wow, that was an almost fatal bit of blindness. On the plus side, I bet everyone involved will check for inefficient database calls first next time. :)
It's entirely possible that they could have spent hours or days debugging their issue only to find it had already been fixed.
We have some software that was returning different results from different environments, and we couldn't figure out the problem. There was a lot of panic in the room, from upgrading and downgrading Maven dependencies, building things inside and outside of Jenkins, and all sorts of random things.
We kept telling the project leadership that we're poking at the wrong part (intuitively), but they kept pushing. I've had to explain how Maven works, how building on Jenkins doesn't differ to building from our IDE's, etc.
It's only when we asked for isolation from the (human) elements, that we had the freedom to properly debug.
In the end, an unstable sort was the cause of the issue. We were taking the last element from an array, but not sorting the array first.
All of the stuff we did since last Thursday to Tuesday evening didn't help us.
So, I agree, you need good humans who are good at responding well when things break.
That wouldn't make any sense at all.
Every developer would have their own guesses that they would need to explore and validate, sometimes there's some grouping around where the focus is, but there's usually one guy that's exploring a totally different area to find that bug.
modern stacks have a huge opacity problem, everyone wants to be magic, and everyone fails. abstractions make reasoning harder, what tools and techniques would you suggest for doing this?
I'd probably run the application in some sort of sandbox and measure the outbound request load vs inbound request load, something a containerized deployment should be giving the end user (developer) as an affordance for application maintenance and visibility. Differential analysis and graphing built directly into the execution substrate.
edit: I would have also assumed they would get this for free from GCP's billing breakdown, but I'm not familiar with it. My first intuition when facing unexpected billing would be to figure out what the major contributor to the bill is (in this case, massive reads from FireStore), not update my frontend packages.
Google cloud has trace built in which could have shown them execution times and is dead simple to drop into most frameworks.
The real story here is that hey didn’t have engineering leadership on the team who knew how to properly diagnose issues, put tooling in place before launch, and understand how their system is architected.
Kudos to the engineers for solving this issue under pressure.
^ Best response, hands down.
They didn’t upgrade packages to solve the mystery billing. They upgraded packages before they checked what was going on with the database. When they saw the high billing that pointed them to the problem and they fixed it.
There was some questionable judgement shown by not checking db requests first, sure, but in no way did someone think “our google billing is high, we better upgrade angular”.
Also, lots of people are impatient and/or intellectually lazy. We have piled up a ton of abstraction layers, yes, but they aren't hard to pry apart. But people want immediate results without doing necessary cognitive work - understanding-guided exploration.
It usually isn't hard to identify which component of your product is misbehaving. Before getting into complex magic of containers and sandboxes and stuff, I'd start with the easiest - looking at the Network tab, at your server's resource use, reading the logs, adding some log statements measuring times in suspected areas, actually profiling the backend code (e.g. with a statistical profiler). This should quickly help you identify where the problem is manifesting itself. Then the search for the cause begins.
Sorry, the guy that wrote/implemented abstraction layers 2 and 3 left 2 years ago and didn't document anything.
We've been understaffed for a year and we've been told not to hire any more staff until the new financial year.
I've got a "technical debt" item on the backlog but business drives the priorities and it'll never get done.
Only matters when the fault is located or manifests itself precisely in that layer, and that's always a risk. Consider all the third-party dependencies you use. They usually have only their APIs documented. Fixing a fault usually requires knowledge of the internal implementation.
> We've been understaffed for a year and we've been told not to hire any more staff until the new financial year.
> I've got a "technical debt" item on the backlog but business drives the priorities and it'll never get done.
Yeah, I get that. I've seen that. Thing is, you can only play lottery so long with your main product - and also, if your workplace runs an assembly line so tight that you can't spend hours thinking per ticket (excluding the most trivial ones), then something is seriously broken on yet another level.
Ultimately, I guess what I'm saying is that the main problem here is cultural - possibly both on developer and management side. The actual technical tasks aren't usually that challenging.
To debug something complex, I would use chrome devtools, which can measure all kinds of metrics, and the function "Audit > LightHouse" automates the process and ranks a webapp in several key categories.
This case would appear to be related to network requests, so that issue should be fairly obvious in LightHouse.
This is a good thought-process even when debugging issues during development. I've seen many developers attempt to "fix" issues by trying to figure out what dance/keystroke makes things work.
Whenever you encounter an issue of any kind, anywhere, understand the issue before attempting to resolve it. It may require you to dig deep into things you don't currently understand, but your career is currently telling you that you need to understand it.
Yeah. That jumped out at me as well. They spent an inordinate amount of effort to solve a non-problem. It's great to stay on evergreen with versions, but probably not a good thing to do so while you're desperately trying to debug a problem.
I suspect this was a hopeful but lazy attempt - in the spirit of "Maybe if we just do this, it will somehow fix the underlying problem". It's a lazy approach to solving problems. Debugging performance bottlenecks is hard and devs generally hate doing it. Upgrading version dependencies is a known factor and developers are comfortable with that.
Sure, there are times when it’s going to work out for you but you should at least have narrowed down your issues before you go down that path.
I can just spin up PHP5, 7.0, 7.1 in case anything goes wrong without delay.
Usually you want to understand the problem before solving it. In this case, they wasted a bunch of time doing a bunch of things (upgrading all the dependencies, and refactoring the app) in the hope that something (ANYTHING) they're doing hopefully fixes a problem they don't understand. Smart move?
That's one perspective. But come on! I really don't understand that attitude that when presented with a problem, the first approach is to spend a few days blindingly refactoring code and upgrading all the underlying frameworks. Seriously?
The problem is also obvious if you just stop and think about it for a second:
- They are using Firebase. For the purposes of diagnosing our issue we can assume the backend will scale well (for trivial queries) and the pipe between server and client should be wide. Firebase could be the problem, but odds are Firebase didn't go down on you just as your go-live went ahead.
- Because they are using Firebase, their app is completely client-side.
You can go through the potential areas of concern:
1) UI has trouble rendering. That should largely be independent of the number of users. If this was only a UI issue you'd expect some users to have problems (maybe ones that created a large amount of artifacts) but not all users. Presumably before going live, the app worked well with their test datasets.
2) Some combination of UI or Network or Data model. They noticed their web-app got slower as the number of users grew. So question is why would individual user session slow down as the total number of users grow? It must be that a single-user view is somehow dependent on the total number of users in the system. WHY?!? We know Firebase is fast, but any fast system can choke if you have a bad data model. So it could be a slow query. Or it could be a too large of a response being sent down (again, why would a large response be sent down). Maybe it was a huge json object and the UI locked up. Or something like this.
It really shouldn't have taken long to at least target potential areas to explore. HELL, you should be able to see the issue immediately if you open up the network tab. You'll see which requests are either taking forever, or lead to large amount of data being transferred or both.
It really isn't about 'armchair developers'. I've been in situations where things are falling apart and you need to figure shit out. Our product is on-prem and used in hospitals and is connected to multitudes of other systems controlled by other vendors. When you're trying to diagnose issues, you have to have a rational approach based on some reasonable hypothesis.
I've literally had people try to make conclusions on comparisons of test runs with completely different parameters, different data sets, different resources, different versions of code, absolutely everything varying.
My head just explodes... I wamt to scream that this isn't how this works, it's not how any of this works.
However we go about it, the first priority is to give ourselves some space to properly analyse the issue and find the real solution without the rest of the business worrying loudly about things being broken.
The tooling to make it easy not there yet.
A testament to how low the barrier of entry has gotten. It's both a good and bad thing at the same time.
It however leads to having to be ever more so vigilant about at least your first layer of dependencies in that ecosystem, if you do want to be professional.
The higher the barrier of entry is to a language, the more likely it is that when you're pulling in dependencies, the code isn't amateurish
Incidentally this is probably why JS juniors think to upgrade dependencies when they encounter unknown situations... a lot of problems in JS do come from your dependencies.
And the result is unsurprisingly poorer quality software. So why is it a good thing?
I don't think the problem of quality should be addressed by arbitrarily axing people from the field. Some sort of a standardization / accreditation seems like a better approach.
And axing is not arbitrary, it's generally done based on experience and know-how. Not everyone can or should become a software engineer.
If you do go and track down the problem in your depedency and file a bug, one of two things is likely to happen: they close it and say it's fixed in the latest version or they refuse to accept your bug because it's filed against an old version.
Skipping the track it down part and just jumping into upgrading can be a time saver. It works fairly well if you fit into the 'common' part of the user base with frequent updates. (Incidentally dependencies with frequent updates are kind of a pain)
This is the part the parent's cow-orkers didn't perform. There's nothing wrong in updating a dependency to include the fix for the problem you're experiencing. But the people in question were apparently too lazy/clueless to even track down the problem, opting for randomly upgrading stuff instead.
There's a difference between upgrading your dependencies because you traced a problem that you know is fixed in the newer version and upgrading your dependencies because you hope it fixes a problem you don't understand.
There is the possibility you are told you made a mistake in thinking it’s a bug with the library
The current trend and curse of DRY and NIH is to solve stuff by adding dependencies and gluing them together. Rookies expect that some software solved the problem at hand without thinking about it. Even worse is that they even apply this to rather simple things. The problem of OP - countig items inefficiently - is absurdly common. IMHO this is the heart of the problem, the new generation is highly uneducated how to handle data.
I used to be in the Java ecosystem, the C# ecosystem, the PHP ecosystem... and I could have made the statement "the most professional cluelessness I’ve ever encountered was in the X ecosystem."
I think it's just an industry thing.
Having spent much of the past year writing Rust and interacting with that community, I'm inclined to disagree.
PHP, Ruby on Rails, and jQuery are other technologies that had low barrier to entry received the attention of the "unwashed masses".
This being 2018, JS has very low barrier to adoption (Have a web browser? You have a JS runtime.) and nature runs its course.
A significant number of graduates of computer science couldn't do software engineering after graduation even before.
That barrier to entry is designed to protect society from poor quality software and actual software engineers from having to suffer through picking up the broken pieces after those people that were helped to jump the barrier.
Most barriers to entry are not designed to protect anyone, they're designed to preserve power. To protect people from bad products, you need regulation, accreditation, etc.
This is the first step. Then one needs to find a company with a good engineering culture, apply the theory they learned and gather experience.
Ideally one should find a qualified engineer as mentor.
Self-study and being aware of developments in the profession are the last piece of the puzzle.
Yes, some people won't be able to do some of these things and as a result they won't be good software engineers. They could still be successful programers, the two aren't necessarily related.
Apologies if I suggested otherwise, but of course programming doesn't need barriers to entry. Just like PCs and the internet don't.
As soon as I got to the part where they just upgraded a bunch of libraries.. I rolled my eyes, I was expecting a serious look at something, perhaps even a bug in Firebase or something in-depth. But nope, what we got was "Ooops I didn't think about the number of API/DB calls we were making because we don't think that way, we just assume everything is the fault of the libraries we use."
That kind of attitude is why I cannot wait to abandon JS all together..
Yeah I'm guilty of this one. Sometimes you know the problem is somewhere in a particular area of code, but that code is all over the place. Pulling it apart and refactoring it can be a good way of understanding all its dependencies. If the refactoring doesn't help, just don't check it in..
However it sounds like people are talking about refactoring an app solely for the purpose of hoping that the refactor shakes out whatever bugs. That sounds like the debugging equivalent of “8 hours of coding saved me 30 minutes of planning”
Us old/wise/thoughtful folk have denigrated the tools that young/foolish/impetuous kids use since we were they.
We need both: yes, these young people made some mistakes, but I'm in awe at what they achieved. They built, triaged and fixed a massively successful campaign in the time I would have taken scoping out the requirements. Oh, and gladhandled Google into paying the tab... impressive!
I refuse to use a service like this unless it gives me the ability to automatically cap costs and alert me when thresholds are met.
All it takes is a rogue line of code in an endless loop or something, and you are bankrupt.
Their site seems pretty basic. I'm struggling to understand why they couldn't just run it with something like Postgres for less than $100 a month on AWS?
Google Cloud user here. A warning: If you ever happen to get, say, frontpage on reddit or techcrunch or other big boost to publicity, your site could be down until the next billing cycle (i.e. 24 hours) and you will have no way to fix it.
This bit me hard one day with appengine and lost us a ton of converting traffic, even though we tried to get the limit increased within ten minutes of the spike (and well before our limit was hit).
Even if the front door of the system didn't help you, we definitely shouldn't have been able to get you in a good state much quicker. My profile has Twitter and my DMs are open (I can give you my email there too).
Doing my kids dinner, so responses might be slightly delayed.
Its a problem with most cloud providers, but Google seems to be notorious for it.
"Company ignoring you? Send out a tweet, that'll work!"
And it blows my mind that it actually does. It's very sad.
You just really have to put some serious thought into what your daily limits should be, and add some reasonable alerting to detect surges. The tools are there and they're not terribly hard to use. It just tends to be an afterthought for most developers because this doesn't look like a customer feature.
Without these sorts of automated scaled services, the traditional behavior is "your app goes down". This is a big improvement!
This is hard for some things, but your startup failing because you didn't want to do it is much harder in the end.
It would be very difficult to build products in a reasonable amount of time if everything has to be coded defensively. I can build my app quicker, and remain sane, if I assume that the DB will always be available, and just fail if the DB isn't there. Same for things like S3 (which I think had only 1 large scale failure in recent history), Redis, etc...
There are APIs which can be unavailable and you need to work around those. For me, these are mostly third party services that I don't control. But then again, I'm not building the next Netflix. I don't have enough engineers to build an app that works with a chaos monkey!
Not the best approach for all applications, but has been good enough for most projects I've worked on. Just my 2 cents.
It's not an unreasonable request that for services which advertise the ability to scale up and down on demand, that the billing and billing limits should also be able to respond similarly.
How so? With a pay-as-you-go system, firing off warnings and giving a projection of their future costs (which is hard when startups tend to have spikey traffic) is about as good as you can do.
Edit: I should add that the common solution to controlling your billing in situations like this is having some overflow path built into your beta app ("Sorry, we're not taking new users at the moment" or the like).
In essence, settings and updating amount, rate and velocity ( speed of rate change ) caps on the fly.
Then I can set whatever tight limit I want to, and not worry about burning through too much cash because of some simple coding, or config error.
A bug almost cost us several tens of thousands of BigQuery costs when a dev accidentally repeated a big query every 5 seconds in an automated script, and while we still had budget warnings, it still cost us a fair bit of money. Even after this, I found it tricky to set/catch budgets for single services. I think I had to use stackdriver to be able to get any kind of warning.
It was in the ’blinking lights and sirens’- territory fast!
An account that goes from $0 spend to $30k in 72 hours should really trigger some kind of flag - even internally within google. What if they didn't have any kind of grant and weren't able to pay?
It's been a little over a year since I used Firebase in production, so maybe this has changed, but the funny thing is Firebase DB doesn't infinitely scale, despite them advertising that it does.
The Firebase DB caps out at 100k active connections and according to them (at the time) it's a technological limit on their part, so they cannot go higher even if they wanted to.
When we brought this up, they told us they were technically unlimited because you could shard your data into different DBs if you needed more connections, which is like saying all restaurants are all you can eat because you can keep buying more food.
That's the thing that terrifies me. If I'm using S3/Cloud Storage etc., I'm getting charged for each GB of outbound traffic, and I have to assume that the bandwidth available to serve my files is almost infinite.
There are some ways to limit writing using the rules engine but you’ll still get charged for failed writes. :)
Only real way to rate limit firebase that I know of is to put some sort of proxy service in between, but then you lose a lot of the advantage of using FB in the first place.
That’s why I prefer to just rent whole machines on AWS. If I accidentally have some code stuck in an infinite loop making some O(1) call, don’t charge me $10,000 for that when it costs you nothing. If it’s actually consuming electricity and significant resources, I’ll know quickly because my service will go down, as it should, not scale infinitely until my company is bankrupt.
It's endless bill monitoring and budget approval.
I'll stick to a flat rate DO droplet.
So, if the cost cap is defined as $n per day, once you deplete it you'll be down for the rest of the day (or if you take some manual action to increase the cap, if the cloud provider supports it).
This problem is a function of the granularity. Imagine a system that let you say:
"I want to spend max $n per second with a extra burst of $m per day/week"
You adjust your "$n" to match the throughput you'd have if you had a fixed size "pay what you provision" system and reserve $m for lucky events as landing on HN.
The amount of planning you have to do is similar to the traditional resource allocation, but with the benefit of paying less than provisioned if you're not getting all that traffic.
For more capacity based services, sure, but it's more likely to be down than just slow. When systems are run at their limit they rarely operate the way they did with a little less traffic.
When you're an individual, the potential for a $10k bill is much scarier than your hobby project going down.
When you're a small org/startup, the potential for a $75k bill is probably still scarier than your site going down.
Caps are tough though. I can certainly understand a use case that would want a hard circuit breaker that just kills everything it can once it hits a certain threshold. Sort of; you presumably don't want everything on S3, for example, to be deleted.
On the other hand, moving up the scale of serious businesses, I can imagine it would be hard to specify circuit breakers (rather than just alerts) and you get into issues of terminating services that affect all sorts of other services across the entire account.
We had daily hard caps and budgets alerts, but it's still an area we can do better.
(Disclaimer: Product Manager for Cloud Firestore)
Perhaps cloud providers should have some sort of hard circuit breaker option (though it won't help for some things like storage) but it's probably not a priority as not a lot of businesses--their primary customers--would be OK with effectively hitting the power button for their entire cloud account if they exceed some dollar amount that someone or other configured a couple of years ago.
It's not firebase's fault you ran bad code and have a huge bill. They have bills to pay also. Think they can tell their vendors 'sorry we can pay you this week. A customer ran up a 30k bill and they can't pay it. So we can't pay you right now. but lol bad code right? '
I don't blame Firebase at all though -- great product.
Among many, I think this article is probably the most succinct endictment of ADHD-ridden "modern" web programming/ecosystem practices I've read.
It's so sad to me that while the name dropping and churn for frameworks and languages continues, frenzied and unabated -- basic (pun sort of intended) analysis and problem-solving techniques go out the proverbial window.
Why learn to think critically when you can just 'npm update', fix 37 broken dependencies, and write a blog post about it? Right?
This is more a problem about startups using inexperienced developers than anything related to what they're building or which tech they're using.
Seing someone use firebase to save paiements, then recompute a total from a collection, and as a consequence having its system explodes with less than one session per second, means everybody in the team drank the « let’s use this nosql google shiny tech, it’s so cool » cool aid.
Even one conversation with any senior dev having some kind of experience with backend development would have asked about expected load, types of queries, data model, etc. And concluded that storing paiements was probably the least interesting scenario for using a tech like firebase.
Bottom-up learning starts at the metal, at the very fundamentals of computation, and builds upward.
1. Developing a fix without understanding root cause (try-something development)
2. Sufficient testing, including load testing, prior to initial deployment
3. Better change control after initial deployment
4. Sufficient testing for changes after initial deployment
5. Rollback ability (Why wasn't that an option?)
6. Crisis management (What was the plan if they didn't miraculously find the bad line of code? When would they pull the plug on the site? Was there a contingency plan?)
7. Perfect being the enemy of good enough
Looks like they were bailed out of the cost but what if that didn't happen?
In the companies I've worked for, these guys would be written up and likely put on a performance improvement plan, if not flatly fired.
I don't understand how you can build a complex application like that without doing basic performance checks like, are we hitting the file system or database too often, our the image assets correctly sized, etc.
I'm not a software engineer however.
A money clock on the table isn't fun and if you replace the the app with a landing page and a newsletter form it's completely acceptable for the visitors.
This is this nightmare I envisioned with cloud services, a client gets hit really hard, and I have to pass the bill on to them.
This reminds me variable rate mortgages.
With dedicated hardware, you may end up with performance issues, but never a ghastly business-ending bill. How does anyone justify this risk? I really don't understand the cloud at all for such high cost resources with literally unlimited/unpredictable pricing.
Can someone explain this risk/reward scenario here?
I'm more concerned about the risk mitigation strategies (capping) I'm seeing advocated.
If your servers being pegged you've only got a few customers missing out while it's pegged, maybe even everyone getting service but sub-optimally. You can ride out the wave and everything goes back to normal.
Putting caps in place is like pulling the plug out of the server after the CPU has been at 100% for 5 minutes and not plugging it in until the next billing cycle.
If anything, that shows a lack of proper hiring decision on you and your team's part.
I do, however, agree that their practices are horrible (just look at their console, they're console.logging random things, running the dev mode of Firebase, and fetching some USD conversion call 10x on load with no caching) and they're lucky Google bailed them out at the last minute.
Hey, friend! I had no control over hiring for that gig.
When I do use queries, it's always in places where the results have a well-defined limit (usually limit = 1), e.g. finding the most recent X or the highest X.
With the above two, you get all the greatness of Firestore, but with a well-defined (low) cost that you can calculate ahead of time.
Definitely more we can improve here for control, and we're open to feedback.
I believe in short time would be nice to have a way to use for the query the create, update and write time of a document. Now, I am doing the creat time manage inside of my document with Date.now(), but when I was running a bunch of promises to create the documents, the createTime between the documents that I was manipulating was in same case the same, so my pagination failed.
Another things, like compound queries inside of subcollections should be nice. A way to export all the database for backup.
A flag to alert the Firestore to return the document when I do an update in this document in the same response (one round trip (dynamoDb has it). I know I can reach this goal witht transaction, but I believe it is simpler than a transaction.
A way to update a array without transaction.
Multiple rules/filters need to exist to trigger SMS/Email alerts, or a pre-defined action, upon certain conditions.
This startup mindset is not always good.
The article refers to some mysterious "engineering team". It would appear very little actual engineering took place at that company.
I've seen and fixed such bugs as described in the article, and before you start trying to upgrade anything a look in the log followed by a git bisect session is the first step.
My rails apps have great logs, I get to see what views and partials are rendered, what queries are sent to the database and more important how often all that happens. If the log excerpt for a single request doesn't fit on my screen I know I have to do something.
You should know your application's profile, you wrote it.
How many resources does you app need? That's something our developers believe is the "operations team"'s responsability. Well, now that you took the 'devops' role you can no longer keep ignoring this. Your new infrastructure provider will be more than happy to keep adding resources, one can only hope the pockets are deep enough.
With attention to the profile this would have been caught at developing time, maybe testing time.
Oh that will do it.
Doesn't sound like a future great company to me, especially when their lesson from this was Google will bail them out and "It is very important that tech teams debug every request to servers before release." rather than hiring less cavalier employees and putting in better process.
The same reason we slow down for car crashes, morbid curiosity. I don't think there is anything "sudden" about it though, we even have sites like thedailywtf dedicated to this level of idiocy.
hackernoon: author patting themselves on the back while everybody else is laughing at their incompetence
For those that didn't read the article, it had a happy ending:
> GOOGLE UNDERSTOOD AND POWER US UP!
> After we fixed this code mistake, and stopped the billing, we reached out to Google to let them know the case and to see if we could apply for the next grant they have for startups. We told them that we spent the full 25k grant we had just a few days ago and see the chance to apply for the 100k grant on Google Cloud Services. We contacted the team of Google Developers Latam, to tell them what had just happened. They allowed us to apply for the next grant, which google approved, and after some meetings with them, they let us pay our bill with the grant.
> Now we could not be more grateful to Google, not only for having an awesome “Backend As A Service” like Firebase, but also for letting us have 2 million sessions, 60 supports per minute and billions of requests without letting our site go down. Besides they understood errors like ours can happen when a startup is growing and some expensive mistakes can jeopardize the future great companies.
There's no guarantee if I made the same snafu next week that Google would necessarily be willing to help, but I can absolutely guarantee you that a VM sitting on a Dell PowerEdge I've got lying around would suddenly obligate me to a $35,000 bill, no matter how bad my code.
Ideally, I guess I'd want to see rather than a hard cap, some sort of smart alert that would go "holy crud, this is an unusual spike in the rate of requests" when the delta changed unusually, rather than waiting until I say, hit a high static cost bar or a hard cap that kills the site.
Budgets and quotas are really tricky as Dan pointed out elsewhere in this thread. App Engine has had default daily budgets (that you can change) forever, but then you run into people saying “What the hell, why did you take down my site?!”.
In this case, they even intentionally pressed forward once they saw their bill was going up. If this had been say a static VM running MySQL with a “SELECT *” for every page view, the site would likely have just been effectively down. For some customers, that’s the wrong choice, even in the face of a crazy performance bug.
That said, we (all) demonstrably need to do better at defaults as well as education (the monitoring exists!).
For smaller organizations, something being down during extreme load is a recoverable problem, but owing the cloud provider all of their money may not be. (Note that even in the case here where Google got them the grant to cover this bill, this is still probably 35K in grant money that could've gotten them further or been used better elsewhere.)
Also I bet they did some manual testing. They didn't catch it because this latency can only be seen by the account with a lot of followers.
I agree that their first solution to upgrade is a bad idea...
You should understand what caused the bug before trying to fix it.
I highly encourage you to monitor the load/pay/request graph on a daily basis. Even better if you hang a screen on the office that displays these. The graphs are already provided by Firebase. That way you can catch these type of anomaly on day one. Also Firebase supports "Progressively roll out new features" https://firebase.google.com/use-cases/#new-features
Makes for an interesting counterpoint to the currently popular "Google is evil" narrative. The truth is probably much more mundane: Google is an awful lot of people trying to work together to achieve a bunch of shared goals and doing an imperfect job of it. This isn't just rose-tinted: I'm quite sure they have their fair share of bad actors, and they certainly make decisions we don't all like (e.g., retiring products), but I don't think it's because the company is fundamentally evil.
We could have a debate on whether Philip Morris is evil. I am sure most of their employees are pretty decent people.
Great logic. Philip Morris is evil because they sell products that cause cancer and death. Therefore Equifax, who extorts money from people to protect them against Equifax polluting their credit rating, and who leaks their data into the wild is not evil, because Equifax doesn't cause cancer nor death!
Horrible architecture decisions like this can be very costly.
At that qps Redis has 2 microseconds per request.
I agree it caches well but your proposed architecture is definitely not production quality.
On my machine 5 concurrent requests, at 100 items, it can do ~6 500 000 items per second, at 300, 11 000 000 per second, and it kind of caps out at that. Even with 1 concurrent connection, at 600 items, you get 6M per second.
And wastes a bunch of various resources making separate requests to a currency conversion service for each amount, as others have noted. And requests /null and /undefined. This might be the most irresponsible development I’ve ever seen.
You want it perfect and you cannot afford to put down the site but you're willing to take a "huge risk" based on (wrong) guesses, with clearly not enough time for proper QA. I sincerely suggest you to slow down and reflect on priorities and risk assessment, there's a reason if that firebase code slipped through. By the way I'm happy you avoided the worst case scenario, good luck with your project
That runs counter to the great Silicon Valley ethos of moving fast and breaking things
I worked at a startup in the Mission for a few months and I remember seeing a quadratic query that ran for every customer that was logged into our application. The CEO and team lead wondered why our app worked great in Dev (with only 5 users) but was terrible on premise (250 users). When I tried to explain the issue the two devs before me didn't really understand what I was talking about. It was a quick refactor and caching solution that fixed the problem, but it was clear that the development was still new.
though they already had this sum precalculated
SELECT * FROM payments where paymentID = 1
SELECT * FROM payments where paymentID = 2
SELECT * FROM payments where paymentID = 3
SELECT * FROM payments where paymentID = 14986
In sql equivalent, what happened would be more like doing a full table scan on each query instead of using an index ( and not even that, because pre-computing a total isn’t really like an index)
This kind of « account balance » problems are typical of the problems where transactions are really useful. But also historically the kind of problem where nosql techs do a poor job ( they’re more built with « eventual consistency » in mind than atomic or transactional behaviors)
By not checking Google Billing after you launch your website. At the very least you should have a billing alert.
I'd rather have default alerts already configured that I can change, rather than none.
Same for the framework change. An outdated framework might be a second or so slower, but a >30-second load time was never going to be fixed by updating. This is just bad problem-solving skills. When your app is taking 30 seconds to load, you don't just guess at what might make it faster -- you open your JS console and your log files, and you figure out where that time is being spent. Two minutes in the Performance tab of Chrome's developer tools, and you would have figured out the issue was on the back-end rather than the front-end.
Denormalize your data (or, optimize for read).
Denormalizing would have required having a relation between collections. Here there was just one (so there was nothing to denormalize).
Optimizing for reads ( or simply think about read performance) doesn’t require denormalizing. It can also be a matter of creating an index, or precomputing values in a cache, like in OP case.
* didn't have good testing
* did no load testing before
* had no code reviews done
* no design reviews done
* had no or very less (useful) application logs
* had no change control mechanisms defined or followed (upgrade a framework in production in a matter of minutes or hours as a way to wing it out and pray for it to work out?)
* had no or very little automated tests
* didn't have a detailed post mortem or root cause analysis to see what they could do to prevent it (the ending looked quit amateurish, by pointing to only one thing as a potential lesson)
* wasted a lot of money that could've helped in the future (by instead using that to pay for en error)
If I were in such a team, I would've honestly stated in such an article how deeply ashamed I am that we missed all these things, how "cowboy coding" and heroics must never be glorified, how we got very, very lucky in someone else waiving off charges (this is not a luxury that most startups or one person endeavors would have), and ended with asking for advice on what could be done to improve things (since it's obvious there were many more gaps than just how a few lines of code were written).
To the team that wrote this code and this article — get some software development methodology adopted (any, actually) and some people who can help you follow any of those. Also read the rest of the comments here. You got very, very lucky in this instance. It may not be the same case again, and you may see your "life's work" get killed because you didn't really learn.
Maybe that ~$600 million isn't in USD?
The graph shows a spike to around $5,000 per day ($5 mil por día). The entire dashboard is in USD, presented in a Spanish locale. That is also why the dollar sign is suffixed, the months are not capitalized, and why May has a dot after it, because it is abbreviated there (mayo).
Every programmer should understand locales even if they do not speak the language.
But OK, I didn't get that "COP" is Colombian Peso :(
And there's another image that shows total collected as "USD $244,875". The ratio is 2450, which is close enough to the exchange rate.
When you said “the image” I thought you were looking at the right one, and I thought it odd you were off several orders of magnitude from what I assumed to be your misunderstanding. That explains that. I had to go back and find your figure.
I gotta say that using "$" for both USD and COP is confusing. So you must say "USD $x" and "COP $x". Then why bother with the "$"?
futhermore the US Dollar itself stems from the Spanish Dollar:
"The U.S. dollar was directly based on the Spanish Milled Dollar when, in the Coinage Act of 1792, the first Mint Act, its value was fixed [..] as being "of the value of a Spanish milled dollar as the same is now current
But it's still potentially confusing, when stuff gets translated with no context for currency values. Especially, I imagine, if you don't know either Spanish or English.
The Dutch, ever innovative in trade, can lay claim as a bigger influence on the word “dollar” and the currency form itself, however, and colonial Americans traded regularly in Dutch daalders (we still pronounce it that way, unlike doh-LAHR/doh-LAHR-ehss for the Spanish varieties). Daalders themselves were descendants of Bohemian thalers, as were Spanish dollars. We just borrowed the neighboring dollars when the time came, probably due to our foreign policy environment at the time, trade with Florida, and so on.
I mean we have kilograms, meters and seconds. And they're the same for every country.
But "$" (dollars and other currency units) means different things, depending on the context. Similarly for ounces, pounds, feet, gallons, etc. So you're left with constructions like "US $" or "USD" or "USD $" vs "Can $" or "CAD" or "CAD $". Just as with "avoirdupois ounce" vs "troy ounce", "US gallon" vs "imperial gallon", and so on.
So anyway, I always write "foo USD", "foo EUR", "foo mBTC" and so on. To avoid ambiguity.
There’s a bit of americentrism down the confusing line of thought, for what it’s worth.
That's not true. It's from Latin, mille.
That aside, I find it a terrible way of writing things. But then again, Americans hate SI, so I guess have fun. :-)