
How we spent $30k in Firebase in less than 72 hours - slyall
https://hackernoon.com/how-we-spent-30k-usd-in-firebase-in-less-than-72-hours-307490bd24d
======
sciurus
When you have an unexplained performance problem, your response shouldn't be
to "upgrade every single framework and plugin" that you use. The 36 hours that
they spent doing this cost them $21,600 dollars on GCP and didn't solve their
users' problem.

Understand the services you depend on. Track the number of requests you're
making to them, how long they're taking, and how many are failing. Reason
through your system and look at the data when you have issues, rather than
grasping at straws.

~~~
thatjsguy
I mean this in the nicest possible way: I see this all the time with the
JavaScript set and I am absolutely not the least bit surprised. I used to work
on a team that was TypeScript top to bottom, with people who didn’t really
even understand how to debug (they were mostly bootcamp juniors). Whenever
something would break, if restating it didn’t work, you know what they’d try?
Yup, upgrading random dependencies. Refactoring was also pretty popular,
although it usually just ended up making things more complicated.

It’s silly to suggest that JavaScript itself is somehow responsible for this.
It’s obvioisly just a tool. But I have to say, the most professional
cluelessness I’ve ever encountered was in the JS ecosystem.

~~~
toast0
So I completely agree, but let me play devil's advocate.

If you do go and track down the problem in your depedency and file a bug, one
of two things is likely to happen: they close it and say it's fixed in the
latest version or they refuse to accept your bug because it's filed against an
old version.

Skipping the track it down part and just jumping into upgrading can be a time
saver. It works fairly well if you fit into the 'common' part of the user base
with frequent updates. (Incidentally dependencies with frequent updates are
kind of a pain)

~~~
TeMPOraL
> _If you do go and track down the problem in your depedency_

This is the part the parent's cow-orkers didn't perform. There's nothing wrong
in updating a dependency to include the fix for the problem you're
experiencing. But the people in question were apparently too lazy/clueless to
even track down the problem, opting for randomly upgrading stuff instead.

------
t0mbstone
This is why infinitely scaling pay-as-you-go cloud services terrify me.

I refuse to use a service like this unless it gives me the ability to
automatically cap costs and alert me when thresholds are met.

All it takes is a rogue line of code in an endless loop or something, and you
are bankrupt.

Their site seems pretty basic. I'm struggling to understand why they couldn't
just run it with something like Postgres for less than $100 a month on AWS?

~~~
itcmcgrath
Product Manager for Cloud Firestore here. It's worth noting we do have the
ability to set hard daily caps, as well as budgets that can have alerts tied
to them. It's something we also looking at ways to improve it.

~~~
alasdair_
>Product Manager for Cloud Firestore here. It's worth noting we do have the
ability to set hard daily caps, as well as budgets that can have alerts tied
to them. It's something we also looking at ways to improve it.

Google Cloud user here. A warning: If you ever happen to get, say, frontpage
on reddit or techcrunch or other big boost to publicity, your site could be
down until the next billing cycle (i.e. 24 hours) and you will have no way to
fix it.

This bit me hard one day with appengine and lost us a ton of converting
traffic, even though we tried to get the limit increased within ten minutes of
the spike (and well before our limit was hit).

~~~
iamgopal
So basically we want to have cake and eat it too.

~~~
paranoidrobot
This seems like an overly cynical/snarky response.

It's not an unreasonable request that for services which advertise the ability
to scale up and down on demand, that the billing and billing limits should
also be able to respond similarly.

~~~
badlucklottery
>that the billing and billing limits should also be able to respond similarly

How so? With a pay-as-you-go system, firing off warnings and giving a
projection of their future costs (which is hard when startups tend to have
spikey traffic) is about as good as you can do.

Edit: I should add that the common solution to controlling your billing in
situations like this is having some overflow path built into your beta app
("Sorry, we're not taking new users at the moment" or the like).

~~~
Dylan16807
The problem is that the user gets those alerts, tries to change the limit they
set, and it doesn't work. That is definitely not as good as can be done.

------
r3vrse
I spend a fair amount of time on HN.

Among many, I think this article is probably the most succinct endictment of
ADHD-ridden "modern" web programming/ecosystem practices I've read.

It's so sad to me that while the name dropping and churn for frameworks and
languages continues, frenzied and unabated -- basic (pun sort of intended)
analysis and problem-solving techniques go out the proverbial window.

Why learn to _think critically_ when you can just 'npm update', fix 37 broken
dependencies, and write a blog post about it? Right?

~~~
onion2k
Critical thinking is a learned skill. For most developers it comes with
experience, and many developers in startups are often yet to learn it.
Thinking well under pressure is even harder.

This is more a problem about startups using inexperienced developers than
anything related to what they're building or which tech they're using.

~~~
bsaul
It comes with experience, but also with more senior devs explaining it to you.

Seing someone use firebase to save paiements, then recompute a total from a
collection, and as a consequence having its system explodes with less than one
session per second, means everybody in the team drank the « let’s use this
nosql google shiny tech, it’s so cool » cool aid.

Even one conversation with any senior dev having some kind of experience with
backend development would have asked about expected load, types of queries,
data model, etc. And concluded that storing paiements was probably the least
interesting scenario for using a tech like firebase.

------
ryandrake
Definitely looks like several "teach-able moments" here: They learned the hard
way about:

1\. Developing a fix without understanding root cause (try-something
development)

2\. Sufficient testing, including load testing, _prior_ to initial deployment

3\. Better change control after initial deployment

4\. Sufficient testing for changes after initial deployment

5\. Rollback ability (Why wasn't that an option?)

6\. Crisis management (What was the plan if they didn't miraculously find the
bad line of code? When would they pull the plug on the site? Was there a
contingency plan?)

7\. Perfect being the enemy of good enough

Looks like they were bailed out of the cost but what if that didn't happen?

~~~
Aloha
Try-something is very useful as a troubleshooting tool, when you need to
change the state of the issue enough to collect further troubleshooting
information.

~~~
flukus
I think "try-anything" would be a better characterization in this case. They
had smoke pouring out of the engine and tried changing the tire.

~~~
Aloha
No disagreement there.

I don't understand how you can build a complex application like that without
doing basic performance checks like, are we hitting the file system or
database too often, our the image assets correctly sized, etc.

I'm not a software engineer however.

------
RobertRoberts
Anyone else feel like they'd rather have their dedicated server slow down
instead of wrack up a $30k debt?

This is this nightmare I envisioned with cloud services, a client gets hit
really hard, and I have to pass the bill on to them.

This reminds me variable rate mortgages.

With dedicated hardware, you may end up with performance issues, but never a
ghastly business-ending bill. How does anyone justify this risk? I really
don't understand the cloud at all for such high cost resources with literally
unlimited/unpredictable pricing.

Can someone explain this risk/reward scenario here?

~~~
flukus
> How does anyone justify this risk?

I'm more concerned about the risk mitigation strategies (capping) I'm seeing
advocated.

If your servers being pegged you've only got a few customers missing out while
it's pegged, maybe even everyone getting service but sub-optimally. You can
ride out the wave and everything goes back to normal.

Putting caps in place is like pulling the plug out of the server after the CPU
has been at 100% for 5 minutes and not plugging it in until the next billing
cycle.

------
whoisjuan
I find ridiculous that their first solution was to go and upgrade to another
Angular version, especially a non-beta version upgrade of a framework that is
used in thousands of super high traffic websites with no problems. How
clueless can you be?

~~~
thatjsguy
> How clueless can you be?

I mean, if you’re junior and you’ve just learned JavaScript, it’s not
difficult. I’ve met a lot of monied people who seem to think a junior dev with
a few weeks of JavaScript training is equivalent to a senior engineer with a
degree. It never works out, at least not for the smart people.

~~~
sushid
Why are you so against junior devs? I've personally never met a junior dev who
just starts going ahead and upgrading various dependencies, as you suggest.

If anything, that shows a lack of proper hiring decision on you and your
team's part.

I do, however, agree that their practices are horrible (just look at their
console, they're console.logging random things, running the dev mode of
Firebase, and fetching some USD conversion call 10x on load with no caching)
and they're lucky Google bailed them out at the last minute.

~~~
thatjsguy
> If anything, that shows a lack of proper hiring decision on you and your
> team's part.

Hey, friend! I had no control over hiring for that gig.

------
shaunpersad
I can't say enough good things about Firebase and GCP in general, but I'm
always cautious when using Firestore in particular. I usually avoid unbounded
queries altogether, and treat it primarily as a key/value store to get by id.

When I do use queries, it's always in places where the results have a well-
defined limit (usually limit = 1), e.g. finding the most recent X or the
highest X.

With the above two, you get all the greatness of Firestore, but with a well-
defined (low) cost that you can calculate ahead of time.

~~~
itcmcgrath
We've also been rolling out updates to rules to enable you to enforce these
types of things. There is performance implications to limit queries for the
real-time update system at scale, but for most use cases this shouldn't be a
problem.

Definitely more we can improve here for control, and we're open to feedback.

~~~
fefb
Hey, thanks for listening.

I believe in short time would be nice to have a way to use for the query the
create, update and write time of a document. Now, I am doing the creat time
manage inside of my document with Date.now(), but when I was running a bunch
of promises to create the documents, the createTime between the documents that
I was manipulating was in same case the same, so my pagination failed.

Another things, like compound queries inside of subcollections should be nice.
A way to export all the database for backup.

A flag to alert the Firestore to return the document when I do an update in
this document in the same response (one round trip (dynamoDb has it). I know I
can reach this goal witht transaction, but I believe it is simpler than a
transaction.

A way to update a array without transaction.

Thanks

------
thisismydevname
The saddest thing here is that they seem proud of this "mistake" and how they
solved it.

This startup mindset is not always good.

------
Apaec
I feel depressed about this, on how the industry promotes and even supports
extreme technical incompetence. Maybe is a consequence of "everyone should
learn to code" campaigns and bootcamps.

~~~
johnmarcus
Cheer up buddy, you were this incompetent at one time too, but now look at
you, a full grown narcissistic asshole senior developer. You have come a long
way and these kids will too!

------
lifthrasiir
I increasingly feel that modern pay-as-you-go services are so opaque to
consumers that it takes the individual employee's empathy (highly subjective)
or the publicity like this (highly subjective _and_ you have to be lucky as
well) to fix any significant problem. Every time a post like this with a
"happy" ending crawls into the HN front page, there would be hundreds or maybe
thousands of "unhappy" endings out there.

------
xtrapolate
Simple stress tests should've revealed this. Basic profiling should've
revealed this. This article makes it appear that they've went live without
ever really testing the infrastructure under any kind of load whatsoever.

The article refers to some mysterious "engineering team". It would appear very
little actual engineering took place at that company.

------
TomK32
Some have mentioned it here already, but I'd like to emphasis how important
application logs are. How much trouble you can prevent by reading and
understanding them.

I've seen and fixed such bugs as described in the article, and before you
start trying to upgrade anything a look in the log followed by a git bisect
session is the first step.

My rails apps have great logs, I get to see what views and partials are
rendered, what queries are sent to the database and more important how often
all that happens. If the log excerpt for a single request doesn't fit on my
screen I know I have to do something.

------
madmulita
"It is very important that tech teams debug every request to servers before
release..." No.

You should know your application's profile, you wrote it.

How many resources does you app need? That's something our developers believe
is the "operations team"'s responsability. Well, now that you took the
'devops' role you can no longer keep ignoring this. Your new infrastructure
provider will be more than happy to keep adding resources, one can only hope
the pockets are deep enough.

With attention to the profile this would have been caught at developing time,
maybe testing time.

------
duxup
>with every visitor to our site, we needed to call every document of payments
in order to see the number of supports of a Vaki, or the total collected. On
every page of our app!

Oh that will do it.

------
pishpash
"Besides they understood errors like ours can happen when a startup is growing
and some expensive mistakes can jeopardize the future great companies."

Doesn't sound like a future great company to me, especially when their lesson
from this was Google will bail them out and "It is very important that tech
teams debug every request to servers before release." rather than hiring less
cavalier employees and putting in better process.

------
ericpauley
I've been noticing a steady rise of posts from hackernoon by amateur
developers who think they'll be the next great tech blogger. I'm not saying I
could do any better, but why are these posts suddenly getting so much
attention?

~~~
flukus
> but why are these posts suddenly getting so much attention?

The same reason we slow down for car crashes, morbid curiosity. I don't think
there is anything "sudden" about it though, we even have sites like
thedailywtf dedicated to this level of idiocy.

~~~
gruez
thedailywtf: making fun of other dvelopers' incompetence

hackernoon: author patting themselves on the back while everybody else is
laughing at their incompetence

------
boulos
Disclosure: I work on Google Cloud (but not Firestore or Firebase).

For those that didn't read the article, it had a happy ending:

> GOOGLE UNDERSTOOD AND POWER US UP!

> After we fixed this code mistake, and stopped the billing, we reached out to
> Google to let them know the case and to see if we could apply for the next
> grant they have for startups. We told them that we spent the full 25k grant
> we had just a few days ago and see the chance to apply for the 100k grant on
> Google Cloud Services. We contacted the team of Google Developers Latam, to
> tell them what had just happened. They allowed us to apply for the next
> grant, which google approved, and after some meetings with them, they let us
> pay our bill with the grant.

> Now we could not be more grateful to Google, not only for having an awesome
> “Backend As A Service” like Firebase, but also for letting us have 2 million
> sessions, 60 supports per minute and billions of requests without letting
> our site go down. Besides they understood errors like ours can happen when a
> startup is growing and some expensive mistakes can jeopardize the future
> great companies.

~~~
ocdtrekkie
Perhaps somewhat optimistically, I assume most of the commenters read the
whole article, but I don't think the happy ending abates the concern. It's
great the Google Cloud team was able to bail them out afterwards, but the fact
that they were able to rack up a $35,000 cost on a code mistake still
highlights one of the major flaws with pay-as-you-go cloud computing.

There's no guarantee if I made the same snafu next week that Google would
necessarily be willing to help, but I can absolutely guarantee you that a VM
sitting on a Dell PowerEdge I've got lying around would suddenly obligate me
to a $35,000 bill, no matter how bad my code.

Ideally, I guess I'd want to see rather than a hard cap, some sort of smart
alert that would go "holy crud, this is an unusual spike in the rate of
requests" when the delta changed unusually, rather than waiting until I say,
hit a high static cost bar or a hard cap that kills the site.

~~~
boulos
Right, I don’t deny that (and didn’t mean to imply this isn’t a problem). At
the time of my comment it did seem like most people stopped before the end :).

Budgets and quotas are really tricky as Dan pointed out elsewhere in this
thread. App Engine has had default daily budgets (that you can change)
forever, but then you run into people saying “What the hell, why did you take
down my site?!”.

In this case, they even _intentionally_ pressed forward once they saw their
bill was going up. If this had been say a static VM running MySQL with a
“SELECT *” for every page view, the site would likely have just been
effectively down. For some customers, that’s the wrong choice, even in the
face of a crazy performance bug.

That said, we (all) demonstrably need to do better at defaults as well as
education (the monitoring exists!).

~~~
gcbirzan
That, or, as explained in the SRE book, the extra load should've been shed.
The binary choice is not always the right one. Though, granted, in this case,
it wouldn't have mattered...

~~~
ocdtrekkie
Yeah, I think "the web server ran out of money at 3 PM today" isn't the right
choice, whereas "a lot of people had a hard time getting to the site when it
got HN'd around 3 PM, but it was back to normal by 5 PM" is a better result.

For smaller organizations, something being down during extreme load is a
recoverable problem, but owing the cloud provider all of their money may not
be. (Note that even in the case here where Google got them the grant to cover
this bill, this is still probably 35K in grant money that could've gotten them
further or been used better elsewhere.)

------
betadreamer
A lot of people here talks about lack of load testing and other "do it the
right way" type of advice, but remember this is a startup. In my opinion,
solid testing foundation will be such an overkill and the time is better spent
on implementing more features.

Also I bet they did some manual testing. They didn't catch it because this
latency can only be seen by the account with a lot of followers.

I agree that their first solution to upgrade is a bad idea... You should
understand what caused the bug before trying to fix it.

I highly encourage you to monitor the load/pay/request graph on a daily basis.
Even better if you hang a screen on the office that displays these. The graphs
are already provided by Firebase. That way you can catch these type of anomaly
on day one. Also Firebase supports "Progressively roll out new features"
[https://firebase.google.com/use-cases/#new-
features](https://firebase.google.com/use-cases/#new-features)

~~~
whoisjuan
Lack of testing and poor code quality are completely different things though.

------
carlsborg
The blog authors should state up front that Google covered the costs with an
additional startup grant, and that this was entirely due to a quadratically
expensive query.

------
bartread
_" We contacted the team of Google Developers Latam, to tell them what had
just happened. They allowed us to apply for the next grant, which google
approved, and after some meetings with them, they let us pay our bill with the
grant."_

Makes for an interesting counterpoint to the currently popular "Google is
evil" narrative. The truth is probably much more mundane: Google is an awful
lot of people trying to work together to achieve a bunch of shared goals and
doing an imperfect job of it. This isn't just rose-tinted: I'm quite sure they
have their fair share of bad actors, and they certainly make decisions we
don't all like (e.g., retiring products), but I don't think it's because the
company is fundamentally evil.

~~~
rafael859
I agree with your argument, but I don't think that people are calling Google
evil because they retired some products. See [0], which was resolved (I think,
though I can't find a source for that) due to the fact that Google is
comprised of a lot of different people, and [1], which is a fairly recent
announcement that has made some people uncomfortable.

[0]
[https://news.ycombinator.com/item?id=17202179](https://news.ycombinator.com/item?id=17202179)

[1]
[https://news.ycombinator.com/item?id=17660872](https://news.ycombinator.com/item?id=17660872)

~~~
iamforreal
For me, I find what they're doing with android pretty evil (disallowing
manufacturers from also offering phones with alternate distros/os')

------
franzwong
But did you see the slow down in the request to firebase before going to
upgrade frameworks?

------
shady-lady
Words fail me as to how anybody could put something live without understanding
how the application functioned as a whole. Unbelievable.

------
bigbluedots
This is another good reason to properly test before going live. If their site
wasn't cloud hosted it would have just fallen over, most likely. Which means a
failed crowdfunding effort. Maybe take off the cowboy boots, guys?

------
ConcernedCoder
"This means that every session to our site read the same number of documents
as we have of number of payments. #UnaVacaPorDeLaCalle received more than
16,000 supporters, so: 2 million sessions x 16,000 documents = more than 40
Billion requests to Firestore on less than 48 hours."

TLDR; Horrible architecture decisions like this can be very costly.

~~~
noncoml
I wouldn’t call it architecture decision. It’s the equivalent of a bad SQL
query.

~~~
avargas
A bad sql query wouldn't do that. Look at their site, from the US, it calls
cloud functions for a COP to USD conversion, on each place they render a
currency (200+ requests just going to their homepage). I think it was built
poorly, but that's just my opinion.

~~~
whoisjuan
I agree with you. There are so many ways to make this infinitely more
efficient. For starters, why are they re-calculating and re-rendering the
value every time they get a new donation? Also, they could store those values
in a more cold&cached-storage and just make the reads to update from
Firestore. Don't use a freaking database that is charging you for every single
read and write, to deal with mundane client-side renderings.

------
riquito
> we were using Angular V.4 and we decided to upgrade everything to V.6. It
> was a huge risk and we wanted this campaign to be perfect, so we did it! In
> just a day and a half, our team had the first release ready in the new
> version. After some tests, it looked like the refactor helped the app’s
> speed, but it was not as fast as we wanted it. Our goal was to load in 3
> seconds and it wasn’t working as we expected.

You want it perfect and you cannot afford to put down the site but you're
willing to take a "huge risk" based on (wrong) guesses, with clearly not
enough time for proper QA. I sincerely suggest you to slow down and reflect on
priorities and risk assessment, there's a reason if that firebase code slipped
through. By the way I'm happy you avoided the worst case scenario, good luck
with your project

~~~
TAForObvReasons
> I sincerely suggest you to slow down

That runs counter to the great Silicon Valley ethos of moving fast and
breaking things

~~~
dboreham
The moving fast breaking things guy came from the east coast.

------
joeblau
The thing I'm wondering about with the seniority of the dev team and what the
code commit and code review process is. It seems like with a senior dev
reviewing commits, someone would have caught that redundant work was being
done.

I worked at a startup in the Mission for a few months and I remember seeing a
quadratic query that ran for every customer that was logged into our
application. The CEO and team lead wondered why our app worked great in Dev
(with only 5 users) but was terrible on premise (250 users). When I tried to
explain the issue the two devs before me didn't really understand what I was
talking about. It was a quick refactor and caching solution that fixed the
problem, but it was clear that the development was still new.

------
manishsharan
The issue with GCP and AWS providing limitless scaleability is that bad code
gets a free ride and we all occasionally write bad code. If the OP had tested
this on a database with 1 CPU server with 2GB RAM, they would have caught it
quickly in early dev testing.

------
melan13
The company success is good, but the SELECT ALL query is a shame bigger...

~~~
sitkack
It shouldn't be _possible_ to run select all in production.

------
paradite
So it's a classic N+1 query?

~~~
alexpi
It's more like (in good old RDBMS world) 'SELECT * FROM payments' and
calculate sum on client vs 'SELECT SUM(amount) FROM payments WHERE some_id =
?'

though they already had this sum precalculated

~~~
jasonkester
That would be more forgivable. This is:

    
    
      SELECT * FROM payments where paymentID = 1
      SELECT * FROM payments where paymentID = 2
      SELECT * FROM payments where paymentID = 3
      ...
      SELECT * FROM payments where paymentID = 14986
    

Each of those in its own API request over the wire, then sum them on the
client.

------
shusson
| How we spent $30k in Firebase in less than 72 hours

By not checking Google Billing after you launch your website. At the very
least you should have a billing alert.

~~~
snowwolf
At the very least Google (and really all cloud services) could have reasonable
default billing alerts built in. And automatic spike detection.

I'd rather have default alerts already configured that I can change, rather
than none.

------
wazoox
And that's why people write applications to understand and track your public
cloud usage, like
[https://github.com/trackit/trackit](https://github.com/trackit/trackit) ...
Most people simply can't follow through what's happening in their cloud apps.

------
Something1234
So what is their product? Can't read spanish, and don't use chrome for
translate.

~~~
whoisjuan
A GoFundMe kind of platform, that allows groups of people to put funds towards
a common goal (like a party or whatever)...in this case they used it to donate
to a politician who had a huge debt after losing an election.

------
doglover7869
Wow, a complete shitshow which could've been avoided by consulting with 1
experienced developer good at debugging. This kind of negligence must be
happening more than what we think so there are lots of opportunities out
there.

------
andybak
Gosh. One for
[https://accidentallyquadratic.tumblr.com/](https://accidentallyquadratic.tumblr.com/)
with the added spice of being on usage billing...

------
oxguy3
This reads less like a "mistake", and more like just not giving any thought to
what your code is doing as you write it. A mistake would be if the programmer
meant to implement X but instead implemented Y; this sounds more like the
programmer just set out to implement Y without even considering other
possibilities.

Same for the framework change. An outdated framework might be a second or so
slower, but a >30-second load time was never going to be fixed by updating.
This is just bad problem-solving skills. When your app is taking 30 seconds to
load, you don't just guess at what might make it faster -- you open your JS
console and your log files, and you figure out where that time is being spent.
Two minutes in the Performance tab of Chrome's developer tools, and you would
have figured out the issue was on the back-end rather than the front-end.

~~~
Dylan16807
They calculated a number the wrong way. That's not a hard mistake to make. And
changing frameworks is silly but not really part of the problem.

------
wiradikusuma
If there's one thing Google App Engine has taught me, over the course of 5+
years using it:

Denormalize your data (or, optimize for read).

~~~
bsaul
Just a precision in case a junior dev read this comment : denormalizing has
nothing to do with the problem experienced by the OP. They just queried the
whole collection of paiements instead of precomputing a total everytime a new
paiement arrived.

Denormalizing would have required having a relation between collections. Here
there was just one (so there was nothing to denormalize).

Optimizing for reads ( or simply think about read performance) doesn’t require
denormalizing. It can also be a matter of creating an index, or precomputing
values in a cache, like in OP case.

------
danielhlockard
My only issue - Where were the metrics around firebase request count? That
could have been a simple dashboard.

------
newscracker
It seems like they:

* didn't have good testing

* did no load testing before

* had no code reviews done

* no design reviews done

* had no or very less (useful) application logs

* had no change control mechanisms defined or followed (upgrade a framework in production in a matter of minutes or hours as a way to wing it out and pray for it to work out?)

* had no or very little automated tests

* didn't have a detailed post mortem or root cause analysis to see what they could do to prevent it (the ending looked quit amateurish, by pointing to only one thing as a potential lesson)

* wasted a lot of money that could've helped in the future (by instead using that to pay for en error)

If I were in such a team, I would've honestly stated in such an article how
deeply ashamed I am that we missed all these things, how "cowboy coding" and
heroics must never be glorified, how we got very, very lucky in someone else
waiving off charges (this is not a luxury that most startups or one person
endeavors would have), and ended with asking for advice on what could be done
to improve things (since it's obvious there were many more gaps than just how
a few lines of code were written).

To the team that wrote this code and this article — get some software
development methodology adopted (any, actually) and some people who can help
you follow any of those. Also read the rest of the comments here. You got
very, very lucky in this instance. It may not be the same case again, and you
may see your "life's work" get killed because you didn't really learn.

~~~
Oras
If they had half of the above they wouldn't use Firebase. Don't get me wrong I
think it is a great product and I would use it for MVP but the idea of using
Firebase is to get something scalable easily and quickly by focusing on the
frontend only. You can tell from their screenshots that they were only
focusing on GA metrics and didn't check Firebase console

------
catchmeifyoucan
Go serverless boys :)

~~~
fefb
It won't resolve the problem

~~~
catchmeifyoucan
AWS Lambda costs about 20 cents per million requests. A billion requests would
have put it just a little south of 200-300 dollars. Factor in a few additional
read units for DynamoDB, but I don't see it going to total $30,000. The
serverless architecture should scale as well.

~~~
fefb
They are using serverless architecture. Firestore is a "Daas". The latency for
loading the page wasn't the real problem, it was the side-effect. The real
problem resulted in 40bilion request, they were requesting all the collection
of payments in each session. It reaches 16K documents downloaded /session, so
they were paying 16K reads documents for session. Firestore gives you 50K
reads/day for free, and ask $0.06 for 100K reads. So It was a really bad
logical approaching from them. If they did in a best way , like update a
document that holds the two accumulators : total money accumulated and total
payments created, and read just this document in each session request their
bill to show this information for 2M session should be less than $2, because
you are reading 1 document for session instead of 16K documents. And pay
attention that their system wasn't down, probably the 30s latency was a side-
effect of downloading 16K document of each payment information in the client
side.

------
mirimir
But wait, the image shows ~$600 million. So $30K is small.

Maybe that ~$600 million isn't in USD?

~~~
rosser
The first sentence in The Fine Article mentions Colombia, whose Peso uses "$"
as its symbol, and currently has an exchange rate of ~2900:1 USD.

~~~
mirimir
Thanks. And I did miss the "COP" in front of the $ amount :(

