I worked at Firebase for many years and the concerns with security rules have always plagued the product. We tried a lot of approaches (self expiring default rules, more education, etc) but at the end of the day we still see a lot of insecure databases.
I think the reasons for this are complex.
First, security rules as implemented by Firebase are still a novel concept. A new dev joining a team adding data into an existing location probably won’t go back and fix rules to reflect that the privacy requirements of that data has changed.
Second, without the security of obscurity created by random in-house implementations of backends, scanning en masse becomes easier.
Finally, security rules are just hard. Especially for realtime database, they are hard to write and don’t scale well. This comes up a lot less than you’d think though, as any time automated scanning is used it’s just looking for open data, anything beyond “read write true” as we called it would have prevented this.
Technically there is nothing wrong with the Firebase approach but because it is one of the only backends which use this model (one based around stored data and security rules), it opens itself up to misunderstanding, improper use, and issues like this.
To be honest I've always found the model of a frontend being able to write data into a database highly suspect, even with security rules.
Unlike a backend where where the rules for validation and security are visible and part of the specifications, Firebase's security rules is something one can easily forget as it's a separate process, and has to be reevaluated as part of every new feature developed.
Yeah, I've never understood how this concept can work for most applications. In everything I build I always need to do something with the input before writing it to a database. Just security rules are not enough.
What kind of apps are people building where you don't need backend logic?
I think I missed where writing to the database precludes backend logic. Databases have triggers and integrity rules, but beyond that, why can't logic execute after data is written to a database?
Because once it is written to the database, it can be output somewhere before you execute your logic. IE, explicit language, child porn, etc. You generally want to check for that BEFORE you write the data.
You're saying it's impossible to have public write access to a table without also providing public read access?
"it can be output somewhere before you execute your logic" is a design choice that is orthogonal from whether you execute your logic before or after input into the database.
First of all, most database records couldn't fit child porn, unless it was somehow encoded across thousands of records, in which case you couldn't realize it was child porn until after you've stored 99% of it.
Sure though, by putting "child porn" in a sentence, you can make anything seem bad. Tell me this, would you rather your application middleware was in the "copying child porn" business? ;-)
Actually, the more I think about it, the crazier this seems. You're going to store all the "child porn" you receive in RAM until you've validated that it is child porn?
I don’t get your tone or why you seem shocked that binary data can be stored in a database. Postgres and MySQL both have column sizes for binary data that can hold gigabytes.
Second, you generally need to hold the entire image in RAM to create the perceptual hash needed to check that the image is/isn’t child porn.
> I don’t get your tone or why you seem shocked that binary data can be stored in a database. Postgres and MySQL both have column sizes for binary data that can hold gigabytes.
My tone is shocked, because what you're describing seems totally removed from any system I've seen, and I've implemented a ton of systems. For performance reasons, you want to stream large uploads to storage (web servers, like nginx, are typically configured to do this even before the request is sent to any application logic). You invariably want to store UGC data that conforms to your schema, even if you're going to reject it for content. There's a whole process for contesting, reviewing and reversing decisions that requires the data be in persistent storage.
I think you misunderstood what I said. Yes, Postgres, MySQL and a variety of other databases have column sizes for binary data that can hold gigabytes. What I wouldn't agree with is that most database records can hold gigabytes, binary or otherwise. Heck, most database records aren't populated from UGC sources and not UGC sources where child porn is a risk.
But okay, let's assume, for arguments sake, most database records are happily accepting 4TB large objects, and you're accepting up to 4TB uploads (where Postgres' large objects max out). Do all your web & application servers have 4TB of memory? What if you're processing more than one request at once, do you have N*4TB of memory?
At least all the systems I've implemented that receive data from users enforce limits on request sizes, and with the exception of file uploads, which are typically directly streamed to the filesystem before processing, those limits tend to be quite small, often less than a kilobyte. Maybe someone could write some really terse child porn prose and compress it down to fit in that space, but pretty much any image would have to be spread across many records. By design, almost any child porn received would be put in persistent storage before being identified as such.
> Second, you generally need to hold the entire image in RAM to create the perceptual hash needed to check that the image is/isn’t child porn.
This is one of many reasons that you generally want to stream file uploads to storage before performing analysis. Otherwise you're incredibly vulnerable to a DoS attack on your active memory resources. Even without a DoS attack, you're harming performance by unnecessarily evicting pages that could be used for caching/buffering for bytes that won't be served at least until you've finished receiving all the file's data.
[Note: Many media encodings tend to store neighbouring pixels together, so you can, conceptually, compute a perceptual hash progressively, without loading the entire file into active memory, which is often desirable, particularly with video content.]
Thought about it some more... this whole scenario makes sense in only the narrowist of contexts. Very few applications directly serve UGC to the public, and a lot of applications are B2B. You're authenticated, and there's a link to your employer (or you if you're self-employed). Uploaded data isn't made visible to the public. Services are often limited to a legal jurisdiction. If you want to upload your unencrypted child porn to a record in Google's Firebase database, you go ahead. The feds could use some easy cases.
There's little point in not writing it to disk, the idea of holding it in RAM vs writing a file to disk is moot. You've got to handle it and the best way of handling that kind of thing at scale is to write it to a temporary disk and then have a queue process work over the files doing the analysis.
No serious authority is going to hang you for UGC which is illegal material in storage while you process it. Heck, you can even allow stuff to go straight to publicly accessible if you have robust mechanisms for matching and reporting. The authorities won't take a hard line against a platform which is open to the public as long as they have the right mitigations in place. And they won't immediately blame you unless you act as a safe haven.
A sensible architectural pattern for binary UGC upload data would plan to put it in object storage and then deal with it from there.
I have never in my life wrote a "child porn validator" that restrict files uploaded by users to "non child porn". This sound nontrivial and futile (every bad file can also be stored as a zip file with a password). This sound like an example of a "think of the children" fallacy.
I also find the firebase model weird (but I didn't use it yet), but not for the child porn reasons.
Writing directly to Firebase is rarely done past the MVP stage. Normally it's the reading which is done directly from the client. Generally writes are bounced through Cloud Functions or a traditional server of some form. Some also "fan out" data, where a user has a private area to write to (say a list of tweets) then they get "fanned out" to follower's timelines via an async backend process which does any verification / cleansing as needed.
context: I have a near-100% naive perspective. Mobile dev whose built out something approximating Perplexity on Supabase. I have to use edge functions for ex. CORS, but by and large, logic is all in the app.
Probably because the client is in Flutter, and thus multiplatform & web in one, I see manipulating the input on both the client and server as code duplication and error prone.
I think if I was writing separate native apps, I'd push everything through edge functions, approximating your point: better to have that sensitive logic of what exactly is committed to the DB in one place.
Our experience has been very different. Our Firebase security rules are locked down tight, so any new properties or collections need to be added explicitly for a new feature to work — it can't be "forgotten". Doing so requires editing the security rules file, which immediately invites strict scrutiny of the changed rules during code review.
This is much better than trying to figure out what are the security-critical bits in a potentially large request handler server-side. It also lets you do a full audit much more easily if needed.
Are you suggesting that it's essentially too easy for a dev to just set and forget? That's a pretty interesting viewpoint. Not sure how any BaaS could solve that human factor.
Say you add a super_secret_internal_notes field. If you're writing a traditional backend, some human would need to explicitly add that to a list of publicly available fields somewhere (well, hopefully). For systems like Firebase, it's far too easy to have this field be created by frontend code that's just treating this as another piece of data in a nested part of a payload. But this can also happen on any system, if you have any JSON blob whose implicit schema can be added to by frontend development alone.
IMO implicit schema updates on any system should be consolidated and lifted to an easily emailed report - a security manager/CSO/CTO should be able to see all the super_secret_internal_notes as they're added across the org, and be able to immediately rectify security policies (perhaps even in a staging environment).
(Also, while tongue in cheek, the way that the intro to a part of Firebase's training materials https://www.youtube.com/watch?v=eMa0hsHqfHU implicitly centers security as part of the launch process, not something ongoing, is indicative of how pervasive the issue is - and not at all something that's restricted to Firebase!)
Generally agreed on improved audit logs of some formed helping.
Re training materials, this is one of the mitigations we launched to attempt to pull security to front of mind. I do not really think this is a Firebase problem, I think average developers (or average business leaders) just don't, in general, think much about security. As a result, Firebase materials have a triple burden - they need to get you to think about security, they need to get you to disrupt the most "productive" flow to write rules, and they need to get you to consistently revisit your rules throughout development. This is a lot to get into someone's head.
For all the awesomeness of Firebase's databases, they're both ripe footgun territory (Realtime Database specifically). Our original goal was to make the easiest database to get up and running with, which I think we did, but that initial ease comes with costs down the road which may or may not be worth it, that's a decision for the consumer.
You could either do away with the model of the frontend writing to the DB and ask customers to implement a small backend with a serverless component like AWS Lambda or Google Cloud Functions.
Barring that, perhaps Firestore could introduce the concept of a "lightweight database function hook" akin to Cloudflare workers that runs in the lifecycle of a DB request, thus formalizing the security requirements specific to the business requirement and causing the development organization to allocate resources to its upkeep.
So while a security rule usually gets tested very lightly, you'd see far more testing in a code component like the one I'm suggesting.
> Barring that, perhaps Firestore could introduce the concept of a "lightweight database function hook" akin to Cloudflare workers that runs in the lifecycle of a DB request, thus formalizing the security requirements specific to the business requirement and causing the development organization to allocate resources to its upkeep.
I think it's more like there's more surface area to forget when you have humans handling so many concerns, and it's not likely the part that's changed the most so it's a likely candidate for being "pushed out of the buffer" (of the human).
In a more typical model, backend devs focus more on security, while not needing to know the frontend, and vice versa.
The concept with firebase DB's is flawed IMO, I never got the point of directly accessing a DB in the frontend, or allowing that even with security rules, it just seems like it would cause problems.
We tried to contact google, via support to try to help or for them to help disclose the issues to the websites. We got no response other then a response telling us that they will be creating a feature request on our behalf if we wanted instead of helping us, which is fair as I think we'd have to escalate pretty far up in Firebase to get the attention of someone who could alert project owners.
One of the things we fought for, for years after acquisition was to maintain a qualified staff of fulltime, highly paid support people who are capable of identifying and escalating issues like this with common sense.
This is a battle we slowly lost. It started with all of support being the original team, then went to 3-4 fulltime staff plus some contracts, to entirely contractors (as far as I’m aware).
This was a big sticking point for me. I told them I did not believe we should outsource support, but they did not believe we should have support for developer products at all, so I lost to that “compromise.” After that I volunteered myself to do the training of the support teams, which involved traveling to Manila, Japan and Mexico regularly. This did help but like support as whole, it was a losing battle and quality has declined over time.
Your experience is definitely expected and perhaps even by design. Sadly this is true across Google, if you want help you’d best know a Googler.
I suspect it is going to end up being Google's downfall, or at least, be part of it.
They simply don't know humans. Their repeated failures at building social networks is good enough evidence. They always try to have the human out of the loop, which, to be fair, worked for them in the early days, as their search engine was better than those that relied on human-made directories. But now it is becoming ridiculous. It is a company of bots, for bots. And when they need humans for some reason, they take away most of the value they can add with rigid frameworks, basically treating them like bots. They pay hundreds of thousands not for people who are competent and trustworthy to provide the best service, but instead, to people who write bots to provide mediocre service.
I believe that at some point, a startup who understand humans will eat them up, bit by bit, by feeding on dissatisfied customers who don't want to deal with stupid bots.
The bigger they are, the harder they fall; is a saying for a reason. There is no such thing as “too big to fail” otherwise the East India Trading Company would still be in operation.
Sometimes. IBM was still considered big when Buffet invested in them early in the 2010s. And it took almost a decade worth of bad performance for him to finally exit. It might be slowly sliding into irrevelance but its stock hasn't completely tanked -- during or after that period.
Looking at https://firebase.google.com/docs/rules/basics, would it be practical to have a "simple security mode" where you can only select from preset security rule templates? (like "Content-owner only" access or "Attribute-based and Role-based" access from the article) Do most apps need really custom rules or they tend to follow similar patterns that would be covered by templates?
A big problem with writing security rules is that almost any mistake is going to be a security problem so you really don't want to touch it if you don't have to. It's also really obvious when the security rules are locked down too much because your app won't function, but really non-obvious when the security rules are too open unless you probe for too much access.
Related idea: force the dev to write test case examples for each security rule where the security rule will deny access.
One simple trick helped us a lot: we have a rules transpiler (fireplan) that adds a default "$other": {".read": false, ".write": false} rule to _every_ property. This makes it so that any new fields must be added explicitly, making it all but impossible to unknowingly "inherit" an existing rule for new values. (If you do need a more permissive schema in some places you can override this, of course.)
Our use of Firebase dates back 10+ years so maybe the modern rules tools also do this, I don't know.
What would really help us, though, would be:
1. Built-in support for renaming fields / restructuring data in the face of a range of client versions over which we have little control. As it is, it's really hard to make any non-backwards-compatible changes to the schema.
2. Some way to write lightweight tests for the rules that avoids bringing up a database (emulated or otherwise).
3. Better debugging information when rules fail in production. IMHO every failure should be logged along with _all_ the values accessed by the rule, otherwise it's very hard to debug transient failures caused by changing data.
I've been an advocate for Firebase and Firestore for a while — but will agree to all of these points above.
It's a conceptual model that is not sufficiently explained. How we talk about it on own projects is that each collection should have a conceptual security profile, i.e. is it public, user data, public-but-auth-only, admin-only, etc. and then use the security rule functions to enforce these categories — instead of writing a bespoke set of conditions for each collection.
Thinking about security per-collection instead of per-field mitigates mixing security intent on a single document. If the collection is public, it should not contain any fields that are not public, etc. Firestore triggers can help replicate data as needed from sensitive contexts to public contexts (but never back.)
The problem with this approach is that we need to document the intent of the rules outside of the rules themselves, which makes it easy to incorrectly apply the rules. In the past, writing tests was also a pain — but that has improved a lot.
It's not that difficult to build the scanner into the firebase dashboard. Ask the developer to provide their website address, do a basic scanning to find the common vulnerability cases, and warn them.
Firebase does that, the problem is "warning them" isn't as simple as it sounds. Developers ignore automated emails and they rarely if ever open the dashboard. Figuring out how to contact the developers using the platform (and get them to care) has been an issue with every developer tool I've worked on.
It also makes portability a pain. Switching from an app with Firebase calls littered through the frontend and data consistency issues to something like Postgres is a lengthy process.
Firebase attracts teams that don’t have the experience to stand up a traditional database - which at this point is a much lower bar thanks to tools like RDS. That is a giant strobing red light of a warning for what security expectations should be for the average setup. No matter what genius features the Firebase team may create this was always going to be a support and education battle that Google wasn’t going to fully commit to
at Steelhead we use RLS (row level security) to secure multi-tenant Postgres DB. Coolest check we do is create a new Tenant and dbdump with RLS enabled and ensure the dump is empty. Validates all security policies in 1 fell swoop.
The security rules where I fell off my love with Firebase, not that there is anything wrong with the security, but until the point of having to write those security rules, the product experience felt magical, so easy to use, only one app to maintain pretty much.
But with the firebase security rules, I now pretty much have half of a server implemented to get the rules working properly, especially for more complex lookups. And for those rules, the tooling simply wasn't as great as using typescript or the likes.
I haven't used firebase in years tho, so I don't know if it has gotten easier.
Firebase needs something like RLS (row-level security). It needs to be real easy to write authorization rules in the database, in SQL (or similar), if you're going to have apps that directly access the database instead of accessing it via a proxy that implements authorization rules.
I don't see the comment arguing for that at all, and I don't think the analogy to crop monocultures being more vulnerable to pests really holds.
There are good reasons we deride "security through obscurity" as valid, and just because "structural diversity" makes automated scanning harder doesn't mean it can't be done. See Shodan.
The idea as I (who is not GP) see it is not that diversity makes scanning harder, it’s that it makes the blast radius smaller. Notably, though, that means we have to be talking about diversity of implementations, not just deployments—numerous deployments of just a few pieces of software can be problematic in their own ways, and of course there have been bugs with huge consequences in Apache, MSRPC, or—dare I say it—sendmail since the very earliest days.
I view the issue as more of a poor UX choice than anything else. Firebase's interface consists entirely of user-friendly sliders and toggles EXCEPT for the security rules, which is just a flimsy config file. I can understand why newer devs might avoid editing the rules as much as possible and set the bare minimum required to make warnings go away, regardless of whether they're actually secure or not.
There should be a more graphical and user-friendly way to set security rules, and devs should be REQUIRED to recheck and confirm them before any other changes can be applied.
We decided to make a shared blog because we will likely have other projects we will do together, so all of us posting on our personal blogs on the same topic would be counterproductive
Many businesses don’t have full time developers. They contract out to agencies who build the website for them. The agencies have a rotating cast of developers and after the initial encounter with their good devs they try to rotate the least experienced developers into handling the contract (unless the company complains, which many don’t).
The vulnerability emails probably got dismissed as spam, or forwarded on and ignored, or they’re caught in some PM’s queue of things to schedule meetings about with the client so they can bill as much as possible to fix it.
> Some days I think one ought to be licensed to touch a computer.
There are plenty of examples of fields where professional licensing is mandatory but you can still find large numbers of incompetent licensed people anyway. Medical doctors have massive education and licensing requirements, but there is no shortage of quack doctors and licensed alternative medicine practitioners anyway.
Sadly, this is true, and theres probably much more. We did our best, sent customized emails to each of them, telling what was affected, how to fix it, and how to get in contact.
It seems reasonable to assume that the exposed information has already fallen into the wrong hands. Might as well post the list at this point (or at some point, at least) so that any users of those sites can become aware, no?
Shouldn't encrypting all databased records be the only sane, safe and legal solution with decryption key sent to local (to the website owner) law enforcement when site owners aren't responsive?
Not saying you should do that given the current state of the laws.
This is the inevitable outcome of picking cheap-fast from the cheap-fast-good PM triangle. Unfortunately for some customers/users, their concerns were left out of the conversation and their PII is the cost.
I’d be wary of any company listed here that made that decision and hasn’t changed leadership, as it has been proven time and time again that many companies simply don’t care enough about customers enough to protect them. History repeats itself.
I have a very basic Firebase question: are most of the apps described in this post implemented entirely as statically hosted client-side JavaScript with no custom server-side code at all - the backend is 100% a hosted-by-Google Firebase configuration?
If so, I hadn't realized how common that architecture had become for sites with millions of users.
Yeah. Either entirely client-side or passing through a server naively. This is the inevitable result of having an "allow by default" security model in an API. Unfortunately, insecure defaults are a common theme with libraries targeted at JavaScript developers. GraphQL is another area I would expect to see these kinds of issues.
Somehow my assumption is it will only get worse from here, with AI agents looking for exploits etc with much more efficiently than bots? weird future is waiting
It’s not enough; make sure to use a unique email for each service you sign up for. This limits the damage in case of an incident and protects your privacy, as no one can perform OSINT on you to cross reference other services. Additionally, I’ve found that sometimes you can detect a site breach before the owners do when you receive a malicious email sent to that unique address.
It isn’t a hassle, you won’t create a separate email, but an aliases.
> Do you know or recommend a service for this thats easy and fast to use?
Honestly I don’t like to promote any commercial services, but there are few out there that automate it, simplelogin (I believe you can host it yourself too), anonaddy, or fastmail that has an integration with 1password, so your password manager generate a random pass and an email alias automatically for you. There are more and it’s better to research it yourself to find the best solution for you, again, my post sounds like shilling for these products but a good start.
>Do you know or recommend a service for this thats easy and fast to use?
I use a google account with a catch all domain. {Website/Store} @ short domain that goes to the same mailbox. I reply from name @ different domain, though.
The benefit is when I give it over the phone it's easy to say "StoreName"@ than spelling my name or using another longer email.
DuckDuckGo and Mozilla have similar products. And if your email provider allows the "plus trick", as Gmail does, you should at the very least use that.
> Turns out that a Python program with ~500 threads will start to chew up memory over time.
Anyone have more info about this issue? I've got a scraper myself in Python with a few hundred threads which seems to eat a lot of memory. Any workarounds or is the only solution to rewrite in another language?
You can do it in python but one has to dig into how python do reference counting and how that interacts with threads.
Personally I prefer using processes rather than threads, with a worker pool and a message bus rather than shared memory. That solution has its own drawbacks (and a bit more overhead), but you don't need to worry so much about memory issues. Processes also seems a better match for crawlers since the number of processes will be fairly constant and the work the processes do is fairly independent.
I’d be interested to know how you’re coming to the conclusion that the amount of affected users is likely higher. From the looks of it, I’d suspect that at least some of the sites you mention (gambling, lead carrot) to be littered with fake account data.
When manually reviewing a lot of these sites it was not identifying PII that were in non-english since the automated scanner checks the variable name for known data types (e.g phone) but that would only work for English sites.
That customer support looked like an automated AI response..
But I’m not surprised of the scale, years ago same thing happened with AWS cloud XY service, and you would find the token literally in plaintext in millions of smartphones apps.
Definitely. They even let you export the password hashes (which you should do carefully). You can then import them into any identity provider that supports modified scrypt[0]. Your users will continue to be able to log in without a password reset.
We only scanned for firestore, which is a NoSQL database, conversion tools may still be possible, a good firebase alternative would be https://supabase.com, but please set up RLS, its IMO much easier then Firebase.
Wouldn't it be easy to migrate a NoSQL database to Postgres without any adaptation?
Postgres is an "object database", so you could use Array, JSON or JSONB fields wherever necessary, and you shouldn't introduce any foreign key relations or such.
Because writing the boilerplate that hooks up the DB to an ORM Model to a ViewModel to a Router and then back via a Controller to the Model to the DB again sucks? A lot of the times it's equivalent to writing manual getters and setters except it's many many lines of code over many files... No wonder people are trying to cut corners!
Laughed out loud with the “Customer support tried to flirt with me when attempting to report the issue”, “I want to be your gf, you very smart” lol the print looks like Kik Messenger?
We believe the gambling ring is based in Indonesia, which is uncommon to use Line, but they seem to be using it here for all of their customer support across all sites.
TBH, that's still better than most of my interactions with customer support when I tried to report bugs. In most cases, they just don't have a script for this and have no idea what to do about it, even if I manage to convince them it's the app that's wrong, not me.
It has an "angry vein" on its head, it's threatening (someone) with the bat.
I have no idea what they're trying to communicate lol. In my experience Line stickers are often used the same way "huh" is, when you don't know wtf else to say but you'd like to terminate the conversation somehow.
biggest threat on the web is Google.
They lower the bar so low that people who have no business collecting user information are collecting user information, then they host it insecurely for you with no liability to the end user whatsoever.
Not only that but they provide the same crappy services to schools and scummy gambling websites alike.
It frustrates me watching people who believe they are professionals flock to these services. Honestly, if you can't roll your own you probably shouldn't let someone roll this for you. But Google won't say no and none of you cloud devs can help yourself. So we have this race to the bottom in cost and first to market and all the products are least common denominator shit that gets built in 6 hours by copy pasting as many GH repositories together as possible on rented infra.
I think the reasons for this are complex.
First, security rules as implemented by Firebase are still a novel concept. A new dev joining a team adding data into an existing location probably won’t go back and fix rules to reflect that the privacy requirements of that data has changed.
Second, without the security of obscurity created by random in-house implementations of backends, scanning en masse becomes easier.
Finally, security rules are just hard. Especially for realtime database, they are hard to write and don’t scale well. This comes up a lot less than you’d think though, as any time automated scanning is used it’s just looking for open data, anything beyond “read write true” as we called it would have prevented this.
Technically there is nothing wrong with the Firebase approach but because it is one of the only backends which use this model (one based around stored data and security rules), it opens itself up to misunderstanding, improper use, and issues like this.