Anonimizing data to convince the public their information Is safe
instead of gathering accurate data is exactly the kind of inefficiency that must be expelled from government.
I worked census out of a duty of democracy. Overwhelmingly the people I interviewed thought it was completely useless, that the "government already knew", or that I was secretly the IRS. There are multiple instances of door slamming, hostile threats, and people who thought that I was somehow the representative of the president of the United States. A guy ran me down in the road to tell me I "ain't no fbi agent" after I left a census flyer on their door. Census bureau were almost trying to get us killed at this point since we were not allowed to have self defense with us. They wanted us to go to known meth labs, quote: "anecdotal comments are not justification.".
Multiple others on our team had a gun waved at them when they went to do reinterviews (checking others work) because the original worker got that information from a neighbor to avoid the threat. But the census system doesn't allow you to do that without attempting the actual location a number of times (annoying them).
There is one solution to all of this, and that is to return to the constitutional requirements. The single number of how many people live there. Nothing about money or jobs or anything else. But I'm afraid the census has already destroyed that opportunity, the 2010 "long forms" are now forever associated with the census.
There wasn’t a 2010 long form. The 2010 census had about 10 questions: sex, race, age, ethnicity, as well as your relationship to the primary homeowner and whether the home was owned or rented. Nothing about money or jobs.
There used to be a long form in 2000 and before, but this was spun off around 2005 to become a separate annual sample-based survey called the American community survey.
Yes, 2000's my mistake. A relative of mine did this in 2000's and before. It was quite invasive, requiring information about the number of types of rooms in households and plenty of other information people were not comfortable giving.
The solution to this is to not ask, and instead aggregate this data from open sources like the local tax assessor, or closed paid sources like the MLS or similar real estate transaction corpus.
This accomplishes whatever the objective is without putting census takers in danger due to unruly/belligerent interviewees.
This sounds very much like a society in slow collapse.
There is this book that I have difficulty getting through just because it depresses me so much. Your story reminded me of some if the issues discussed in this book.
I'll put it out there. Highly recommended if you want a sobering take: America: The Farewell Tour by Chris Hedges.
If you have already embraced the meme that “America is slowly collapsing” then your confirmation bias can make almost anything look like evidence of collapse.
Personally, I don’t see anything like that at play here.
It seems to me that if you went around asking invasive questions like this at any time in the country’s history you could have expected some subset of the population to react in the way that OP described.
If fact, given that the census conducts door-to-door interviews when people have not already voluntarily submitted the census form, it is to be expected that a good percentage of people interviewed just don’t want to answer the questions, and probably feel hostile towards the idea of cooperating with the census. That doesn’t mean that the society as a whole is hostile towards the census. There’s an overwhelming selection bias at play here.
I don't think I have embraced it, in fact I try to find reasons to consider the opposite just to keep me feeling in a positive mood. There is nothing else you can do when you have no control over the outcome.
The reason I brought up the book is because the stories told by the OP bear a very close resemblance to stories told in the book.
>It seems to me that if you went around asking invasive questions like this at any time in the country’s history you could have expected some subset of the population to react in the way that OP described.
Well yes, statistically this is more likely as the population increases. I also feel that people circumstances will contribute to their behavior. If they have felt betrayed by a government that has failed them in the past or they internally believe that the system does not work for them then these responses are to be expected.
>If fact, given that the census conducts door-to-door interviews when people have not already voluntarily submitted the census form, it is to be expected that a good percentage of people interviewed just don’t want to answer the questions, and probably feel hostile towards the idea of cooperating with the census. That doesn’t mean that the society as a whole is hostile towards the census. There’s an overwhelming selection bias at play here.
Fair enough. I don't have enough data to prove/disprove this conclusively.
I'm curious when and where that was. I was an enumerator for the 2000 Census in Los Angeles County and can't remember a single instance of hostility or even non-compliance once I explained what I was doing. The households I had to visit mostly didn't fill out the original form because they didn't speak English.
> Anonimizing data to convince the public their information Is safe instead of gathering accurate data is exactly the kind of inefficiency that must be expelled from government.
Your proposal is also anonymized, by default.
I think you mean:
Anonimizing [extra] data to convince the public their information Is safe instead of gathering [only the data they need] is exactly the kind of [overreach] that must be expelled from government.
>The question I kept coming back to in conversations with computer scientists and the Census Bureau was: what would a real-world privacy harm look like?
>McSherry suggested a scenario involving banks or insurance companies using reconstructed data to discriminate on the basis of racial or ethnic categories.
...
>But discrimination on the basis of race or ethnicity is illegal,
That seems glaringly uncritical. It seems awfully close to "Well that couldn't happen because it's illegal!".
>But there’s a simple solution to this that even skeptics like Ruggles support: get rid of the smallest blocks.
But they're publishing block by block data, you, the user of the data can exclude that data yourself, and the added noise for large blocks is relatively small so you can ignore. So this proposed solution acheives almost nothing.
I think we should come back to basics here: the census bureau has 1 job. Generate the data used for the federal government to allocate congressional seats to the states, and allocate spending. It's not really its job to provide data to academics for research. So the priority of "Do your job" vs "Do this other thing that's not really your job" weighs massively in favour of "Do your job". So be grateful for what they release, and realise that something that has the tiniest impact on them being able to do their job completely outweighs your need for them to do something that's not their job.
As a researcher who has (and continues to) use the publicly available census data, I agree with this.
IMO it makes more sense to have higher quality micro level data products that are only available in statistical data centers, https://www.census.gov/about/adrm/fsrdc.html, as opposed to these aggregate tables that even before all this were of variable quality.
This comes with a cost that not everyone can have easy access to the data anymore, but I think that will result in higher quality research in the end (even if less quantity overall).
Apportionment should be done in some standard way as well by states/fed instead of the gerrymandering mess we have now, but unfortunately that is not likely to happen anytime soon.
> In a [court filing](https://storage.courtlistener.com/recap/gov.uscourts.almd.75... ) last April, Abowd wrote that “our simulated attack showed that a conservative attack scenario using just 6 billion of the over 150 billion statistics released in 2010 would allow an attacker to accurately re-identify at least 52 million 2010 Census respondents (17% of the population) and the attacker would have a high degree of confidence in their results with minimal additional verification or field work.”
What would "re-identification" expose?
For example, say Alice is re-identified: what does the attacker then know about Alice?
If Alice's household Census data is re-identified, the attacker may learn that she's in a single-sex relationship. Sure, her neighbors may know, but now some organizations that maybe don't have her best interests at heart are able to easily and automatically locate her and everyone like her, so as to be able to harass them or worse.
Or the attacker may learn that Alice is relatively young, has a kid, and doesn't seem to be living with anyone. Sounds like a bastard child, which is a sin. The neighbors are accepting, but now anyone else can find out.
Or, maybe Alice is part of a racial or religious minority group that tends to vote Democratic. Now attackers can make a list from afar of individual houses to visit on election day to more efficiently intimidate folks in that group from voting.
Or, maybe with a bit of cross-referencing with commercial data, it sure is looking like Alice is not a citizen. This time, the neighbors don't even know, but now the folks who are trying hard to have everyone like her rounded up and expelled ASAP do.
Or, maybe a special-interest pedophile group that has a thing for certain age kids in a certain racial group can now make a list of which houses are worth staking out.
And, if any of these things happen even on a small scale, and word gets out that it was due to Census data being insufficiently protected, it will mean that in the next census, all these groups of folks will think twice before participating, with the result that they will lose representation at the federal level as well as federal funding for their communities.
If this all sounds far-fetched, what do you make of the fact that of the 16 State Attorneys General who sued the Census Bureau over Differential Privacy, 14 were Republicans? With additional briefs from Republican lawmakers from 4 states? (By the way, they lost so resoundingly, and the written rulings by the three-judge panel delivered such a thumping to the attorneys involved, it seems that they slunk away and didn't dare to tarnish themselves any further by taking the next step and appealing to the Supreme Court, where the embarrassment would have been yet worse.)
While selectively accusing Republican governments of being criminal racists, you are ignoring the opposition to DP from Mexican American Legal Defense and Educational Fund ("Voting Rights Act and its enforcement could be adversely affected by differential privacy"), and Native American Rights Fund
Yes, minority groups certainly have valid concerns, which of course can be addressed.
But what is your alternate explanation of how this particular group of Attorneys General happen to have been the ones to sue? In their ruling, the (Republican) judges clearly stated that they saw no other cogent motivation. I'm not aware of any other proposed explanation from any quarters. So, seriously, what do you think it might have been instead?
Correction for posterity: "In their ruling and in the trial transcripts..."
For instance, in the transcript, check out the part where (Trump-appointee) Judge Newsom wonders "And you've got the Pennsylvania Republicans on your side and the attorneys general -- overwhelmingly Republican attorneys general on your side. I'm just trying to figure out kind of what to make of any of that." What do you think of the (non-)reply? If you could give your own reply to cast things in the best possible light, what would you say?
By the way, the transcripts also show that Judge Newsom (and the others) fully grasped that the choice isn't between "Beautiful, Pure, Unperturbed Data vs. Noisy Data" but rather between "Data that has been fuzzed via a Swapping technique that the Census Bureau had been using for decades but is now known to be unsuitable going forward, vs. Data that's fuzzed using Differential Privacy."
Would like to talk about an extended warranty for your car? Or end of life care for your 78yo family member? Or super awesome student loans for your 16yo family members?
That is just low hanging fruit.
Imagine cross referencing other data sources to get more complete images of everyone in 52m responder households.
> Would like to talk about an extended warranty for your car?
… the people making these calls don't give a rat's ass about census data, accurate or otherwise. It's fraud, all of it. I get these all. the. time.; my current car is 25 years old, I guarantee you its warranty is not "about" to expire. I got these before I ever owned a car. They're not going to sell you an extended warranty, they're going to take the money and run. Accurate targeting doesn't matter.
Unless the census data has a column "is_sucker", but even then, I doubt it.
You could maybe alter that argument to be like "well but reputable ads then" … but, no, I doubt that too. Even with the tech we have, ads shown to eyeballs still making incredibly questionable "there is no way this has positive ROI" choices, like running the same ad twice in every commercial break. YT clearly has ~1 ad right now, that stupid game one whose name I cannot remember.
The car warranty amuses/annoys me for the same reason, I drive two cars over 25, well past 200K and both from defunct manufacturers, so I really doubt any warranty would touch them. Same thing with the "student loan forgiveness hotline"-- I haven't had any for years. I used to have a bit of fun with the "Microsoft tech support" calls, since we didn't have a Windows computer in the house.
Recently I've had a few calls that just said "I'm actually calling to inform him now," so I think the callers don't really care what the content of the call is as long as they can get you to call them back. And maybe if you're gullible enough to call them back, you're gullible enough to give them money.
A census is only useful if the participants can fill out the paperwork honestly. I actually don't understand the people who don't see this being a problem, nor why they would believe layered privacy protections aren't useful. It would seem irresponsible to release accurate or detailed census data to the general public.
To use an example that I'm aware of: in the USA (but likely exists in other places), there are groups of individuals which seek out and target interracial couples. These couples are then harassed and violence is not uncommon.
Approaches such as Differential Privacy exist to address this specific privacy weakness. The act of fuzzying the data this way is an accepted method for the data to only be useful for its intended purpose.
As you and others have noted it's trivial to build up various use cases, from the commercially annoying to the dangerous: the assumption that individuals and companies are never going to try to exploit this data or break the law is perhaps dangerously naive.
I'm not a statistician so I can't possibly get into a discussion about what is worthwhile information for collection. I am aware that a lot of planning goes into such questions to avoid potential abuse (even by the government), the goal being to ask the minimum needed to provide governance while also providing a historically meaningful snapshot.
I don't think there is a strong argument against having a census, it's a more privacy preserving approach to planning than other forms of data collection such as mandatory registers or combining existing government databases - medical, births/deaths/marriages, automotive, postal, taxation, education and so on (in this example the resulting database has too much information about the population, governments are mindful to keep these databases separate on purpose.)
Back to your point though: Arguably a lot of needed and seemingly benign information can still be problematic. As one commenter noted, merely indicating childrens' presence can be a problem, while such information is clearly beneficial for governance.
Ultimately protections should be in place regardless of the kinds of information that is sought, since one can't foresee all potential forms of abuse such data collection can bring, nor how such information can be merged with additional data sources to reveal more accurate profiles (as is the contemporary issue of online privacy.)
Consider Alice's census response as a row in a table of responses. You can reconstruct every attribute/column for which a sufficiently large number of summary statistics are released.
Names are not a part of the statistics the Census releases, so you won't be able to reconstruct the name. Make a fingerprint out of some of the reconstructed attributes and run a database join against another dataset with names. You've now enriched your data with any remaining attributes that were not a part of your fingerprint.
I'm grateful no one has attacked and shared the 2010 decennial census, or ACS, which has considerably more questions. If it seems far-fetched... well, you only need one person to do it, there's an existence proof, and the attack is basically just a convex optimization problem.
Extra:
I've written some reconstruction attacks like this myself. One approach is to find the least squares solution to |Ax - b|. Let b be a vector of Census statistics. A is a query matrix. An attacker has both. Solve for x, which is a column in the dataset. If b is long enough, then the system is fully determined, and you can solve for x exactly. In practice, b can be much smaller than x for the system to reconstruct x with high accuracy. Repeat for each column. There are more efficient approaches, like SAT solvers, for large systems.
Happy to talk more about the general approaches used to privatize census data if anyone is curious. I don't work on the Census mechanisms, but I do privacy research and am familiar with DP hierarchical histogram mechanisms.
That's interesting but what I reacted to was this paragraph;
>According to census data, this block—and hence this house—had 14 residents in 2020: one Hispanic person, seven white people, three biracial people (white and black), two multi-racial people (white, black, and American Indian), and one person of “some other race.” There were supposedly eight adults and six children living in the house.
Why is this important? I love how they trail off at the end there. So if you don't qualify for bi-racial you're "some other race". What if your grandparents were biracial? Does that make you some other race? Where does this end?
Don't mistake this for me trying to make a point about "woke". I'm a Swede who truly finds it fascinating how the United States emphasize people's "race" instead of their heritage (nationality).
The importance of collecting race data is because of a history of systematic discrimination based on race. It's not for example Black or Hispanic groups that decided that those categories are important, it was the people creating Jim Crow Laws, Redlining, Bank discrimination, unequal resource distribution to schools, racial discrimination of public pools, racially based differences in policing strategies etc. The reason to ask these question is to make it possible to identify such discrimination.
Sweden is a great example of why this is important. There it is much harder to identify such kind of discrimination due to poor data. See for example the treatment of Romas or Afro Swedes.
The US census allows you to specify your race from a list of options or choose other. A person with biracial parents can just choose whatever they feel best identifies them. There is no "qualifies".
Nationality does not work in the US. In Sweden (which, ironically given your question doesn't seem to publish country level ethnicity statistics), most Swedes are 12th generation or older. That pretty much predates the US.
Meanwhile, other issues on ethnicity are that approximately 12% of Americans can only list "ancestors kidnapped from somewhere in Africa"
No, your assumption that “some other race” means you “don’t qualify for bi-racial” is wrong. “Some other race” means you don’t identify with any of the racial categories on the Census, and in practice that means you’re Hispanic/Latino or Middle Eastern - those groups are officially grouped into white, but respondents often don’t identify that way. If your grandparents were biracial you’d be biracial.
> I'm a Swede who truly finds it fascinating how the United States emphasize people's "race" instead of their heritage
The US is founded on essentially shipping criminals and undesirables overseas so they can go murder some people that aren't in the public eye. If there was a focus on heritage, there would be basically no Americans. And that would be unacceptable, because the culture is deeply racist in a way that is so fundamental it is not possible to separate it today.
The amount of privacy is configurable, and researchers who would like to tune it (in either direction) can make a plea so the public can decide the amount of privacy loss they're willing give up in order to aid research.
Each culture is different and has different tolerances for privacy and amount of intrusiveness.
My solution to this was remarkably simple. When the census person came by, I told them I lived in my house and shut the door. There is no obligation to give any other information, at all.
I don't believe that house residents are being changed in order to stop identification attacks. That simply doesn't have the ring of truth, and in no other earlier press release or discussion about the Census did I ever see that mentioned.
This Census action actually doesn't stop that attack from materializing. It is known and possible to purchase DMV records that give information about residents. People already do this all the time. I'll guarantee you that DMV records are overall more correct than Census data.
It's just the intentional recording of falsified data.
? Talk to someone who actually worked on the census. These are fudge numbers used for differential privacy for low-count categories in a census block. The counts at district and state levels are and must be accurate; individual blocks are fudged.
Why? That doesn't make the data accurate. They are still bad data, and DMV records already allow this supposed attack far more effectively and directly. This munging of data doesn't do what it is purported to do. It is security theatre and just a pretense at security.
DMV records are not publically published! Census records are, after a mere 72 years. The protection isn't from your government, it's from everyone else in the world.
"The U.S. census is a direct count of every resident. Required by the Constitution, it has taken place every decade since 1790."
"The data it collects is used to determine political representation in Congress and to direct more than $1.5 trillion in federal funding annually."
Serious questions...
1.) What data are they anonymizing? Especially by "randomizing numbers by taking numbers in one place to another place", doesn't this defeat the purpose of a census?
2.) Are they asking your political affiliation? I never got asked this, rather just: "how many people live here" - this is all I was asked.
3.) What other questions are they asking residents? Is it different in each state? What are they doing with this data? Is it legal?
4.) Isn't this then number used to assign how many "congressional seats" go to which state?
> 1.) What data are they anonymizing? Especially by "randomizing numbers by taking numbers in one place to another place", doesn't this defeat the purpose of a census?
You can mix up data at the individual level so that the aggregate statistics that matter are still accurate. That way the individual is protected, but the data still has meaning.
I was being lazy, you're right. I don't recall ever filling one of those forms out however. I do recall a Census worker coming door to door and I suppose filing this form on my behalf.
The census should be done by overhead satellite images using statistical approaches. It doesn’t matter precisely how many people there are, because that is changing every day.
The government has already proven itself to be untrustworthy, not sure what benefit people get out of giving them personal information about their age, income, etc. The IRS, DMV, social security, etc already has all the information they need. Saying it is important is like saying it is important to give out your information to Google to make it easier to cater search results and give you targeted advertisements.
Where to build specific medical facilities is based on age and population. A community of 25 year old heterosexuals is going to have kids, where a community of 85 year olds is going to have geriatric issues. It also impacts the needs for transportation etc.
Income adjusts where you have to subsidize local hospitals, where to place welfare offices etc.
As to why you need to collect this info again, homeless people don’t report income to the IRS etc.
> As to why you need to collect this info again, homeless people don’t report income to the IRS etc.
Many homeless people also do not have mail, so they never get contacted or participate in the census. If they are in a shelter, then that shelter already has their information.
Do you have any evidence that that would be feasible? Because it sounds impossible. Houses in rural areas hidden under trees would go uncounted. High density urban areas would be have no indication how many inhabited floors there are, or how densely inhabited.
What problem do you have with how the census is currently done? It seems remarkably efficient.
What I find more troubling is that I know for a fact of several people who did not respond to the surveys and were then never again contacted/followed up with as was indicated they would be.
That was in specific area, which, along with other knowledge and experience with working with the Census bureau years earlier, even before they were in the news for being extremely behind the curve ball even before the pandemic; makes me extremely concerned that the data is not at all accurate.
From my perspective it is not at all unreasonable that the 2020 Census data may have a lot more “noise” in order to cover up inaccuracy.
I suspect that with some data analysis it should be relatively easy to at least get an indication of something being of between the 2010 and 2020 census along with the community surveys.
I have been around these things in the FedGov and what I know does not at all give me a lot of confidence about this “noise for privacy” explanation. Does anyone know if they announced this move to increased “privacy” a long time ago or if it’s just a relatively recent thing? I am not aware of such an announcement. All the sudden the government is concerned with privacy???
> Does anyone know if they announced this move to increased “privacy” a long time ago or if it’s just a relatively recent thing? I am not aware of such an announcement.
I don't really keep an eye on census news myself but taking a look I did find this page from 2018 talking about the latest set of changes to go into effect for the (then upcoming) 2020 census https://web.archive.org/web/20181107155115/https://www.censu...
> Does anyone know if they announced this move to increased “privacy” a long time ago or if it’s just a relatively recent thing?
By long-standing law, the census data has to be kept anonymous for 72+ years. The census employs long term employees to help protect that privacy. This is one result.
I worked census out of a duty of democracy. Overwhelmingly the people I interviewed thought it was completely useless, that the "government already knew", or that I was secretly the IRS. There are multiple instances of door slamming, hostile threats, and people who thought that I was somehow the representative of the president of the United States. A guy ran me down in the road to tell me I "ain't no fbi agent" after I left a census flyer on their door. Census bureau were almost trying to get us killed at this point since we were not allowed to have self defense with us. They wanted us to go to known meth labs, quote: "anecdotal comments are not justification.".
Multiple others on our team had a gun waved at them when they went to do reinterviews (checking others work) because the original worker got that information from a neighbor to avoid the threat. But the census system doesn't allow you to do that without attempting the actual location a number of times (annoying them).
There is one solution to all of this, and that is to return to the constitutional requirements. The single number of how many people live there. Nothing about money or jobs or anything else. But I'm afraid the census has already destroyed that opportunity, the 2010 "long forms" are now forever associated with the census.