
You can’t buy Congress’s web history – that's not how any of this works - gojomo
http://www.theverge.com/2017/3/29/15115382/buy-congress-web-history-gop-fake-internet-privacy
======
AdamSC1
This article fails to take into account that time and time again we've seen
that 'anonyimized aggregate data' is never truly anonymous.

In the AOL anonimized data leak there were plenty of individuals identified:

[http://www.nytimes.com/2006/08/09/technology/09aol.html](http://www.nytimes.com/2006/08/09/technology/09aol.html)

MIT researchers also showed that four anonymous purchases are enough metadata
to identify 90% of individuals:

[https://www.technologyreview.com/s/536501/data-sets-not-
so-a...](https://www.technologyreview.com/s/536501/data-sets-not-so-
anonymous/)

And, a personal favorite of mine where researchers from Standford and
Princeton are reporting at the World Wide Web Conference this April:
"Researchers found that they could identify the person behind an 'anonymized'
data set 70% of the time just by comparing their browsing data to [often
public] social media activities"

[https://www.techdirt.com/articles/20170123/08125136548/one-m...](https://www.techdirt.com/articles/20170123/08125136548/one-
more-time-with-feeling-anonymized-user-data-not-really-anonymous.shtml)

It would not be hard to buy a zipcode worth of data and compare it to known
facts about a person until you de-anonymized it.

~~~
jt2190
While I agree with the sentiment that companies have a poor track record of
protecting personal data, I'd like to clarify that a "zip code of data" is not
anonymized intentionally, which is what makes picking out individuals so easy.

There is a technique of intentionally anonymizing data [1] that I learned
about because that Apple was talking it up in relation to storing health
information. I'm only a layman, but my understanding is that it makes it much,
much harder to do an analysis like you describe.

[1] The term is Differential Privacy. Here's a tutorial video (1 hour, 34
mins): [https://youtu.be/ekIL65D0R3o](https://youtu.be/ekIL65D0R3o)

~~~
AdamSC1
The zip code was intentionally not anonymized because it was seen as
acceptable to release, and that is part of the issue of anonymized data.

US Zip codes serve around 7,500 per code on average (US Census 2010) which is
different than common wealth postal codes like in Canada where they serve an
average of 19 households (25 - 75 people).

But, zip data can be interchanged with plenty of other unique identifiers on
the web. Maybe it is browser language setting, or version of java etc.

Think of it like an Excel spread sheet, if in column A you can have options
"1" or "2" then in a list of 100 people there will be at least 50 who share
the same data footprint. If you keep adding columns from B onward with the
same logic eventually you'll have pretty unique strings.

Things like searching history or web history are even worse. Ever done a
search for a pizza place near your address, or Google map directions from your
home to another location? That identifies you pretty easily. So does
connecting to your works website, and the school your kid attends. Web
browsing data is nearly impossible to anonymize by its nature unless it was
compiled to something like "XX% of users in Zip XXXXXX visited website.com"

As for differential privacy, it is a nice emerging theory, but there are
challenges with it as there is a significant trade off right now in terms of
data accuracy when applying differential privacy. It is primarily effective at
casting doubt on if variable "A" about user "B" in a data set is true or not,
but if you don't have a specific target or specific metric then enough of the
data is true that it could still in theory be deanonymized, and since most of
the anonymity is based on incorrect variables in a data-set, all it would take
to reverse engineer it is a large enough data-set and a few known variables.

I hope people like Apple continue to champion the advancement of differential
privacy though - it is a major step in the right direction. But, being able to
buy browsing history, even in aggregate does not protect individuals.

~~~
jt2190
>The zip code was intentionally not anonymized because it was seen as
acceptable to release, and that is part of the issue of anonymized data.

Forgive me for asking, but you seem to have two definitions of "anonymized":

\- anonymized \- not anonymized, but claimed to be anonymized

I think this argument (which I agree with) would be more forceful if we could
stop calling non-anonymized data "anonymized". "Depersonalized", perhaps?

~~~
AdamSC1
So there is the version of "anonymized" that most companies currently use
which is as you say 'depersonalized' \- they remove your personal information
and think it is safe, but data is still tied together in unique records. (i.e.
your name is replaced with an ID number)

Then there is actually 'anonymized' data which would be the release of data in
which you cannot in anyway identify a user. An example that comes to mind for
me is the census releasing aggregate stats such as "14% of American's speak
language X."

If the census instead had records of each American line by line, listing which
language they speak and other associated factors about them then this data is
likely to paint a unique picture of the individual even if their personal
information like name and address were removed.

I think most data is very hard, if not impossible, to truly anonymize. Even if
the search history that gets sold wasn't broken out into history per
tuple/record, then you could still identify at least a few trends in it.

Does that make more sense? But yes, I agree that these companies are more
attempting to 'de-personalize' data for the sake of research, but, that is far
from anonymous and naming it as such is misleading to the public.

------
DannyBee
"and they every major network has robust safeguards in place to prevent you
from working back to a single person’s web history."

hahaha. THat's quite a talking point. What would these "Robust safeguards" be?
The history of pretty much every study on this, ever, is that it's pretty easy
to deaggregate and deanonymize data.

But look i'm sure, verizon, at&t, etc, those great bastions of "doing it
right", have done this right too!

20 bucks says if someone buys large amounts of the aggregated data, they can
extract significant information that can be pinpointed to individual
congresspeople.

~~~
derefr
The safeguards might just be that they won't sell you "large amounts of" the
aggregated data unless you're clearly a profit-driven multinational
corporation rather than a ideologically-driven individual. KYC and all that.

~~~
ckastner
What about a profit-driven multinational corporation with political ties?

Say it is indeed possible to pinpoint individual congresspeople, as the
grandparent suggested. If some of these congresspeople have a browsing history
they'd rather not have made public, identifying them could leave them open to
blackmail.

~~~
tedunangst
At some point the threat also just takes the form of "here's a list of random
sites we totally made up and will tell people you visited". How well does
blackmail like this even work? Even if somebody has every naughty site Trump
has visited, where will you publish it and why are his supporters going to
read it and believe you? Why will they not assume you made it all up?

~~~
Daviey
And Trump would need to either ignore the claim made by the media, which could
be seen as an admission of guilt...

Deny the claim, and then be later impeached for dishonesty.

Admit the claim, and then well, admit it.

IF some of the content is serious enough, it could surely be a foundation of a
congressional hearing or perhaps even impeachment process? How can he respond
to the questions there?

In any scenario... I can't see how he could "win" this? (Unless he is really
innocent of anything questionable of course)

~~~
krapp
He's not going to be impeached for dishonesty, the government is entirely run
by his fellow Republicans.

~~~
dhimes
He's not a Republican. He hijacked the party and beat the living shit out of
every single one of them, and they fear him.

~~~
krapp
He is a Republican. He's the leader of the Republican party by way of winning
the Republican nomination and the Presidency. The rest of the party may not
like that, but the fact that Republican voters chose Trump over a more
mainstream candidate demonstrates that mainstream Republican ideology is no
longer relevant to the future of the party.

Which is why Trump probably won't be impeached, or even seriously censured -
doing so would only cause an insurgency by Trump's supporters and it would
weaken Republicans in the face of the Democrats. It would be suicidal.

~~~
dhimes
I think we agree.

------
drawkbox
People are missing the point on the bill, it isn't just horrible for privacy.
Really this is a jab at net neutrality, the open selling of data by ISPs is a
power grab away from the FCC that helped to make net neutrality a thing by
labeling broadband/ISPs as common carriers.

Republicans (this was a party line vote) say it is unfair that Google and
Facebook have your personal information and use it for ads but why can't ISPs
have that and also sell it? One big major reason is people sign up to Google
and Facebook for the purpose of sharing and agree to their ads in exchange for
a service. Google built the most powerful search engine and Facebook built the
social graph. Google/Facebook built value and they only use your info to
target ads to you, they don't sell it because others would do the same. They
sell ads and people use them because they have info on you, not necessarily to
sell off to others.

If you ask me it is unfair for republicans to legally allow ISPs to do the
same because we expect privacy from ISPs in ways we do not from Google and
Facebook. You can choose not to use Google or Facebook but you cannot choose
your ISP/broadband provider. In my opinion this is like letting someone view
your mail, read it and then sell information about you.

It is also an unfair competitive advantage for ISPs above all because they can
place ads on any website if they want or track you across all sites not just
like Google/Facebook which are huge but only see a portion of what you do.
ISPs built no value product like a search engine or social graph for this
purpose, they should do that if they want access like Google and Facebook. It
seems almost like the GOP are harming innovative companies and
rewarding/catching up non-innovators. I bet broadband companies/ISPs won't
even use the profits to improve broadband and rollout gigabit service for
real. It is a rewarding of lazy semi-monopolies over innovative companies and
products.

Republicans also control the FTC not the FCC so they want all control to fall
to the FTC instead. It is both a power grab and a bending over of all their
constituents.

Most of all, it is also another step in dismantling net neutrality as FCC
protected that by categorizing the broadband/ISPs as a common carriers and
they want to sap the FCCs power in that regard.

~~~
diggan
> you cannot choose your ISP/broadband provider

Wait, what? Never visited the US but I assumed you had the choice of
providers, just as most of the rest of the world where internet is available.

I see the reason for this to allow ISPs to have infrastructure to collect data
setup, so when some organization needs* access to it, it's already there and
easily accessible.

* for national security or whatever

~~~
gremlinsinc
I'm in utah...maybe it's just different here, but we have Utopia which owns
the conduit, and leases out to different ISP's who own the networking
equipment. I'm with X-mission who's CEO is a big privacy advocate, and
supporter of EFF. But there's google fiber in places, there's x-mission and
about 5-6 other utopia-leasing ISP's locally you can choose - and that's just
for fiber, you also have Comcast, Qwest(DSL), and CenturyLink, and a few
others.

Even in rural Cedar City down south you have about 5 ISP choices. Though the
highest speed looks to be TDS at 300MBps

~~~
UncleMeat
Bay area here. Its in my _lease_ that I need to use Comcast for any wired
internet connection.

~~~
orthecreedence
I'm so incredibly surprised this hasn't been sued out of existence for anti-
competitive behavior. Not that people in the Bay Area ever have more than one
choice to begin with for internet service (AT&T and Comcast seem to split the
area up amongst themselves).

------
downandout
This controversy seems overblown to me. It would have been nice if these rules
had gone into effect as scheduled, but let's remember that they _never
actually did_. So nothing has been lost - ISP's have gained no new abilities,
despite how the extremely biased media coverage I have seen of this paints the
situation. They have been able to sell aggregated targeting data forever, and
will be able to continue doing so.

This article is actually somewhat encouraging, in that it displays the limits
of the laws as they exist today and will apparently remain. You cannot go buy
the browsing histories of specific individuals - enemies, employees, etc. Many
of the ridiculous "sky is falling" leaps of fact and logic that have been
portrayed in the media to get views (ironically so that they can make money
from displaying targeted advertising on articles decrying it) have been
disproved by this article.

~~~
tzs
You make it sound like the FCC privacy rules were bringing something totally
new to the table. That's not really correct. They were more bringing back
something that had recently been lost.

Until mid 2016 ISP privacy abuses could be dealt with through the FTC,
basically the same as privacy abuses at most other internet businesses are
dealt with.

In mid 2016 a court ruled that the FTC did not have authority to deal with
these issues for common carriers, which ISPs had been since being reclassified
as part of the 2015 net neutrality rules. The result was a significant
loosening of privacy regulation for ISPs.

The FCC rules would have undid that loosening.

~~~
orthecreedence
Thanks for explaining this. I didn't know the full chain of events nor the
FTC's involvement and how it affected the 2016 FCC rules. This makes it much
more clear.

------
avmich
Of course you can buy a non-identifying information perfectly legal. And then
guess what's the definition of non-identifying information? Think again. If
ISP thinks it sells you that, nothing stops you to look at that yourself
closely - and may be, just may be you'll uncover something which ISP missed.
After all, scrubbing data to be perfectly non-identifying could be, you know,
expensive. And doesn't exactly fit ISPs business model. So sharing even
Congress-related history could be an interesting thing.

~~~
thinkling
Buy aggregate browsing history for users in DC and nearby area codes. Look for
browsing of .gov websites that are effectively government intranet tools.
Correlate with access to RNC sites, Infowars, FoxNews, etc. While I'm sure you
would both miss people and have some false positives, you'll likely also
identify many members of congress and their staff.

~~~
hackinthebochs
Google searches have the query in the URL... probably most "important" people
google themselves occasionally.

~~~
tedunangst
Why are important people searching google without https?

~~~
hexadec0079
HTTPS will not protect the URL of the request, just the contents though.

~~~
temprature
This is wrong. The domain name is revealed through SNI but the path of the URL
is sent as part of the encrypted HTTP request.

------
lph
Prohibiting the sale of individually identifiable information is a very weak
protection. It's alarming how quickly multiple dimensions of "anonymized" data
can be used to zero in on any target person, and it won't take long for amoral
opportunists to start selling de-anonymized data to the highest bidder. See
also k-anonymity.

~~~
gorbachev
Combine the data with publicly available information about politicians and
other public figures, feed it to a machine learning algorithm and I bet you
could get very high accuracy results very, very quickly.

------
gyger
One should check out the talk of 33c3, where a group of german journalists
bought this type of data and deanonymized some users.

[https://media.ccc.de/v/33c3-8034-build_your_own_nsa](https://media.ccc.de/v/33c3-8034-build_your_own_nsa)

~~~
rndgermandude
tl;dw Even worse, they found stuff like links to a police detective's
application for a search warrant in the "anonymized" data.

Oh, and of course they found and deanonymized some German politicians and
their staff.

So while it might not be possible to deanonymize and "expose" all of Congress,
if you really tried and got some money to actually buy the data, you will be
able to at least deanonymize some of your targets, especially if the data
originates from a man-in-the-middle like an ISP and not just some random ad
network/tracker.

------
alistproducer2
So I was a call-in on NPR
yesterday([http://www.wbur.org/onpoint/2017/03/29/internet-privacy-
cong...](http://www.wbur.org/onpoint/2017/03/29/internet-privacy-cong...))
that discussed the ISP privacy issue. I brought up the cried fund and all of
the guests immediately jumped to the defense of the data seeking companies
saying how these companies were too reputable to "allow" such a thing to
happen. I'm finding all of this love for these previously unknown companies
very strange.

~~~
thirdsun
Your link is incomplete. I think you meant to link to
[http://www.wbur.org/onpoint/2017/03/29/internet-privacy-
cong...](http://www.wbur.org/onpoint/2017/03/29/internet-privacy-congress)

------
ohthehugemanate
I don't see this as a particular obstacle. Yes, I can't buy Paul Ryan's data
individually.

But I can buy anonymized browsing and demographic data for the downtown DC zip
codes, and de-anonymization is not particularly hard when you have so much
information to work with. Zip code, gender, and birth date together are enough
to get you to 87%.

------
maaaats
> _You can’t buy Congress’s web history_

They probably don't understand or know that, though, so why not give them a
scare?

~~~
backpropaganda
Right. Only us smart HNers living in SV know anything about tech.

~~~
maaaats
That's a stupid straw man.

------
mirimir
> In the meantime, the two biggest campaigns have collectively raised nearly
> $140,000 for the purchase of web histories that will never go up for sale.
> It’s anyone’s guess where the money will end up.

Well, NSA staff did LOVEINT. Maybe some ISP staff will want to profit.

Edit: But I wonder how it'd be possible without self-pwning.

------
elchief
Didn't Netflix pull their 2nd competition because people figured out how to
de-anonymize all the anonymized data? Can you not figure out who someone is by
repetitive filtering?

------
mumpy
> that's not how any of this works (theverge.com)

While the publicity pieces may be poorly stated, (in particular one should
never use the word "individual" in any context) the desired teachable moment
is based on buying data or access to a segment of at least 500 individuals
based on factors like travel habits, income and demographics. (In some
proposals that segment could be as high as 5-10K as I see mention of
congressional aids and the like.)

That desire could be quite easy to achieve and quite hard to block without
upsetting the intended market for ISP data since being able to buy affects on
segments down to such granularity is exactly how the market works.

------
notacoward
If the data can be used to target ads at individuals, then it clearly isn't
anonymous in any meaningful sense of the word.

------
cameldrv
For those that say this is impossible, take a look at these slides from
Dstillery, a major demand side advertising company:
[https://www.matroid.com/scaledml/2017/claudia.pdf](https://www.matroid.com/scaledml/2017/claudia.pdf)
Pay particular attention to slide #23, and also the following few slides,
which shows the architecture of their data processing stack. They are
deanonymizing most of what people do on the web, and they undoubtedly have a
file on most all members of Congress. Part of the input to this model is the
"anonymized" location data that mobile phone companies sell. If you already
know where a person lives and works though, or simply have a few retail
transactions with timestamps, you can deanonymize these location traces.

------
rdlecler1
>you want to get really clever, the Wiretap Act also makes it illegal to
divulge the contents of electronic communications without the parties’
consent, which arguably includes browsing history.

So wouldn't it be illegal for sites to sell visitor information (email lists)?

------
makomk
Also, people were talking about buying Congress's _search history_ , which of
course ISPs don't even have - only Google does and their ability to sell it
was not in any way affected by this rule that covered only ISPs.

~~~
bigbugbag
Those people are confused, at least for searchinternethistory.com the plan is
to buy internet history and make it public and searchable.

Then again, I'm pretty sure google does not know about what people search on
bing or yahoo, and it seems to me that what you search on google appears in
the URL so your ISP would know.

~~~
couchand
> what you search on google appears in the URL so your ISP would know.

Well no, Google searches go over HTTPS, so everything but the host is
encrypted.

~~~
bigbugbag
Thank you for fixing my shameful mistake, don't know what I was thinking
there. my bad!

------
bigbugbag
It was my understanding that there would be some deaggregating and
deanonymizing to be done on the data.

Hasn't this been shown repeatedly that it is not only feasible but also quite
easy ? It seems to me that the verge missed the point here.

------
pcarolan
Lotta speculation in the news for something that could be answered by looking
at the data. Since this isn't a new program, has anyone seen it? (please no
speculation)

------
ryan_j_naughton
This raises an interesting question: who is the ISP for congress? My guess is
the government itself, but anyone have anything more specific on that?

~~~
ryan_j_naughton
Using the history of congress editing wikipedia pages I found the answer --
they are their own ISP:

[http://whois.arin.net/rest/net/NET-143-228-0-0-1/pft](http://whois.arin.net/rest/net/NET-143-228-0-0-1/pft)

[http://whois.arin.net/rest/net/NET-143-231-0-0-1/pft](http://whois.arin.net/rest/net/NET-143-231-0-0-1/pft)

[https://en.wikipedia.org/wiki/Wikipedia:Congressional_staffe...](https://en.wikipedia.org/wiki/Wikipedia:Congressional_staffer_edits#United_States_House_of_Representatives)

[https://www.whoismyisp.org/ip/143.228.255.255](https://www.whoismyisp.org/ip/143.228.255.255)

~~~
ryan_j_naughton
This led me to an idea: why not buy online ads targeted by IP address at the
IP ranges of congress? It would be extremely targeted since it would only be
seen by computers on the hill.

