
Google Data Collection research - pjf
https://digitalcontentnext.org/blog/2018/08/21/google-data-collection-research/
======
gooftop
In some ways not surprised, but its still a LOT of data (in sheer volume).
Does Google even keep the data and analyze it, or is much of it thrown away as
digital exhaust? Anyone know?

------
arnjerobben
PDF: [https://digitalcontentnext.org/wp-
content/uploads/2018/08/DC...](https://digitalcontentnext.org/wp-
content/uploads/2018/08/DCN-Google-Data-Collection-Paper.pdf)

------
close04
People will look fondly to the times when they thought Microsoft's telemetry
was the worst privacy issue to hit them...

~~~
fauigerzigerk
I think you are mistaken.

Take a look at the privacy settings in your Microsoft account, specifically
the activity history. They collect very much the same stuff that Google does,
only Google lets you disable the data collection and Microsoft doesn't.

~~~
close04
I am reasonably sure this is not the case, especially since every piece of
evidence, article, and comparative study points at Google as being by far the
worst offender on the data collection and privacy topics. If you have an
Android phone you'd better give up on the idea that you can really "disable"
the data collection.

Windows on the PC gives me clear options immediately after installation to
disable all that and what's left is the "basic" telemetry, which is very light
data collection according to today's standards. At least there's nothing
personal there. Also I can very well use the PC without a MS account (so far).

On the mobile side... Is MS still into that? Myself along with probably 5
other people do have a Windows 10 mobile phone but again, the privacy wizard
apparently did a great job to disable most of the collection of data. And when
I go to Settings > Privacy > Location and I disable it _it 's disabled_.

But yeah, this is the kind of ignorant BS that got all of us here in the first
place. As long as there's a critical mass of ignorance on the market the likes
of Google know they'll always have someone to sell to so they shove this down
everyone's throat. MS makes money by selling you the products. Google gives
you everything for free. Not raising any flags?

~~~
fauigerzigerk
Please go to [https://account.microsoft.com/privacy/activity-
history](https://account.microsoft.com/privacy/activity-history) and see for
yourself. You can delete your activity history (including your search history)
but you cannot disable the data collection as you can in your Google account.

 _> MS makes money by selling you the products. Google gives you everything
for free. Not raising any flags?_

Microsoft makes money in all sorts of ways, among them free and ad funded
services like Bing, Skype or LinkedIn.

But yes it does raise flags. That's why - as a paying Office 365 (for Mac)
subscriber - I was expecting to have more control over data collection than as
a user of ad funded services, not less.

~~~
close04
I think you really want to miss the point: The problem isn't that Google is
collecting performance data and other such stuff. They collect _very private
data_ that can be directly tied to you - _fauigerzigerk_ personally. And they
make it very hard or impossible for you to really disable it.

First, MS doesn't exist on the mobile side. This might sound irrelevant until
you realize that a phone has the potential to be a _much_ more powerful data
pump.

Second, most Windows installations aren't tied to a MS account. As such no
data can be linked to it and to you in person.

Third, you see this screen (a few settings are off-screen)? [1] It actually
disables most data collection, like location. It doesn't just hide it from the
dashboard and have you disable it "again" in 3 more places like Google was
shown to do these days. Just by disabling Cortana you get rid of most data
collection that can be tied to you.

This is the first thing I do on any Windows machine. Between this and the fact
that I don't have to log in with a MS account means that my MS activity
dashboard only includes my explicit browser logins and this:

 _We don’t have any data associated with this Microsoft account at the
moment._

Or this:

 _There’s nothing to see here yet. To add some interests in this category,
open Cortana’s Notebook on your device._

[1]
[https://www.bleepstatic.com/images/news/companies/m/microsof...](https://www.bleepstatic.com/images/news/companies/m/microsoft/windows-10/privacy-
settings-preview/privacy-screen.jpg)

~~~
fauigerzigerk
_> I think you really want to miss the point: The problem isn't that Google is
collecting performance data and other such stuff. They collect very private
data that can be directly tied to you - fauigerzigerk personally_

And here's the point you are in fact missing or rather denying: Microsoft does
collect tons of very personal data as well, not just performance data. You are
acknowledging as much when you say that disabling it is the first thing you do
on any Windows machine.

Most people will never change the defaults, and Microsoft is pushing extremely
hard to make all users of their software log in to a Microsoft account. A lot
of Office 365 functionality makes no sense without being logged in. So it is
important what privacy settings there are, what the defaults are and what you
can or cannot disable.

But I think the reason why we are talking past each other is that I don't use
Windows. I don't have the settings in your screenshot. I was comparing the
settings in my online Microsoft account to the settings in my Google account.

If your main point is that Google is far more dependent on collecting as much
personal data as they possibly can than Microsoft, then I fully agree with
that. But that makes it all the more baffling why Microsoft behaves so much
like Google when it comes to data collection.

------
pjf
The site seems slow/down now. A cached copy is available, obviously, on Google
:)

[https://webcache.googleusercontent.com/search?q=cache:m-ItC3...](https://webcache.googleusercontent.com/search?q=cache:m-ItC3KnszwJ:https://digitalcontentnext.org/blog/2018/08/21/google-
data-collection-research/)

~~~
exikyut
I asked myself, why on earth do websites have to be slow when they're under
heavy load?

To be fair, the technical reason is that the modern web (circa last 3-5 years
or so?) has mostly switched over to demand scaling, and this is why hug-of-
death situations are comparatively rare now.

If you have to be super-economical, and even setting an upper spending limit
is uncomfortable, set your site up to ping you when/if it's falling apart so
you can pull the trigger on ordering more cores yourself. Maybe put a super
easy "order more cores" button in the ping, make 2:30AM disasters less
terrible.

Of course, technical reasons aside, my initial unimpressedness (or, I must
admit, indignation - ha) was directed at whoever this group's ISP is. Clearly
the ISP is not providing enough [demand] bandwidth.

[Edit: After reading some comments below I think I should insert a
preface/prefix here and say that it's entirely possible the website itself is
misconfigured. I think I got sufficiently indignant that I didn't factor this
in, and this was definitely an oversight. I don't know the situation, and it's
unfair to squarely blame the ISP. Original message continues unedited...]

So, FWIW, this group is using mediatemple. (Found by a simple google of the IP
and careful untangling of the results.) ...Well that's a name I've not heard
in years. And it made me wonder - do MT have an answer to demand scaling? I'm
not able to discern either way from a quick browse of their website (including
the "managed wordpress hosting" section). If they _don 't_ (and they seem to
be stuck in the VPS days), ouch. If they _do_ have demand scaling, then I
think it's unprofessional to offer configurations that don't handle situations
like this.

In any case, a "504 Gateway timeout" with a tiny "nginx" underneath is going
to make people go "hmm, who's this person's ISP" and knock a few points off
whatever the answer to that question is. I wouldn't want that happening to me
or anyone I build websites for.

[Edit, in the same vein as the edit above: the conclusion above is plausible,
but incomplete. Again, it's necessary to consider all angles.]

~~~
r3bl
I'm always wondering the same. As a pretty noob sysadmin, my dead cheap[0]
second-hand dedicated server was able to handle being on the HN homepage for
11ish hours without breaking a sweat. No CDN was involved what so ever, all
content was going directly to my single point of failure.

Yet, I have to resort to looking at cashed versions of the submissions from
companies that have way more resources than I could ever imagine to have.

[0] Price of the cheapest Media Temple VPS offering.

~~~
Rjevski
Looking at this page's source it seems like Wordpress (or rather Shitpress) is
involved, so this downtime is not surprising at all considering how many
resources that pile of garbage uses.

But yeah, a properly optimized site can handle being on the HN front-page on a
single server with no issues.

~~~
SquareWheel
There's plenty of caching plugins that make Wordpress run just fine at higher
traffic. This site likely just doesn't have one installed.

~~~
Rjevski
Would you buy a new car if it needed duct tape just so it can stay in one
piece while driving on the highway? So why would you think it's acceptable for
a CMS to require third-party "cache" plugins just to be able to handle
traffic?

~~~
exikyut
Computers and cars are both solutions to problems, but the complexity of
computers is yet to be routinely contained without targeted effort. This is
especially true of the Web. Wordpress is the single most installed CMS (and
kitchen sink, at this point), it's a massive attack target, and it's installed
by people who don't really know what they're doing.

Between being a kitchen sink (and having to deal with the internal/external
API/architectural [compatibility] baggage that comes with any long-lived
implementation) and having to be the CMS equivalent of Windows/macOS, it's
a... _difficult_ ask for a WP installation with (say) 30 poorly-constructed
addons to to run smoothly. Possible, yes, absolutely, but it'll need tuning.

Caching is simply the simplest bandaid that can be tacked on. If the caching
is tuned properly, it'll fix everything else.

Yes, this will mean the underlying configuration/state will be the digital
equivalent of a giant wound you just want to clean out and fix up, but, err...
it works.

:/

------
econ4all
It's interesting how the media cartel front thinks it's safe to drop all
pretense and just post their opposition research like that.

[http://digitalcontentnext.org/membership/members](http://digitalcontentnext.org/membership/members)

[http://digitalcontentnext.org/about/overview/](http://digitalcontentnext.org/about/overview/)

At the very least it reflects poorly on their member's coverage of FB/google.

~~~
zepto
Who cares? What matters is whether it is true or not.

Are you just trying to smear the information with an ad hominem, or do you
have reason to believe it is inaccurate?

~~~
econ4all
It matters because the manner in which any "findings" are presented has a big
role in shaping public opinion and in this case the opinion shapers are
purpofuly distorting finding to attack an advesary.

See the WSJ's false report that google let's developers read user emails, it's
technically true but that only happens after the user allows it via an
explicit permission screen.

Or the AP's claim that google keeps track of user location ever after location
history is turned off, which again is technically true but that is because
location history is just a location feature and not the master switch for all
location features.

In both cases other media outlets repeated the claims without further
commentery of research and this cartel might be the reason.

~~~
blub
Ok, since you claim that "the opinion shapers are purpofuly distorting finding
to attack an advesary", name one false claim from the PDF document's executive
summary. Should be quite easy, right?

You're letting Google really easy of the hook regarding the location history
PR disaster. Google uses dark UX patterns to trick their customers into
sharing more data, and this is one more instance of that. The fact that
"technically" location history is not a master switch is just a flimsy excuse.

~~~
joshuamorton
Can I name some tautological claims instead? The last three bullet points are
all basically conjecture. "Google could do this" or "Google has the capacity
to do that".

The article uses dark patterns to make you think they're making strong claims
when they're just saying things that might be the case, or that the
researchers don't have the ability to rule out. And it doesn't help that this
article goes to great lengths to word things in more nefarious ways than the
underlying research paper.

------
yuhong
My favorite is trying to trace the problems back to Larry/Sergey, which is why
I wrote the essay. Worth mentioning that one of the hardest parts of writing
it was tracing the history of things like Google Analytics.

~~~
yuhong
I forgot to mention that I had an Ask HN on this:
[https://news.ycombinator.com/item?id=17447280](https://news.ycombinator.com/item?id=17447280)

