
Fedora, UUIDs, and user tracking - tlburke
https://lwn.net/SubscriberLink/776327/e0cf49b9b9976c5a/
======
AdmiralAsshat
Good on them for taking the time to think about how to do this properly. I use
Fedora, I'm happy to them _know_ that I use Fedora, and even to make the
check-in somewhat regular so that they can know if I _stop_ using Fedora.

Provided that they figure out a way that absolutely nothing can be done with
the information other than to say, "This non-identifiable machine reports that
it uses Fedora," I'd be okay with that.

------
z3t4
It's a slippery slope, first you want to know how many use your software, then
you want to know what features they use, then you want to know other apps
installed, then you want to know what web sites they visit and what they
search for, and so forth.

~~~
Barrin92
I can understand the slippery slope threat in a proprietary system, but Fedora
is a open system. Should data collection at some point go beyond whatever it
is the devs announce, someone is going to pick it up and everyone is going to
drop the distro like a hot potato. So I'm not sure this is a big issue.

~~~
Karunamon
The lack of serious consequences to Canonical regarding their spyware-like,
opt-out search integrations in Ubuntu with Amazon way back when (and then
doubling down with legal threats against fixubuntu.com) gives me doubt as to
this being the case.

That goes double given Fedora's position as what boils down to RHEL upstream.
Too much corporate support for any backlash to make a dent, even if it could
get critical mass.

~~~
reacweb
IMHO, there were serious consequences. 5 years ago I would not have consider
using fedora on my desktop because ubuntu was trustworthy and I was
considering fedora as useless fragmentation. Now, I see redhat as a knight on
my side to improve linux adoption (I hope that IBM will not ruin this) and I
see fedora as stable and up to date than ubuntu.

~~~
Jedd
Can I ask what was / is it about Debian that made you discount it, and embrace
RH / Fedora instead?

~~~
Grimm665
Not the OP, but I made a similar transition. I got fed up with using APT to
manage packages while Yum/DNF seem much more complete and elegant. I seem to
get far fewer package conflicts with Yum, and the error messages when
something does go wrong are more digestible, though this may be just due to my
use case/package selection. I also dislike the use of the dash shell by
default. To me, it just further muddies the water between compatibility of sh,
bash, and dash. I'd rather just have bash and be done with it.

Fedora's packages are also more up to date while being as or more stable than
Ubuntu. Debian is still probably king of stability, but when compared to
CentOS, I prefer CentOS' default package selection and configuration (postfix
vs exim, sudo installed by default, ssh installed and enabled by default) ,
especially since they embrace systemd while Debian seems to use it grudgingly
while also keeping around old methods of configuration that don't quite fit
with systemd (network configuration being the big one here, don't even get me
started on Ubuntu's adoption of friggin' Netplan)

~~~
JohnFen
> especially since they embrace systemd

I will never forgive Red Hat for SystemD, and even more than that, I will
never forgive Debian for adopting it. But that's off-topic.

~~~
Grimm665
To each their own, of course. I happen to like it a lot, but I'm still
surprised Debian chose to use it as well.

------
paavoova
It's interesting to see how this discussion steers towards privacy. Some
distributions, like Ubuntu for example, are far less conservative. Besides the
NTP tracking mentioned in the article, there's the Amazon fiasco of past, and
Ubuntu 18.04 installs quite bit of telemetry [1], including tracking packages.
It ships with a "dynamic" MOTD that runs a script periodically which downloads
updates from Canonical. While this may be useful for server administrators who
wish to be notified of products and updates, it has at one point shown ads for
an HBO show [2].

Annoyingly, while installing the Xubuntu flavor, there appeared to be no
option to opt out nor was there even mention of any such telemetry in the live
installer interface. I had to track down and disable manually post
installation - something the average user is not going to bother with and what
Canonical is surely betting on. I appreciate how Poettering brings up trust
and "red flags", knowing full well the lower the transparency, the larger the
reactionary incentive for users to opt out or disable such telemetry.
Canonical could perhaps take note.

[1] [https://askubuntu.com/questions/1027532/how-to-opt-out-of-
sy...](https://askubuntu.com/questions/1027532/how-to-opt-out-of-system-
information-reports/1030168#1030168) [2]
[https://bugs.launchpad.net/ubuntu/+source/base-
files/+bug/17...](https://bugs.launchpad.net/ubuntu/+source/base-
files/+bug/1701068)

------
jammygit
"Essentially, users would need to trust that the project isn't doing the
tracking because it says it isn't."

The cynic in me is recalling that red hat just got bought by IBM and IBM is in
the news for tracking people in a weather app in a sneaky way

I don't know any better though, maybe fedora is quite independent of red
hat/IBM and its 100% legit to trust their promises. I'm not sure how it works
tbh

Edit: added quote from article

~~~
TomasSedovic
(disclaimer: I work at Red Hat, not on any OS/distro)

This is a nitpick, but Red Hat did not get bought by IBM. What happened was
IBM announcing the intention to buy Red Hat.

It's maybe a subtle, but possibly important distinction. Red Hat is still its
own independent entity until the deal goes through (which means IIRC passing
the board's approval, SEC and likely other stuff). This is expected to happen
in late 2019 I believe, but it might still fall through.

This doesn't absolutely dispel any possibility of IBM's influence, but it
should be very low/zero until the merger actually goes through. But I also
don't know how all this works.

~~~
jammygit
Thanks for clarifying for me, I had that mixed up

~~~
TomasSedovic
No worries! Pretty much everyone I talked to who was aware of the deal (online
and off) thought the same.

These things are pretty complex.

------
javajosh
This post describes a bad way to track users, but the real utility of this
post is in the email that describes a way to count Fedora users without
tracking them:

[https://lwn.net/ml/fedora-
devel/20190108152239.GA24118@garde...](https://lwn.net/ml/fedora-
devel/20190108152239.GA24118@gardel-login/)

~~~
yjftsjthsd-h
That actually sounds more reasonable, although it does run tiny risk of being
trivial to mess with if a malicious client wanted to skew numbers. But I don't
think it's possible to defend against that without being horribly invasive,
privacy perspective.

I must say, it feels odd to support a Poettering proposal, but this actually
does look like a good solution.

~~~
stordoff
> although it does run tiny risk of being trivial to mess with if a malicious
> client wanted to skew numbers

Is that not also the case with the UUID solution? Generating the UUIDs in
virtual machines, or just replacing the UUIDs in the requests, doesn't seem
out of the question

------
Crontab
I don't mind Fedora wanting to get counts of things, provided it is
exclusively an opt-in feature. Debian's 'popcon' is an example of doing it
right.

------
sliken
Why not just track downloads from the mirrors? If you post a new version of
package for fedora 29, just track how many downloads of that specific file are
made. Write some scripts for log processing and require official mirrors to
submit the logs to give you the package counts.

That way user info never makes it past the mirror (which has their IP anyways)
and you don't need anything complex like UUIDs, playing tricks with NTP, or
calling home.

This would give a reasonably accurate number. Use bash for measuring linux
installs (pretty rare to have linux installed without bash). Then more desktop
apps like firefox, eog, and xpdf to measure desktop use. If interested in
server side track mongodb, apache, mysql, and similar.

This would also help fedora decide which applications they should pay more
attention to.

~~~
jewelry
But this would be either over-counting if some CI scripts download the version
every once a while, or under-counting if some organization put the image on
their own privately maintained mirror which is quite common.

~~~
noobiemcfoob
Telemetry of any type usually fails to measure precisely the thing you want,
but something adjacent that correlates strongly. What you mention are clear
problems with inferring usage from downloads, but if you can infer the
percentage of downloads that correlates to a machine running Fedora, you don't
need much more precision.

------
JohnFen
Any distro that phones home with a unique identifier is a distro I won't touch
with a ten foot pole. I don't care what they claim they will or won't use that
identifier for.

~~~
kbenson
Maxims that act on the symptom rather than the problem rarely help in the end,
as the problem just evolves to support its needs through other means.

For example, sending a unique identifier is not the problem. Tracking people
through a unique identifier is. So, depending on your goals, you can design a
unique identifier system that does not allow tracking (or at least makes the
tracking period so small as to be unuseful for purposes other than designed)
as outlined in the article through changing the identifier on the client side
weekly.

If all you want to do is get a good estimate of how many users use what types
of configurations of your software (major and minor version), a UUID that
rotates weeks on the client side is perfectly acceptable to use for those
statistics to a fair degree of accuracy.

On the other end of the spectrum, people long ago started reducing their
trackable footprint online, and the online tracking ecosystem just evolved to
finding people through other, trickier methods, such as browser
fingerprinting.

~~~
JohnFen
You're right in general, of course. But here's the reason for my hardline
stance on that: history shows that trusting promises or assertions made about
things like unique identifiers is unwise, and so I have to take a strong
defensive stance.

> you can design a unique identifier system that does not allow tracking

You can (sortof), but we run against that trust issue again. If I'm giving a
unique identifier to someone, I have no way of knowing if their assertions
about its use are accurate. Even if they are, there's no guarantee that won't
change in the future.

> If all you want to do is get a good estimate of how many users use what
> types of configurations of your software (major and minor version)

You're talking about the perspective of the publisher. I'm talking about my
perspective as a user. A company's "need" to collect metrics is their problem,
not mine. If their solution results in more information disclosure than I'm
comfortable with (and a unique identifier absolutely is), then I will avoid
their software or block communications to their home base.

~~~
kbenson
> A company's "need" to collect metrics is their problem, not mine.

When it's couched in how to deliver software updated, it becomes your problem
as well. That's a transaction, and they want to charge more for it now. You
can decide it's too costly, as you indicate here, but it's not like they're
giving nothing in return.

I think it's important to note the goals of those involved. In this case, it's
the people that put together a free product for us to use and also supply free
timely software updates looking for more information on who is using what so
they can do a better job at delivering that free stuff to us.

And in this case, it's not adding tracking where it doesn't exist, it's making
it better for the specific cases that are useful to them and that impact users
the least (an accounting of software configurations). They already track
through IP address, but that's inaccurate to a much larger degree for the
information they want (but somewhat less so for the personal information you
likely want to protect). Adding an additional system that allows better
tracking of the useful information without increasing the personally
identifying features of IP based tracking (which still exists) is laudable, in
my eyes.

~~~
JohnFen
> When it's couched in how to deliver software updated, it becomes your
> problem as well.

I honestly don't see how. If/when I'm ready to take an update, I can come get
it myself. If they want to charge me (or charge me more) for it, then they can
do so at that time. No tracking needed except for that associated with
payment.

> Adding an additional system that allows better tracking of the useful
> information without increasing the personally identifying features of IP
> based tracking (which still exists) is laudable, in my eyes.

Not as laudable as not engaging in tracking in the first place. However, I
don't see how this doesn't increase personally identifying features. On the
contrary, it's adding one: a unique identifier.

~~~
kbenson
> If they want to charge me (or charge me more) for it, then they can do so at
> that time. No tracking needed except for that associated with payment.

That's what's proposed? An identifier sent along with the request to see the
current list of updates available?

> I don't see how this doesn't increase personally identifying features. On
> the contrary, it's adding one: a unique identifier.

An identifier that changes every week or so. At that point it is useless for
identifying an individual, but can still be used statistically to determine
how many systems are running what versions of Fedora, even behind NAT
gateways. The only difference from before is now instead of "there's one IP
with more than average check-ins, or check-ins from two or more different
configurations", it's "there's one IP with X number of unique identifiers that
randomize weekly seen over the last 28 days, so we can approximate X/4
different systems behind that IP".

~~~
JohnFen
> The only difference from before is [...]

Yes, I understand, but your explanation isn't reassuring to me. It's
confirming that I actually do understand the mechanism and its ramifications.

Red Hat can do whatever it likes (although my take on it is that they're not
likely to do this unique identifier thing). I'm not saying otherwise -- that's
their right, after all.

All I am saying is that software that does this sort of thing is unacceptable
to me and I will avoid it to the best of my ability. As is my right.

------
akerro
>unique user ID (UUID) for each installed system that would be sent with DNF
mirror-list requests. It explicitly calls out privacy concerns: "We don't want
to track; just count."

If Fedora server is compromised they can serve different packages to different
users.

~~~
derefr
Given that package servers serve packages over HTTP, you can already do this,
identifying the user you want to serve different packages by their IP.

However, the packages need to be _signed by Fedora_ for the package manager to
accept them, so this has been considered a pretty weak excuse for an "attack"
for a while now. "Getting access to code-signing keys allows you to attack the
people consuming signed binaries"—wow, you don't say!

------
rhn_mk1
Looking at the wiki page [0], I can see the benefits of the move:

> Better metrics overall

> Public stats page updated automatically

> Better knowledge of relative use of different variants

> Insight into Fedora's use in short-lived test systems and temporary
> containers vs. longer-term installations

but nothing evaluating how and whether the proposed solutions will achieve
those things.

With no method being perfect, I'm suprised that no one is calling for a
quantitative evaluation of various ID collection schemes, and that there is
defined "good enough" value, other than

> We need better data than that.

I'm not a Fedora maintainer, and I'm not maintaining any other software of
such popularity, so I have to ask: why? I assume it's to allocate work better.
At which point do the downsides outweigh that benefit?

[0]
[https://fedoraproject.org/wiki/Changes/DNF_Better_Counting](https://fedoraproject.org/wiki/Changes/DNF_Better_Counting)

------
z3t4
If it's totally anonymous there's nothing stopping someone trolling the
statistics.

~~~
tflink
Disclaimer: I work for Red Hat on Fedora

True but we're already in that boat with the way that we gather statistics
from mirror hits. I have a hard time seeing how a method like the one proposed
would be any more vulnerable to tampering.

EDIT: spelling

~~~
whatshisface
The idea is that it isn't less vulnerable to tampering, but you pay a privacy
and public image cost.

------
v_lisivka
This change proposal can be tracked here:
[https://fedoraproject.org/wiki/Changes/DNF_Better_Counting](https://fedoraproject.org/wiki/Changes/DNF_Better_Counting)

In short:

Add a new "countme" variable. This variable will: \- Start as a "true" value,
\- Reset to a "false" value the first time the client successfully makes a
request to Fedora mirror servers, and \- Be reset to a "true" value after
seven days.

This way, rather than filtering by unique IP addresses, we can count only the
"true" requests, so we count each machine once — but no more than once.

------
Beldin
I'm not sure what they want to count. It definitely isn't users, as they
ignore multiple users per system. It seems to be something like "currently
active and online machines". But then you should not ignore machines that will
not be updated. Maybe they mean "machines that follow the weekly update
schedule this week"?

That seems to be what Poeterring's approach counts.

~~~
tflink
Disclaimer: I work for Red Hat on Fedora. Take that for what you will

As far as I know, the desire is to get better numbers on how much the parts of
Fedora are being used. There is always more work to do than there are folks to
do all of it; having better numbers on how much different bits are being used
helps us make better decisions on what to focus on.

Granted, I'm not Matt but I've heard him talk about similar things and have
run into the issue myself - "Is anyone even using this? Is it worth putting
this level of effort into this particular thing?"

EDIT: Phrasing of the last sentence

~~~
JohnFen
But Fedora should remain wary of an over-reliance on telemetry. It's very,
very easy to draw the wrong conclusions about things, leading to decisions
that reduce the quality of the product.

As an example, there are very likely to be packages that aren't often needed,
but are absolutely critical when they are.

------
nixpulvis
Just count the number of bug reports. That seems like a more useful metric
anyway. If the users aren't complaining who cares.

(about 75% serious)

~~~
viraptor
It's not a consistent metric. You'll get both spikes around new releases and
changes that reflect the automated reporting/ease of reporting changes.

~~~
moosingin3space
Also, since Fedora is primarily an integration project, many users report bugs
upstream.

------
anonunt
what funny is i just started using fedora (and i have actually been really
enjoying it).. but to help me remember its not apt or rpm or even yum i have
been thinking to myself Do Not Follow - for no reason at all other than i
first learned about it after installing a new machine and configuring firefox
etc. :)

------
dane-pgp
> Lennart Poettering ... did suggest using an application-specific machine ID,
> like those calculated by sd_id128_get_machine_app_specific().

Yes, I'm sure he did.

------
jjgreen
Please don't use "UUID" for that, it's taken (and useful).

~~~
nixpulvis
Well, they could just as easily use a "real" UUID [1] variant, and all the
concerns of this topic would still remain the same.

[1]:
[https://en.wikipedia.org/wiki/Universally_unique_identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)

------
Tharkun
Wonderful. I guess I'll now have to find a way to regenerate this UUID or to
spoof it every time Fedora tries to phone home.

If you want to count users, ask for permission during firstboot. If that's too
much to ask, then I'll be in the market for a new OS. Maybe I'll finally go
back to my first love: FreeBSD.

~~~
MBCook
Read the whole article. They seem to have decided against that and for a
simple ‘countme’ flag on update requests to mirrors. Possibly by only a random
subset of machines.

No tracking, just simple numeric data for for purpose.

