
Pkg.jl telemetry should be opt-in - boromi
https://discourse.julialang.org/t/pkg-jl-telemetry-should-be-opt-in/
======
mixologic
I feel like many developers fail to understand the difference between the
ethos of Free/Libre/Open source _software_ , and the realities of running a
networked _service_.

Services are not free (as in beer) - they always take time, money, and labor
to provide. A PkgServer.jl is exactly the kind of thing that has to be
sustained _somehow_.

It's not possible to use a networked service without exchanging some
information with that service, which may or may not be useful for the service
providers to collect, so that they can provide a better service (Read: make it
cost less)

The idea that one should be entitled to use a service, for free, and at the
same time ask that the service does not collect any data, or make it opt-in by
default, is akin to demanding free beer that people can optionally pay for.

Caveat: My bias is from being a service provider for a packaging endpoint, a
security updates endpoint, and a community CI service. Any telemetry data we
can get our hands on to help us make informed decisions about what to support,
and what to drop support is absolutely invaluable.

~~~
rnhmjoj
As a user of free software I have the opposite bias. If a software is built
around a community, which should be the case of free software, I think it's
better if decisions are taken by asking the users, be it forum discussions,
polls or whatever.

As you mentioned, reducing the costs usually amounts to taking something out:
this can piss off users, particularly if they were not informed and no
discussion took place. An example that comes to mind was the decision by
Mozilla to stop supporting the ALSA driver in Firefox based on the telemetry
showing little usage.

Ths example also shows that often data are biased and making a decision solely
based on data is not ideal: ALSA is (was) the default choice on most GNU/Linux
and BSD distributions, where firefox is usually built and distributed by the
maintainers with telemetry disabled.

> is akin to demanding free beer that people can optionally pay for.

This is similar to how donations work: with donations you can't force people
to give you money but you can be very insistent and it can leave users with a
bad taste in their mouth. Also, if you are implying the service providers are
entitled to collect all the data they can, I think this must have limitations.
Running code on the client side from which the provider only can (directly)
benefit should require permission, because you grant users access to server
and so should the users grant you access to their machine.

~~~
mixologic
Gathering telemetry data _is_ asking the users, and happens to be the most
cost effective way to do so. Forum discussions do not scale and only satisfy
the needs of the loudocracy, and polls are going to suffer from a similar
participation bias.

Disabling telemetry by default ended up harming the users of those
distributions by making their usage of ALSA invisible. Again, that choice by
the distro maintainers isn't something that Mozilla had any control over.
Voluntarily withholding telemetry data is similar to abstaining from an
election. You dont get to complain about who got elected if you didn't vote,
likewise you dont get to complain about lack of support for your use cases if
you don't allow upstream projects to know that you have them.

~~~
storedbox
> Gathering telemetry data is asking the users

No, it _absolutely_ is not.

------
throwawaw
This is an extraordinarily level-headed and well-reasoned version of the "how
much telemetry" conversation, from both "sides". The Julia community comes off
looking really good here.

~~~
pwdisswordfish2

       Moreover, at present, we have no idea how many people use each solver (and on which platform!). Knowing how many people installed which solver
       would allow us to prioritize support from our finite developer time.
    

Why not just let users vote on that? The support is for the users, no? Instead
the developers want to minimise the amount of time they spend on maintenance
based on the number of users who could potentially complain. The reason for
this is (as we are about to be told) so they can spend more time working on
platforms where they believe commercial solver developers could provide for-
profit support services "(or $$)".

    
    
       This would also allow us to lobby the commercial solver developers to provide official support (or $$). To quote one company "We'll want to
       provide official support at some point, but it looks like the scales haven't tilted quite yet." It'd be nice to know whether 100, 1000, 10000, or
       100000 people per month use their software; that might change their mind.
    

The truth comes out. Collecting data via "frictionless" telemetry allows
someone else, e.g., commercial solver developers, to make money. Nothing wrong
with that if we let users know about these intentions, however when devlopers
try to operate under the guise of "free", "non-profit", "open source", etc.
while, truthfully, they have commercial motives, then it seems to me they are
doing everything they can to avoid tipping users off that this aims to be a
commercially-oriented project. Instead of just being transparent about their
motives and letting users decide, they want to sneak something by (most)
users. The issue raised here is not the collecting statistics (nothing wrong
with that), it is the less transparent, opt-out nature of it: telemetry.
Deceptiveness, stealth. The message coming from this discussion is "Don't tip
(majority of) users off that we are collecting data." And why is that? Because
the developers know this is something most users do not want.

    
    
       Finally, if it is opt-in, the vast majority of users will not opt-in. This leaves us no better off than we were before. Opt-out is a good
       compromise.
    

The discussion should have ended right here. If providing usage statistics is
something that the Julia developers _already know_ the vast majority of users
do not want to do, then sneaking it by them via opt-out telemetry is wrong,
and it tells us much about the people behind Julia. If users do not want it,
and you know that, then why the heck are you doing it anyway? Anyone reading
this will know why, but most users will probably never read what we are
reading here.

The rest of this discussion devolves into "Everyone else is doing it". The
lone dissenter finally gives in to peer pressure.

I remember when using download statistics was enough. Developers still
maintained software. No "trade-offs" were needed.

~~~
3JPLW
This is quite the caricature of what's happening in that thread.

> The truth comes out.

The truth was never hidden. It's all laid out quite plainly here:
[https://julialang.org/legal/data/](https://julialang.org/legal/data/).

> If providing usage statistics is something that the Julia developers already
> know the vast majority of users do not want to do

There are three reasons someone might not opt-in: they don't want to, they
don't know about it, or they simply don't care. To ignore the latter two is
simply disingenuous.

> The rest of this discussion devolves into "Everyone else is doing it". The
> lone dissenter finally gives in to peer pressure.

That's certainly not my read of the 200+ post thread.

~~~
pwdisswordfish2
You are certainly entitled to your developer perspective.

What are the reasons users do not read the /legal/data page on the Julia
website? What are the reasons users do not read 200+ posts from developers
debating the use of opt-out telemetry? To ignore such reasons, assuming they
exist, would also be disingenuous.

If you put the choice clearly before users and they knowingly, affirmatively
choose to submit usage data, then no 200+ post thread is necessary. Instead,
the choice is not left to users. It is made by developers, and the fact of the
use of opt-out telemetry is found on a webpage that developers know users do
not read.

------
parsimo2010
There are a few concerns I have as a occasional Julia user. When I update my
packages is this going to be a silent change, or can we get a notification and
a Y/N option to opt out when updating? How visible and easy is it to change
this setting after updating if I change my mind?

I don’t have a specific concern about the Julia team using my data, but I have
general concerns about companies collecting telemetry. Can’t they get a rough
estimate of active users by counting unique IP addresses over the past X
months which doesn’t require opting people in to telemetry?

Edit: I think I read the link incorrectly. This person is arguing that users
should have to actively opt-in, not that they are opted in automatically. They
are arguing for a change that would increase privacy, and I need to opt-out in
my current installation. I didn’t know I was sending telemetry right now.

~~~
nwvg_7257
You are not sending telemetry right now. That is a feature which will be
activated in the upcoming 1.5 release. It will display a notification.

~~~
fiddlerwoaroof
If you make an HTTP request, you are sending “telemetry” information in the
form of endpoints, headers and IP information. The server may not track this
information, but it’s exploitable

~~~
parsimo2010
My issue is that I’ve come to terms with the fact that the IP address of every
connection can be tracked server side- I can use a VPN to get a little
anonymity but can’t stop a server from logging connections and downloads. But
telemetry adds data on top of that, and it seems like a lot of software wants
to track me. I’d feel it was okay if I was required to register an account and
log in before downloading/updating packages, that’s a noticeable action that
lets my brain process the idea that I’m able to be tracked. But sending
“anonymous” metadata with almost no action on my part rubs me the wrong way.
Lots of devs try to optimize things so they are low friction for users, but I
think the Julia user base is a little different than normal software and
wouldn’t mind a little friction if it meant they had better control of their
privacy.

~~~
fiddlerwoaroof
As far as I can tell, this isn’t adding anything to IP sharing: the package
manager just attaches a persistent UUID to every request. In fact, it is more
private than IPs because it can’t be tied to an ISP or geographical region.

~~~
ninjin
As someone that was active in the previous HN thread [1], this one, and in the
Discourse one this position has popped up several times and it perplexes me.
Attaching a persistent UUID on top of a protocol that carries your IP can not
be _more_ private as you are giving away additional information that would
have to be inferred statistically from the IP alone. Now, we can argue other
benefits of the UUID, but simply calling it a day by ignoring the fact that
you are already giving away your IP is just baffling to me. Am I being thick
here? What am I missing?

[1]:
[https://news.ycombinator.com/item?id=23706271](https://news.ycombinator.com/item?id=23706271)

~~~
detaro
> _What am I missing?_

I'm guessing there's an unspoken assumption that given a UUID the server-side
would not log IPs. It then comes down to trust that they'd stick to that.

~~~
ninjin
Thank you, that could be it. Then again, there would at least have to be a
separate log somewhere on the same box with IPs to counter abuse. I think
creating and using a UUID without explicit opt-in is still a red line for me,
but I do concede that I could very well be too paranoid for the good of myself
and the community as a whole.

I should probably get back into the Discourse thread to see if I can
contribute constructively, but the amount of back and forth between mostly “My
freedom!” and “Tū quoque!” [1] in the thread over the weekend – apart from me
being far too busy to take the time to summarise it all – has kept me away,
although it looks way better over the last few hours. With the little free
time I have I would rather work on my Julia code. '^^

[1]:
[https://en.wikipedia.org/wiki/Tu_quoque](https://en.wikipedia.org/wiki/Tu_quoque)

~~~
fiddlerwoaroof
Yeah, it sounds like they’re designing a way for package authors to get usage
stats: imo, this extra piece of data doesn’t really help the server owners de-
anonymize because it’s less identifying than the data the server is already
collecting as an http server (especially if it’s in an unlogged part of the
request like a header or a post body). But, even if it is a privacy risk
relative to the server owners, it’s preferable that data derived from this
uuid be shared with package authors, rather than IP-based data, because it’s
based on a less-identifying datasource, which means that even if someone were
to breach the database, they’d have less ability to de-anonymize people.

Also, I find this whole discussion to be somewhat irrelevant when talking
about a service serving up arbitrary code to be executed on your machine: if
you don’t trust the server owners, you really shouldn’t be executing the code
they serve up.

------
Tarrosion
The back-and-forth in that thread is a great discussion. One thing I hadn't
realized is that many other popular languages are already doing something
similar. See this post for a bit more detail:
[https://discourse.julialang.org/t/pkg-jl-telemetry-should-
be...](https://discourse.julialang.org/t/pkg-jl-telemetry-should-be-opt-
in/42209/17)

------
KenoFischer
Hi HN, please note that this is an active discussion thread in the Julia
community. You are all more than welcome to chime in, but we do try to keep
discussions as productive as possible, so if you do decide to comment, I'd ask
that

1) You familiarize yourself with the actual proposal and the improvements that
are currently underway and

2) Be kind

A number of people have put in an enormous amount of effort to try and get
this right - please remember that they are indeed people.

~~~
papaf
Is the telemetry available to users? I glady opted into Synchthing telemetry
after seeing this page:
[https://data.syncthing.net/](https://data.syncthing.net/)

When the data is available to the community, just like the source code, its a
much easier sell.

~~~
KenoFischer
The plan is to make aggregate usage data available publicly and potentially
share more detailed usage data with individual package authors. The exact
format is TBD since it'll depend on the quality of the data that we get (this
is not active yet, except on the preview build). The raw logs will be
accessible to core developers with a reasonable need to access (e.g. they're
working on the infrastructure or running the analytics), but will not be
public.

~~~
j88439h84
How about deleting the IP addresses within 48 hours like 1.1.1.1 and 8.8.8.8
do?

[https://developers.google.com/speed/public-
dns/privacy](https://developers.google.com/speed/public-dns/privacy)

~~~
staticfloat
We do have a limited retention policy for the package server logs we keep
(which include client IP addresses). It's not publicly stated anywhere right
now, but one reason why we need to keep IP addresses is for abuse mitigation.
We have been hit in the past by users that do things like download large
(100MB+) files from our package cache servers multiple times a second for days
on end. This is a particularly easy case to catch (since it easily pops to the
top of any analysis you'd care to run, across any timespan) but there are more
subtle forms that require a longer time window of analysis (e.g. users that
download once per hour, all month) that would be lost in the noise without the
ability to see what's going on.

This comment is not meant to serve as an official policy, just pointing out
one of the reasons why we can't delete IP addresses like 1.1.1.1 and 8.8.8.8
do; because the abuse vectors for a server that serves the community large
resources is very different from that of a DNS server.

Most of the "abuse" we see is not malicious in nature, but is instead users
that have some kind of very poorly-configured autoinstaller on a cluster. In
the case of a catastrophic issue like the one mentioned above, we null-routed
the IP address, reached out to the abuse contact for that IP, and worked with
the user to architect a better system. Everyone is happy now, and we can
continue to provide a high quality service for the community without breaking
the bank.

~~~
edw
How about hashing IPs? You could still see if someone were on your abuse list
if abusers.contains(hashfn(req.addr)).

~~~
KenoFischer
Doesn't help for two reasons 1) If the has has enough bits to be useful for
blocking, it's trivial to reverse 2) Even if it did make the IPs anonymous, we
want to be able to email the NOC at whoever is sending the abusive traffic, so
they can go investigate

~~~
j88439h84
> we want to be able to email the NOC at whoever is sending the abusive
> traffic, so they can go investigate

If you block their traffic with HTTP 429 Too Many Requests, they can email you
instead.

~~~
KenoFischer
We prefer not to break researchers' workflow because the group next door
misconfigured their server. Happens all the time. We only sinkhole IPs if the
traffic is malicious or on track to exceed or budget.

------
dnautics
It's a fascinating discussion! I don't use Julia much anymore due to job
change, I hope all language package teams get to read the back and forth.

------
marmada
If you download the software, it seems reasonable for it to get basic
information.

User's need to consider the developer perspective. Julia is a product with
millions of hours sunk into it. It needs to sustain itself, since it's open
source.

I doubt the telemetry is being used for profit anyway, but anything we can do
to help Julia is good. "Donations" aren't sustainable and can't fund a large
software project.

Also it's not hidden, so I fail to see the issue. If the information is in a
legal document and the __source code __, then you know exactly what 's going
on. There's no shady business.

------
kanonieer
Telemetry deservedly has a terrible reputation due to its usage in proprietary
software. In open source software, it's not a deal breaker for me as I have
means to get rid of it.

But given the landscape of privacy issues, I wouldn't vote for an opt-out
telemetry in any of the OS projects I'm involved with.

------
CyberDildonics
I skimmed the link but still have the same question - is there really a
justification for having any telemetry turned on by default? I think most
people wouldn't want any network traffic unless they instigated it, let alone
unique identifiers and package information.

~~~
KenoFischer
Note that this is about metadata for package requests, so you're downloading
something from a server already. The question is what information is in that
request.

------
systemvoltage
Why telemetry at all? I don't expect a programming language to have telemetry
as an default feature.

I want to hammer this rule into everyone regardless of the domain you're
working in when it comes to privacy:

\- _Explicitly ask the user. Respect their privacy. Explain why you would like
to collect data, may be show past examples of what you 've done with the data
and don't deploy dark patterns or default behavior._

It is not that hard. No backlash. No problem at all if you ask the user. Sure,
that would lead to less than optimal telemetry for the collecting party but
there should not be any way around this. Want more data? Incentivize users,
may be give them free subscription for helping out with the beta testing. Give
them a discount. Treat data just like a commodity that costs money to obtain
responsibly. Right now, everyone is a data-cartel trying to hoard as much as
possible.

Why is this so hard to understand? This is opposite of "level-headed". I
usually allow PyCharm to collect telemetry, I allow Apple to use Siri requests
for improving it. It is because they do this as respectfully as possible
without deceiving the user.

~~~
umvi
Have any zealous "opt in" folks ever been in a position where they need to
somehow obtain statistical information about their user base (to raise
funding, to make business decisions, to know what features are most being
used, etc)? Opt in is like hard mode and practically worthless, <1% of your
user base will opt in.

~~~
ninjin
Yes, as an academic and a co-creator of the de-facto annotation tool in my
field [1] I certainly have (although having Google Analytics for the website
is something I regret…). Now, we have the “luxury” of citation counts as a
proxy for academic usage, but I know next to nothing about how our tool is/has
been used in industry apart from what pops up on the mailing list. To be fair
though, I suspect we could have had bigger impact and maybe keep the project
more “alive” with both more efforts on our part and if we had decided to raise
funds additional user metrics could have helped, but I am not shedding tears
over this.

[1]: [http://brat.nlplab.org/](http://brat.nlplab.org/)

Now, I absolutely sympathise with any developers in this situation. But I
think the underlying issue is that we lack a good way to give consent and are
stuck with awful solutions like opt-out and ridiculous pop-ups. Is there not
good work on this out there or are we forever going to have to endure sub-
optimal solutions?

~~~
edarchimbaud
Hi, I'm Kili's CTO. We have a free version of our annotation tool:
[https://kili-technology.com/](https://kili-technology.com/). It's the most
versatile tool on the market (text, image, video, voice), with native python
integration, the ability to use ML to speed up annotation, and great support.
Let me know if we can help. Edouard

~~~
ninjin
Wow… Just wow… The Internet these days is just a cesspool at times due to this
kind of behavior and I wish I could downvote this into oblivion. Your attempt
at marketing disgusts me and know that I now have an awful impression of Kili
as a company. Your need to drive customers to your company does not justify
this kind of behavior, no matter the quality of your product.

~~~
memexy
What are you talking about?

~~~
ninjin
The comment by edarchimbaud that I replied to? Where he blatantly inserted
himself into a discussion about telemetry, package development, funding based
on concrete metrics, etc. Just to namedrop his company and tool?

~~~
memexy
I didn't see anything blatant. What was blatant about it?

~~~
ninjin
Imagine the following exchange:

A: “Recently I have been thinking about personal responsibility.”

B: “Why so?”

A: “I believe there is a strong correlation between a sense of personal
responsibility and success later in life.”

C: “That is interesting, I think I read a study about this once. Here is a
link!”

<the discussion goes on for some time>

B: “Myself, I learned about personal responsibility – in particular financial
– when I as a child ran my own little business. What I did was to deliver
apples and later fruit for a small fee to the neighbourhood on my bike when I
was about twelve. It did not make me rich of course, just enough to buy a
video game in the end. But I do think it gave me solid experience in life.
Later on in high school I started designing local webpages.”

D: “Hi! I am D, I am head of research at Foobar Corp and we have a new apple
breed: [http://foo.bar/baz](http://foo.bar/baz) It is the best apple on the
market: crisp, juicy, and perfect for pies! Let me know if we can help!”

A, B, and C: “Eh?!”

Now, you are perfectly in your right to disagree. But I think D is being a
dick here and inserting themselves blatantly solely to attract attention to
their product and adding nothing to the discussion or the community as a whole
– possible because D has signed up for some god awful “business intelligence”
tool that just scans various websites for mentions of “apples” so that they
can insert a generic, re-usable message.

Regardless, I will not monitor this conversation further as I feel we are at
this point deviating far far from the topic of this “dead” thread. If you
still feel the need to discuss this matter, feel free to dig up my e-mail on
my personal website. Trust me, I am fairly easy to locate with my username and
a keyword or two from my profile – or just look at the about page using the
link to the tool I mentioned earlier in this conversation.

------
bencollier49
Wow, if this is done without prompting the user, then it's illegal in the EU
and UK. IP addresses are considered PII.

~~~
staticfloat
The GDPR explicitly allows for the processing of personal information without
consent in the event that such processing is required for ensuring network
security and availability, see [1], [2] and [3] for more reading on this. Note
that I am not a lawyer, and you should consult a lawyer (as we did) to ensure
that all policies fall within GDPR laws.

That is precisely what the logged IP addresses are used for (an example: nginx
access logs), and is one of the reasons why we would much rather use a random
number generated by the client machine than an IP address; because the bits
themselves have no meaning, unlike IP addresses.

As mentioned in the linked thread, NumFocus has worked with a legal team that
specializes in this type of law, this plan is all in compliance with the GDPR.

[1] [https://gdpr-info.eu/recitals/no-49/](https://gdpr-
info.eu/recitals/no-49/) (The actual GDPR text regarding security concerns)
[2] [https://blogs.akamai.com/2018/08/dispelling-the-myths-
surrou...](https://blogs.akamai.com/2018/08/dispelling-the-myths-surrounding-
security-technology-and-gdpr.html) (Akamai legal team confirming that this
interpretation of logging IP addresses for security purposes is valid) [3]
[https://law.stackexchange.com/a/28609](https://law.stackexchange.com/a/28609)
(Stack exchange post pointing out that even more exceptions exist beyond just
security)

~~~
bencollier49
From the first paragraph of TFA:

'The goal is to answer the question “How many Julia users are there?”'

This is a commercial concern, nothing to do with security, and to my
understanding at least, is not a valid reason for collecting PII. There
doesn't appear to be a security argument for collecting this data without
consent.

------
seemslegit
This is an inherently bad-faith practice that should be punishable for open-
source and commercial vendors alike.

