
Microsoft’s purchase of GitHub leaves some scientists uneasy - bcaulfield
https://www.nature.com/articles/d41586-018-05426-0
======
lotia
Seeing this published on Nature's website is quite thick with irony. They have
profited greatly from locking up access to academic papers in their most
prestigious journals.

The article includes the following tweet:

“Open Science is not compatible with one corporation owning the platform used
to collaborate on code. I hope that expert coders in #openscience have a
viable alternative to #github,” tweeted Tom Johnstone, a cognitive
neuroscientist at the University of Reading, UK.

Github's current position is to be default open with a paid option for
private. Nature's default position was to lock up journal articles and sell
them for extortionate prices.

It's surprising to see them post an article noting that scientists feel MS
will change that for the worse. They have been good participants in the open
source community of late specially under Mr. Nadella's leadership.

~~~
icelancer
_Nature_ is also not an Open Access or Open Science journal at all, and
furthermore, does not require authors to publish all of their data for
replication purposes.

It's incredibly hypocritical. If people cared about Open Science, they'd
publish in PeerJ or PLOS One, but they care a lot more about saying they have
pub credits in _Nature_.

~~~
amelius
> If people cared about Open Science, they'd publish in PeerJ or PLOS One, but
> they care a lot more about saying they have pub credits in Nature.

True, but perhaps the thinking goes: we'll publish another paper in an open
journal, so we can have our cake and eat it too.

~~~
n4r9
When I was an academic, the thinking was: if I manage to publish in Nature
then I might actually be able to get a postdoc when my current one / PhD runs
out. Otherwise I'll go open.

------
ToFab123
"Open Science is not compatible with one corporation owning the platform used
to collaborate on code"

That was already the case before Microsoft purchased GitHub. So why did they
start to use it in the first place, if they had this concern?

~~~
org3432
Now the whole data set in GitHub, the metadata, being able to analyze all the
projects may not be as open since it can conflict with other microsoft
products, such as LinkedIn if you're looking for candidates, or possibly AI
training data.

~~~
bobbytherobot
Microsoft products already conflict with Microsoft products.

------
jdormit
> “Open Science is not compatible with one corporation owning the platform
> used to collaborate on code. I hope that expert coders in #openscience have
> a viable alternative to #github,” tweeted Tom Johnstone, a cognitive
> neuroscientist at the University of Reading, UK.

This shows an astounding lack of understanding, both of Github and of git
itself. Github was _always_ owned and operated by "one corporation" \- now it
is owned by a different one. And "the platform" isn't Github - it is git
itself! If this neuroscientist is so concerned about the acquisition, he can
just set up a git server for himself and his colleagues and start citing that
URL instead of Github.

~~~
SiempreViernes
The platform is pretty clearly github, a binary on a disk isn't a "platform".
And any scientist is far too busy with the interesting stuff to set up a git
server, otherwise they would be a sysadmin.

~~~
tedajax
Umm, support staff exist largely for this exact reason. In college I worked in
IT support for the CS professors and would setup whatever services a lab
needed.

------
Fomite
"During the 2014–16 Ebola outbreak in West Africa, for example, researchers
used the platform to share and cross-check daily patient counts." Pretty sure
this is referring to the lab I used to work in.

Honestly, I'm not particularly worried about the buyout - pretty sure the
chances of them just scrapping the whole service are slim, and "GitHub
disappears" seems equally as likely burning through VC cash as it does owned
by Microsoft.

~~~
gaius
_the chances of them just scrapping the whole service are slim, and "GitHub
disappears" seems equally as likely burning through VC cash as it does owned
by Microsoft._

Right, the people who should really be worried are VSTS users. And GitLab
users funnily enough. They are VC backed and will have to exit too, and
everyone’s looking at them now. The least worst thing that could happen is
bought by Atlassian and folded into Bitbucket.

And you’re completely right, it is actually pretty weird that they were never
worried before when it was burning VC cash. Funded by grants, I’m not sure
scientists fully understand how this works.

~~~
earenndil
IIRC they've said they intend to keep VSTS and github as separate products.

~~~
gaius
_they intend to keep VSTS and github as separate product_

Still, consider Codeplex, which isn’t around anymore.

I see the two merging over time, say in 5 years, and I expect the migration to
be fairly painless, so I am not too worried personally, but I can see how some
people might be.

~~~
radicalbyte
Unlikely, they're targeting different markets. There is a place for both of
them in Microsoft's portfolio.

~~~
bonesss
MS-GitHub Enterprise would seem to address a lot of where TFS otherwise lands.
Once TFS incorporates GitHub workflows and infrastructure I think anything off
of that core is gonna struggle mightily.

I mean... as it stands TFS struggles with mindshare, features, and engagement.
While the name and marketing will persist, I think the underlying technical
reality will be the continued cannibalisation of that offering by the superior
model of git and its cross-language appeal.

------
janwh
The level of hypocrisy is beyond measure. If "Open Science" was a real thing
already, nobody would care about Nature, Elsevier, et al. If there has ever
been a danger to open access to scientific research, it's publishing
conglomerates that provide little to no value — the exact opposite of what
GitHub has been doing the past few years.

Before complaining about others, clean up your own act, Nature.

------
geoalchimista
To be honest, I don't think GitHub is to be blamed for how scientists use it.
AFAIK there are data repositories more suitable for the purpose of sharing
data sets and metadata, and out of the hands of big corporations (but may
subject to big governments), e.g., Zenodo (funded by CERN) and DataONE (funded
by NSF). Zenodo can even generate a DOI for your data set, which GitHub does
not do.

That being said, I think this Nature report adds nothing new to the narrative.
Sure, open science is great, but who's gonna pay for it? Neither GitHub nor
Microsoft is a charity. Since most of the research mentioned in this article
are funded by government, they should go to the data repositories that are
designed to store data derived from publicly funded research. GitHub just
wasn't designed for that purpose.

~~~
randomsearch
“Why don’t scientists use modern tools like GitHub, instead of these crap
bespoke ones? They shouldn’t waste their time duplicating effort.”

“Why are scientists using inappropriate tools? These are for commercial use
only!”

In practice, GitHub works fine for collaborative code projects and sharing
data over the next few years. If you’re looking to store large amounts of data
over long periods, you’re right.

If GitHub don’t want to support science (they’re just a business) then they
should stop giving free products to academia eh.

~~~
geoalchimista
Don't make a straw man argument. I said none of the above. I'm NOT against
open science. And I am glad that GitHub supports open science. In fact, GitHub
does offer _free_ private repositories to researchers who have an .edu email
address.

~~~
SiempreViernes
I see it mostly as a summary of the "common wisdom"; it has been a mantra for
years that you should "just put it on github", but now that Mircosoft has
reminded us all of how cynical the world truly is, suddenly the scientists
that paid attention before are to blame?

------
prepend
The beauty of all this is that people can simply run their #openscience in
Github and gitlab and bitbucket and (hopefully) their own gitea instance
running on a $5/month digital ocean instance.

It’s funny that this is a complaint about a company controlling a centralized
resource because they worry. But any centralized resource has this risk. Did
they think Github would lose money forever and subsidise infrastructure in
perpetuity? I think it was a situation of users being oblivious to risk and
then murmuring because the externalized risk they ignored is now more apparent
in a smaller risk.

So maybe these jarring events are good, because it helps us realize that many
users are stupid and don’t assess data risk well. Luckily for us Github was
really principled and Microsoft is a good fit for a PR/biz perspective. So the
risk didn’t hit.

Having these discussions now, so scientists learn to push to multiple public
repos (or universities run their own instance for a few thousand a year),
rather than rely -falsely- on one source.

------
ploggingdev
Slightly OT. I was wondering what happens to Github's policy on employees'
side projects post acquisition. Links :
[https://news.ycombinator.com/item?id=13921433](https://news.ycombinator.com/item?id=13921433)
and
[https://news.ycombinator.com/item?id=13142327](https://news.ycombinator.com/item?id=13142327)

------
SCdF
I know a lot of people don't like n-gate and webshit weekly, but I think their
slice on this was particularly relevant:

> Microsoft Is Said to Have Agreed to Acquire GitHub

> A near-monopoly closed-source software company, fed up with trying to seem
> like a good corporate citizen by releasing source code of their worst
> programs, is acquired by Microsoft.

When you combine that with the concerns from the article:

> They fear that the site will become﻿ less open

(it's not currently "open" in any significant way now, bar that there are free
public accounts. They don't even have a public f-ing bug tracker for site
issues)

> Open Science is not compatible with one corporation owning the platform used
> to collaborate on code

(It was already owned by one corporation, the Github corporation. Now it's
owned by a different one. It's not like GH was some kind of open source
alliance distributed project or anything)

------
saagarjha
This article seems to have a very focused interest on data in GitHub
repositories, as opposed to source code. I get that the article is aimed at
scientists, but I don’t see the problem here: if Microsoft takes down your
dataset just move it somewhere else. You’re not tied down with pull requests
or comments like code repositories are.

~~~
Fomite
The problem is the stable of the pointer. From a paper of mine:

"The code and simulation results are available online at
[https://github.com/mylabname/](https://github.com/mylabname/) project."

That's in _print_. In another paper, I expressly cite a GitHub repository as
the source of the data used in the analysis. Pointing to data in papers is the
way most of the scientists I know use GitHub - because it's relatively stable,
not tied to an institutional account, and relatively pain free.

~~~
pdkl95
The problem is not any particular hosting service such as GitHub, it's that
the pointer in your paper ss relying on a single point of failure.

There was a bit of drama several years ago when Megaupload was seized and shut
down; various small/free projects lost access to the only copy of some of
their files. Like your paper, important documents had evolved in forums, which
linked to the file hosting service for files that could not be uploaded to the
forum. A few projects were the canonical documentation for something that the
original author had abandoned, the first result in Google couldn't be updated
creating the same pointer problem as your reference in a paper.

At the time, a lot of people talked about finding a "replacement file hosting
service" in the same way people currently talk about finding a replacement for
GitHub. Moving to a different service is still a single point of failure.
Instead, when you want to preserve access to data in the long term, you need
to assume any single service might fail and build in redundancy.

Instead of saying, "[things] are available online at [URL]", you should
include in the paper something like:

    
    
        The code and simulation results are
        available as an archive named:
               foo_project-2018-06-16.zip
        The file has the following checksums:
            MD5:  1271ed5ef305aadabc605b1609e24c52
            SHA1: ab69db8315af7de6e673a6ddf128d415157a7c3f
            (...more...)
        The file was originally hosted at:
            $GITHUB_URL
            $GITLAB_URL
            ${OTHER_HOSTING_SERVICE_URLS[@]}
            $INTERNET_ARCHIVE_URL
            $AUTHOR_UNIVERSITY_URL
            $COLLAB_UNIVERSITY_URL
    

With tools like git (or rsync, _etc_ ), making multiple copies of a project is
very easy. Redundancy protects against some risks, but including checksums
(and any other relevant metadata) makes _content addressable searching_
possible. Even if all of the URLs in your paper eventually become defunct,
someone reading the paper in the future may be able to find your data by
searching for the file's hash.

The hosting service isn't the (primary) problem; the paper needs to include a
pointer that is more robust than a single reference to a single path on a
single server.

~~~
Bedon292
Unfortunately most papers (that I have been involved with at least) have very
specific size requirements and limits. Spending that much space on sharing the
files is not something that can happen. Anything more than "[things] are
available online at [URL]" is just going to take up too much space. (Although,
from now on I might try to include two URLs)

Besides, if / when GitHub ever shuts down, you know people will end up going
through and cloning every repo on the system and archiving them at some
similar url. So everyone can just visit it at notgithub.com/user/repo instead.

Now, I agree that something like that would be awesome. I don't think its that
easy to do. Its just not practical. GitHub / Lab are easy you just toggle a
switch to make a repo public after the paper has been published, but other
services may not be as easy. Especially trying to set up university urls, and
keeping them private until after publication. For tech savvy people it might
be fine, but not for all.

Maybe it would be better if universities either hosted their own git repos, or
even banded together and hosted something as a educational service? Rather
than a for profit entity being where it is stored.

The contact info for any author involved with the paper is going to be right
there at the beginning of the paper. So IF something were to ever happen to
the one link, contact is not going to be hard. Yes, there are risks, but it
still gets the job done for the time being.

------
johnchristopher
> "[..] researchers used the platform to share and cross-check daily patient
> counts."

No concerns about private datasets in the open on Github ?

I get they like the workflow but they traded ethics for convenience there.

Medical ethics.

~~~
jarfil
If it's anonymized data, I see no problem. All scientific datasets should be
in the open.

~~~
johnchristopher
Indeed, the first four repository I found for a google search with `ebola
outbreak github` terms only have statistical data.

------
masonicb00m
The cool thing about git is it doesn’t matter what happens to the platform.
You can take your repo and push it somewhere else if you don’t like how
Microsoft runs the show. Given how Microsoft took Github’s idea with Atom and
ran with it to make VSCode, I wouldn’t be surprised if Github improves under
Microsoft’s rule.

~~~
jarfil
The problem lies in the comments and issue tracking, which are not stored in
git. Sure, you can clone the repo, but you can only take the wiki with you,
the rest would be lost.

~~~
Fomite
To my mind, the actual problem is neither one of those things, but if Github
is wrapped up into some new project. Many of the citations for Github aren't
for software, but as a stable, open store for data.

Which means if that URL changes, that link - in a physical paper - goes stale.

~~~
Cyphase
Even if it does get merged or renamed, I don't see why Microsoft wouldn't
forward the URLs.

Link rot is a real issue in general though, I don't mean to minimize that.

~~~
Fomite
I agree - which is why I haven't moved my lab's accounts.

------
mariodiana
From the article: "Others are hopeful that Microsoft’s stewardship will make
the platform even more valuable."

Perhaps I missed it, but nowhere in the article were any of these hopes
articulated. I may lack imagination; so, can someone tell me how Microsoft can
make GitHub _more_ valuable? I don't mean how they can use GitHub to make more
money for themselves, by selling "enterprise-level" support, or by making
_their_ development tools better. I'm asking how GitHub is going to be made
more valuable to people who currently are not using Microsoft products.

Any ideas? What exactly is there to be "hopeful" about?

~~~
manigandham
Github is the least functional source-code hosting service. Everything from
search to issues to wiki is sub-par compared to Gitlab and others. It's
massively unprofitable and its main advantage is the large mindshare and
community that keeps it around.

Microsoft with its rich resources can easily improve the platform
functionality which will benefit anyone using Github, along with likely making
several features free (like private repos).

------
chris_wot
Forget about GitHub, _Nature_ makes scientists uneasy.

~~~
SiempreViernes
Just the machine learning scientists, most every other community are more or
less fine with them.

------
scruffyherder
Good thing GitHub wasn't just one company ready to sell out at a moment's
notice.

Why the hell are people centralizing everything on a single place? The whole
point of the internet is open protocols, open end points that can host
anything.

Host your own build your own communities!

------
subatomic
So will Nature be comfortable with having scientists cross publishing on
GitHub in the name of "Open Science"?

------
tanu057
To survive, GItHub must provide the way to encrypt private repositories so
that they can make sure that files cannot be read by github employees. Would
that be possible?

~~~
oaiey
Microsoft is the number two cloud vendor. When they violate the privacy of
their paying customers, they are dead.

And no, telemetry is about privacy but not the same as reading the data/code
itself.

~~~
lucb1e
> Microsoft is the number two cloud vendor. When they violate the privacy of
> their paying customers, they are dead.

Perhaps it's different for paying customers, but I do have evidence of them
scanning my OneDrive about ten year ago and them reading Skype chats about a
year ago. I bet if someone uploads the OS X source code to OneDrive, or
perhaps a private Github repository, the good parts of its design would
somehow make it into Windows. Or perhaps something else that they can use, I'm
pretty sure they'll check that out before making an acquisition offer.

~~~
oaiey
I think we should be more careful to distinguish between accessing data and
"big data" usage like stats, ads, ... (and scanning illegal stuff). The later
is "normal" business while the other is a real breach of trust.

~~~
lucb1e
I'm pretty sure they don't ban Skype accounts based on algorithms alone, seems
risky.

------
notimetorelax
I don’t really understand why it matters that these are scientists. It seems
that the scientist title is added only to give this article more weight.

~~~
beojan
Because this is Nature. It's a science magazine.

------
paulific
Are people worried that they're trying to eradicate free software at the
source?

------
a-dub
tbh i think this article was pretty poorly conceived and written. was this an
intern summer project?

the complicated topic of open science, open data and reproducibility goes way
beyond whether venture capitalists or a large public company are footing the
bill for a fashionable place to host code publicly for free.

------
wpdev_63
ughh... just roll with your own git server.

