Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft’s purchase of GitHub leaves some scientists uneasy (nature.com)
197 points by bcaulfield on June 16, 2018 | hide | past | favorite | 142 comments



Seeing this published on Nature's website is quite thick with irony. They have profited greatly from locking up access to academic papers in their most prestigious journals.

The article includes the following tweet:

“Open Science is not compatible with one corporation owning the platform used to collaborate on code. I hope that expert coders in #openscience have a viable alternative to #github,” tweeted Tom Johnstone, a cognitive neuroscientist at the University of Reading, UK.

Github's current position is to be default open with a paid option for private. Nature's default position was to lock up journal articles and sell them for extortionate prices.

It's surprising to see them post an article noting that scientists feel MS will change that for the worse. They have been good participants in the open source community of late specially under Mr. Nadella's leadership.


Nature is also not an Open Access or Open Science journal at all, and furthermore, does not require authors to publish all of their data for replication purposes.

It's incredibly hypocritical. If people cared about Open Science, they'd publish in PeerJ or PLOS One, but they care a lot more about saying they have pub credits in Nature.


Nature expressly allows preprints. You can put your Nature paper on arXiv at any point, before or after review. Authors are perfectly able to open their papers themselves. Authors also retain all copyrights.

https://www.nature.com/authors/policies/license.html

I’m no fan of big publishing, but Nature’s policy is ok in this respect, and better than eg. IEEE.


> If people cared about Open Science, they'd publish in PeerJ or PLOS One, but they care a lot more about saying they have pub credits in Nature.

True, but perhaps the thinking goes: we'll publish another paper in an open journal, so we can have our cake and eat it too.


When I was an academic, the thinking was: if I manage to publish in Nature then I might actually be able to get a postdoc when my current one / PhD runs out. Otherwise I'll go open.


See I hope that "expert" coders understand that 'git push' works even on non-MS controlled endpoints, and how to setup a damned wiki if need be...

GitHub is stupid valuable for MS, and will make their dev tools offerings much stronger (TFS is a dog, bringing GitHub Enterprise into MS Enterprise support & sales cycle is gonna make a loooot of scratch). But the value of the GitHub community persists only so long as the community is present. MS can't start destroying Open Science, because Open Science will be open any- and everywhere.

Frankly, disciplines should be creating their own community hubs and portals based on Git and the publishing standards of GitHub: leveraging GitHubs infrastructure and standards while capturing and engaging with the relevant users. That way they can always migrate and maintain the community.

Besides, even cynically MS is more likely to do a bad job of community management trying to upsell related services moreso than poop on things for the sake of pooping on them. If it becomes offensive you'll have a nice window to find a superior collaboration hub.


You can already use GitHub in visual studio. MS didn't have to buy GitHub to integrate with it.

Pretty much the best that can happen is GitHub will now get worse as MS puts in functionality to do with their tools. The worst is they really fuck it up like they did with Skype.


There is not much to be gained by defending Microsoft. But from my view, my appreciation for their tooling is highly correlated with my own technical maturity. I spent about a decade primarily Mac, ran several Linux servers and developed from my Mac environment. But in the last few years I have switched to surface book and a windows development environment. It does not totally suck. Especially, now that I’ve got an Ubuntu logo in my taskbar. Actually, it’s awesome. Now, I’m running our production web app on python with gevent on windows server OS using IIS with a reverse proxy. Dead simple to setup. With windows server 2019 we can even use the subsystem for Linux in production. It’s just awesome while Apple only gives a damn if I’m building IOS crap. Did I ever think I’d be doing it like this? Hell no. But it works great. To me, people that bash anything and everything Microsoft tooling leads me to believe they have a lack of technical maturity.


With a few exceptions, the problem with MS has never been their technical chops.

True, they've moved from a malignant stance to a more benign one recently, but their disregard for privacy and partnership with government should still make folks pause to consider. And pause we do.


Yes, I certainly didn't mean that the tool was bad, I use VS and have recently switched to the github extension inside the editor from using a combo of GitHub Desktop (really easy to use, but also incredibly feature lacking, including the new version) and GitTortoise (pretty good but annoyingly reliant on right-clicking in explorer).

I generally much prefer GUIs over command lines as I'm lazy and have a bad memory, so anything off the beaten path involves loads of unnecessary googling to remind me of the right invocation to the CLI gods.

I'd avoided using it for so long because of bad memories of using TFS, but I must admit it's pretty good for generally using git, with only a few minor UI gripes. I must admit I haven't had to do a merge yet because I'm flying solo on my newest project, which is usually the big test.


Or they remember the 90’s and early 0x’s. MS has made a mess of many a fine piece of software in their efforts to ‘integrate’ it with their other offerings.


[flagged]


Please don't turn hn into Reddit.


You're going to have to go in depth about what makes the Microsoft stack so beautiful. From the outside it looks likes it grown without a benevolent dictator to keep it in line, and there's way too many wrapper layers



I don't really care. I can ask a copypasta to define themselves.


Bravo on deftly treading the fine line between fanboi-zealotry and sarcasm: I couldn't tell which it was until the very end...I hope.



I really don't get the hate MS gets for Skype.

I used it before the acquisition and it was usually ok, and I've used it since and it's usually ok.

I dont' like how I can't find some menu items because they keep moving, but this happened a lot before, too.

Is it just that they didn't progress the platform? What - specifically - have they done wrong with it?


If you and I were in a group meeting, you would get an alert stating _you_ would not be able to use the good features on this call because _I_ needed to update my skype client.

Of course, only client available to me is the same exact version they bought years before as they never updated with features just bug fixes. You would have been used an unwitting tool as you helpfully mansplained how to overcome my version deficiency.


That seems reasonably understandable from the technical point of view. I'd prefer to know that than the alternative of not knowing why I couldn't use some feature.

If that is the limit of the problems people have with that acquisition then I really fail to understand what the problem is.


It is not reasonable to trick your users into harassing someone who uses a platform for which there is no newer version because MS abandoned it. A more honest technical note would own up for their choice to discontinue adding features.

I have long since deleted it, too many semi-competent users telling me that all you have to do is go to this site and click on this link ...


Well in the general case it’s hardly tricking them, right? For most platforms that is exactly what the user needs to do.

But is that really the worst thing about it? Or have you heard other things?

I’m a desktop Linux users but even I don’t try using Skype on Linux.


There are multiple points of integration between Visual Studio and github, yes.

What there is not is a comprehensive end-to-end support for github based workflows (issues, projects, hooks, wiki-branches, releases), as first class citizens across the entire MS platform. Github Enterprise, sufficiently tailored for MS shops, presents some huge opportunities for scaling dev teams and integrating with other languages and tool chains. I believe there are also significant opportunities in academics: something like VS Code + Jupyter notebooks + deep Github integration, aligned with K-to-PHD and classroom support, strikes right at a core MS's market and would provide significant capture of key users and key metadata.

These are not hurdles of a technically challenging nature, just product design, and will improve their offerings markedly. I see them as givens for their next product updates.

The best that can happen here is a new breed of pre-shared, DCVS based, open source projects and learning tools. MS can't do a lot to stop that, as the tools are already available for GitLab or Atlissan or a monied interest, but MS can do a lot to enable it.


Seeing this published on Nature's website is quite thick with irony.

1. Pointing fingers at someone else as The Bad Guy is a common tactic for diverting attention away from one's own behavior.

2. If Nature is not positioning itself as open source, then they aren't really doing anything ironic. There is nothing inconsistent with not being open source yourself while calling out a threat to open source stuff.


Does the publishing platform invalidates the message?

As to a corporations changing their ways, they do it all the time to accommodate their changing view on profit maximisation.


It's not just Nature magazine. GitHub was a corporation before. Maybe not as big as Microsoft, but a corporation none-the-less. We're just moving from one company owning the repository to another one. Big Deal.


In addition to Microsoft’s evolving attitude toward open source I want to add that I’m surprised at that tweet because Microsoft Research is extraordinarily open and has contributed a lot of freely accessible basic research to society now and in past decades.


Great point contrasting Nature with GH, it was the first thing that came to my mind as well.


"Open Science is not compatible with one corporation owning the platform used to collaborate on code"

That was already the case before Microsoft purchased GitHub. So why did they start to use it in the first place, if they had this concern?


Now the whole data set in GitHub, the metadata, being able to analyze all the projects may not be as open since it can conflict with other microsoft products, such as LinkedIn if you're looking for candidates, or possibly AI training data.


Microsoft products already conflict with Microsoft products.


It was a bad idea in the first place, and one of the dangers was that this one failure point could end up controlled by a company like Microsoft.

You can say that it was a bad idea for people to store all of their things in a warehouse with no sprinklers or smoke detectors, but after it's been done, it's pretty silly to scold people for not continuing to store things in a warehouse that is currently on fire.


> for not continuing to store things in a warehouse that is currently on fire.

That metaphor is rather silly. Microsoft's purchase of github didn't set the metaphorical warehouse on fire. At most, it only raised attention to the fire hazard that was present since the warehouse was created.

What next? People will complain about the privacy risks presented in services such as Dropbox only after some major corportion buys it?


I think it's blindingly naive to think that, on any meaningful timeline, a company like GitHub wouldn't be mining their userbase for trends and capitalising on them by selling it to third parties... There's nothing shady about using corporate information and user metadata to provide value to others, but you have to imagine that MS owning that data set is highly similar to MS buying it from GitHub.

Up to the minute analysis of a meaningful percentage of the development ecosphere is highly valuable. Selling reports, analysis, or monitoring is a natural expansion of GitHubs business model.

Maybe having this dataset go into the same warehouse as LinkedIn strikes some as scary, but I think one actor who is less reliant on direct revenue will have less incentive to push boundaries and spread that info to as many other actors as possible... So whatever privacy issue people feel has arisen here, it's likely just refined itself a little to be much narrower but only a little deeper.

Fundamentally, if you don't trust MS with your goodies you shouldn't trust GitHub with your data at all. GitHub Enterprise might be a better solution ;)


Except that selling data to third parties has never been Microsoft’s MO. It’s always been about getting fat checks from the worlds largest companies’ IT departments.


The whole github purchase is premised on the idea that Microsoft have changed their ways in the last few years.

So it isn't safe to assume they aren't changing to a more Facebook/Google-like data-selling business model.


Fair point.

I'll counter that with if we are using history as an indication of future actions, then selling off personal data has not been Microsoft's course of evil action. Suppressing competitors, sure, selling personal data, no.

Having dipped my toe briefly into the world of collecting and selling personal data, it is a rat-race way to try to make large sums of money. It is an absolute grind. I personally have a hard time seeing why Apple or Microsoft would decide to move away from making windfalls of money through direct purchases. Many startups fall into selling personal data because they can't crack the nut of getting people to pay them.


Exactly :)

My point was that EvilCorps using the metadata from GitHub was an unavoidable eventuality IMO [and honestly, it would probably be a Good Thing for developers, but whatevs...]. As it stands, that dataset is now going to be owned be a single EvilCorp. One who has a distinct mercantile incentive to exploit, instead of resell, that data.

If the concern is corporate exploitation MS ownership is more likely to reduce the number of exploitative partners to 1. Ironically a net win, for those sensitive to metadata reuse.


> That was already the case before Microsoft purchased GitHub. So why did they start to use it in the first place, if they had this concern?

For the same reason many people use GitHub (or facebook): network effect.


Yes and we will have to see how they currate that network effect while drawing corporate value.


Maybe what he meant was one large corporation.

It's usually the large players that start taking down accounts at will and have bogus takedown rules, because they are the largest and don't care about any one account.


> “Open Science is not compatible with one corporation owning the platform used to collaborate on code. I hope that expert coders in #openscience have a viable alternative to #github,” tweeted Tom Johnstone, a cognitive neuroscientist at the University of Reading, UK.

This shows an astounding lack of understanding, both of Github and of git itself. Github was always owned and operated by "one corporation" - now it is owned by a different one. And "the platform" isn't Github - it is git itself! If this neuroscientist is so concerned about the acquisition, he can just set up a git server for himself and his colleagues and start citing that URL instead of Github.


The platform is pretty clearly github, a binary on a disk isn't a "platform". And any scientist is far too busy with the interesting stuff to set up a git server, otherwise they would be a sysadmin.


Umm, support staff exist largely for this exact reason. In college I worked in IT support for the CS professors and would setup whatever services a lab needed.


To follow up on this, the design of git is inherently open and free. The very act of checking out a repo is mere seconds away from pushing it to any other server or platform/service provider.

Thus, by the great design of git, there is no way to make it a "walled garden" like your email/calendar/facebook. Should Microsoft ever show any inkling of restricting content, moving thousands of repos to another provider would be trivial (i.e. I could do it in a day)


"During the 2014–16 Ebola outbreak in West Africa, for example, researchers used the platform to share and cross-check daily patient counts." Pretty sure this is referring to the lab I used to work in.

Honestly, I'm not particularly worried about the buyout - pretty sure the chances of them just scrapping the whole service are slim, and "GitHub disappears" seems equally as likely burning through VC cash as it does owned by Microsoft.


the chances of them just scrapping the whole service are slim, and "GitHub disappears" seems equally as likely burning through VC cash as it does owned by Microsoft.

Right, the people who should really be worried are VSTS users. And GitLab users funnily enough. They are VC backed and will have to exit too, and everyone’s looking at them now. The least worst thing that could happen is bought by Atlassian and folded into Bitbucket.

And you’re completely right, it is actually pretty weird that they were never worried before when it was burning VC cash. Funded by grants, I’m not sure scientists fully understand how this works.


All we need is for GitLab to implement a federated protocol for dealing with issues, pull requests etc. and then there's no need to worry. See https://github.com/git-federation/gitpub being worked on.


Not everything has to run by capitalist rules, least of all open source (which Git is). We live in a society and subsets of that society can set up their own rules, and are.

It's more a matter of the mechanics of it all, and that part is indeed interesting. However, venture capital is not the final arbiter, any more than Microsoft is.

FWIW, I have not bailed on Github and have no specific plans to. What I'll do, is back up all my repositories to physical read-only media just in case of ridiculous catastrophic scenarios: under normal circumstances, even if Github's plug got suddenly pulled as a big gotcha to open source, my local repositories would not be deleted. If I have local git repositories, I can put them elsewhere.


IIRC they've said they intend to keep VSTS and github as separate products.


they intend to keep VSTS and github as separate product

Still, consider Codeplex, which isn’t around anymore.

I see the two merging over time, say in 5 years, and I expect the migration to be fairly painless, so I am not too worried personally, but I can see how some people might be.


Isn’t the existence of Github the reason Codeplex is no more? Github was so good alternative that it made no sense for Google and Microsoft maintain their own services geared towards open source projects.

Building a viable Github competitor would not have fitted to Microsoft strategy which was back then focused on Windows.


Unlikely, they're targeting different markets. There is a place for both of them in Microsoft's portfolio.


MS-GitHub Enterprise would seem to address a lot of where TFS otherwise lands. Once TFS incorporates GitHub workflows and infrastructure I think anything off of that core is gonna struggle mightily.

I mean... as it stands TFS struggles with mindshare, features, and engagement. While the name and marketing will persist, I think the underlying technical reality will be the continued cannibalisation of that offering by the superior model of git and its cross-language appeal.


VSTS does a lot of stuff GitHub doesn't, and has had GitHub as a source control option for a while.


GitHub or just git? I thought just the latter.


Actually, both. GitHub is a nice option if you want to use VSTS for CI.

Note that Windows itself is in a Git repo on VSTS (os repo in the Microsoft tenant)


For the current regime, mmaybe. But don't underestimate next regime's ability to flip tables.


I mean yeah, but that's always true. But until my next grant wants to pay for someone to deal with self-hosting, it's a risk I'll take.


That is the main worry, I would say. Less likely Github is going to flip and die than Microsoft becoming anti-opensource again (they are anti-foss/floss still I would say -- not talking about the engineers working there of course).


Of course, this ignores the fact that by forging their employers goals ahead, they're, to some extent, endorsing their viewpoints.


The level of hypocrisy is beyond measure. If "Open Science" was a real thing already, nobody would care about Nature, Elsevier, et al. If there has ever been a danger to open access to scientific research, it's publishing conglomerates that provide little to no value — the exact opposite of what GitHub has been doing the past few years.

Before complaining about others, clean up your own act, Nature.


To be honest, I don't think GitHub is to be blamed for how scientists use it. AFAIK there are data repositories more suitable for the purpose of sharing data sets and metadata, and out of the hands of big corporations (but may subject to big governments), e.g., Zenodo (funded by CERN) and DataONE (funded by NSF). Zenodo can even generate a DOI for your data set, which GitHub does not do.

That being said, I think this Nature report adds nothing new to the narrative. Sure, open science is great, but who's gonna pay for it? Neither GitHub nor Microsoft is a charity. Since most of the research mentioned in this article are funded by government, they should go to the data repositories that are designed to store data derived from publicly funded research. GitHub just wasn't designed for that purpose.


Microsoft Research is one of the largest supporters of academic computer science research in the world, and is a non-trivial funding source for many non-CS academic fields as well.


“Why don’t scientists use modern tools like GitHub, instead of these crap bespoke ones? They shouldn’t waste their time duplicating effort.”

“Why are scientists using inappropriate tools? These are for commercial use only!”

In practice, GitHub works fine for collaborative code projects and sharing data over the next few years. If you’re looking to store large amounts of data over long periods, you’re right.

If GitHub don’t want to support science (they’re just a business) then they should stop giving free products to academia eh.


Don't make a straw man argument. I said none of the above. I'm NOT against open science. And I am glad that GitHub supports open science. In fact, GitHub does offer free private repositories to researchers who have an .edu email address.


I see it mostly as a summary of the "common wisdom"; it has been a mantra for years that you should "just put it on github", but now that Mircosoft has reminded us all of how cynical the world truly is, suddenly the scientists that paid attention before are to blame?


Not cool to insert fake quotes as if you are quoting the person you’re responding to.


I think it's a pretty normal shorthand, especially when readers can see for themselves that you aren't quoting them.


The beauty of all this is that people can simply run their #openscience in Github and gitlab and bitbucket and (hopefully) their own gitea instance running on a $5/month digital ocean instance.

It’s funny that this is a complaint about a company controlling a centralized resource because they worry. But any centralized resource has this risk. Did they think Github would lose money forever and subsidise infrastructure in perpetuity? I think it was a situation of users being oblivious to risk and then murmuring because the externalized risk they ignored is now more apparent in a smaller risk.

So maybe these jarring events are good, because it helps us realize that many users are stupid and don’t assess data risk well. Luckily for us Github was really principled and Microsoft is a good fit for a PR/biz perspective. So the risk didn’t hit.

Having these discussions now, so scientists learn to push to multiple public repos (or universities run their own instance for a few thousand a year), rather than rely -falsely- on one source.


Slightly OT. I was wondering what happens to Github's policy on employees' side projects post acquisition. Links : https://news.ycombinator.com/item?id=13921433 and https://news.ycombinator.com/item?id=13142327


I know a lot of people don't like n-gate and webshit weekly, but I think their slice on this was particularly relevant:

> Microsoft Is Said to Have Agreed to Acquire GitHub

> A near-monopoly closed-source software company, fed up with trying to seem like a good corporate citizen by releasing source code of their worst programs, is acquired by Microsoft.

When you combine that with the concerns from the article:

> They fear that the site will become less open

(it's not currently "open" in any significant way now, bar that there are free public accounts. They don't even have a public f-ing bug tracker for site issues)

> Open Science is not compatible with one corporation owning the platform used to collaborate on code

(It was already owned by one corporation, the Github corporation. Now it's owned by a different one. It's not like GH was some kind of open source alliance distributed project or anything)


This article seems to have a very focused interest on data in GitHub repositories, as opposed to source code. I get that the article is aimed at scientists, but I don’t see the problem here: if Microsoft takes down your dataset just move it somewhere else. You’re not tied down with pull requests or comments like code repositories are.


The problem is the stable of the pointer. From a paper of mine:

"The code and simulation results are available online at https://github.com/mylabname/ project."

That's in print. In another paper, I expressly cite a GitHub repository as the source of the data used in the analysis. Pointing to data in papers is the way most of the scientists I know use GitHub - because it's relatively stable, not tied to an institutional account, and relatively pain free.


Well, it still has a higher chance of being alive than other common alternatives. From my experience, any url put in a CS paper has a very high risk of being dead just a few years after publication. It seems that once the research grant has run out and the researchers have moved on, noone takes on the responsibility of ensuring that the web page is kept alive. Domains and hosting plans are not renewed; universities reorganize their departments and don't map the old URLs to the new ones; or policies change, and personal department hosted web pages for researchers are closed. The solution has to be some kind of independent library service for researchers, like arxiv for data/source code, which is guaranteed to be kept alive.


[disclaimer, work for Digital Science]

There are various services for this. Figshare and zenodo are two major ones where you get dois.

Figshare is archived with CLOCKSS so if it goes down all public content can be maintained and dois redirected. I think zenodo may be the same.


In practice, the pointers are the authors.

In Computer Science, it’s a mess that works fine in practice. I go to GitHub - if it’s not there, then I google the authors or the name of the project. And I don’t know what I do if I don’t find any of them or their projects, because I’ve never encountered that scenario.

I’ve never seen a paper more than say 10 years old and thought “god I wish I had their code/data” because CS moves at a rapid speed. Important stuff is preserved because other people build on it. For example, it becomes the basis of an open source project.

Whilst from a rigorous and idealistic point of view, we want central long term data storage, in practice right now the real problem is that _many scientists do not make their data and code available in any form at all_ rather than worrying about pointers decaying.


I have: papers on alias analysis for compilers. I've had a lot of difficulty filling in the gaps in the old-but-important papers I've read.


The problem is not any particular hosting service such as GitHub, it's that the pointer in your paper ss relying on a single point of failure.

There was a bit of drama several years ago when Megaupload was seized and shut down; various small/free projects lost access to the only copy of some of their files. Like your paper, important documents had evolved in forums, which linked to the file hosting service for files that could not be uploaded to the forum. A few projects were the canonical documentation for something that the original author had abandoned, the first result in Google couldn't be updated creating the same pointer problem as your reference in a paper.

At the time, a lot of people talked about finding a "replacement file hosting service" in the same way people currently talk about finding a replacement for GitHub. Moving to a different service is still a single point of failure. Instead, when you want to preserve access to data in the long term, you need to assume any single service might fail and build in redundancy.

Instead of saying, "[things] are available online at [URL]", you should include in the paper something like:

    The code and simulation results are
    available as an archive named:
           foo_project-2018-06-16.zip
    The file has the following checksums:
        MD5:  1271ed5ef305aadabc605b1609e24c52
        SHA1: ab69db8315af7de6e673a6ddf128d415157a7c3f
        (...more...)
    The file was originally hosted at:
        $GITHUB_URL
        $GITLAB_URL
        ${OTHER_HOSTING_SERVICE_URLS[@]}
        $INTERNET_ARCHIVE_URL
        $AUTHOR_UNIVERSITY_URL
        $COLLAB_UNIVERSITY_URL
With tools like git (or rsync, etc), making multiple copies of a project is very easy. Redundancy protects against some risks, but including checksums (and any other relevant metadata) makes content addressable searching possible. Even if all of the URLs in your paper eventually become defunct, someone reading the paper in the future may be able to find your data by searching for the file's hash.

The hosting service isn't the (primary) problem; the paper needs to include a pointer that is more robust than a single reference to a single path on a single server.


Unfortunately most papers (that I have been involved with at least) have very specific size requirements and limits. Spending that much space on sharing the files is not something that can happen. Anything more than "[things] are available online at [URL]" is just going to take up too much space. (Although, from now on I might try to include two URLs)

Besides, if / when GitHub ever shuts down, you know people will end up going through and cloning every repo on the system and archiving them at some similar url. So everyone can just visit it at notgithub.com/user/repo instead.

Now, I agree that something like that would be awesome. I don't think its that easy to do. Its just not practical. GitHub / Lab are easy you just toggle a switch to make a repo public after the paper has been published, but other services may not be as easy. Especially trying to set up university urls, and keeping them private until after publication. For tech savvy people it might be fine, but not for all.

Maybe it would be better if universities either hosted their own git repos, or even banded together and hosted something as a educational service? Rather than a for profit entity being where it is stored.

The contact info for any author involved with the paper is going to be right there at the beginning of the paper. So IF something were to ever happen to the one link, contact is not going to be hard. Yes, there are risks, but it still gets the job done for the time being.


This is what DOI handles are for[1].

That system sucks (it's always going down, or the handle goes to some universities crappy DSpace system which doesn't work properly or something).

I'll remain grateful whenever I see a Github URL.

[1] https://en.wikipedia.org/wiki/Handle_System


Nice idea, but more robust checksums would be better (especially for archives containing code).


Sounds like a labor-intensive version of ipfs


[disclaimer work for related company - Digital Science]

It's really worth getting those things into a system that will give you a doi and participate in an archive so that content should never be lost. Figshare is one, zenodo another (I think but can't find on a phone that the data is archived).

A doi, if versioned, can ensure people see the actual version you used. It can be pointed somewhere else if the provider goes down, and should be archived to ensure its not lost.

URLs are just not stable over time.


I've been consistently unimpressed with the offerings in that space.


Curious what's missing, iirc you can set figshare to automatically pull from GitHub on a release. Seems pretty low requirements for free long term archiving and fitting into the common workflows with DOIs


I can't quite put my finger on why I don't care for figshare - it just never seems to quite grip.


What is wrong with an institutional account ? We used university FTP servers for many years to host software created by research projects at that institution.

I work on an international standard, we are moving our repository in a couple of stages from being hosted at Sourceforge to being hosted directly at iso.org. Reference data in this standard will also be

Maybe the research funding agencies should start providing their own hosting with the base URI derived from the project name.


Because in the past four years I've been at three different institutions, and those accounts go away when I leave.


I presume that the departments or project groups you worked in are still there.


Some of the labs are - some of them have changed their names, or the PIs have moved on. But two of those papers were with my group, which has moved.


These very regularly change.


> That's in print.

In olden times there'd be an address you could write to for a copy of the code on tape. It's an old problem without an obvious solution.


It should not be complicated to move the repository somewhere where a search engine query with "mylabname/project" would find it.

As a side note, it would be nice to see scientists stop using git for dataset management. There are alternatives that are considerably better adapted to managing data.


The thing with git (and GitHub by extension) is that it is "good enough" for many use cases, and relatively painless to get up and running.


This is a bit unrelated, but I think it would be cool to have a decentralized way of making persistent URLs.


The one that seems to be gaining the most traction is IPFS[1]. /ipfs/ URLs are content-addressed (like git objects). As long as someone has a copy of the data somewhere on the IPFS network it will continue to be available, and if you manage to get a copy of the content from somewhere else you can try hashing it and make sure it matches. (And if it does you can `ipfs add` it and it will become available again, even if nobody who had an original copy was still connected to the network.)

[1]: https://ipfs.io/


I should have added "human readable short persistent url". Like foo.com/xy/github where xy is an identifier as short as possible to avoid collision.

And with ipfs you are dependent on the ipfs own domain. To remove domain dependency it would have to be something not based on current DNS system.


I'm not sure how well "short" and "decentralized" go together. The problem is that there's no good way to prevent someone from creating as many short identifiers as possible and taking them up forever. Distributed, maybe, if you have a central issuing authority who publishes a complete list that others can mirror and archive. I think this is what a DOI is supposed to be.

IPFS.io itself is an IPFS gateway, but you can resolve /ipfs/ paths without it if you run your own IPFS daemon; in fact this seems to be the end goal of the IPFS foundation, with ipfs.io being an stopgap measure since most people aren't running an IPFS daemon yet. The easiest way is to get this is to install the IPFS Companion browser plugin, which takes over resolution of ipfs.io and replaces it with a local gateway on your computer.


> And with ipfs you are dependent on the ipfs own domain

No, you might have gotten something wrong here. We (I work at Protocol Labs, I'm part of the development team working on IPFS) are hosting a public gateway you _can_ use but in no way have to use. You can either use your own local gateway, another gateway via IP or accessing the other peers gateways if that's accessible. Nothing is tied to the "ipfs.io" domain.


Yes I understand how ipfs works. But I mean, now we don't have the ability to print an URL that is not dependent on a domain name ipfs or any other. That's why DNS would require an alternative for true persistent and decentralized identifiers that are human readable.


Well, of course you can't have a URL without a domain name; part of the spec for URLs is that they have a host part!

But, IPFS does not depend on the domain name system to keep your data accessible. You can install IPFS Companion to remove the dependency on the DNS system and ipfs.io gateway, replacing it with your own local gateway (so even if the IPFS Foundation goes away they will still work for anyone who has this set up). Or, you can look at the URL to find the hash and perform the lookup manually. There are also some discussions about making a new URI scheme, though I don't believe they've settled on anything yet.

As I've said in my other comment, I don't think that decentralized and human-readable are mutually compatible; if there is no central authority then how can it be decided when two people want the same name? If by first-come-first-serve, what prevents someone from squatting on all the good names? Namecoin for example solves this by requiring periodic renewal, but then you have the same problem that DNS has of link rot as names expire.


Yeah, having an IPFS like system handled at the browser level (in the address bar) would solve the name resolving bit without DNS.

We could imagine something like this, the full URL is:

QmaG4FuMqEBnQNn3C8XJ5bpW8kLs7zq2ZXgHptJHbKDDVx/github/example.jpg

but could be written like so

Qx/github/example.jpg

And if there are duplicates, all duplicates are displayed in a list, with metadata (like creator, ssl info to help verify the creator...) and you select the one you want.


> "[..] researchers used the platform to share and cross-check daily patient counts."

No concerns about private datasets in the open on Github ?

I get they like the workflow but they traded ethics for convenience there.

Medical ethics.


These were publicly reported data sets of case counts. You can find them on various Ministry of Health websites.

I know because I worked in the lab that maintained one of them. The biggest issue was that one country reported their data in non-machine readable form on Facebook. Maintaining that was extremely time intensive.


If it's anonymized data, I see no problem. All scientific datasets should be in the open.


Indeed, the first four repository I found for a google search with `ebola outbreak github` terms only have statistical data.


The cool thing about git is it doesn’t matter what happens to the platform. You can take your repo and push it somewhere else if you don’t like how Microsoft runs the show. Given how Microsoft took Github’s idea with Atom and ran with it to make VSCode, I wouldn’t be surprised if Github improves under Microsoft’s rule.


The problem lies in the comments and issue tracking, which are not stored in git. Sure, you can clone the repo, but you can only take the wiki with you, the rest would be lost.


So what I'm hearing is the solution to this migration problem is to also store the entire issue tracker as a git repo alongside the real repo and wiki repo? Git repos for all the things! (Seriously, I think this solution would fit right in along with the way Wikis are handled)


How hard is it for some ont to use the GitHub api and port all of that over to a new system?


GitLab already has an automated migrator you can use.


Github has APIs to export all of this. Some services have importers to rebuild the issues but at the very least the data won't be lost.


To my mind, the actual problem is neither one of those things, but if Github is wrapped up into some new project. Many of the citations for Github aren't for software, but as a stable, open store for data.

Which means if that URL changes, that link - in a physical paper - goes stale.


If you need stable links then buy yourself a domain or use a URL shortener that you can update. Might be wise to also provide content hashes of the data you're referencing in your paper.


Even if it does get merged or renamed, I don't see why Microsoft wouldn't forward the URLs.

Link rot is a real issue in general though, I don't mean to minimize that.


I agree - which is why I haven't moved my lab's accounts.


Surely it ought to be possible to store all that stuff in git as well.


Won't GDPR legislation force GitHub to make that exportable?


It's already exportable through the API, that's not the problem.


So many repos don't use wiki. Just create docs/ folder and store .md there


From the article: "Others are hopeful that Microsoft’s stewardship will make the platform even more valuable."

Perhaps I missed it, but nowhere in the article were any of these hopes articulated. I may lack imagination; so, can someone tell me how Microsoft can make GitHub more valuable? I don't mean how they can use GitHub to make more money for themselves, by selling "enterprise-level" support, or by making their development tools better. I'm asking how GitHub is going to be made more valuable to people who currently are not using Microsoft products.

Any ideas? What exactly is there to be "hopeful" about?


Github is the least functional source-code hosting service. Everything from search to issues to wiki is sub-par compared to Gitlab and others. It's massively unprofitable and its main advantage is the large mindshare and community that keeps it around.

Microsoft with its rich resources can easily improve the platform functionality which will benefit anyone using Github, along with likely making several features free (like private repos).


Free private repos come to mind.


Forget about GitHub, Nature makes scientists uneasy.


Just the machine learning scientists, most every other community are more or less fine with them.


Good thing GitHub wasn't just one company ready to sell out at a moment's notice.

Why the hell are people centralizing everything on a single place? The whole point of the internet is open protocols, open end points that can host anything.

Host your own build your own communities!


So will Nature be comfortable with having scientists cross publishing on GitHub in the name of "Open Science"?


To survive, GItHub must provide the way to encrypt private repositories so that they can make sure that files cannot be read by github employees. Would that be possible?


Microsoft is the number two cloud vendor. When they violate the privacy of their paying customers, they are dead.

And no, telemetry is about privacy but not the same as reading the data/code itself.


> Microsoft is the number two cloud vendor. When they violate the privacy of their paying customers, they are dead.

Perhaps it's different for paying customers, but I do have evidence of them scanning my OneDrive about ten year ago and them reading Skype chats about a year ago. I bet if someone uploads the OS X source code to OneDrive, or perhaps a private Github repository, the good parts of its design would somehow make it into Windows. Or perhaps something else that they can use, I'm pretty sure they'll check that out before making an acquisition offer.


Perhaps it's besides the point, but core osx is open source as Darwin and XNU, and I don't see much value in them. NT hybrid kernel is extremely well designed, and there is an ecosystem built around it. It is extremely unlikely anyone at Microsoft will find anything aiding in Office 98 compatability online...


I think we should be more careful to distinguish between accessing data and "big data" usage like stats, ads, ... (and scanning illegal stuff). The later is "normal" business while the other is a real breach of trust.


I'm pretty sure they don't ban Skype accounts based on algorithms alone, seems risky.


> evidence of them scanning my OneDrive

Microsoft publically states that they scan OneDrives for forbidden content, like pornography (yes, all pornography is forbidden on OneDrive, not just child pornography).

> I bet if someone uploads the OS X source code to OneDrive, or perhaps a private Github repository, the good parts of its design would somehow make it into Windows

The OS X kernel is open source, and that's the most valuable part of the OS X source code. It's not like Microsoft can import the OS X window styles. https://github.com/apple/darwin-xnu


>The OS X kernel is open source, and that's the most valuable part of the OS X source code

Uh, no.


I don’t really understand why it matters that these are scientists. It seems that the scientist title is added only to give this article more weight.


Because this is Nature. It's a science magazine.


Totally. Came here to say exactly this: the title of the article is needlessly and confusingly specific. It’s like saying that scientists are concerned about the declining US dollar value, or some such. There seems to be nothing unique about scientists who use GitHub as opposed to the rest of GitHub users, and their concerns are the same.


Are people worried that they're trying to eradicate free software at the source?


tbh i think this article was pretty poorly conceived and written. was this an intern summer project?

the complicated topic of open science, open data and reproducibility goes way beyond whether venture capitalists or a large public company are footing the bill for a fashionable place to host code publicly for free.


ughh... just roll with your own git server.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: