Not to say I disagree with the frustration... but it's also not something new. It's been this way for decades. I'd much rather hear about who is doing work in this space and what they're working on. Here are the ones I know of:
1. The Center for Open Science (https://cos.io) is one such org trying to fix this with the Open Science Framework .
2. GitHub also recognizes the need for citable code and gives special discounts for research groups, in fact, Mozilla is one they work with .
Two smaller related startups are:
3. Datazar (https://datazar.com) - A way to freely distribute scientific data.
4. Liquid (https://getliquid.io) - A scientific data management platform. Somewhat like "Excel for scientific data as a Service".
Also, a related HN thread from some years ago: "We need a GitHub of Science" https://news.ycombinator.com/item?id=2425823
Code is a joke, most data is processed using "pipelines", which in reality means some irreproducible mess. People don't generally do research trying to understand how cells or tissues work, they generally write papers about "stories" they found. Only a small minority are trying to do some serious modeling using serious math.
You're not wrong, and it's not limited to bioinformatics; Reinhart-Rogoff's findings were reversed when an additional 5 rows were included in a spreadsheet they used to calculate their correlation between GDP growth and debt ratios. And of course, they insist that despite the actual outcome being twice as strong and in the opposite direction, they still support their original position.
I wonder if one can get a CS PhD by producing enough retractions. Of course, it won't win you many friends in the academy, and would probably lead to less source code made available. But given the Perl code I've seen published who's termination condition is a divide-by-zero exception, one can argue that peer review in the information age has to include code review.
Judging academically, the original paper and the refuting paper is a healthy debate, but the dynamic of the society and politics ab-use them to attack a whole school of thought at large (the austrian school: less bailout, less intervention by government, less control over everything) in favor of keynesian school (more bailout, more government spending, more public debt, especially in recession and crisis).
Anyway, it remains a controversy, because theoretically one can do what one wants, but once it involves policy and real life matters, it is hard to argue for what method is right and what is wrong, in the presence of so many (ready to b angry) interest groups.
I agree that from a purely academic point of view this is nothing big to worry about, but this paper played a completely outsized role. And the authors stood by and let things run their course, without any attempt to reign in or moderate the debate.
Speaking as a layman, that seems like a very strong claim for what one hopes is a hard science supposedly applying the very best practices of the scientific method (i.e. falsifiable theories vs. anecdotal stories).
Is this meant to be hyperbole to get your point across, or something that is generally known throughout the bio[med|tech] industry? As a sister comment pointed out, the latter scenario would be quite alarming.
EDIT: I'm aware of growing sentiment within the scientific community to reconsider using p-values, of which John P.A. Ioannidis and his body of work helped to raise awareness of. Was this the "story"-like theme that you're referring to in cell and tissue papers?
However, as data analysis techniques have become more powerful and data more easy to produce thanks to advances in the scientific methodologies we use to access information about the matter we are studying, the datasets are getting bigger but also easier to manage.
We're at an interesting point where many of the useful mutant screens have been done to saturation, at least for the most common species. Expanding our genomic resources to screen more species is certainly helping. The way forward is just the same as it ever was; good mutant phenotypes with strong genetics.
I don't think 'serious math' is at all useful for most biological investigation. Modelling is only a means for generating hypotheses, and in my experience it's a terribly weak tool. A poor substitution for proper genetics.
Struck a nerve
Medicine certainly has it's own issues, but they're working on it:
I mean really, the Churchill quote about democracy applies here as well: the current way of doing research is the worst possible way of doing it, except for all the other ways we've tried before.
VUSes that actually could be clinically significant are hoarded, and conversely, some companies claim to know that certain VUSes are not based on their own proprietary data that no one else sees/double checks. meanwhile the actual data comes from patients, who are left in the dark
That's just cruel. Comparatively, at least CS gets math formulations
What I have found is that if you need the code for a specific data analysis then send an email to the authors of the paper. More than ever you will be surprised. They do share code with fellow researchers.
So - if you do really want code, write the email. Be forewarned though - the quality is often bad from a s/w engineering perspective and suits a very specific purpose. It will also come in a language of the authors choice (Tcl/Tk for ns2 scripts, matlab scripts, octave, etc.)
Maybe that just means I don't write very interesting papers.
The purpose of the code written is usually very simple: to produce results of the paper, not to provide a tool other people can use out of the box. Even when such a tool is nominally provided (for example, when a statistics paper is accompanied by an R package), there are good reasons to be very careful with it: for example, the paper may include assumptions on valid range of inputs, and using the package without actually reading the paper first would lead to absurd results -- which is something that has happened. The way to use academic research results is to (1) read and understand the paper, (2) reproduce the code -- ideally, from scratch, so that his results are (hopefully) unaffected by authors' bugs, (3) verify on a test problem, and (4) apply to his data. Using an out of the box routine skips steps 1-3, which are the whole point of reproducibility.
That's not how I read his comments at all.
What he seems to be asking for is the ability to take the code you used to produce the pretty graphs and tables in your paper and re-run it, maybe tweak it himself and use it on a slightly different dataset. He wants to be able to see that your results extend to more than just the toy synthetic dataset you made up, and also be able to verify that some bug in your verification code didn't make the results seem more successful. Finally, he wants to be able to compare apples-to-apples by knowing all the details of your procedure that you didn't bother putting into the paper.
(Impersonal you, of course)
You're assuming such code exists. If the graphs were produced by hand (eg, typing things into MATLAB to create the plot and then saving the figure), then there is no code to hand off. Now the code request has risen to "redo all that work".
That is not to much to ask, but the academic system is full of perverse incentives. Doing good, robust work looses to good looking, quick and dirty all the time.
We need funding bodies to require publication of all these details, and we need to structurally combat publish or perish. Hire people based on their three best results, not on h-factor or other statistical meassures.
This rests on a common false assumption that programmers make: they think it's easier to write bug-free code when starting from scratch. The reality is that it's almost always easier to start with something that's nearly working and find and fix the bugs.
What really happens when you do a clean room reproduction is that you end up with two buggy programs that have non-overlapping sets of bugs, and you spend most of the effort trying to figure out why they don't match up. It's a dumb way to write software when it can be otherwise avoided.
Worse, with the attitude that the OP has, do you really think they will take extra time to verify the entire code base or look for bugs?
Often the best way to find errors in these sorts of analyses is to do a clean room implementation then figure out why the two programs don't produce the same results.
Reimplementation matter hugely (in ML at least). But that doesn't mean having the original implementation available isn't a huge advantage, obviously it is.
That is 100% not what he's asking. I don't know how you could even interpret that as what he's asking.
He wants to be able to take your research and run it over an updating dataset to verify that the conclusions of said research actually still apply to that data.
Well it's not, although CS should be particularly amenable to reproducibility.
> "He is asking that researchers produce code that he can put into production"
No, he asked for code [full stop]. "Because CS researchers don't publish code or data. They publish LaTeX-templated Word docs as paywalled PDFs."
> "The way to use academic research results is to (1) read and understand the paper, (2) reproduce the code -- ideally, from scratch, so that his results are (hopefully) unaffected by authors' bugs, (3) verify on a test problem, and (4) apply to his data."
He wants to re-run the authors analysis with new data. He's not looking to recreate the research from scratch or publish a new paper. Saying that this is the only valid usage of the results is awfully short sighted. It misses the point that the research has use beyond usage by other researchers.
Imagine if rather than open source software, we published the results of our new modules and told potential collaborators to build it from scratch to verify the implementation first. You'd learn a lot about building that piece of software, but you've missed an enormous opportunity along the way.
That sort of thing has been done for years in the scientific computing world. The end result is that you are making decisions based on code that may have worked once (Definition of 'academic code': it works on the three problems in the author's dissertation. There's production code, beta code, alpha code, proofs-of-concepts, and academic code.) but that you have no reason to trust on other inputs and no reason to believe correct.
Case in point: I had lunch a while back with someone whose job was to run and report the results of a launch vehicle simulation. Ya know, rocket science. It required at least one dedicated human since getting the parameters close to right was an exercise in intuition. Apparently someone at Georgia Tech wanted to compare results so they got a copy. The results, however, turned out to be completely different because they had inadvertently given the researchers a newer version of the code.
Provided someone's willing to pay for it, that could be a fun engineering job: implementing research papers as production code.
That's the bare minimum. If you don't know how to make code agnostic to file paths or dependencies, that's too bad, but fortunately a field practitioner picking your code up will know how to work around those issues. At least they're not starting from scratch on trying to rewrite your code.
And they shoot themselves in the foot constantly by not prioritizing non-crappy software.
Then think it's normal that a new student needs months of close support before they can start doing anything interesting. They think it's normal to spend weeks trying to get bitrotted code running again, or just give up and throw it out losing months or years of hard-won incremental improvements.
I'm not even talking about supporting reproducibility by outsiders -- they can't even achieve reproducibility within their own labs, because they don't follow practices that are baseline in industry (version control, configuration management, standardized environments, etc).
True that. Four years ago, when I was writing my thesis in computational physics, I attended a research group meetup. One session gathered all the students and had them showcase their research. When there was still some time left at the end, I asked the audience who was using version control systems for the programs they write, and only 5% or so raised their hands. I then immediately ran them through a Git tutorial, and people were amazed by what is possible.
File paths and manual steps can be worked around, mobody said it should build and run painlessly forever after having been published on any future system with no tweaking and with the researcher indefinitely maintaining the program. If however nothing remotely close to a working state can be published it's not a great sign.
As an extreme version of the idea, imagine if the actual paper itself (TeX) and all the data within it are also built as part of the repository; any graphs in the paper are rendered from data in the repo, any numbers are data accesses, etc. This probably wouldn't be helpful to researchers, but it would promote scientific reproducibility and aid everyone building on a researcher's work. Tremendous work goes into authoring the papers themselves, sometimes with methods or tricks that are private; laying it all out publicly would greatly help students of science.
Going even further: to avoid cherry picking of positive results, review boards expect experimental criteria to be published (at least privately to them) in advance, for research that involves capital-E experiments. Perhaps this includes analysis code at least in prototype form; like test driven development, the acceptance criteria are written first. When the paper is ready for review, the reviewers can compare the initial prototype analysis logic to the final form. Perhaps the board also expects all data and trials collected during experiments to be made available in the repository, whether positive or not. All collected data should be in the platform, in the most raw form it was originally recorded, as well as all steps of summary and analysis.
I wonder if a process and platform like this could contribute to the integrity and quality and reproducibility of scientific research. People funding research ought to ask for it, especially public funded research, and the whole repo is made open eventually if not initially.
Perhaps as part of the platform's value prop to researchers (on whom it is imposing probably more burdens than benefit, for sake of public benefit), the hosting is free and funded by a foundation, or steeply discounted. (OK, it won't pay for LHC scale data sets, but otherwise ...) So using it to host your data, code, and paper is free, at least up to a point. I would be interested to contribute time and resources toward building or supporting a platform like this.
In many ways, it already is: good research requires meticulous log keeping in order to reproduce results, and equal effort must be spent on maintaining references to other literature, or you risk missing a citation in a published paper.
My impression of the field was there was a severe mismatch of skillset. The set of people with the scientific background to carry proper experiments, and the funding to do so, is very disjoint from the set of people who understand the field. That made a lot of the papers feel "off". Almost like reading text generated by a machine: individual sentences make perfect sense, the whole doesn't seem to go in a relevant direction.
As someone who's done a fair bit of practical software engineering, seeing academics study software engineers feels like seeing a WW2 veteran trying to understand how youngsters use snapchat. It feels very awkward for the youngster, just as it does for the software engineer. Which I imagine is one reason why Mike is pissed off.
There is some irony that businesses are much more scientific in this particular subfield than academia, because business incentives require the results to be reproducible and meaningful, over a longer period of time.
I think openness has been a big contributor to the recent explosion in popularity and success of machine learning. When talking to academics about this, machine learning would be a great field to hold up as an example.
Since this is closely related to my current research, yes, ML research is kind of crappy at this right now, and can scarcely even be considered to be trying to actually explain why certain methods work. Every ML paper or thesis I read nowadays just seems to discard any notion of doing good theory in favor of beefing up their empirical evaluation section and throwing deep convnets at everything.
I'd drone on more, but that would be telling you what's in my research, and it's not done yet!
Although I can see global linearity being unlikely in most cases; why is local linearity unlikely?
I'm not saying the work is zero now, but maybe we can get there. If a researcher is developing on a platform where their repository is expressed as a container-like image, then they should be able to publish it for anyone to run exactly as-is. The container repo includes the data, the operating system, and any languages and libraries, with an init system that optionally builds the results.
I know researchers that used Subversion when it was on the rise, but they just abandoned version control altogether when Git became the generally preferred option.
Of course verification on submission is also a great idea, but we can make it the next step.
Mozilla is a bit of a weird entity, half non-profit, half very much corporate, yet all code is free.
In fact, if they are they're using a pretty ineffective path to get paid.
Merely publishing it at all would be helpful. That way other people can pick it up and modify it to work for them. That's still better than them flying blind and having to completely rewrite your code from scratch.
It's really infuriating to see researchers creating this false dichotomy between publishing production-quality software and publishing any code at all.
The folks who are asking for this are NOT asking for: (1) production quality code (2) portable code (3) an open source project based on the work (one way code dump is fine) (4) well commented high quality code (5) any support using the code beyond which would normally be afforded fellow researchers in reproducing one's results.
We (me and anyone else who is asking) understand that there may be some circumstances in which not all data can be published due to privacy, or in which code depends on proprietary dependencies that can't be shared. That's fine. Document it if you can as a disclaimer, and people can work on getting access to those private resources if they need to.
Anyone funding research should expect this; and publicly funded research should be required to disclose all technical work product that's materially involved in published results.
We're building a way for CS researchers to run, share and compare experiments including their whole research environment.
We just applied to Apply HN and would appreciate your feedback:
Somewhat tangential, but do CS academics actually write papers in Word? During my grad school days I did not encounter a single paper 'typeset' in Word. Writing was usually done with a LaTeX and and a makefile in a git repo.
It is, however, very good for tracking changes over versions. Many academics are not familiar with git, diff and so on and it's nice to easily see historical edits in the document. For simple documents like abstracts, it's much easier to send a Word document than it is to send a tex file and assume that everyone on consortium is going to be able to compile it (especially if you work with industry).
> although I think this is now possible in 365
I've made some good experiences with Authorea, a paid online collaborative scientific writing tool, I think that one runs on top of Pandoc+git too, but it doesn't have full Word import support yet AFAIK
Edit: 0 mentions for Authorea in this entire discussion, I guess these guys need to do a bit more advertising :)
Given that, it seems reasonable that you should make the modicum of effort to publish your code.
Also, don't construct a false dichotomy. Publishing your code (at all) does not mean you have to make it production-quality or provide support for it.
No. If you publish something that's incomplete or doesn't have all the right dependencies listed, etc, it's not really of any use. Writing up compiling instructions plus dependencies plus how to run it plus input files etc takes time and by the time you've got it to the state that someone else can run it, now it's "production-quality".
> Given that, it seems reasonable that you should make the modicum of effort to publish your code.
There's currently no incentive/requirement to publish code, it uses time and does not increment your publication counter. Find the incentive and you'll start seeing published code.
Again you're creating a false dichotomy which is just fundamentally false.
Even if you only published half your code—to the point where it doesn't even compile—that's still helpful.
In the status quo, I have to write your entire code from scratch.
If you published what you have, I would merely have to debug the issues and figure out what the dependencies are.
I don't understand why it's so hard to understand that some code is better than no code.
So published papers are held to a high standard -- filtered through editors and peer review -- while publishing code can be half-assed at best? I still disagree; if it's worth doing, it's worth doing right.
We're pushing a generation of kids without rich uncles to bail them out (myself being one of them; I start university in fall) towards massive, crippling debt. The places they're going aren't hurting for money. Is it that unreasonable to ask for open access?
You could pressure universities / departments to make reproducibility a requirement for graduation, but I don't see why they would follow along, because this is not going to help the university get more funding or publish more papers (prestige).
Now, if organizations and companies were willing to put money behind software reproducibility (maybe some sort of fellowships or million-dollar research grants to labs), then the incentives would be aligned.
Every publication involving data and code or analysis should publish them to a degree that makes validation possible in at least as detailed a way as portrayed in "Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff" (http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...)
 Holehouse, A. S. & Naegle, K. M. Reproducible Analysis of Post-Translational Modifications in Proteomes-Application to Human Mutations. PLoS One 10, e0144692 (2015). (http://journals.plos.org/plosone/article?id=10.1371/journal....)
This doesn't address open access, but it does make sure the software is usable by a non-author to reproduce the results of the paper.
(tl;dr of my linked blog post: Apply a slightly lighter weight form of some industry engineering practices in CS research coding. I think it's feasible. It doesn't solve all of the problems, because as discussed elsewhere in this thread, some of them are incentive-related and I'm not going to claim to have answers to everything. :)
(a) Convince more research groups to do their research on GitHub by default -- ideally, in open repositories. They get good hosted SCM, the world gets a better chance of seeing their code.
(b) Create more incentives, like the USENIX Community Award, for research that puts out its code. I'd say that in the systems community, a pretty decent chunk of the papers at SOSP, OSDI, and NSDI have code releases (of varying degrees of usability) accompanying them, though that's not a scientific count.
Mozilla could throw $1k to help create community-award-style incentives in the conferences they're interested in. Win-win. You get engaged with the community, you create some incentive for people to do the right thing, and you can use it as an onroad to deeper engagement with the winning authors (i.e., you can try to bring them in for internships. :).
The reason academic research works is because it takes risk on potential failures, because it's only donated or grant money anyways. But academic institutions fetishize academic papers, which is the problem from the article.
We need to legitimize research outside of the academic institution. Pay people to do research on their own time, if they perform it in an open-source, reproducible fashion. Incentivize it based on the reproducibility factor, but avoid attaching a profit motive.
Look at what Xerox PARC and Bell Labs were able to do before the penny-pinching bean counters took over.
Otherwise, it just sounds like "we want all this risky work done, and we don't want to pay for it."
Then in future, others throw away the product (or observe it a bit) and make the next nice things (products for industry adaptations, or further research) mostly based on their papers.
Until well get to a point where building/installing/administrating is not hours of bullshit, research (and free software) will suffer.
If you are interested just take a look at the complexity of licensing/ownership of code written by a PhD student at a Research University in United States.
If you look at most of my Open Source code, I use AWS AMIs to share both data as well as OS + code, however I can do that only for side projects. The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.
> The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.
Commercial value, the university is just more stringent on the licensing/ownership restrictions, or something else?
1. Commercial value.
2. Future grant applications (a competing group not sharing code will have better chance winning the grant.)
3. Future of other students and collaborators in the group. If two PhD students write a paper, the junior student might wish to write extension papers without getting scooped.
And many more. Yet if a paper is important enough, independant researchers will often attempt at replication, this nowadays routinely happens in Machine Learning and Vision due to huge amount of interest. Also in several cases replication is fundamentally impossible, e.g. consider a machine learning paper that uses proprietary data from hospital attached to the university, etc.
If it's paid for by public dollars, then the code and data belong in the public domain eventually. I understand there are exceptions like hospital data affected by patient confidentiality - that's fine. However the code released by that researcher should be capable of reproducing their results with that data set plugged in (such as by someone else who has access to it).
As a taxpayer, my concern for publicly funded research is maximizing benefit to the public good. I understand your point about follow-on research, and I'm not saying that I'd expect the code and data to be made available immediately with publication, but that deserves to be the case some reasonable time afterward (like a year). I understand that researchers' incentives are not necessarily aligned toward making it public; I am saying that people who fund research (including taxpayers through the political process) should require and expect it. Keeping it private indefinitely is a degree of self-centeredness that does not strike an appropriate balance between benefit to the researcher and to the public in my opinion.
Further funding arrangements themselves are very complex, a professor typically procures funding from University, NSF, NIH, private companies, donors etc. In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate. Finally all requirements aside, one can always release intentionally poorly written code in form of MatLab .m and compiled mex files. I have observed several such cases, where the code can demonstrate a concept but is intentionally crippled.
Finally graduate students, graduate and are paid for doing research which is publishing/presenting papers at peer reviewed conferences and journals. If what funding agencies really seek is ready made software they ought to fund / pay at the same level as software developers (as many companies do).
I didn't make the argument that the public owns it or has a right to ownership, though I suppose that some people might and so I can see why you would touch on that point.
I would describe my view as like this: public funding is subject to the political process, and voting by taxpayers (directly or indirectly through voting of politicians or their appointees). As a taxpayer, I prefer to make public domain publication a requirement of publicly funded research, and I think every taxpayer should too. I consider the goal of public funding of science to be the benefit of public good, and believe that public good will best be served by public domain publication of all data, code, and analysis methods. (Whew, there's a lot of "pub" and "public" in there!)
One might reach my position by working backwards from, "Why do we as taxpayers agree to fund science with government money?" It's certainly not to give researchers prestige or jobs! (Those may be necessary parts of achieving public good, but they're not the goal which is the public good, and if they're in tension with public good then the public good probably needs to win.)
I don't seek ready made software; not at all. I only seek adequate disclosure of data and analysis methods sufficient for others to easily verify it and build on it. See for example the attempt at replication in http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...
> In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate.
I will need to think more about this issue. I might be willing to accept the downside as a taxpayer. I'm not sure I understand it well enough what the friction would be to collaboration at the moment. If you're referring to the GPL specifically, then yes I agree that's probably the wrong license - public domain would be more appropriate.
I would be OK if this was simply an electronic log of the data as well as all machine commands that have been run on it - something that is recorded automatically by the operating environment. I am truly not looking for "working production code". But, those sequence of commands should be reproducible if someone "replays" them; a verifiable digital journal. Publishing an article that's difficult-to-reproduce feels like producing the least possible public good while still getting credit. Publishing an article that's fully and automatically reproducible, because it contains references to all of the data and code that yield the results as an executable virtual machine with source code, provides the maximum public good, and that's what I want science funded with public money (and ultimately all science) to work toward. (I realize that this is just like, my opinion man :)
Regarding the economics study you linked to, I am very much familiar with that study having seen the interview of graduate student on Colbert Report. For non-CS fields the quality of code is anyway so bad that its much more difficult. Further several researchers rely on proprietary tools, which only make this task difficult.
In my opinion the correct way is not by having NSF impose rules, but rather by having venues that accept papers (Conference & Journals) insist on providing software. However this is easier said than done, since its a competitive two sided market.
Regarding actual licensing issues, I can assure you that GPL is second favorite license of choice favored by University IP departments, the first one being "All rights reserved with modification explicitly forbidden, except for reproduction of experiments."
And the answer to that is that the incentives are set up such that reproducibility is a waste of time. As a CS researcher, I want to be idealistic. But the field is competitive and I'm not sure how much idealism I can afford.
There's considerable effort to bring artifact evaluation into the academic mainstream (I'm actually helping out at OOPSLA this year ), and I think this is a good way forward.
Thank you for yours, and others', work making artifact evaluation a priority. Concurrently, people (including myself) are trying to do something about that learning curve. Hopefully both efforts will massively succeed within a decade :).
Btw, I'm excited about NixOS. It sounds like you're involved. Thank you. I haven't used it, but I'm hoping to find the time soon.
I think the root a the problem is that the goals of the researchers are not aligned with the goals of Science. This isn't a criticism of the researchers but instead of the "game" they are forced to play.
For example,the goal of Science is to move the ball(knowledge) down the field for the benefit of mankind. We don't reward researchers for doing that, at least not very well. We reward researchers for writing papers full stop - not for making their research easy to reproduce or build on.
* Provide great scientific and matrix manipulation libraries within the browser. WebAssembly isn't going to solve this. Why would the academia rewrite everything?
* Provide incentive to use the web for everything. A great one would be an easy to use and debug toolset and an easy set of methods to get data in and out, an editing environment that can be setup with one click. The closest is iPython Notebook. And it takes work to get there.
Sharing should be default and easy. If it isn't we are no one to complain.
Why not? This sounds like exactly the solution. It's no burden on the researcher - they don't have to alter their research methods to fit a new system, they just dump the code, and leave it up to other users to reimplement.
Building web tools to allow research sounds like precisely the wrong solution, at least in the short to medium term. Research funding doesn't go far enough as it is, you're not going to get researchers changing their processes entirely for no gain. What if they need custom hardware, or access to tools and libraries that haven't been implemented?
Sharing _is_ default and easy. That's why Github has exploded.
It all sounds so easy until you look at the actual constraints. Professors are usually smart and experienced and they have thought about this stuff a lot. If it was as easy as you thought, it wouldn't be a problem.
I publish all code as a matter of lab policy. I chose where to set up my lab partly so that I was able to do this. Not everyone has this luxury or makes this a priority.
If researchers have the legal right to publish their work, I can't see any reason why github wouldn't be exactly the place they'd share it, rather than some custom online research system as proposed by the parent.
That said, I don't have any experience in CS research, it's not my field, so I may be wrong about that, do tell if so.
But again, the IP rules at my university allow this at my sole discretion, which is unusual.
Political influence from the top (non-scientific top - the funding sources mostly) is the only way that can improve anything for this problem, the scientists don't really have any realistic way to do that - i.e. one that doesn't expect to sacrifice their scientific and personal goals to go significantly against the current incentive system just to slightly improve openness in their fields.
The same goes about datasets - quite a few of the more interesting datasets from industry to do science on are available only under very harsh conditions. You can do very interesting and useful analysis that cannot be reproduced by anyone else, because they are unlikely to ever get access to that particular set of data.
An idealistic grad student working on a greenfield project could start and keep that project as open and reproducible as would be best - but most of them work on basis of existing projects with lots of legacy, and there it's different.
Releasing code is a widespread practice in the programming languages and software engineering communities, and one that is getting stronger (see http://artifact-eval.org).
If you are a CS researcher, please fill out this survey of open source and data in computer science:
You won't need to put so much weight on describing some of your practices, you can just how how your solution is better.
Publishing in pseudocode made sense some 30y. ago
Today if you're publishing in pseudocode I personally think it's someone who is doing that because can't write a Hello World in C
Oh and also let's stop the "research comes only from universities" idea
I recall the professors at our uni treating the undergrads with contempt as wasting their precious lucrative research time.
The idea isn't to ask researchers to formalize what they make more than before, but to include fully reproducible details in the publication. A spreadsheet is totally fine because you can see how it works, reproduce the result, and tweak the inputs/methods to build on it.
Every publication involving data and technical analysis should publish them to a degree that makes validation possible in at least as detailed a way as portrayed in "Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff" (http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...)
It seems like CS is a subset of math.
Machine Learning has a lot of theoretical results produced by academia, but the more practical techniques (decision trees, SVMs, neural networks, etc) also all came from academia. The engineers scaled the algorithms to run on bigger datasets, but the initial work is still driving those systems.
Graphics research has seen comparable contributions between industrial research labs and academia. Sure, a higher proportion of papers from academia don't end up practical, but the number of papers that are very practical makes that quite irrelevant. It's to be expected since you can't predict what works in advance, and industrial research labs just don't bother publishing negative results, not that they don't get any.
Many programming languages and compiler techniques came from papers.
I could go on but you see the point - it depends on the field.
These are just the areas I work in, its very hard to making sweeping statement about CS as a full field since it is very diverse sub-community to sub-community.