Hacker News new | past | comments | ask | show | jobs | submit login
Let me see your papers, let me see your source (cflewis.posterous.com)
49 points by Lewisham on July 24, 2010 | hide | past | web | favorite | 31 comments

People should be clear if they want to do something academic or commercial. If you want your code to be "secret sauce", then don't publish in academic journals, write a memo to investors or customers. If you want to do science, publish and be open and testable. Scientific publication serves to communicate and spread ideas, not to gain "reputation points" you can then exchange for credibility in the marketplace.

On a less polemic note: To address the reproducibility of computational experiments, Andrew Davison presented his Sumatra project at Euroscipy 2010 (http://www.euroscipy.org/talk/1960). I thought I'd throw this in here as the OP's gripes are a problem in all of modern science, not just computer science. It's also a reminder that making the code available is but one element of the problem of reproducibility/falsifiability in modern science (though probably the biggest one).

I agree with the feeling, but a lot of the time, scientific software is just not ready for real usage. My software at least is completely useless most of the time (you have to open it in the interpreter and type the right incantations, and the data has to be in a non-trivial massaged way, filesystem paths are hardcoded everywhere, etc). It takes some effort to turn a works-for-me research tool into something releaseable, and this effort might even hinder future research, since the fluidity of the hacky code is sometimes necessary for further tweaking. Also, there are usually dozens of dependencies, some of them patched to work with that software, with different incompatible licenses, etc.

I think, however, that most journals should at least ask for a shell script that downloads the code and data, runs the experiments and regenerates the graphs and tables as seen in the paper. This is not always practical (a lot of papers deal with over a terabyte of data, for example), but it is so more often than not. At least for my papers this is a nearly-attained goal.

(Disclaimer: Author of the OP)

I absolutely understand your feelings on hacky code. Every academic produces hacky code, there are precious few who don't. I myself, when I started, did not want to release my code for the same reason.

However, once I began to realize that we were all on the same boat of HMS Hacked Together, that feeling began to dissipate. My advisor calls it "research code", and it's fine, because as academics, we're all used to it!

That's why I usually just ask for source. I assume the build won't execute on my Mac, and that's OK. I'm not really interested in running the tool, but explicitly finding out how you solved the problem.

(Disclaimer: I work in your lab. ;-))

I've asked people for code a few times, but my experience after getting it is actually that I don't really ask for it anymore, because I've never found it to help me. What I really want in most of the cases is a clear enough English writeup, perhaps with pseudocode, so that I can understand how they solved their problems, and ideally reimplement it myself. At least, that's the case if it's at a scale where that's feasible to reimplement; if they built something absolutely gigantic then it might be another story, but then their megabytes of messy research code I can't grok aren't very useful to me either, and I have no real choice but to wait for the cleaned-up release.

In short, I think "can this be reimplemented by a third party from the published literature?" is a better test for reproducibility than .tar.gzs are. And there's certainly a ways to go on that front, not least because in areas where 6-to-8-page conference papers are the norm, even well-meaning authors can't include enough details, and most don't get around to writing the detail-laden tech report version. But I guess I find code mostly useless for that purpose; it might as well be an asm dump for all the good I usually get out of it.

I agree with your TLDR;, but as you say, we're in a culture of 6-to-8 pages. I actually quite like the 8 page limit for most papers, it forces authors to a brevity of expression that aids focus, but you're right that details are the first thing jettisoned.

If I'm going to propose a probably impossible sea change, taking the baby step of saying "just show me what you've already done" instead of "now write another 12-20 page set of documentation" is the more likely of the impossible two :) In a perfect world, we'd have both!

This is also very true.

And I think a part of the reason why this is worse with research code is that the meaty part of the code tends to be (at least in ML/NLP) a few equations from the paper, in a hacky and convoluted way, and unless you're a world-class expert on keeping track of indexes and one-greek-letter variable names, there's very little to get from the code to a well-written paper. I make an exception for tuning parameters, constants, tweaks, etc, but these shouldn't matter much anyway.

I was just typing my comment while you posted yours. I absolutely agree with this. Especially the "fluidity of the hacky code" - the environment that makes cool research happen is often the opposite of clean software engineering.

One thing that'd make it particularly difficult for me is that most of my early-stage experimentation is done in image-based environments with REPLs, not by editing code (Lisp, R; the Smalltalk people are also big on it). So I don't even have code to send! Well, some parts usually are in code, but it won't run unless you load it into my image and do the right thing with it. I have working images / image dumps, session transcripts, notes about what I did (some of which may be stuff I did by hand for the first proof-of-concept stage), etc. I find it a much more fluid way to work than code in a text editor, personally.

That is a huge part of my problem, as well. Sometimes when showing the code to an advisor/colleague I have to bring up an interpreter and type over 10 differend commands before something starts to work (and don't even get me started on the fact that half of my bookkeeping, logging, and plotting is done by made-at-the-time emacs macros).

It's been a few years since this was relevant for me, but my primary concern would be that in order to publish actual code I would either have to spend months cleaning it up, writing documentation etc., or spend the next year fielding support calls. Or both.

I agree with another poster here that having somebody else repeating the experiment with their own implementation is a better test for validity - if a second paper just copies the source code from the first and makes a few tweaks, mistakes could easily carry over.

But having the data available would be great.

Reminds me of my own experience doing research in mobile robotics. It basically followed a cycle described pretty well by the the comic here: http://www.willowgarage.com/sites/default/files/blog/201004/...

From what I can recall of the six years I spent in academic research in the 90s the only thing that counted was publications - any effort into anything that wouldn't end in a publication was strongly discouraged.

Open-sourcing is not enough. Honest researchers should also publish all data sets for peers to validate results. When I was in grad school, it was a disappointing fact that very few academics in the machine learning field did this.

Exactly. Some data in my field (NLP) is available, however usually at high fees.

Coming back to the topic of source code. I think there are three additional reasons source code is often not published:

1. Some scientists (I am trying to avoid making overgeneralizations) are bad programmers. Sometimes just enough hacks are stacked to produce results, but the result is not something to be particularly proud of.

2. Rinse and repeat. It's often possible to get more out of a discovery by applying it to multiple datasets. If the source is published, others could be eating your lunch.

3. There is a contract that avoids publishing source code.

My PhD project is financed by the Dutch national science foundation. Fortunately, since software developed in my project adds to existing work under the LGPL (creating derivative works), my work is under the LGPL too. Copyleft can help you if (3) applies.

I try to follow the same strategy as the author: make software public on Github once a paper is accepted.

There are some well known public data sets used for this purpose, such as those in the UCI Machine Learning repository. Unfortunately, not everyone is using them. And even if they do, it is often impossible to reproduce the results as pre-processing of the data is not described well enough in the paper, or because the authors add random components (such as costs) to the data without describing the distributions properly.

Publishing scripts for the complete workflow starting with the raw data and printing the table with the results in the end would be the best. But I've seen academics working in a way that is completely orhogonal to this - copying & pasting data to Excel or Matlab (or even re-typing them) and doing the analysis by hand in the GUI... I don't have any doubts they would be able to learn how to write the script, but I'm very sure they would put up heavy resistance to do so.

So true. I did some research on video retrieval a few years ago, where I wanted to make a user interface to see how we can benefit from all the cool segmenting and clustering techniques other people were researching. It turned out it was nearly impossible to get any data to build upon, I wasted hours and hours doing everything by hand. This severely impacts the ability to build upon the work of others.

But a lot of data cannot be opened for various reasons (privacy being a huge one) - see netflix prize 2 cancellation in the ML field.

There is also the issue of preventing competitor (other researchers here) to get a free ride on your work - getting data, preparing them is a huge part of the researcher's work in some fields.

Academics in some fields build up entire careers worth of work that all builds upon their "special sauce" super-secret dataset or software, which they've built up over many years, and while there might be enough information in principle in their papers to reproduce the data or results, it would be a thesis level project to actually do so.

There's a story (unfortunately I forget where, perhaps someone else can jog my memory?) about a string theory PhD student who, after getting annoyed wasting months on a set of hundreds of straightforward but tedious and time consuming calculations, decided to just do it all once and for all and spend the time to put together a table of results for all of them. Of course, this helped his future work immensely.

When he went to his PhD advisor to ask what he thought about publishing that table, the guy looked at him like he was crazy. Again, I don't remember the quote, but it was along the lines of "What you have there will give you a 1000% speed advantage pushing out papers in this field compared to your peers - you'd have to be crazy to share that sort of competitive advantage with everyone else when you could keep it to yourself, this is your golden ticket!"

There is also the issue of preventing competitor (other researchers here) to get a free ride on your work - getting data, preparing them is a huge part of the researcher's work in some fields.

Which brings to light very clearly the source of the problem - the ideal of academia is to advance the overall state of knowledge as fast as possible, but once you start using the word "competitor" in a serious way that actually has bearing on whether you publish something useful or not, that ideal has been perverted.

Obviously it's the "publish or perish" mindset that causes this, and I absolutely understand why people would be tempted to see their supposed colleagues as competitors instead of collaborators (in the general sense, when they're not actively collaborating on a paper); it's one of the main reasons I decided not to go into academia, in fact, I saw too much political bullshit spewing about even in the harder fields like physics and math. I have no idea how to solve any of this, but it's a serious breakdown in the system, and I suspect it (publish-or-perish, not just information hiding) hinders the long term progression of academic knowledge in some of these fields by a large amount, not in the least because it rewards herd-like behavior and punishes exploration. That's another can of worms for another day, though...

This "getting a free ride on your work" is also called "standing on the shoulders of giants".

(Disclaimer: I'm the author of the OP).

Yes, but the person you are replying to is right to note the competitive aspects of research. A lot of people might say "well, this project is on-going, and I don't want people scooping/stealing it from me." It's a sad thing, but most research labs are in a a competitive relationship with other ones, and a citation is less useful in those oh-so-important tenure reviews than a publication. I wish it was more the "standing on the shoulders of giants"!

That's partly why I didn't call on academics themselves to release code, but for some sort of authoritative institution instead, to level the field for everyone. That should remove the competitive aspects (I understand "that should" is a very naïve outlook ;) )

Source code publication along with the paper is a great idea in theory, but there are several problems with it in practice.

The reason CS researchers do not usually publish their code has nothing to do with dishonesty - nobody is trying to hide their code because it does not really work, or anything like that. It's not even that people are worried of scooping, though that sometimes happens.

The main problem is that any time spent on cleaning up the code, packaging examples, writing instructions, answering bug complaints, etc. is time not spent on things that matter in academia - doing research, presenting it, and teaching students.

It might help if some conferences required source code submissions - but people might just submit to different conferences instead. The only real solution would be if funding agencies like NSF required that any projects funded through them have to release source code. This makes sense from a taxpayer's point of view, and would make the extra work acceptable (since everyone would have to do it).

(Disclaimer: I am author of the OP)

The NSF is a great point.

One of the things that bugs me about this, that I didn't go into in the post for brevity's sake, is that a lot research is funded by some government institution under the banner of public interest. If you are paid to create something, and then you lock it away for whatever reason, that's not in the public interest. Worse still, more money has to be spent for someone else to re-implement the exact same thing if they liked it!

I hadn't thought this out to the logical conclusion of having the funding body also ask for the code to be released, but I think it's a great idea.

Wouldn't things like Bayh-Dole have to be repealed before that could even be anywhere near remotely possible? At this university source-code transfers must be rubber-stamped by the university IPO office. For publishing we only have to tell them what's going on if we think it may be valuable. But actually letting university-owned things slip out is a university IPO matter. Unless you're a student, in which case code you write directly related to your degree is yours, the university owns the code. The work of post-doc's, professors, associates, etc are owned by the employer. The way most research grants work, you are an employee of the university, not of the NSF/NIH.

I honestly wouldn't know about the majority of these issues (and, secretly, I suspect most universities don't either).

Aside from legal issues, it seems to me the business proposition is the same as the open-source business proposition: you know the most about the system you've created, so you're in the best position to consult on it. If you want a startup, I think guys like Cloudera show that even if you give away what was traditionally thought of as the family jewels, you can still very effectively monetize. That's what the university should leverage.

Anyway, for most projects, you've already given the game away in the paper (or at least, should have done): the expensive thing was the idea. Reimplementation is cheap.

I'm surprised that as a researcher you have not had notification or training about Bayh-Dole (someone mentioned they are in your lab so I assume you're funded). Universities have to report that they have notified researchers about the act. Here, all grad students have to sign an acknowledgement form indicating that fact during orientation. Flaunting the act has serious implications for the university's federal funding. It's government research privatization indoctrination 101 since the Regan revolution.

It's funny because the public (like you) demands access to the research because they paid for it, but the politicians view you as an economic investment and demand that you monetize and produce returns (e.g. tax revenue, employment from startups). The public doesn't write the rules.

I think there are other motivations involved here. Much of what's called open source these days has commercial aspirations. Often a lot of source code is developed and then published when it's quite far along, after the sheer volume of it provides barriers to entry. It's not really open source in the sense of leveraging the advantages that collaborative development in a community brings.

In academic settings competitive aspects of research likely produce the same sorts of issues, and in compsci there is likely commercial interests involved too. Come up with a clever new NPL technique for semantic searching and the VCs come out of the woodwork.

However I agree with the author and commenters that both code and data ought to be available to all, mainly because that's the only way to make progress. Research is hard, very hard, and building on the half-baked ideas, good and bad, is the only way progress is made.

I see some of the comments here talk about the code being useless it terms of using it the same way the author of the paper has. Personally for a lot of the papers I have read in NLP based stuff I'd feel more comfortable just having some source to look through to give a better idea of how it was done. Some papers are quiet abstract about their method and aren't really helpful past giving you basic ideas without some hard code to let you fully understand what they have done.

In contrasting it to maths it would see stupid for a maths paper to describe a new maths proof without putting it forward in mathematical notation in the paper.

I'm not sure I agree. In most NLP papers I've read the real meat of the paper is not in the code, but in a couple of equations that, with a lot of mostly-mechanical wrapping around, implementing, debugging, and plugging into well-understood parts should make sense and be a contribution in and of themselves.

And for these other things, it might be good to see code a couple of times, but mostly at first, and to get up to speed in an area.

I think a problem is that, if you force all papers to include source code, due to the fact that science builds upon itself, you'd find that the average paper length for an area should climb a bit (going back down when a paradigm shift happens, because then the tricks stop working, but climbing all over again), and most of it would just be repeats of what's already there. Comparing to maths, it's like asking every paper proving a theorem to prove their lemmas, even very basic ones: sure, it'd help a novice understand what's going on, but it'll hinder progress more often than it'd help it. There's a place for introductory writing and a place for stand-in-the-shoulders-of-giants writing, and papers are mostly of the last sort.

For example, a couple of decades ago every paper that used a naive bayes classifier would derive the equations, describe feature selection, weighting, etc; today, most just say "I use a naive bayes classifier for this, that, and that" and move on. Likewise for SVMs---you don't want to see the full code for most papers that use that, since it's a mess of kernel caches and dual variables that mean nothing whatsoever to the problem at hand (but the algorithm won't work without it).

I guess it depends on the level your at, it comes back to an article a while ago that to truly understand all the new research coming out you have to be actively contributing research of your own and interacting with others in your field.

Your right about the classifier, not needing a mention now. My problem starting out of my honours thesis is that I would read something like that then have to go research the thing they have just mentioned in passing because their core audience knows all about it already. So I guess it has an aspect of knowing the best starting point for what you want to research as well.

There is a movement towards reproducibility and open code/data in academic publications. A friend of mine, Victoria Stodden, has dedicated her career to this. See http://www.stanford.edu/~vcs/Talks.html for her talks on the subject and http://blog.stodden.net for some of her writings.

I 100% agree, but think there is no chance of this happening. The incentives simply do not line up, and some sort of all-or-nothing transition is needed.

It's not (exactly) my community, but SIGMOD has been trying to do this since 2008 with the "repeatability/workability committee." See this:


There's also an interesting FAQ about the repeatability requirements specifically here:


... as well as various follow-ups from that work, if you search on Google Scholar.

The issues as I understand them are:

1. Research code is often of poor quality, usually thousands of lines of unchecked output by a single graduate student. There are probably bugs, some of which may change the results. In the absence of code, results are assumed to be correct. It takes substantially longer to write good code, and hurts researcher "output" to do so. As a result, good quality code is actively discouraged.

2. Releasing code and/or data usually makes it substantially easier for others to duplicate or catch up to your research program, which is seen as a disadvantage if you are still in an area. It may lead to more citations, but that probably isn't enough for the effort. Other researchers also have an incentive to try to find bugs in the code/data, which they may overstate as they try to get their own work accepted.

3. The code and/or data itself my be copyrighted or have unclear distribution terms, for example, if you are doing experiments on a web crawl.

4. Actual production quality code that does something useful can be used as the basis of a startup or other venture, especially if the researcher is the only one who has and understands it. Furthermore, research groups can make money licensing their code to outside companies if they do not release it openly.

5. Many (most?) industrial papers involve code or data that cannot be released. Ultimately, in highly competitive conferences, it is hard to balance "unverifiable" papers written by industry with academia papers. A blanket ban on papers without code or data would remove a huge number of industry contributions, but an optional requirement for code or data mostly continues the status quo. Many of the most interesting recent papers (e.g., MapReduce) might not have been published with a code/data requirement.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact