Open-sourcing is not enough. Honest researchers should also publish all data set...

danieldk · on July 24, 2010

Exactly. Some data in my field (NLP) is available, however usually at high fees.

Coming back to the topic of source code. I think there are three additional reasons source code is often not published:

1. Some scientists (I am trying to avoid making overgeneralizations) are bad programmers. Sometimes just enough hacks are stacked to produce results, but the result is not something to be particularly proud of.

2. Rinse and repeat. It's often possible to get more out of a discovery by applying it to multiple datasets. If the source is published, others could be eating your lunch.

3. There is a contract that avoids publishing source code.

My PhD project is financed by the Dutch national science foundation. Fortunately, since software developed in my project adds to existing work under the LGPL (creating derivative works), my work is under the LGPL too. Copyleft can help you if (3) applies.

I try to follow the same strategy as the author: make software public on Github once a paper is accepted.

ondrasej · on July 24, 2010

There are some well known public data sets used for this purpose, such as those in the UCI Machine Learning repository. Unfortunately, not everyone is using them. And even if they do, it is often impossible to reproduce the results as pre-processing of the data is not described well enough in the paper, or because the authors add random components (such as costs) to the data without describing the distributions properly.

Publishing scripts for the complete workflow starting with the raw data and printing the table with the results in the end would be the best. But I've seen academics working in a way that is completely orhogonal to this - copying & pasting data to Excel or Matlab (or even re-typing them) and doing the analysis by hand in the GUI... I don't have any doubts they would be able to learn how to write the script, but I'm very sure they would put up heavy resistance to do so.

micheljansen · on July 24, 2010

So true. I did some research on video retrieval a few years ago, where I wanted to make a user interface to see how we can benefit from all the cool segmenting and clustering techniques other people were researching. It turned out it was nearly impossible to get any data to build upon, I wasted hours and hours doing everything by hand. This severely impacts the ability to build upon the work of others.

cdavid · on July 24, 2010

But a lot of data cannot be opened for various reasons (privacy being a huge one) - see netflix prize 2 cancellation in the ML field.

There is also the issue of preventing competitor (other researchers here) to get a free ride on your work - getting data, preparing them is a huge part of the researcher's work in some fields.

ewjordan · on July 24, 2010

Academics in some fields build up entire careers worth of work that all builds upon their "special sauce" super-secret dataset or software, which they've built up over many years, and while there might be enough information in principle in their papers to reproduce the data or results, it would be a thesis level project to actually do so.

There's a story (unfortunately I forget where, perhaps someone else can jog my memory?) about a string theory PhD student who, after getting annoyed wasting months on a set of hundreds of straightforward but tedious and time consuming calculations, decided to just do it all once and for all and spend the time to put together a table of results for all of them. Of course, this helped his future work immensely.

When he went to his PhD advisor to ask what he thought about publishing that table, the guy looked at him like he was crazy. Again, I don't remember the quote, but it was along the lines of "What you have there will give you a 1000% speed advantage pushing out papers in this field compared to your peers - you'd have to be crazy to share that sort of competitive advantage with everyone else when you could keep it to yourself, this is your golden ticket!"

There is also the issue of preventing competitor (other researchers here) to get a free ride on your work - getting data, preparing them is a huge part of the researcher's work in some fields.

Which brings to light very clearly the source of the problem - the ideal of academia is to advance the overall state of knowledge as fast as possible, but once you start using the word "competitor" in a serious way that actually has bearing on whether you publish something useful or not, that ideal has been perverted.

Obviously it's the "publish or perish" mindset that causes this, and I absolutely understand why people would be tempted to see their supposed colleagues as competitors instead of collaborators (in the general sense, when they're not actively collaborating on a paper); it's one of the main reasons I decided not to go into academia, in fact, I saw too much political bullshit spewing about even in the harder fields like physics and math. I have no idea how to solve any of this, but it's a serious breakdown in the system, and I suspect it (publish-or-perish, not just information hiding) hinders the long term progression of academic knowledge in some of these fields by a large amount, not in the least because it rewards herd-like behavior and punishes exploration. That's another can of worms for another day, though...

RaphaelWimmer · on July 24, 2010

This "getting a free ride on your work" is also called "standing on the shoulders of giants".

Lewisham · on July 24, 2010

(Disclaimer: I'm the author of the OP).

Yes, but the person you are replying to is right to note the competitive aspects of research. A lot of people might say "well, this project is on-going, and I don't want people scooping/stealing it from me." It's a sad thing, but most research labs are in a a competitive relationship with other ones, and a citation is less useful in those oh-so-important tenure reviews than a publication. I wish it was more the "standing on the shoulders of giants"!

That's partly why I didn't call on academics themselves to release code, but for some sort of authoritative institution instead, to level the field for everyone. That should remove the competitive aspects (I understand "that should" is a very naïve outlook ;) )