

We need a GitHub of Science - marciovm123
http://marciovm.com/i-want-a-github-of-science

======
mechanical_fish
I'm a former biophysics postdoc myself. Now I work for an open-source software
company.

This post strikes me as charmingly naive. You have to love this guy. And yet
any essay that discusses the incentive structure of science but doesn't use
the word "grant" until the last sentence is beating around the bush. Follow
the money, my friends.

The publications are a side issue. To the extent that your count of top-tier
publications matters when trying to get an academic job, it's because it's
correlated with your ability to bring in money. (Money comes from peer review
too, and what your peers want to read about is also what they want to fund.)
What the hiring committees really want is grants. Grant money pays for labs
and salaries. It pays for grad students and postdocs. And grant money
literally buys prestige: Big projects come from big grants, and big grants
require strong track records and a bunch of preliminary data, which in turn
comes from smaller grants, or from the shared equipment that your neighbor
bought with _her_ grants.

The fact that there aren't that many top-tier peer-reviewed journals is a side
effect of the limited number of top scientists, and the number of scientists
is limited by available resources, not by lack of knowledge or connections or
education. I could literally pick up the phone and reach a dozen Ivy-educated
postdocs who would be full-time scientists if they could afford it.

Why can you find so much great software on Github? There are lots of reasons,
but a fundamental one is: Moore's Law. Computer hardware has become so dirt
cheap that you can be a programmer in your spare time. You can literally be a
twelve-year-old kid with a $200 cast-off computer and yet do top-notch
software work. If computers cost millions of dollars each, like they did in
1963, we wouldn't have Github. We'd have the drawer of a desk on the ninth
floor of Tech Square. (After all, in the old days half the AI researchers in
the world lived within a few miles of that drawer, and the others were just a
phone call away.) That's how most advanced science works today: There's no
need for more publishing infrastructure for scientific technique, because the
available methods of getting the word out -- top journals, second-tier
journals, email, the phone, bumping into people in the hallway at conferences
-- scale well enough to meet the limited demand. Because just having the
recipe for your very own scanning multiphoton microscope doesn't do you much
good: You need a $150,000 laser, and a $200,000 microscope, and tens of
thousands of dollars in lenses and filters and dyes, and a couple of trained
optics experts to maintain the thing, and that's before you even have
something to photograph.

I wish there were a magical way to turn everyone's suburban basement into a
cancer research lab, the way Github has turned everyone's couch into a
potential CS research lab, but there's no magic bullet. A few technologies,
like DNA sequencing, are sufficiently generic, useful, and automatable to be
amenable to Moore's-Law-based solutions, so we probably will soon be able to
(e.g.) drop leaves into the hopper of a $1000 box and get a readout of the
tree's genetics. But something like cancer research is never going to be
cheap. To study cancer you must first have a creature that has cancer. Mice
are as cheap as those get, and mice _are not cheap_ , especially if you know
what the word _mycoplasma_ means.

~~~
dougabug
Do you need exclusive access to your very own scanning multiphoton microscope?
Seems like laboratory virtualization would go hand in hand with open source
science.

~~~
mechanical_fish
You can share them. Many labs do. Indeed, most of these things are probably
shared in one way or another.

And you're correct: The trick to encouraging open source science is not to
focus on the social networking tech -- that will be ready when you need it --
but to first attack the problem of doing quality lab work on the cheap. That's
where the bottleneck is.

The biggest problem with shared facilities is the tragedy of the commons. In
engineering -- or machining or woodworking or cooking, for that matter -- you
quickly learn the importance of having your own tools. It only takes seconds
to ruin a good tool. It only takes seconds to contaminate your cell culture,
or your neighbor's cell culture, or an entire room full of your department's
laboratory mice.

And mailing your samples off to a distant "virtual" lab is fine if you're
studying disposable samples, or inorganic samples, or samples that have been
permanently fixed and preserved on a glass slide. But living cells ship poorly
even when you're allowed to ship them at all, and animals ship even _more_
poorly than that. So often you've got to live _next door_ to the equipment
you're trying to share, and that's still expensive.

~~~
dougabug
The cost of maintaining your own tools is quite high, as is the expertise
required to maintain them in working order. It seems to be that the
inexperience or lack of skill which leads to shared equipment getting ruined
is due to the fact that extremely sensitive equipment is being handled by
students and junior scientists with little engineering background. The same
argument could be made for the value of having your own servers. The
scientific equivalent of a datacenter, would necessarily entail staffing of
highly qualified, experience personnel. Such expertise would undoubtedly be
costly, but it would be amortized over a large amount of equipment.

~~~
ylem
Staffed by who? There are actually some user facilities for materials
synthesis--the problem is that if someone is making someone else's material
instead of working on something that they're interested in, then I don't think
that you're going to get the same level of productivity out. Also, materials
synthesis takes time and running through lots of dead ends. When I was a grad.
student, a postdoc from a collaboration would drive in to our lab with powders
she had made--try to grow a crystal for a week (sleeping for maybe an hour or
two a night on a floor in our office) and then go back to her home institution
and come back in a few weeks. That just doesn't work--you don't have the
responsiveness to be able to figure out what dead ends you're wandering
down...Imagine if you were writing code and could only run it once every few
weeks. Now, imagine that it was thousands of lines of nontrivial code and
you're trying to debug it that way...

~~~
dougabug
Staffed by scientists and engineers, I imagine. You can get some pretty
sophisticated parts built by foundries. At one point, making a microchip was
prohibitively expensive. Nowadays, when you create chips basically by coding
them in high level synthesis languages. Spinning a chip does take weeks.
Obviously when a process takes longer to carry out, with high iteration cost,
careful methodology is called for. Simulation and error checking software
becomes valuable. Heck, once upon a time computers themselves were massively
expensive, among the most expensive machines built. Computing wasn't born
cheap. Mass production and decades of technological advances made them so.
Machines for combinatorial science may someday be more effective than graduate
students.

------
guygurari
I'm a Ph.D. student working in theoretical high-energy physics. In this field
we don't rely on peer-reviewed journals. Instead, when a researchers wants to
"publish" a paper she uploads it to the arXiv, and the paper appears on the
site within a day or two. The arXiv is open in the sense that almost anyone
can publish there [1]. Researchers in the field catch up on new research by
scanning the arXiv daily for interesting papers. No one I know reads peer-
reviewed journals. I know that many papers are also published in journals, but
I believe this is a formality that has more to do with obtaining grants and
such than with actual communication within the community. As far as I know
there's no reason to publish in a journal before you become a professor.

The result is similar to the GitHub situation in many ways. Because there are
no barriers to publishing, everyone makes up their own mind about which papers
are interesting. If your work is relevant, others will build on it and cite
you. They will discuss it in their group meeting, and so on. A scientist's
reputation is then directly related to the quality of their work, as judged by
the community, with no artificial barriers. This means that a self-respecting
scientist would not publish a sub-par paper even though it's technicality
possible to do so, because that would hurt her reputation.

So it seems to me that the situation in high-energy physics is close to ideal,
with respect to ease of publishing and the social aspect of reputation. Having
said that, there are certainly aspects of GitHub that I would love to see
adopted.

For instance, when several researchers are writing a paper, generally no
version control system is employed. Instead, at any point in time the draft is
"locked" by one of the collaborators, and only that person can change it.
Beyond the obvious inefficiency of this method, note that it is also difficult
to track what changes were made in each lock cycle. I use diff for this
purpose, but in my experience many scientists in the field aren't aware of
such tools. So something that could really help is a simple way to collaborate
on papers, just a basic source control system. Also, it must be possible to
work on the paper in private within the collaboration, and only publish the
end result.

[1] The few barriers that exist are in place to keep out the crackpots, who
reduce the signal-to-noise ratio and in that sense resemble spammers.

~~~
chancho
"In this field we don't rely on peer-reviewed journals...papers are also
published in journals, but I believe this is a formality that has more to do
with obtaining grants and such than with actual communication within the
community."

So peer reviewed journals are only important for grants, but your community
doesn't rely on peer reviewed journals. Do you not rely on grants? Who funds
your work? Do you all work for free, in your spare time?

~~~
guygurari
I guess I wasn't clear enough. What I mean is that the communication within
the community, and the reputation of a researcher among her peers, does not
rely on publications in reviewed journals. These are the things that can be
compared with open-source development.

If researchers also need to publish their work in journals, write grant
proposals etc., how is it relevant to the idea of applying the GitHub model to
science? Of course raising money is part of the job for a professor, but
thanks to the arXiv it's decoupled from the actual research work. It's at a
point where I, as a Ph.D. student, have no reason to consider publishing in
reviewed journals. This is in contrast to my friends in optics or condensed
matter, for whom a publication in Nature or Science practically guarantees a
good postdoc position.

------
Groxx
> _\- GitHub's success is not just about openness, but also a prestige economy
> that rewards valuable content producers with credit and attention_

I don't think I can agree with that. GitHub's success, IMO, seems to be based
almost _entirely_ on its openness. It has turned contributing to open source
software into a drop-dead easy task, which would never be found nor
contributed to if they weren't open. And they _keep making it easier_. I've
fixed a number of things with machines which don't have Git installed, simply
because they have their on-site editor.

Imagine if GitHub were behind a paywall. Do you think it would still be the
success it is today? And, I may be weird, but I very rarely look at the names
associated with commit histories. The code should speak for itself.

The rest of it sounds about right, scientific publishing as a whole is
massively backwards compared to GitHub, if you're looking at it from an "Open"
perspective. But I think that a lot of that is that the _researchers_ tend to
be insular compared to the _implementers_ (businesses guarding their IP aside
- they're not really GitHub's target audience anyway). GitHub isn't used
exclusively for comp-sci researchers to post their findings with code, it's
more for people _doing_ things with ideas others have contributed to.

There are experiments on GitHub, absolutely. I have a few myself. But the main
thing that GitHub has done is to make _final products_ easy to find, modify,
and contribute to. I have significant doubts that it would fit a research
workflow smoothly, without becoming something else entirely.

~~~
Ruudjah
John Resig started a project to generate resumes from the contributions at
github. Regularly I see posts at HN/reddit/etc sites of users saying: "I build
this and that, check it out!".

Personally, a motivation (of a lot more motivations) is indeed prestige. Now I
can show off my nice work.

Sure, openness started it all: Linus shared his VCS, which in turn sparked
Github, which in turn initiated thousands of developers to share their code.
But openness really isn't the sole driver for people to share their code any
more; more incentives, of which prestige is an important one, drive the
popularity of github.

~~~
Groxx
> _"I build this and that, check it out!"_

But people do that with other open source sites as well. Does GitHub provide
this feature better than SourceForge or others? You still need to go to the
user's page, they don't advertise it anywhere else. Do people go to GitHub to
see information about person X, or project Y? And for the posts on social
sites, are they more often about _the creator_ or _what they created_?

~~~
Ruudjah
> And for the posts on social sites, are they more often about the creator or
> what they created?

You hit the nail right on the head. Github is about sharing code. The code is
the creation. Github's unique selling point is the webinterface to show the
code: It's the best designed interface in the industry to show code in a
browser.

> Does GitHub provide this feature better than SourceForge or others?

Various software is available to show code in the browser, but none works as
well and is polished as well as Github. The creation (code) therefore can be
best shared on Github. As such, others willing to check out the creation are,
from the moment they arive at github, mostly busy this exactly that: check out
the creation (code). Every time I visit repo's visualizing the code in html on
other places then github, I get agitated by the annoying interface. Example:
some interfaces require you to click on a file, after which the postback
returns a site containing a list of revisions, and new buttons to 'view' a
revision. This causes me to wait for a postback, and click, twice per file I
want to view. Github instead instantly shows the last revision.

All these small tweaks account for much better usability on Github compared to
other sites.

I guess for scientists, there must be the same approach as Github approaches
programmers. For programmers, it's about code, and for scientists, it's about
data and the conclusions derived from it. Instead of showing code, show a
paper with the possibility to drill down on data. The data being shared can
then be treated the same as code being in a source repo, so the usual git
stuff (branching, merging, pulling, pushing) can be applied on the paper+data.

------
melling
Perhaps not at the academic level, at least not initially, but drawing more
people into science by making it easier to ask questions and get answers
couldn't hurt. <http://area51.stackexchange.com/categories/7/science>

Someday there might be 1,000,000 well-defined science/math questions, along
with great answers.

~~~
ignifero
There are science sections in quora as well, but people tend to post too many
general/popsci/naive questions.

------
sunir
My goal for <http://bibdex.com> is to be this. I based the software on a wiki
(original name was Bibwiki). The idea was to build lit reviews on topics that
you could reuse and share with colleagues privately or publicly with the
world.

I realized after starting that scientific communication is more complex, or at
least it tries to be for various reasons. I could use help learning what
people want from such a system.

I am keen on feedback or insights to drive my development. Please, if you are
interested, you can reach me at sunir at bibdex com.

------
mariuskempe
I agree ([http://www.quora.com/What-online-tools-do-scientists-wish-
ex...](http://www.quora.com/What-online-tools-do-scientists-wish-existed-to-
facilitate-their-work/answer/Marius-Kempe)). Why don't we just start using
GitHub itself to do this and go from there? The pain points will suggest ways
that a real science-focused github could improve on GitHub itself.

~~~
Ruudjah
The problem is currently GUI. There are no good GUI's to work with git.
Windows and Mac OS X have some GUI tools scratching the surface of what's
possible with git, but none come close in opening up the full possibilities of
git. Linux has a few very alpha, specific (like showing branches) GUI tools.

If we want non-programmers to use git, we need GUI's to instantly visualize
the state, commands and other possibilities. No non-programmer is going to
learn git using a CLI.

~~~
mcrider
Perhaps if it caught on enough. There are many people that had to learn LaTeX
who were not programmers, which is IMHO not a trivial feat.

Not that I'm disagreeing with you--But making git point-and-clickable doesn't
strike me as being very simple.

------
countersignaler
<http://science.io> was featured on HN recently. Not github, but at least a
place to discuss and sift research.

------
emilepetrone
I tried to start a science network a few years ago, knowble.net, and I know
this problem well. The main roadblock we faced was the "publish or perish"
mentality. Luckily this mindset seems to be shifting & the idea of a 'GitHub
of Science' is very powerful. Much more than a Science LinkedIn, which is what
Knowble was.

The main unanswered questions for this idea are 1) Funding & 2) Maintenance.
Knowble was a for-profit venture, but should have been a non-profit
organization. So where can you/someone get the funding to build & maintain the
site?

If you need a python hacker to help out - my email is emile.petrone (at)
gmail.com

~~~
nolite
Thought of starting a science network myself... mind if I email you? Want to
know more about what your roadblocks

------
mbreese
One issue I see is what branch of science are we talking about? Physics
largely seems to have this figured out via arXiv.org, but funding for
molecular / biology / medical research is heavily dependent upon publication
record. I'm not sure about Computer Science. But my point is when one says
"Science" needs X or Y for "Science", no one is speaking the same language.

These comments are enough evidence of this. Some have already mentioned
arXiv.org, and others Science.io which seems to be specifically targeted at
CS. When you add medical research, the needs for these branches is _vastly_
different.

------
erikpukinskis
Good ideas, but I disagree that you need a Bill Gates to make it happen.

The way this will happen is a grad student hacker who is avoiding working on
his thesis will start coding it, and then create a kickstarter asking for
support to spend the summer working on it. If she's a credible engineer,
she'll get the support easily, and every subsequent kickstarter grant will
also be fulfilled and it'll get built.

If you build it (right) they will come.

~~~
apl
You may not _want_ to believe it, but this is not an engineering challenge.

------
thisrod
The programs on Github were written by amateurs. Professionals can do better -
compare Python, PHP and Gnuplot to Mozilla, Scheme, Haskell, Plan 9 and
Mathematica. But evidently people can keep their day jobs and still write good
programs.

Science is different. The amateurs are called cranks, and a small community of
professionals does the good stuff. (There are exceptions, but few.) The basic
issue is who will pay their living expenses, and buy the million dollar
machines that they work on.

These days, almost all research money is spent by governments. They spend most
of it rewarding people for publishing in prestigious journals. Scientists will
keep packaging their research that way until someone starts buying it in a
different package.

~~~
SoftwareMaven
Most of the interesting projects on Github were written by professionals and
are far better quality than the average "professional" day-job application
because it is a developer's passion and not constrained by business
development.

I do not believe open source solves all problems, but you dismiss the
incredible value and quality of so much that I find it difficult to take the
rest of the comment seriously.

It sounds more like "People with credentials (whether scientists or
professional programmers) are the only people who can produce quality work. I
have credentials. I'm part of the elite who can do quality work."

------
figital
We also need a GitHub of government / legislation.

------
serichsen
The most pressing problem of our modern information society is the abundance
of crap.

The service peer-review provides is the filtering of crap, so that not
everyone has to do that by himself. This makes science possible, as not
everyone can be a master of all trades.

Publication without review is called "journalism".

As a side note, I believe that Elsevier has acquired an extreme market
dominance in the scientific publishing sector and is abusing it in alarming
ways.

------
juretriglav
Yes we do. Not just a replica of it with different content though, but a
separate product tailored to the needs (and wishes) of science, sharing only
some of the core ideas of GitHub. Sometimes I wonder if I should welcome the
surfacing of ideas that have a large overlap with my own, or be anxious
knowing that my lead has possibly been somewhat reduced.

------
gsiener
I just met the brains behind Opani (<http://opani.com>) last night and they
are a huge step in this direction.

~~~
etree
Opani is actually a huge step in the right direction. I think Marcio's post is
fantastic. I would in fact add an additional idea to it.

In addition to the open prestige inherent in GitHub, there is also the fact
that one's work is vetted by a community. It becomes very difficult if not
impossible to publish crap and claim that it is quality. In science this is
not the case. The peer review system is supposed to protect us against that.
However, my understanding is that a surprising percentage of research in top
ten journals can't be reproduced either because key details about
implementation are missing or because it is actually not reproducible.

A GitHub for science could also meaningfully move the ball forward in making
science reproducible as it should be easy to wrap ones scripts in a
specification of an "environment" that can be readily setup, deployed, and
run. A lot of work would be required to develop corollaries for non-
computational scientific domains, but it would be a hugely valuable effort as
discussed in the general reproducible research community
(<http://reproducibleresearch.net/>).

------
VladRussian
arXiv.org?

------
diamondhead
I think we need a github of any kind of information.

------
ignifero
Do you think something like <http://pubcentral.net> could be useful in that
direction?

------
wicknicks
The biggest problem with CS academicians have been in their misinterpretation
of computers. Its a very different field from traditional sciences like
Physics, Chemistry etc.. In traditional sciences, we study the world,
understand it and express those ideas formally. With computers, its upto one's
imagination what they can do with it. We just get so lost in the depths of
formalism, that we forget that hacking and exploration are what can break
boundaries and enable people to make computers do what they could not.

Funnily, academia harbours the most brilliant minds of CS, and barely produces
usable software. Its people who identify problems, and provide software/ideas
who actually get things moving. Github/Blogosphere etc allow such solutions to
emerge more efficiently by allowing a lot of people to look at such solutions.
In academia, a publication is taken as a end point for problem solving. There
are no incentives to build real software or real systems.

If computer science wants to make a difference, it must move away from its
publish or perish culture.

~~~
xyzzyz
Computer Science is not about building real software or real systems, as
opposed to Software Engineering. Your comment strikes me as missing the whole
point of CS. Maybe it is so because of misleading English name -- as someone
said, Computer Science is not about computers, just like Astronomy is not
about telescopes -- but what kind of person would call Astronomy a Telescope
Science?

