
The Big Data Brain Drain: Why Science is in Trouble - plessthanpt05
http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain
======
hharrison
This is so true. I'm in a Ph.D. program and everyone around me is wasting so
much time by reinventing the wheel every time they need to code something. So
I spend my time making libraries to help them out, but then I get scolded
because that's time that's not going directly toward getting publications. And
few people use my code because they don't trust software as up to the
scientific standard unless (a) they spent thousands of dollars on it, a la
MATLAB, or (b) they wrote it themselves and, e.g., take a mean by manually
iterating over an array, "just to make sure" the mean is calculated correctly.
Ugh. It doesn't matter how many tests I an point them to. I can't wait to get
out of here and work somewhere where coding is appreciated, where I can
actually get paid, and where I have some choice as to which state I live in.

~~~
michaelt
How would one take a mean of n elements without visiting all n elements? Won't
the memory bandwidth and big-O complexity always be the same? Genuinely
curious.

~~~
CamperBob2
The language used in MATLAB and Octave is designed for vector processing to an
extent most developers haven't seen before. MATLAB doesn't mean "Math
Laboratory", it means "Matrix Laboratory." Operations on row and column
vectors are first-class language elements. You almost never have to manually
iterate over an array to compute its statistics -- you'd just say M = mean(A
[,dim]) where A is a standalone vector or a column vector of a matrix. In that
example, M itself is a vector, if A was a matrix.

MATLAB syntax is ugly but the underlying principles are pretty cool. Well-
written code scales automatically on newer hardware, or at least it has the
potential to. That's not true in languages where higher-order vectors are
built from discrete scalars.

~~~
joe_the_user
The good stuff of Matlab must be balanced by it's perverse, pathological and
obscene qualities.

The most vile aspect of Matlab is the faith every researcher has that
producing something in Matlab is enough when the reality is code coming from
Matlab will never escape, will never be as useful nakin-style pseudo for the
creation of any larger system.

------
brianberns
I'm a software developer working with big data, but I think the premise of
this article ("the ability to effectively process data is superseding other
more classical modes of research") is simply false.

The example problem domain ("automated language translation") is actually a
stellar counter-example to the claim. Has anyone actually tried to use Google
Translate for anything sophisticated? It's still truly horrible, by human
standards. The field needs more research and deeper conceptual understanding,
not less.

There may be some problems that can be solved by throwing
software/hardware/data at them, but I don't think this is a good paradigm for
the big unsolved problems in general.

~~~
btilly
_The example problem domain ( "automated language translation") is actually a
stellar counter-example to the claim. Has anyone actually tried to use Google
Translate for anything sophisticated? It's still truly horrible, by human
standards. The field needs more research and deeper conceptual understanding,
not less._

Have you compared Google Translate with the previous attempts to do automated
translation based on conceptual understanding?

There is a reason why Google Translate would be claimed to be a success.

~~~
joe_the_user
Hmm,

Being the best of a bad lot isn't enough.

Perhaps it is only a human reflex to believe that some contemplation is needed
to solve problems that have resisted mounds of data being thrown on them. But
being not-coincidentally human, I happen to find it plausible.

------
analog31
It's a nice article, but it overlooks one fact: There needs to be a brain
drain out of academia, because academia can't absorb more than a fraction of
its own production of talent.

~~~
reeses
A more rigorous stack-rank-and-yank may be appropriate in this situation. :-)

A big part of the problem is that current researchers did have to manage their
programming, IT, etc. during undergrad, grad, and post-doc periods. They did a
lot of hacking with C, Perl, and Maple/MATLAB/etc.

As with many managers who were promoted because they were stellar engineers,
those skills fade but they continue to think that they are qualified to judge
the difficulty of the work. The fact that they have "done it"[1] before leads
to a logically incorrect assumption that it's not that difficult.

After all, they learned those things on their own while thinking about the
tough stuff.

[1] Except for, of course, in a predictable, repeatable, safe, auditable, etc.
way with fifty other users asking for high priority changes.

~~~
grishas
This is on the mark.

I know many academics who spent a lot of time throughout their school career
writing code to get their research done.

They looked at this as a necessary evil, and as such learned the bare amount
minimum to get by. They are smart people and were able to make the code help
them solve their research problem. But they are busy thinking about their
research so usually their algorithms are fairly simple + straightforward (lots
of nested loops and n^2 sort of things).

The main problem, in my experience, is that many of the research problems are
actually fairly simple (algorithmically) and most research departments have
access to fairly powerful computing facilities. Coupled, this means you can
brute force a lot of solutions-- there is no real push for understanding of
the algorithmic complexity.

As well, most academics are on a much MUCH longer timeline than your average
business or startup. Did your algorithm break 1 month into processing your
simulation? Fine, fix it and run it for another month. Or just take what you
had and publish it anyways.

Just as academics look down upon the technical side of things, we are just as
much to blame for idolizing the academics. Science (even in math and
engineering) is a lot more 'sloppy' then we like to imagine. There are oodles
of papers out there that are just downright incorrect-- and not on purpose!

(* My creds: I have participated in academia as both a student, researcher and
software developer )

~~~
kskz
My experience in academic computer science has been the complete opposite.

In industry, what I've seen is that often engineers are scrambling to please
managers or customers, with work divided among multiple people, so the code is
usually poorly written and undocumented.

In academics, publications are of primary importance, so everything is
documented. The longer timescale means there's more time to refine code that's
designed for a single, focused problem. The limited scope of the programs used
means code quality isn't an issue most of the time.

Also, in theoretical computer science at least, the focus is entirely on
rigorous proofs and finding optimal algorithms. While in industry, it's more
"get practical things done quickly so we can sell it".

~~~
sixbrx
> While in industry, it's more "get practical things done quickly so we can
> sell it".

That's a pretty short sighted example of industry - I'm sure those examples
are out there, but I don't think they're common (or the companies long lived).

Most places I've worked know that they'll have to _maintain_ that code well
into the future.

Not really so in academia (publish and forget) - which is why it's rare to see
even basic measures taken for modularity and abstraction, e.g. the creation of
types to represent entities in the problem domain. I think I've seen that done
in Matlab, _once_.

------
joe_the_user
Honestly,

The situation is simpler.

Academia has become abjectly miserable and abusive in its practices. It no
longer offers good but low-paid jobs for smart non-conformists, it just offers
it's special brand of misery based on some long-past promise of this.

Given this, only the mediocre stay (and compounding it, anyone who stays has
no reason to be better than mediocre). And that is a huge, huge loss to the
whole project of the development of human knowledge, something that has a long
history in Western society.

~~~
eli_gottlieb
With all respect towards your implied bitter experiences that led to your
leaving academia, it's not the mediocre who stay (though I've met mediocre
researchers). Mostly, it's the people who have a _very good chance_ at a
permanent job, which is a combination of the following factors:

* The incredibly workaholic

* The incredibly specialized into a well-funded field

* The incredibly skilled

* The incredibly good at the scientific method

* The incredibly savvy at the academic game

* The lucky bastards

* The even more incredible scientific polymaths

I'd say that any combination of two of the above factors will suffice to keep
someone on the academic track for a while longer. Note how rare those factors
actually are.

~~~
joe_the_user
_With all respect towards your implied bitter experiences that led to your
leaving academia..._

Hey, parent here. Love your "incredible" assumptiveness about my experiences.
"Respect" to you too, baby, dude.

Actually I've never been an academic myself, not even slightly. OK, I have had
friends and relatives that who've been failed as well as very successful
academics so I have some idea of the culture but be that as it may.

Your post seems likely a way to showcase the versatility of the adjective
"incredible", usefulness with new age cretinisms and your "implied success"
there, well, except I don't see any.

~~~
ChristianMarks
The assumption seems to be that intellectuals still have a home within the
university. The enthymeme in the air is that those unfortunates who were
unable to find suitable employment in the academy do not deserve to live the
life of the mind. But the university is no longer an institution “where
teachers and students can pursue unconstrained the life of the mind.” This
activity cannot be the exclusive domain of a tiny elite.

See [http://mg.co.za/article/2013-11-01-universities-head-for-
ext...](http://mg.co.za/article/2013-11-01-universities-head-for-extinction/)

------
ChristianMarks
Technical work, including indispensable scientific software development, tends
to be considered of low academic value in academia. This is an ingrained
attitude. I very recently left after having heard "oh, you're the technical
guy" once too often from other academics.

Here's an example. The Globus Online grid ftp service web page intended for
users adopts an overtly apologetic tone [1]. Users of this service are
promised freedom from "low-value IT considerations and
processes"\--considerations and processes that the Globus Online team has
humbly sought to undertake on their behalf. I have to laugh at the claim that
there is "No need to involve your IT admin—all you need is Globus Online." The
message is that information technology is of low academic value--unless you
happen to have been one of the authors of publications that came out of the
Globus Online project. If not, your career is sidelined.

Software development, system administration, network administration and
desktop support have become somewhat specialized in the past 30 years, but in
the minds of some principal investigators and academic administrators, these
very different activities are conflated. An expert in numerical methods,
computational fluid dynamics and dynamic downscaling methods for climate
assessment models is a seasoned web developer with a portfolio, fluent in
jQuery, underscore, backbone, responsive websites with bootstrap, CSS3, HTML5,
PostgreSQL, PostGIS, the Google maps API, Cartodb visualizations, as well as
an Android developer conversant with the SPen library for the Galaxy Note
10.1. It's as much effort to stay current technically as it is to keep up in
the scientific literature.

There are faint signs of improvement. On January 14th, the NSF revised the
biosketch format by changing the _Publications_ section to _Products_ [2].
"This change makes clear that products may include, but are not limited to,
publications, data sets, software, patents, and copyrights." The previous
biosketch format was awkward for software developers, inventors and producers
of data sets.

Recently, a number of prominent computer scientists, and scientific software
developers affiliated with the Climate Code Foundation [3], published a
Science Code Manifesto [4]. The manifesto includes the recommendation that
"software contributions must be included in systems of scientific assessment,
credit, and recognition." Software developers in the digital humanities may
wish to add their names to the list of signatories.

Whether these developments reflect a broader understanding that software
developers ought to enjoy greater recognition and opportunity for advancement
in academia that they do currently remains to be seen. Greater career
advancement opportunity for software developers, inventors and data set
producers working in academia might do something to address the Ph.D.
overproduction problem.

But these developments were too little and too late for me. I left.

[1]
[https://www.globusonline.org/forusers/](https://www.globusonline.org/forusers/)

[2]
[http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp](http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp)

[3] [http://climatecode.org/](http://climatecode.org/)

[4] [http://sciencecodemanifesto.org](http://sciencecodemanifesto.org)

~~~
x0x0
yup. science gets the software they deserve: industry not only (ime) pays
somewhere between 2-3x as well, but companies like engineers and view it as
prestigious instead of labs where you are viewed as the help. Even if the
results coming out of the lab are completely dependent on very sophisticated
computer programs to produce.

------
denzil_correa
HN really amazes me sometimes. I posted the exact same article with the exact
same title 12 days ago [0] and didn't receive much traction. In fact, the only
comment on the submission read

    
    
        this should have landed on the front page...
    

Anyways, I wonder if there is any explanation of this phenomena.

[0]
[https://news.ycombinator.com/item?id=6623501](https://news.ycombinator.com/item?id=6623501)

~~~
gwern
It's just random variation. Depending on the time of day, you need maybe 4 or
5 upvotes to make the front page and escape the /newest ghetto; but there's
not much traffic on /newest so it's very easy for articles to not escape one
time even if they escape the other time.

~~~
denzil_correa
It is fine really - I wonder if there was more than random to it. Anyways, now
I have attracted downvotes for my initial comment. Kind of surprised at that
too! :)

------
wrongc0ntinent
Nicely written. A couple of things, somewhat independent of the economic
issue: the "big data - dumb analysis" trend is bound to change, it's a matter
of time till volume is not an advantage anymore. And the fact that "academia"
is by no means a uniform discipline, some fields can only move ahead with the
gathering of more data, while other fields need new testable hypotheses more.

------
throwawayBio
Compensation is a huge issue. Unlike pure biologists, computational scientists
have plenty of job opportunities outside of academia and pharma.

As a staff programmer at a prestigious institution, I was making about $50k. I
left for a large company and a few short years on I'm at around $200k.

------
rubidium
"I have some serious doubts about whether the project will be able to attract
a sufficient pool of applicants for these positions."

Really?? Certainly, academia has its inefficiencies, but if there is an area
of academia that has an undersupply of PhD's, please correct me!

~~~
mbreese
There are very few people in academia that can actually write code or run
computational analyses with skill. Most are just trying to answer their very
specific question, and will take all kinds of shortcuts. And God help you if
you want to track back exactly what commands (or _gasp_ versions of programs)
they used.

That's where they undersupply is. I've seen some very low quality code from
PhDs, even CS PhDs. One thing that is missing from a lot of academic curricula
is software engineering techniques. Hell, I've had to fight in order to get
people to use git.

I'm more familiar with medical/biology research, where you have people that
understand the domain, but not necessarily how to properly code. And if they,
as a PI, need to manage a programmer (tech) / computational researcher, it can
be difficult.

~~~
analog31
Ah, but this isn't just a problem with coding. A mechanical or electrical
engineer would have been horrified by my vacuum system or electronic
circuitry. ;-) When I was in grad school, the mantra was: It only has to work
long enough and well enough to produce a result. My experiment was declared a
success when I got the MTBF up from a few seconds to a few minutes.

~~~
varelse
IMO the life sciences are full of a very special kind of stupid wherein
computational science is performed by people who could barely pass Calculus I
let alone Linear Algebra or Statistics. And this would be harmless except that
a lot of what they publish turns out to either be training set-based
prediction of the training set or horrifying misapplication of null hypothesis
significance testing.

And all that would merely constitute black box ignorance except that often if
you try to help them make their work more reproducible or point out that even
32-bit floating point roundoff error can be initially indistinguishable from a
bug that makes the results useless, a lot of them become agitated and tune out
the possibility of any of this being relevant to their work.

I left academic science for industry over a decade ago because I caught a big
shot who then proceeded to threaten me instead of fix what I found.

------
kriro
My basic line of argument is fairly simple. If you work at a university and
your salary is paid via taxes you have an obligation to make your work
available to the taxpayers.

This means the software you write should be open sourced and the papers and
articles you write should be freely available.

I'll use the basic idea of this article as more fodder for that argument (it's
even more important now because...Big Data ZOMG) in the future :)

I don't fully agree with the thought that writing software and papers are
exclusive. With a little creativity you can get a paper or two out of most
academic software development projects (granted they won't usually be the
A-level "this advances my discipline" kind of papers but one can't only write
those anyways imo).

~~~
rimantas
If I work at intelligence agency and my work is paid by taxpayers, should the
results of it also be available to the public?

------
nn3
Re underpaid researchers:

The part that is always unclear to me is: there is lot of evidence that
research/university budgets are growing at a high rate. Where does all that
money go to. It doesn't seem to go to the people who actually do science.

That seems to be the fundamental problem here.

~~~
dnautics
"overhead", i.e. administration. I told someone that I was a postdoc, he
estimated my salary at first six figures. Then I told him down. He said 75k.
When I told him 45k, his eyes bugged out and he said 'there's something wrong
here'.

I mean, I come from an excellent pedigree, too, top 10 undergraduate, top 10
grad school, and I currently work at a place where my boss is a nobel
laureate. (He makes at least 250k, according to the place's 990 forms)

However, my going price gets diluted by the idiots running around with rubber-
stamped PhDs. Maybe there needs to be a 'brain drain', if we can drain those
people selectively somehow.

~~~
lost_marbles
I'm really confused - why don't Ph.D.'s with excellent pedigrees just get paid
more money than "idiots running around with rubber-stamped PhD.s"? Maybe the
market doesn't see such a big difference..

~~~
ananduri
I think you are right lost_marbles. The difference between a PhD with an
excellent pedigree and a rubber-stamped PhD should be their publication
records. The problem would still exist if these rubber-stamped PhDs weren't
there, I think. As a postdoc, you're going to get paid zilch, period.

------
marvin
This article has some incredible insights which have gone right past me.
Thanks for sharing. I'm currently studying for a Masters degree which is
centered on visualization techniques for exploring and analyzing large
datasets.

I have found myself thinking many times that a position in the industry, where
I can use my teaching, data processing and analysis skills to further some
business goal, seems like a much more preferable option than sitting around
writing research papers and applying for grants all day. Not to mention that
academia pays less and has worse overtime conditions than _any_ industry job I
could concievably get.

This article really nails the key issues for why I am feeling this reluctance
towards an acedemic career.

~~~
draugadrotten
> visualization techniques for exploring and analyzing large datasets.

Would it be possible for you to treat the vast amounts written on the internet
regarding career choices as your "large dataset" and use your knowledge and
tools to explore and analyze this dataset, and visualize it to us?

~~~
marvin
Probably not. I only work with data that's more structured, like spreadsheet-
like data with very many records and dimensions.

------
calhoun137
I am an example of this trend, having dropped out of a physics PhD program to
pursue programming, and feeling great about the decision. There are two points
which are not made by the author and which I think deserve to be mentioned:

1.) Scientific research is _hard_. It is very frustrating to spend all day
working on something you can't be sure is going to even work. On the other
hand, programming is easy, it's mostly monkey work. Sure there are places
where you have to be clever and think things out carefully, but at the end of
the day, when you write code it just feels like you have so much more to show
than when you do research.

2.) In research, you need to get grant money to do anything, because research
is expensive. So much time is then spent writing grant proposals, and even
when you get money you still can't do everything you want because it's just
too expensive and/or time consuming. I'm not talking about LHC money here, but
just the standard money for a professor running a lab in a university.

In programming, you can work on (I'd estimate) 95+% of problems with nothing
more than your computer and some old fashioned hard work. There are so many
good, free to use open source libraries out there, making it pretty easy to
jump into whatever field is interesting to you. Best of all, when you finish a
project, there is no need to spend weeks writing a paper about it or any of
the nonsense associated with that, you can just publish your code.

------
001sky
_" Simple models and a lot of data trump more elaborate models based on less
data..."

If we make the leap and assume that this insight can be at least partially
extended to fields beyond natural language processing, what we can expect is a
situation in which domain knowledge is increasingly trumped by "mere" data-
mining skills._

This is a great point. For many years, domain knowlege was merely such
experience ex ante. In otherwords, the barrier to domain knowlege was access
to data. In lieu of this, perhaps theory to estimate it. As data becomes
larger, more free, and more amenable to analysis by a larger group of talent
the "domain knowledge" itself as a barrier to entry (prestige, effectiveness)
seemingly declines. Whle this is in theory good (more access, more analysis),
there is probably a corralary that we should expect turf-wars and restriction
to access, as those previously in positions of privledge fight to retain their
status as "keepers of the keys".

~~~
hgh
I've spent some time in the economics field and certainly found that getting
access to quality data was an enormous challenge. In fact a good amount of the
work is spent putting together data sets. This is probably the bulk of what
econ grad students do.

Besides this, it's important to recognize the difference between identifying
correlations and patterns in data vs understanding the mechanisms behind the
phenomena. Krugman makes a strong point: "The problem is that there is no
alternative to models. We all think in simplified models, all the time. The
sophisticated thing to do is not to pretend to stop, but to be self-conscious
-- to be aware that your models are maps rather than reality." [1]

Data-mining will help generate and support hypotheses, but this is
complementary to model building.

[1]
[http://web.mit.edu/krugman/www/dishpan.html](http://web.mit.edu/krugman/www/dishpan.html)

~~~
001sky
Econ is an interesting subject. Basicaly, it's taught using slide-ruler-y math
from the 1950's[1]. Its amazing that learning to build linear algebra models
without a slide-ruler (ie, by hand), comined witha year or two of statistics
and some calc, makes someone an expert in "Economics".

If you think about the huge increase in computer power available today, it
seems a field ripe for disruption. However, the guys that run the fed, and
guys like Krugman are basically from the "slide rule era" of Econ.

I'm sympathetic to the power of modeling and techniques, but I do think its a
case of the more you know, the more you understand what you don't know. And I
think for most people the deeper you go into the field of Econ, the more this
becomes apparent. Its getting better, though, and I'm sure in another
generation (once the current tenures expire) things will look a good deal
different (hopefully better).

Not sure if your views are similar, though.

[1] You see no little to no respect for dynamic systems that are chaotic or
non linear; bounded rationality and its antecedent effcts; the role of
institution underpinnings to markets, etc. Just to name a couple that are
glaringly relevant and empirically important, but not subject to "hand math".

~~~
truthteller
you should actually make an effort to learn and understand modern economics
before coming to such strong conclusions. there are good reasons why the
discipline exists in its current form.

it is clear from the "slide-ruler-y math" jibe that you have no idea about the
technical ability required to do research, especially in theoretical
microeconomics and econometrics. you'll need more than mastery of "Mathematics
for Engineering" and MATLAB to disrupt the economics profession.

------
otikik
> We always come across reports of how much talented Indians are and are
> conquering the world in the field of technology & business.

Um, I don't mean to be offensive, but I have never heard that.

------
nekopa
My biggest worry is that it's not science that is in trouble, it's humanity as
a whole that is in trouble.

The drain towards industry is to solve industry's problems: make a profit. And
yes, these can be extremely interesting problems, and hard ones too.

But what would our current world look like if the scientists from the last 500
years had bent their minds to solve the problems of merchants? (and don't get
me wrong, some of the problems merchants had evolved great solutions for
mankind)

------
robg
I'm an example of this trend. I trained for a decade in big data for
neuroimaging. Then I realized I was working long hours with little upside
beyond my own curiosity. I certainly wasn't changing the world in any tangible
way. I had originally gone into neuroscience to understand my family's history
of mental illness. Ten years later I was helping no one - not even myself.

I'm not convinced Big Science is in trouble though. Those who have the
motivation and talent to stay in the academy will continue to do so. Yes,
outstanding people will be lost but Science can progress from their
contributions to commercial efforts. Geoff Hinton ends up at Google but we'll
keep hearing from him the best use of his talents - the improvements to
products we all use every day at a massive scale.

------
michaelochurch
Scientists in actual science are seriously underpaid, but there's a case of
the Teacher-Executive Problem that's going to make it hard to fix that
([http://michaelochurch.wordpress.com/2013/11/03/software-
engi...](http://michaelochurch.wordpress.com/2013/11/03/software-engineer-
salaries-arent-inflated-at-least-not-for-the-99/)). The more useful you are,
the more need there is for many of you (individual "rock stars" are overrated
in the real world) and the greater the implicit multiplier on improving your
salary. Increasing scientific pay is has a bigger cost load than increasing
executive pay because we actually need scientists (or teachers, in the
original formulation of TEP) in significant numbers.

Society has reached a point where the academic route means practically begging
for a job that won't even pay for a house, while the startup lottery offers,
at least, a chance. (And finance, better yet, offers a high likelihood of
being well-off.)

------
saraid216
I haven't read Wolfram's _A New Kind of Science_ , but something tells me that
this is a one page summary of it?

~~~
shoyer
Uh... no. A New Kind of Science is its own special kind of nonsense. For the
most part, Wolfram argues that cellular automata are a transformative paradigm
for science. It's not about the power of "big data" at all.

~~~
saraid216
Huh. All the discussion on HN that I saw on it seemed to revolve entirely
around big data. Nevermind, I guess.

