Hacker News new | past | comments | ask | show | jobs | submit login
The Big Data Brain Drain: Why Science is in Trouble (jakevdp.github.io)
211 points by plessthanpt05 on Nov 9, 2013 | hide | past | favorite | 94 comments



This is so true. I'm in a Ph.D. program and everyone around me is wasting so much time by reinventing the wheel every time they need to code something. So I spend my time making libraries to help them out, but then I get scolded because that's time that's not going directly toward getting publications. And few people use my code because they don't trust software as up to the scientific standard unless (a) they spent thousands of dollars on it, a la MATLAB, or (b) they wrote it themselves and, e.g., take a mean by manually iterating over an array, "just to make sure" the mean is calculated correctly. Ugh. It doesn't matter how many tests I an point them to. I can't wait to get out of here and work somewhere where coding is appreciated, where I can actually get paid, and where I have some choice as to which state I live in.


I hated the 'publish at all costs' attitude I felt while pursuing my PhD and within a post-doc project. IMHO that leads to the huge amount of trash articles and conferences that is now plaguing academia.


Plus it tends to reward established networks of "friends" who assign each other as coauthors on papers rather than individuals doing the hard part of the work.


I am assuming that you are doing computers science, and in the current environment focusing on the conceptual contribution and do the minimal amount of engineering is solid advice.

I started in physics and there someone could make a great career corroborating for or disproving conceptual contributions. This is not a track in CS and is practically career suicide.

From experience most CS research can not be trusted to be correct, and enabling people to build a career on replicating or corroborating studies would in my opinion be of great value. Even the research that is correct is often not fully implemented so you not only have to implement their approach, you also have to discover how to realize it. That work is not publishable in CS, and it is a non-trivial amount of extremely risky work.


Nope, Psychology with a focus on complex systems, statistical physics, dynamical systems, that sort of thing. Everything from time series analyses that require hundreds of thousands of data points to plain old factorial ANOVAs.

Psychology is probably one of the worst sciences for the attitude described in the article. Being in the most "mathy" corner of the field doesn't really help.


Oh, man. Don't tell them about Kahan summation--they'll freak out and go rewrite everything.


I think I'm one of the "them"--now that I know about it, it seems like a pretty important thing to know. I can't help but wonder, how many other gems like this out there that the "them" don't know about?


Just think, when you get paid your taxes will fund those people re-inventing wheels.


Hey, I have no problems with my taxes going toward science. The way it's done is far from perfect but the answer isn't to take away funding.


Far from perfect, okay. Blatant misuse of funds, not okay.


It's not really a blatant misuse of funds, though. My roommate is an intensely bright dude finishing up his math PhD working with studying interactions between complex systems. He writes all his code in C, and he recompiles it every time he wants to change a variable (e.g. the input file, or the number of iterations).

He's been doing it this way for years because that's what he was taught. That's the level of software engineering acumen you'll get in academia. But it "works". I've offered to help him modify the code so it will accept command line arguments, and we're going to sit down and do that so he can run several instances in parallel and utilize all of those fancypants cores on the computer I loaned him, but... he didn't know you could do that. No one told him! How would he know where to start looking that up? How reasonable is it to expect him to grok all that, when he's deep in math-land?

So it was blatant to me, software developer of four years, that something was pretty wrong, but for him: he's about to finish his PhD. He's been published a couple of times. They're not running horribly inept software development, they're running mathematics the best way they know how.


Yeah, libraries aren't a good way to start because there's not enough interest in using them.

There are opportunities to build standalone tools which blow away their predecessors by multiple orders of magnitude, though; after getting enough researchers to use one such tool, you might attract sustained curiosity from a few people wondering "how the hell did s/he do that?!" and organically grow a small library with a real user base. That's one of my own long term goals, anyway.


Well I'm self-taught so I have to start somewhere. I'm not sure I could put together a stand-alone tool and still complete my Ph.D. program. Anyways I've found most stand-alone tools just aren't flexible enough and I don't feel like making something I wouldn't use myself.


Fair enough, and definitely agree with not making something you wouldn't use. (The "most [existing] stand-alone tools aren't flexible enough" problem is, however, one of the reasons why there's so much room to do better...)


True that! Okay, you've convinced me to make it a long-term goal.


How would one take a mean of n elements without visiting all n elements? Won't the memory bandwidth and big-O complexity always be the same? Genuinely curious.


The language used in MATLAB and Octave is designed for vector processing to an extent most developers haven't seen before. MATLAB doesn't mean "Math Laboratory", it means "Matrix Laboratory." Operations on row and column vectors are first-class language elements. You almost never have to manually iterate over an array to compute its statistics -- you'd just say M = mean(A [,dim]) where A is a standalone vector or a column vector of a matrix. In that example, M itself is a vector, if A was a matrix.

MATLAB syntax is ugly but the underlying principles are pretty cool. Well-written code scales automatically on newer hardware, or at least it has the potential to. That's not true in languages where higher-order vectors are built from discrete scalars.


The good stuff of Matlab must be balanced by it's perverse, pathological and obscene qualities.

The most vile aspect of Matlab is the faith every researcher has that producing something in Matlab is enough when the reality is code coming from Matlab will never escape, will never be as useful nakin-style pseudo for the creation of any larger system.


In MATLAB, R, or numpy, it's the difference between `mean(n)` and manually looping. It's not an issue of algorithmic efficiency, it's an issue of lost productivity because they don't even write a function to reuse (all they understand is scripting) they recode the loop every single time they have to sum or take the mean of something.


Well it is because NumPy and friends do all the heavy-lifting in hand-tuned C. dis() your Python function for taking a mean and see the difference, it's huge.


The point is not computational time; the point is that one could simply call an existing library function rather than hand-coding the loop oneself and risking making an error (a fencepost error, for example).


I can understand that you'd want to manually check what's happening. For example taking the mean over the rows of a 2D array using numpy's mean function and aren't really sure whether axis=0 or axis=1 refers to the rows.

But you'd only have to figure it out once and then learn to trust numpy, instead of rolling your own version every time.


You missed these key words: "manually iterating"

So looping in a high-level language rather than using vectorized functions.


It's probably more in reference to the layer the work is completed in. I haven't used matlab in years but you can probably sum an array by iterating or you can call a faster more efficient library. You get much greater gains when doing this in higher dimensions. If you can do your operations at a matrix level you get a magnitude improvement in speed in most languages.


I think the concern is over the manual component of it, especially if that set of n is big by human standards. (Say, doublechecking a few hundred entries of some column entry by calculator.)


I'm a software developer working with big data, but I think the premise of this article ("the ability to effectively process data is superseding other more classical modes of research") is simply false.

The example problem domain ("automated language translation") is actually a stellar counter-example to the claim. Has anyone actually tried to use Google Translate for anything sophisticated? It's still truly horrible, by human standards. The field needs more research and deeper conceptual understanding, not less.

There may be some problems that can be solved by throwing software/hardware/data at them, but I don't think this is a good paradigm for the big unsolved problems in general.


The example problem domain ("automated language translation") is actually a stellar counter-example to the claim. Has anyone actually tried to use Google Translate for anything sophisticated? It's still truly horrible, by human standards. The field needs more research and deeper conceptual understanding, not less.

Have you compared Google Translate with the previous attempts to do automated translation based on conceptual understanding?

There is a reason why Google Translate would be claimed to be a success.


Hmm,

Being the best of a bad lot isn't enough.

Perhaps it is only a human reflex to believe that some contemplation is needed to solve problems that have resisted mounds of data being thrown on them. But being not-coincidentally human, I happen to find it plausible.


Google Translate is by no means a counter-example. Certainly, it's flawed and imperfect. But Google Translate is the state of the art. You will not find any existing purely automated system that does the task better. Sure, some hypothetical, possibly AI-complete system with rich language understanding would do better. Good luck building that, especially without processing a huge corpus.


Yes you will. You just don't have free access to it online.


Why didn't you name it then?


What about "deep learning" combined with lots of data ?

teams using the technique have won some competitions in kaggle , while doing little feature engineering(which is usually the part which you put domain knowledge) .

And it had shown better results than current systems for big problems like voice recognition and image recognition(the famous google cat experiment).


Check out this essay: http://cacm.acm.org/magazines/2010/9/98038-science-has-only-...

"I believe that science still has only two legs—theory and experimentation. The "four legs" viewpoint seems to imply the scientific method has changed in a fundamental way. I contend it is not the scientific method that has changed, but rather how it is being carried out."


What about Google Search? Its all a black box these days. The only way this changes is the data needs to be opened up.

Giving a few hundred nerds in an ivory tower access to all that hardware and data just makes the black box blacker.


It's a nice article, but it overlooks one fact: There needs to be a brain drain out of academia, because academia can't absorb more than a fraction of its own production of talent.


A more rigorous stack-rank-and-yank may be appropriate in this situation. :-)

A big part of the problem is that current researchers did have to manage their programming, IT, etc. during undergrad, grad, and post-doc periods. They did a lot of hacking with C, Perl, and Maple/MATLAB/etc.

As with many managers who were promoted because they were stellar engineers, those skills fade but they continue to think that they are qualified to judge the difficulty of the work. The fact that they have "done it"[1] before leads to a logically incorrect assumption that it's not that difficult.

After all, they learned those things on their own while thinking about the tough stuff.

[1] Except for, of course, in a predictable, repeatable, safe, auditable, etc. way with fifty other users asking for high priority changes.


This is on the mark.

I know many academics who spent a lot of time throughout their school career writing code to get their research done.

They looked at this as a necessary evil, and as such learned the bare amount minimum to get by. They are smart people and were able to make the code help them solve their research problem. But they are busy thinking about their research so usually their algorithms are fairly simple + straightforward (lots of nested loops and n^2 sort of things).

The main problem, in my experience, is that many of the research problems are actually fairly simple (algorithmically) and most research departments have access to fairly powerful computing facilities. Coupled, this means you can brute force a lot of solutions-- there is no real push for understanding of the algorithmic complexity.

As well, most academics are on a much MUCH longer timeline than your average business or startup. Did your algorithm break 1 month into processing your simulation? Fine, fix it and run it for another month. Or just take what you had and publish it anyways.

Just as academics look down upon the technical side of things, we are just as much to blame for idolizing the academics. Science (even in math and engineering) is a lot more 'sloppy' then we like to imagine. There are oodles of papers out there that are just downright incorrect-- and not on purpose!

(* My creds: I have participated in academia as both a student, researcher and software developer )


All true, but the issue isn't really that they could improve algorithmic complexity with more technical skills, it's that they could improve their overall productivity.

There's a huge resistance to using source control, so lots of time gets spent searching through deep folder structures and finding just which 'file1_v3 (4).doc' is the right one. Data gets lost due to simple mistakes.

They spent 20 minutes coding the Runge-Kutta algorithm every time they need to run a numerical simulation without realizing or caring that (a) they could spend 25 minutes to create a function that they could reuse or (b) that the function already exists.

In short, the issue isn't computing time, it's researcher time. But the idea of spending some time now to save time later is so foreign because of the focus on getting the publication out as quick as possible.


My experience in academic computer science has been the complete opposite.

In industry, what I've seen is that often engineers are scrambling to please managers or customers, with work divided among multiple people, so the code is usually poorly written and undocumented.

In academics, publications are of primary importance, so everything is documented. The longer timescale means there's more time to refine code that's designed for a single, focused problem. The limited scope of the programs used means code quality isn't an issue most of the time.

Also, in theoretical computer science at least, the focus is entirely on rigorous proofs and finding optimal algorithms. While in industry, it's more "get practical things done quickly so we can sell it".


> While in industry, it's more "get practical things done quickly so we can sell it".

That's a pretty short sighted example of industry - I'm sure those examples are out there, but I don't think they're common (or the companies long lived).

Most places I've worked know that they'll have to maintain that code well into the future.

Not really so in academia (publish and forget) - which is why it's rare to see even basic measures taken for modularity and abstraction, e.g. the creation of types to represent entities in the problem domain. I think I've seen that done in Matlab, once.


Honestly,

The situation is simpler.

Academia has become abjectly miserable and abusive in its practices. It no longer offers good but low-paid jobs for smart non-conformists, it just offers it's special brand of misery based on some long-past promise of this.

Given this, only the mediocre stay (and compounding it, anyone who stays has no reason to be better than mediocre). And that is a huge, huge loss to the whole project of the development of human knowledge, something that has a long history in Western society.


Something like this seems to get posted every time academia comes up. I'm sorry for the people having shitty experiences and being forced out of academia, but it's important not to exaggerate the problems. I've been through two UK universities, and I'm now at a US university, and I still meet plenty of very smart people doing interesting things.

Yes, there are shitty bits in academia, like the intense competition. But we should focus on where the problems are and how to fix them, not just write the whole enterprise off as shit.


With all respect towards your implied bitter experiences that led to your leaving academia, it's not the mediocre who stay (though I've met mediocre researchers). Mostly, it's the people who have a very good chance at a permanent job, which is a combination of the following factors:

* The incredibly workaholic

* The incredibly specialized into a well-funded field

* The incredibly skilled

* The incredibly good at the scientific method

* The incredibly savvy at the academic game

* The lucky bastards

* The even more incredible scientific polymaths

I'd say that any combination of two of the above factors will suffice to keep someone on the academic track for a while longer. Note how rare those factors actually are.


With all respect towards your implied bitter experiences that led to your leaving academia...

Hey, parent here. Love your "incredible" assumptiveness about my experiences. "Respect" to you too, baby, dude.

Actually I've never been an academic myself, not even slightly. OK, I have had friends and relatives that who've been failed as well as very successful academics so I have some idea of the culture but be that as it may.

Your post seems likely a way to showcase the versatility of the adjective "incredible", usefulness with new age cretinisms and your "implied success" there, well, except I don't see any.


The assumption seems to be that intellectuals still have a home within the university. The enthymeme in the air is that those unfortunates who were unable to find suitable employment in the academy do not deserve to live the life of the mind. But the university is no longer an institution “where teachers and students can pursue unconstrained the life of the mind.” This activity cannot be the exclusive domain of a tiny elite.

See http://mg.co.za/article/2013-11-01-universities-head-for-ext...


I think that "The incredibly specialized into a well-funded field" is not optional. The second one is.


Technical work, including indispensable scientific software development, tends to be considered of low academic value in academia. This is an ingrained attitude. I very recently left after having heard "oh, you're the technical guy" once too often from other academics.

Here's an example. The Globus Online grid ftp service web page intended for users adopts an overtly apologetic tone [1]. Users of this service are promised freedom from "low-value IT considerations and processes"--considerations and processes that the Globus Online team has humbly sought to undertake on their behalf. I have to laugh at the claim that there is "No need to involve your IT admin—all you need is Globus Online." The message is that information technology is of low academic value--unless you happen to have been one of the authors of publications that came out of the Globus Online project. If not, your career is sidelined.

Software development, system administration, network administration and desktop support have become somewhat specialized in the past 30 years, but in the minds of some principal investigators and academic administrators, these very different activities are conflated. An expert in numerical methods, computational fluid dynamics and dynamic downscaling methods for climate assessment models is a seasoned web developer with a portfolio, fluent in jQuery, underscore, backbone, responsive websites with bootstrap, CSS3, HTML5, PostgreSQL, PostGIS, the Google maps API, Cartodb visualizations, as well as an Android developer conversant with the SPen library for the Galaxy Note 10.1. It's as much effort to stay current technically as it is to keep up in the scientific literature.

There are faint signs of improvement. On January 14th, the NSF revised the biosketch format by changing the Publications section to Products [2]. "This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights." The previous biosketch format was awkward for software developers, inventors and producers of data sets. 

Recently, a number of prominent computer scientists, and scientific software developers affiliated with the Climate Code Foundation [3], published a Science Code Manifesto [4]. The manifesto includes the recommendation that "software contributions must be included in systems of scientific assessment, credit, and recognition." Software developers in the digital humanities may wish to add their names to the list of signatories.

Whether these developments reflect a broader understanding that software developers ought to enjoy greater recognition and opportunity for advancement in academia that they do currently remains to be seen. Greater career advancement opportunity for software developers, inventors and data set producers working in academia might do something to address the Ph.D. overproduction problem.

But these developments were too little and too late for me. I left.

[1] https://www.globusonline.org/forusers/

[2] http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp

[3] http://climatecode.org/

[4] http://sciencecodemanifesto.org


yup. science gets the software they deserve: industry not only (ime) pays somewhere between 2-3x as well, but companies like engineers and view it as prestigious instead of labs where you are viewed as the help. Even if the results coming out of the lab are completely dependent on very sophisticated computer programs to produce.


[deleted]


Also true, though it tends to be difficult to pin down what "business experience" means.

    What I am confident about is that there's no such thing
    as "business experience". Running a small business, 
    founding a new start-up, and being CEO of a major 
    corporation like Sprint are quite different 
    propositions,  requiring different skill sets, and 
    calling for different decisions. Managing a gigantic 
    organization of thousands of employees in a 
    multi-billion dollar transnational is simply not at 
    all like bootstrapping a small venture with one or 
    two underpaid partners and a shoestring budget.
    -- Zachary Ernst, Why I Jumped Off The Ivory Tower [1]
Suppose one had the ideal academic manager, fully cognizant of the specializations within the industry, with a 100% grant proposal acceptance rate, available when needed, incapable of assigning the wrong people to the wrong project, never overpromises funding agencies solutions to open problems in computer science, and so on. The opportunity for academic advancement for a scientific programmer will still depend on publications, and in the case of a scientific programmer, that would include publications about the software one has written.

[1] http://zacharyernst.blogspot.com/2013/10/why-i-jumped-out-of...


To your last point, there's definitely no shortage of venues to publish and present when it comes to scientific software and development methodologies.


Agreed, though my point was that publications about software are weighted somewhat more heavily than the software itself.


In my experience established PIs tend to delegate a healthy fraction of the responsibility for managing technical staff, mentoring students, and even acquiring funding to their postdocs. In turn, faculty search committees see this experience as crucial, and competition for academic jobs is strong enough (~100 applicants per position, last I heard) that I would imagine any new PI should have a fair amount of experience under their belt.

At the PhD level and above, academia seems to revolve around managing one's personal brand. Publishing a few good papers isn't enough, you have to be "THE guy/girl" in an in-demand specialty to get a job. Your observation that this requires a business/PR/advertising mentality in addition to the ability to do solid research is spot on, but I think the game changes long before professor-hood, and I think that anyone who successfully obtains an entry-level professorship has already demonstrated competence at playing it.


yes I started as a research assistant at a world ranked rnd organisation and we where paid peanuts about 1/3 of the civil service where paid for the same role.


Interesting you should mention climate stuff:

http://wattsupwiththat.com/2013/07/27/another-uncertainty-fo...

Flipping a compiler flag gives completely different results! How can anyone trust this code, or any research made off of it regardless of their personal feelings about climate change?


HN really amazes me sometimes. I posted the exact same article with the exact same title 12 days ago [0] and didn't receive much traction. In fact, the only comment on the submission read

    this should have landed on the front page...
Anyways, I wonder if there is any explanation of this phenomena.

[0] https://news.ycombinator.com/item?id=6623501


Anyways, I wonder if there is any explanation of this phenomena.

It's the basic principle in real-world science that for stuff, people get credited who didn't invent/discover it. That's the reason why your submission wasn't noticed; you simply would have had to wait for the Nth time around.


Didn't hit initial voting numbers to reach "escape velocity" from New->Hot.

Also consider it game theoretically -- that you're competing vs lots of articles... perhaps other articles at the time were more interesting/compelling.


It's just random variation. Depending on the time of day, you need maybe 4 or 5 upvotes to make the front page and escape the /newest ghetto; but there's not much traffic on /newest so it's very easy for articles to not escape one time even if they escape the other time.


It is fine really - I wonder if there was more than random to it. Anyways, now I have attracted downvotes for my initial comment. Kind of surprised at that too! :)


You were in the right place at the wrong time.


Nicely written. A couple of things, somewhat independent of the economic issue: the "big data - dumb analysis" trend is bound to change, it's a matter of time till volume is not an advantage anymore. And the fact that "academia" is by no means a uniform discipline, some fields can only move ahead with the gathering of more data, while other fields need new testable hypotheses more.


Compensation is a huge issue. Unlike pure biologists, computational scientists have plenty of job opportunities outside of academia and pharma.

As a staff programmer at a prestigious institution, I was making about $50k. I left for a large company and a few short years on I'm at around $200k.


"I have some serious doubts about whether the project will be able to attract a sufficient pool of applicants for these positions."

Really?? Certainly, academia has its inefficiencies, but if there is an area of academia that has an undersupply of PhD's, please correct me!


There are very few people in academia that can actually write code or run computational analyses with skill. Most are just trying to answer their very specific question, and will take all kinds of shortcuts. And God help you if you want to track back exactly what commands (or gasp versions of programs) they used.

That's where they undersupply is. I've seen some very low quality code from PhDs, even CS PhDs. One thing that is missing from a lot of academic curricula is software engineering techniques. Hell, I've had to fight in order to get people to use git.

I'm more familiar with medical/biology research, where you have people that understand the domain, but not necessarily how to properly code. And if they, as a PI, need to manage a programmer (tech) / computational researcher, it can be difficult.


Ah, but this isn't just a problem with coding. A mechanical or electrical engineer would have been horrified by my vacuum system or electronic circuitry. ;-) When I was in grad school, the mantra was: It only has to work long enough and well enough to produce a result. My experiment was declared a success when I got the MTBF up from a few seconds to a few minutes.


IMO the life sciences are full of a very special kind of stupid wherein computational science is performed by people who could barely pass Calculus I let alone Linear Algebra or Statistics. And this would be harmless except that a lot of what they publish turns out to either be training set-based prediction of the training set or horrifying misapplication of null hypothesis significance testing.

And all that would merely constitute black box ignorance except that often if you try to help them make their work more reproducible or point out that even 32-bit floating point roundoff error can be initially indistinguishable from a bug that makes the results useless, a lot of them become agitated and tune out the possibility of any of this being relevant to their work.

I left academic science for industry over a decade ago because I caught a big shot who then proceeded to threaten me instead of fix what I found.


My basic line of argument is fairly simple. If you work at a university and your salary is paid via taxes you have an obligation to make your work available to the taxpayers.

This means the software you write should be open sourced and the papers and articles you write should be freely available.

I'll use the basic idea of this article as more fodder for that argument (it's even more important now because...Big Data ZOMG) in the future :)

I don't fully agree with the thought that writing software and papers are exclusive. With a little creativity you can get a paper or two out of most academic software development projects (granted they won't usually be the A-level "this advances my discipline" kind of papers but one can't only write those anyways imo).


If I work at intelligence agency and my work is paid by taxpayers, should the results of it also be available to the public?


Re underpaid researchers:

The part that is always unclear to me is: there is lot of evidence that research/university budgets are growing at a high rate. Where does all that money go to. It doesn't seem to go to the people who actually do science.

That seems to be the fundamental problem here.


"overhead", i.e. administration. I told someone that I was a postdoc, he estimated my salary at first six figures. Then I told him down. He said 75k. When I told him 45k, his eyes bugged out and he said 'there's something wrong here'.

I mean, I come from an excellent pedigree, too, top 10 undergraduate, top 10 grad school, and I currently work at a place where my boss is a nobel laureate. (He makes at least 250k, according to the place's 990 forms)

However, my going price gets diluted by the idiots running around with rubber-stamped PhDs. Maybe there needs to be a 'brain drain', if we can drain those people selectively somehow.


I'm really confused - why don't Ph.D.'s with excellent pedigrees just get paid more money than "idiots running around with rubber-stamped PhD.s"? Maybe the market doesn't see such a big difference..


I think you are right lost_marbles. The difference between a PhD with an excellent pedigree and a rubber-stamped PhD should be their publication records. The problem would still exist if these rubber-stamped PhDs weren't there, I think. As a postdoc, you're going to get paid zilch, period.


This is a trend that stretches well back 50 years, some say to the establishment of the postwar DoD funding system. I recommend Paula Stephan's How Economics Shapes Science[1] for a comprehensive view of both the underpaid and increasing budget trends you mention.

(The bottom line is that the money goes to administration costs as mentioned, leading institutions to expand to stay relevant, thus building more buildings and hiring more PIs with full expectation for them to be funded via grants. Meanwhile, the average researcher's earnings are mostly unchanged.)

[1]http://www.amazon.com/Economics-Shapes-Science-Paula-Stephan...


It goes towards hiring more administrators to chase that grant money they need to make the budget even bigger.


This article has some incredible insights which have gone right past me. Thanks for sharing. I'm currently studying for a Masters degree which is centered on visualization techniques for exploring and analyzing large datasets.

I have found myself thinking many times that a position in the industry, where I can use my teaching, data processing and analysis skills to further some business goal, seems like a much more preferable option than sitting around writing research papers and applying for grants all day. Not to mention that academia pays less and has worse overtime conditions than any industry job I could concievably get.

This article really nails the key issues for why I am feeling this reluctance towards an acedemic career.


> visualization techniques for exploring and analyzing large datasets.

Would it be possible for you to treat the vast amounts written on the internet regarding career choices as your "large dataset" and use your knowledge and tools to explore and analyze this dataset, and visualize it to us?


Probably not. I only work with data that's more structured, like spreadsheet-like data with very many records and dimensions.


I'm interested in doing a Master's in visualization. What program are you in? Would you recommend it?


I'm not in the United States. I am in the Visualization group at the Institute of Computer Science (Institutt for informatikk) in Bergen, Norway. It is one of the handful of research groups cooperating on data visualization in Europe, and is a very international research group. I think me and one of our PhD students are the only native Norwegians at the moment. We have a Turk, two Swiss, an Italian, a Russian, and a couple of other nationalities. Prominently, Vienna has another good visualization group and I am pretty sure that UC Berkley has one. (Ben Shniderman's, maybe?). I am currently the only Masters student in this group, we get almost no applicants.

Would warmly recommend it. Visualization is a very large field, so you'll have to carve out some niche. I am in information visualization, which is the branch most generally applicable if you want to do data mining-related things. But there are many other variants: Visualization of scientific/simulation data, e.g. flow rendering, combustion processes, climate simulation, as well as medical fields: CT/MRI volume rendering, real-time 3D ultrasound and quite a few others. Central themes at the moment are using GPUs to implement more advanced 3D volume rendering techniques, or even using GPUs to draw data which is not 3D but where there are performance issues when using CPU alone. For instance, drawing dynamic (25FPS, interactive) scatterplots of large (>1 million records) datasets.

I guess the definition of a "large" dataset varies by context, in visualization you hit this limit earlier than in statistics and non-visual data mining if you use "discrete" methods where every item is drawn on screen.


You only mentioned reasons against an academic career.

Presumably you have some for one. Why are you still uncertain?


Here's what future career paths very roughly look like for new grads (based on my own perceptions, not glassdoor):

Trying to Cure Cancer: $25k/yr, bump to $40k after 6 years

Engineering Medical Devices, Airplanes, etc: $60k/yr

Trying to Build the Next Twitter: $100k/yr, $150k/yr after 6 years

Helping Rich People Game the System to Get Richer: $150k/yr, $300k/yr after a few years if successful

The desire to stay in academia comes from having different priorities than the market. Many/most people do. The people who don't usually happen to specialize in a field that the market is currently smiling upon. It's great if your dream can be gently tweaked to be compatible with market considerations, but please understand that for most people this is not the case. There are many Big Problems lurking on the horizon where intermediate progress can't be monetized. Academia lets you work on them. The market doesn't.


> Here's what future career paths very roughly look like for new grads (based on my own perceptions, not glassdoor)

This sounds reasonable to me. Most people never have the experience of tripling or quadrupling their salary in a single career pivot, but that's what happens when you decide to bail on your sci/tech graduate research program and start working in the software industry. It takes a pretty crazy level of passion for your research domain to accept such insane opportunity costs. When I realized that by staying in science/academia long term I might never be able to afford to buy a house in a decent neighborhood or have kids it made it really easy for me to quit.


I am an example of this trend, having dropped out of a physics PhD program to pursue programming, and feeling great about the decision. There are two points which are not made by the author and which I think deserve to be mentioned:

1.) Scientific research is hard. It is very frustrating to spend all day working on something you can't be sure is going to even work. On the other hand, programming is easy, it's mostly monkey work. Sure there are places where you have to be clever and think things out carefully, but at the end of the day, when you write code it just feels like you have so much more to show than when you do research.

2.) In research, you need to get grant money to do anything, because research is expensive. So much time is then spent writing grant proposals, and even when you get money you still can't do everything you want because it's just too expensive and/or time consuming. I'm not talking about LHC money here, but just the standard money for a professor running a lab in a university.

In programming, you can work on (I'd estimate) 95+% of problems with nothing more than your computer and some old fashioned hard work. There are so many good, free to use open source libraries out there, making it pretty easy to jump into whatever field is interesting to you. Best of all, when you finish a project, there is no need to spend weeks writing a paper about it or any of the nonsense associated with that, you can just publish your code.


"Simple models and a lot of data trump more elaborate models based on less data..."

If we make the leap and assume that this insight can be at least partially extended to fields beyond natural language processing, what we can expect is a situation in which domain knowledge is increasingly trumped by "mere" data-mining skills.

This is a great point. For many years, domain knowlege was merely such experience ex ante. In otherwords, the barrier to domain knowlege was access to data. In lieu of this, perhaps theory to estimate it. As data becomes larger, more free, and more amenable to analysis by a larger group of talent the "domain knowledge" itself as a barrier to entry (prestige, effectiveness) seemingly declines. Whle this is in theory good (more access, more analysis), there is probably a corralary that we should expect turf-wars and restriction to access, as those previously in positions of privledge fight to retain their status as "keepers of the keys".


I've spent some time in the economics field and certainly found that getting access to quality data was an enormous challenge. In fact a good amount of the work is spent putting together data sets. This is probably the bulk of what econ grad students do.

Besides this, it's important to recognize the difference between identifying correlations and patterns in data vs understanding the mechanisms behind the phenomena. Krugman makes a strong point: "The problem is that there is no alternative to models. We all think in simplified models, all the time. The sophisticated thing to do is not to pretend to stop, but to be self-conscious -- to be aware that your models are maps rather than reality." [1]

Data-mining will help generate and support hypotheses, but this is complementary to model building.

[1] http://web.mit.edu/krugman/www/dishpan.html


Econ is an interesting subject. Basicaly, it's taught using slide-ruler-y math from the 1950's[1]. Its amazing that learning to build linear algebra models without a slide-ruler (ie, by hand), comined witha year or two of statistics and some calc, makes someone an expert in "Economics".

If you think about the huge increase in computer power available today, it seems a field ripe for disruption. However, the guys that run the fed, and guys like Krugman are basically from the "slide rule era" of Econ.

I'm sympathetic to the power of modeling and techniques, but I do think its a case of the more you know, the more you understand what you don't know. And I think for most people the deeper you go into the field of Econ, the more this becomes apparent. Its getting better, though, and I'm sure in another generation (once the current tenures expire) things will look a good deal different (hopefully better).

Not sure if your views are similar, though.

[1] You see no little to no respect for dynamic systems that are chaotic or non linear; bounded rationality and its antecedent effcts; the role of institution underpinnings to markets, etc. Just to name a couple that are glaringly relevant and empirically important, but not subject to "hand math".


you should actually make an effort to learn and understand modern economics before coming to such strong conclusions. there are good reasons why the discipline exists in its current form.

it is clear from the "slide-ruler-y math" jibe that you have no idea about the technical ability required to do research, especially in theoretical microeconomics and econometrics. you'll need more than mastery of "Mathematics for Engineering" and MATLAB to disrupt the economics profession.


> We always come across reports of how much talented Indians are and are conquering the world in the field of technology & business.

Um, I don't mean to be offensive, but I have never heard that.


My biggest worry is that it's not science that is in trouble, it's humanity as a whole that is in trouble.

The drain towards industry is to solve industry's problems: make a profit. And yes, these can be extremely interesting problems, and hard ones too.

But what would our current world look like if the scientists from the last 500 years had bent their minds to solve the problems of merchants? (and don't get me wrong, some of the problems merchants had evolved great solutions for mankind)


I'm an example of this trend. I trained for a decade in big data for neuroimaging. Then I realized I was working long hours with little upside beyond my own curiosity. I certainly wasn't changing the world in any tangible way. I had originally gone into neuroscience to understand my family's history of mental illness. Ten years later I was helping no one - not even myself.

I'm not convinced Big Science is in trouble though. Those who have the motivation and talent to stay in the academy will continue to do so. Yes, outstanding people will be lost but Science can progress from their contributions to commercial efforts. Geoff Hinton ends up at Google but we'll keep hearing from him the best use of his talents - the improvements to products we all use every day at a massive scale.


Scientists in actual science are seriously underpaid, but there's a case of the Teacher-Executive Problem that's going to make it hard to fix that (http://michaelochurch.wordpress.com/2013/11/03/software-engi...). The more useful you are, the more need there is for many of you (individual "rock stars" are overrated in the real world) and the greater the implicit multiplier on improving your salary. Increasing scientific pay is has a bigger cost load than increasing executive pay because we actually need scientists (or teachers, in the original formulation of TEP) in significant numbers.

Society has reached a point where the academic route means practically begging for a job that won't even pay for a house, while the startup lottery offers, at least, a chance. (And finance, better yet, offers a high likelihood of being well-off.)


I haven't read Wolfram's A New Kind of Science, but something tells me that this is a one page summary of it?


Uh... no. A New Kind of Science is its own special kind of nonsense. For the most part, Wolfram argues that cellular automata are a transformative paradigm for science. It's not about the power of "big data" at all.


Huh. All the discussion on HN that I saw on it seemed to revolve entirely around big data. Nevermind, I guess.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: