Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What’s the biggest difference between professional coding and academia?
67 points by acalderaro on July 8, 2017 | hide | past | web | favorite | 52 comments

Academic code typically just has to work once or a handful of times, for a small number of highly expert users, frequently just for the author. Ease of update is of the essence - you'll rewrite most of it many times, as your understanding the problem change. You can use all sorts of ugly hacks so long as you get what you're after.

If any of it ever becomes commercially released or whatever, there'll need to be a complete rewrite that makes it usable and maintainable by people other than yourself. But most of the code will never get to that point because most of what you've done up until about a week ago is wrong and worthless, and the current, correct-until-next-week iteration is stuck together with duct tape.

Speed only matters on the infrequent hot paths, which is why Python is popular. The rule of thumb is nobody cares about speed / resource consumption until it needs to run on a cluster, but then you care a lot because cluster time is metered and simulations can get huge. Fortran is still fairly popular because many math libraries are on it and porting would require huge effort from a very small group of very busy people.

Most of the coders are not software engineers and don't know / don't follow best practices; on the other hand the popular best practices are not designed for their use-case and frequently don't fit. Versioning (of the I-don't-know-which-of-the-fifty-copies-on-my-laptop-is-the-right-one type) is a big issue. Data loss happens. Git/Github/etc has steep learning curve, but so does all the various workflow systems designed for research use.

We've had good luck with some academic code bases in production -- ETH Zurich puts out some great code [1,2].

[1] https://github.com/libigl/libigl

[2] https://github.com/pybind/pybind11

Well said. It's like iterating prototypes: why would you spend time and thought engineering it properly? The point of prototypes is they're quick and cheap.

My own experience with reusing code by making a framework in academia: it immediately prompted me to think of interesting cases not possible within it...

To be fair, pybind11 is not an ETHZ codebase ;). (original author here)

In academic systems papers, every performance claim needs to be backed up by an experiment. But you can get credit for features even if you argue that it is possible to implement that feature with your design, even if you didn't actually do it.

In production software, this is flipped. Every feature claim needs to have an associated test, as it's a contract with your user. But when it comes to performance, everyone just waves their hands.

I'm being a little glib. But production software has to work. You'll spend far more time dealing with all of the "less interesting" details and edge cases than with research software. As ams6110 points out, this means more focus on testing, maintenance and good design. But I do want to emphasize testing - sometimes you'll spend more time testing something than actually implementing it. There's also often many more residual effects from dependencies elsewhere in the ecosystem you're working in. That's not typical in academic software.

I've encountered academic code that implements the state of art, most efficient, algorithm for solving some problem.

The code that comes to mind had the following properties: over 20 years old; written in C and badly converted to C++ somewhere along the way (the stuff-all-the-globals-into-a-class approach); a combinatorial explosion of #define and #ifdef statements (covering all the experiments in the original paper)

In the paper, it is clear that one of the experiments wins, and why. So...

Step 1: remove all dead code.

Step 2: observe that the algorithm needs no dynamic memory allocation, remove all but 1 call to malloc, calloc, realloc, and free.

Step 3: the use of float can be replaced by correctly scaled 64-bit unsigned integers, with no loss of precision

Step 4: rewrite entirely in modern C++, this has two benefits, a) I get to use the <algorithms> library (judiciously, this simplifies the code enormously), and b) the code can send clearer messages to the compiler than the mid-90s liberal sprinkling of the 'register' keyword.

The net result is no asymptotic improvement whatsoever — arguably a slight improvement for very large N as heap performance starts to interfere, but nothing worth the effort.

However, the code now has tests (step 0), is clean and maintainable, is 10% of the size, and is 5-30x faster (depending on the shape of the data)

This is not a good rule of thumb, it depends on what your research is. In most cases I've dealt with (security) the academic software displays terrible performance characteristics and is very buggy. The industry application that surfaces years later does not have these problems but it doesn't present anything novel.

I said academic systems papers; their evaluation criteria is generally performance.

Not really. Their evaluation criteria is usually only performance if there are existing established ways of doing something.

In my academic systems papers, the evaluation criteria is generally performance. I have to admit I'm not quite sure what you mean.

Maintainability. A lot of academic code only has to last long enough for one project or thesis, and the only maintainer will be the original author. Real-world code will last longer[1], and be worked on by more people, including people of lesser skill without the original author around to guide them. Often, that code also has to run in more environments. This difference is reflected not only in the code itself, but even more importantly in the infrastructure around it - source control, tests, documentation, bug trackers, etc.

Ironically, an academic might get to spend a higher percentage of their time on pure coding than a professional coder does. They have other concerns. Maintainable code is not part of the desired outcome. It's consumable and expendable, not durable, so any time spent making it any better than "just barely good enough" is wasted. Why build a tank when all you need is a bicycle?

[1] At least the expectation. Some academic code lives on far longer than its authors intended, and some non-academic code vanishes pretty darn quickly. But in general, both the intent and the expectation is that non-academic code will live longer.

Ha. Check what I wrote two days ago on a different thread https://news.ycombinator.com/item?id=14708868 about great programmer habits.

> easier to maintain code is king. Unless you are writing something extremely time critical do not try to be clever. A little slower is okay (and yes, I am in the performance consultancy business) if it significantly decreases the maintenance burden. Clever hacks belong to toy projects and blog posts. The next person who maintains it will be stupid to the code -- even if it's yourself. That clever hack is now a nightmare to untangle. In short: always code under the assumption that you will need to understand this when the emergency phone kicks you out of bed after two hours of sleep in the middle of the night. The CTO of Cloudflare was woken to the news of Cloudbleed at 1:26am.

Now I correct myself: clever hacks belong to academia, toy projects and blog posts.

This is important. Plus everything related to debugging when shit happens at 3am and developers are sleeping and maybe contractually not reachable (think about big companies, not startups).

I've been moved to Operations for two years after 10 years of Development. When I went back to Development I started coding as if the 20+ years the code will live after the initial deployment are more important than the 6 months spent creating it. And they are more important for the company, because they pay everybody's salaries and the shareholders. The first 6 months? Not so much.

Academia never has that problem. They also almost never have to deploy code to production.

"Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth.

This is the biggest difference between academic and professional programming in a single pithy statement, from a paper that Knuth wrote.

Academia only worries about getting results for publishing. Testability, maintainability, clean design, all take a back seat because once the paper is done the author will likely never touch the code again.

Currently having this problem right now. The authors also don't respond to emails either.

Because they're working on the code for other papers and fixing old code doesn't add anything to their CV.

Try to make your company offer them money to cooperate. They might be suddenly very interested in your questions.

I'm just a student doing research, trying to use their code. I don't even want them to fix the code, I just wanted to ask the primary sources about how they implemented a few things. I don't have to ask them, I could go to someone else or a forum and ask the same thing. Maybe I'm naive, but it doesn't take a whole lot of effort to just respond back and tell me that they don't have time.

Because they got new email addresses at faang.com :)

A number of Linux kernel developers have been working with a subset of the Usenix FAST (File systems and Storage Technologies) community. We hold a Linux FAST workshop after the FAST conference for the past few years.

A few years back, some of the researchers (professors and graduate students) claimed they were interested in more testing and possibly taking some of their work (Betrfs[1], specifically), and productionalizing it. In response, I spent a lot of time with kvm-xfstests[2] and gce-xfstests[3][4] testing infrastructure, cleaning them, making them work in a turn-key fashion, and providing lots of documentation.

[1] http://betrfs.org

[2] https://github.com/tytso/xfstests-bld/blob/master/Documentat...

[3] https://github.com/tytso/xfstests-bld/blob/master/Documentat...

[4] https://thunk.org/gce-xfstests

Not a single researcher has used this code, despite the fact that I made it so easy that even a professor could use it. :-)

The problem is that trying to test and productionalize research code takes time away from the primary output of Academia, which is (a) graduating Ph.D. students, and (b) writing more papers, lest the professors perish. (Especially for those professors who have not yet received tenure.) So while academics might claim that they are interested in testing and trying to get their fruits of the research into production code, the reality is that the Darwinian nature of life in academia very much militates against this aspiration becoming a reality.

It turns out that writing a new file system really isn't that hard. It's taking the file system, testing it, finding all of the edge cases, optimizing it, making it scale to 32+ CPU's, and other such tasks to turn it into a production-ready system which takes a long time. If you take a look at how long it's taken for btrfs to become stable it's a good example of that fact. Sun worked on ZFS for seven years before they started talking about it externally, and then it was probably another 3 years before system administrators started trusting it with their production systems.

Academics aren't paid to code. Academics are paid to do research.

Professional coders are paid to code.

Maybe this is unusual, but I've seen labs hire CS grad students to write their code. I always assumed this was widespread practice.

That is usually done for large research projects that collaborate with industry and require high quality code (e.g. EU-funded project "HOBBIT"; https://project-hobbit.eu/)

JavaScript and the web generally are a really big deal. Unless you are working close to the metal at some point you are probably going to have to write code that somehow works in a browser. The problem with this is that it doesn't resemble C++ or Java and very few people figure out how these technologies actually work.

Academia isn't preparing developers for this reality. Many will try to fake it or hide under imposter syndrome, which is fine if everybody in the company is an imposter, otherwise it is plainly obvious you are incompetent.

I suppose it depends on why the academic is writing code. I've written simulations for social science research -- basically to extrapolate the results of certain decision making strategies by an idealized decision maker in a toy problem. The theory is the subject, not the code. Many people who read my paper will not be programmers able to critique the code, and few will care whether it follows best practices. I've made it open source (https://github.com/joeclark-phd/bandito) because I think it's a good practice to promote, but I'm kind of an oddball. When I was studying other well-known simulation papers, I found their models were incompletely described and I had to contact the original authors to get implementation details in order to replicate them. This is a lot like the problem of data sharing -- all academics should be willing to share their data in order to prove that their results are as reported, but it's not an easily enforceable principle.

If you are talking about computer science academics, of course, that's a horse of a different color. In that case, the code is the topic, so I would guess that they're providing it! On the other hand, the majority of such research is probably solving niche problems and special cases, so it may not be very usable in your professional coding.

In academia you get to choose your problem. This means that if something isn't working you can restrict the inputs or only operate on some subspace of the problem.

In contrast, industry doesn't let you choose the problem: you need to solve whatever the problem is that the client has. This means generalising a lot further and having a less optimal solution that is more robust to input error or poorly calibrated measurements. Even if it does fail you should be able to identify why and explain to the user what they did wrong.

In academia this feedback process is generally to the person who wrote the software, so a cryptic error message including some algorithmic details might be sufficient to debug the inputs sufficiently.

In one sentence: Scientists build so they can study. Engineers study so they can build.

... one?

Do you know where this quote comes from?

My primary observation so far has been that academics are going to want to reach into the guts of my data analysis pipeline at every single step and inspect what's going on.

This informs my design choices quite a bit.

In academia you don't care about the quality of your code, in professional you do care but don't get the time to fix it ;).

On a more serious note. In addition to what is already mentioned by others on quality, performance and so on I'd like to add that in professional career you most likely work with a (larger) team. Which means you will run into code conflicts where code is reused for different purposes and you cannot simply change it. In addition you have to think about readability and documentation as your colleagues have to be able to understand the code without losing too much time or needing you.

You will also always have to work with legacy code. Most likely code you want to change but can't considering the timelines.

You will have to sync your design with many others. You might have to convince them or discuss issues with conflicting requirements or deadlines. There will be times you can't finish your entire design and have to think of a staged introduction or even harder, change it so it can work with only 50% of the design.

Also, your code has to run for many years. You can't simply take an expirimental third party package maintained by a single person. Too risky. You have to think about hardware expiring or no longer being supported (especially with gpus).

You gave to think about licenses. Academia is usually free. With professional you have to take a close look.

Operations. Forward and backward compatibility concerns. In professional coding, good enough beats cute implementations that no one will see.

The ability to even get a build of the software using anything but the exact OS version, tools, compiler, libraries, etc, that the author used. Even if you can get it to build, the chance that it works as intended is small.

The consensus here is that "professional" code is more maintainable than "academic". That's probably the ideal, but not entirely sure it holds up in practice. In particular, approaches which put a lot of emphasis on clarity and "testability" of individual functions/"units"/whatever can make it harder to understand and reason about what the program as a whole is doing.

Also, the focus on building software in teams seems to lead to architectures that need teams (vs. suites of manageable-size, "do one thing well" tools).

Slightly different take on this: http://yosefk.com/blog/why-bad-scientific-code-beats-code-fo...

What does "academia" mean to you? Are you an undergraduate student, a grad student, a postdoc, a research scientist, a professor (and if so, which level)? Relationships to software creation differ at each level.

See: Why do many talented scientists write horrible software? - Academia Stack Exchange https://academia.stackexchange.com/questions/17781/why-do-ma...

Wrote this recently, kind of an answer to this question (if you don't mind another disgruntled JavaScript rant): https://guscost.com/2017/06/19/future-driven-development/

Academic is often theory

Professional is often whatever works

This is fairly common with many academic vs professional differences, btw

In most cases, not enough testing

For what it is worth, LLVM was birthed in academia.

So was the Internet, broadly speaking. Same for a lot of technology we rely upon; the question is: what changed it from a one-shot proof-of-concept into a product?

the biggest difference, in academia you code on sunday night and actually, a very small number of people read your code. But in industry, you write a code and then it goes through a pipeline of multiple reviews, so you end up spending most of your time addressing all those comments, and also you do the same for the codes written by other people.

Obligatory xkcd: https://xkcd.com/664/

The super structure of a Software project, size and density of crew.

The number of // @TODO's in the code.

Testing and review.


Previous HN discussion: "Why can't you guys comment your fucking code"


Copy&pasting my response there:


Why is code coming out of research labs/universities so bad?


Academic projects are typically one-offs, not grounded in a wider context or value chain. Even if the researcher would like to build something long-term useful and robust, they don't have the requisite domain knowledge to go that deep. The problems are more isolated, there's little feedback from other people using your output.


Different incentives between academic research (publications count, citation count...) and industry (code maintainability, modularity, robustness, handling corner cases, performance...). Sometimes direct opposites (fear of being scooped if research too clear and accessible).


Lack of programming experience. Choosing the right abstraction boundaries and expressing them clearly and succinctly in code is HARD. Code invariants, dependencies, comments, naming things properly...

But it's a skill like any other. Many professional researchers never participated in an industrial project, so they don't know the tools, how to share or collaborate (git, SSH, code dissemination...), so they haven't built that muscle.

The GOOD NEWS is, contrary to popular opinion, it doesn't cost any more time to write good code than bad code (even for a one-off code base). It's just a matter of discipline and experience, and choosing your battles.

> Previous HN discussion: "Why can't you guys comment your fucking code"

Who is that clown? and why is the shit-post of a 4-day old reddit account being discussed all over the interwebs like gospel?

That person very likely has regrets not finishing high school and is venting frustration in the form of misplaced anger.

What resonates with people, and why, is a rather deep question. Indicative of an arbitrage opportunity (lucrative), if you can really get to the bottom of it.

It would befit someone of your intellect to try to figure out why the post was so popular, instead of an arrogant dismissal.

What makes you think a highly upvoted online discussion is something that resonates with people, especially in the era of strong correlation between anonymous foul-mouthed posts and massive vote manipulation.

But let's discuss that post in case you couldn't assess the level of ignorance of that shit-bag:

- A universal claim, e.g., one starting with "every [javascript] project ...", is fairly easy to debunk (I guess it's fair that a high-school dropout like him did't know that), and lo-and-behold, it did not take me more than 10 minutes of google and github search to find javascript projects with a near-complete absence of code comments, and with variable names resembling the ones that moron was complaining about.

- He is a total hypocrite, as pointed out multiple times on reddit as well as HN, for pissing on other developers about short variable names and yet making a post and comments full of acronyms himself.

- If JS developers are 'inbred peasants' (his own characterization), the fact that one of those visits a machine-learning forum and throws a temper-tantrum at the whole community for variable naming and code comments, only goes further to confirm the impression that the JS community carries some of the least-educated, least-knowledgeable nasty teenagers who just discovered the developer console of a browser they use 24x7 to cast slurs on each other, and now they think they're the gods of computer science.

Even if you ignore all that, the biggest thrust of that shit-post is a wholly subjective one, that variable names he's encountering while reading machine learning code are _not to his liking_. That is it. I could just as well go ahead and say, ctx_h is a perfectly fine variable name, 'ctx' stands for the word 'context' (a well-known shorthand), the underscore is borrowed from the latex convention of subscripting, hence the 'h' is a subscript. And while it is not clear from the name what 'h' should stand for, it's obvious that ctx_h is a special case of some 'context', and it's completely fair to expect the reader to understand this source code in light of the paper associated with it, (which by the way is the source's documentation and, in a sense, a super-polished form of code-comments). Not to mention, this naming convention is practised even more faithfully in the mathematics community, where you would find names like x_i, a_0, all over a theorem or proof (again underscore representing a subscript). And yet my whole argument would be based on a subjective opinion.

While I completely admit that academics, by virtue of being domain-experts first and software-developers second, are more likely to suffer from problems of lack of clean coding and established software-engineering practices, it is far from being a black-and-white case. Not even close. Spending half a decade in a grad school after spending many years in the software industry, and advocating use of modern software-engineering practices, I recently took up work at one of the big software companies, and was shocked to find out the quality of their C++ code was worse than any of the Fortran and C++ codebases I encountered at the university. And personally, I've found machine learning python codes to be a fair bit cleaner than most C++ codes I've come across.

I'm not against criticism, and I think machine learning community could use a lesson or two on software engineering, but if you're up for such undertaking (criticizing the whole community) you better make sure you don't come across as a complete ignoramus and a hypocrite.

Because it quite obviously does, even by your own admission. Are you arguing against yourself now?

That reddit post is clearly tongue-in-cheek, written consciously in an exaggerated voice to spark interesting discussions (which it did) -- not a peer-reviewed journal article. But I have no doubt you're aware of that, please stop trolling.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact