Hacker News new | past | comments | ask | show | jobs | submit login
The eigenvector of “Why we moved from language X to language Y” (erikbern.com)
442 points by platz on March 16, 2017 | hide | past | favorite | 189 comments

The research methodology in this blog post is fundamentally flawed. The author only counts how many people move from X to Y, but he doesn't count how many of them do not move at all. The whole diagonal of his (sample) transition matrix are actually missing values, but he treats them as zeroes. This greatly distorts the equilibrium distribution. As a result, he misinterprets each equilibrium probability as the "future popularity" of a language as well, when it at best only represents the future popularity of a language among those who constantly switch their languages.

Author here. You are absolutely right. As I mentioned in the notes, I think this matters a bit less than it might seem like (the stationary distribution does not change if you add a diagonal matrix) but clearly some languages will have a higher propensity for people to stay.

I think this flaw is even smaller than the issue of using Google statistics to infer transition probabilities. It's just a shitty proxy, at best.

At the end of the day, there's a lot of assumptions going into this analysis. I hope I didn't make it seem more serious than I meant it to be – it's really just a fun project and kind of a joke not to meant taken seriously.

That being said, I think the conclusions are at least "directionally" correct. They might be off by a factor of 2x or 5x or even 10x, but the stationary distribution exhibits an even bigger spread (multiple orders of magnitude) so I suspect the final ranking is still "roughly" correct (with a very liberal definition of "rough")

The thing I like most about this post is that it's falsifiable. We will know in ten years whether C and Java are still popular, and whether Go succeeds in the sense this data suggests. So thank you for being concrete and clear, even if it's all in fun and other people don't like it :)

> whether Go succeeds in the sense this data suggests

An interesting thing about this methodology is that it is extremely sensitive to the age of a language. It's possible to switch from an old language to a new language, but not the other way around -- so if you happen to do your measurements after a language has had some uptake but before it's been around for long enough that people have built significant projects on it and subsequently gotten sick of it, the future distribution by this method can only be 100% New Language. (Because sometimes people switch to New Language, but no one ever switches away.)

Actually, to predict the future distribution of language use, you also need to know the rate of people moving from nothing ("I just had a brilliant idea!") to each language. If everyone eventually transitions to Go, but everyone starts in Ruby, then the division of market share between Go and Ruby depends in part on how frequently people start new projects.

The sorted stochastic matrix shows that C contradicts your assumption that it's not possible to switch from a new language to an old one. Or, at least, it shows that portions of new language code are occasionally rewritten in C.

What's missing from the matrix is "no language" language or null language. That is, a column and row that represents people who start projects from scratch in a given language.

I agree that the analysis makes it abundantly clear people move to older languages, but the question is what new projects are started in, and how many projects represent new versus transitioned projects.

This analysis is interesting, and gives a rough idea of what people are moving from and to when they decide to do that, but not necessarily popularity.

What the author is indexing in the end isn't really predicted overall language use, it's predicted transition target frequency.

I have defined "new language" in my comment as one so new that no significant projects exist in that language, not as one which is newer than C but still arbitrarily old.

> if you happen to do your measurements after a language has had some uptake but before it's been around for long enough that people have built significant projects on it and subsequently gotten sick of it

By this definition, it is not possible to switch from a new language to anything.

It's stated as a binary, but really this defines a continuum of newness, and the metric of the OP is very sensitive to it.

I didn't interpret the results in this post as "predicting the future distribution of language use." Rather, I interpreted the ranking as an indicator of a qualitative trend in that distribution. I think the author also made this very clear.

And of course, all things trend toward newness, so your objection there seems more about time or human psychology than the methodology of the post.

From the post:

> I took the stochastic matrix sorted by the future popularity of the language (as predicted by the first eigenvector).

Emphasis in original.

Nevertheless, it seems quite obvious the author did not literally interpret the eigenvector as "x% of future projects will be written in Go." Rather, the conclusions he drew were along the lines of "Oh wow look Go is on top, C and Java are still relevant."

I reserve the right to respond to what people say. Commenting on the accuracy of a label is worthwhile regardless of whether the label was meant to be precise or loose.

This is a great point. It's a model that predicts something about the future. I could have backtested it on historical Google stats to figure out if it's a good model :)

In an unrelated note: interested in doing a guest post over at Math ∩ Programming? I see you've got lots of cool stuff with high-dimensional NN :)

That bit about the stationary distribution not changing if you add a diagonal matrix sounds completely wrong to me. Let me see if I understand what you mean. Given a matrix M with non-negative entries (and no row of just zeros), let S(M) denote the stochastic matrix you get by normalizing each row of M. You are saying that if M is any matrix and D is a diagonal matrix with non-negative entries then S(M) and S(M+D) have the same stationary distribution?

Moreover data is collected over the entire history. A matrix is a linear operator from time step T_i to T_i+1. By conflating all historical observations into one matrix it definitely is not an ordinary transition matrix.

That apart from the fact that it is questionable that it can be represented by an operator that is finite and linear.

It's more likely a stochastic process (infinite matrix) with births and deaths.

I would be surprised if it became true. :-)

That was my immediate thought upon reading. A little more accuracy in the description would be helpful. This should be presented as: "If this aggregated data reflected a constant across time, then we can see where language usage would end up in the 'long run distribution'." But the jump matrix shown here will change with time. Most likely, if search results could be binned up by time period, the time dependence of the distribution represented by the eigenvector(s) would be somewhat interesting to watch, even if not remotely predictive. Could animate that or contour plot it.... exercise for the author... :)

Right. Both the matrix S and the identity matrix will project the stationary distribution onto itself. So any linear combination of them will project the stationary distribution onto itself. Let me know if I'm saying something really stupid

Oh, you meant "multiple of the identity matrix" when you said "diagonal matrix"?

Doesn't that mean you're assuming the number of people staying with any language X is the same regardless of X?

[[1 1] [1 1]] and [[10 1] [1 1]] will have different stationary distributions, the values on the diagonal will likely be different than a multiple of the identity matrix.

Another thing which seems at least as important:

How many projects are _started_ in a language, and how many _die_?

Given p.e. Java - It may seem that there is a huge flow to go. However, if there are enough new projects started in java, then the number of java projects might still rise faster than the number of go projects.

And I share my Go reservations with you, given the whole error-handling (or lack thereof) philosophy as well as information emerging that it may require 100 lines of Go to do roughly the same amount of work as 20 lines of Elixir or Haskell, according to one example at https://medium.com/unbabel-dev/a-tale-of-three-kings-e0be17a...

(Although I concluded the Haskell-Elixir equivalency myself based on functional semantics)

Go is verbose. There is a lot of thought behind that, but that is an intentional design aspect of the language. Personally I would not use gin, either net/http, gorilla/mux, or httprouter are solid choices

What is the empirically-determined advantage of verbosity, then?

A lot of it boils down to making it easier for developers to work with each other, rather than any technological benefit.

The less magic that happens and the more code that is commonly used by all developers, the easier it is for others to read your code and understand what it does. Rob Pike and others have some interesting talks and blog posts on this

Also, the likelihood of people blogging about a language change is not the same across languages.

Some crowds (Go and nodejs enthusiasts) are notoriously vocal due to the hype.

Finally, it's common practice for companies to have their marketing department to pay "media marketing specialist" to advertise for products (their language) by posting on forums.

Some languages and frameworks are simply too new to have a significant number of people moving away from them. Go and Vue fit that case. MySQL looks genuine, even if surprising.

Yeah, many languages will be almost entirely silent in transition. COBOL, FORTRAN, any proprietary language. And I would guess those transitions are usually to the "boring" languages: Java, C#, etc.

I wonder if the data could be normalized against usage statistics somehow. Maybe not perfectly effectively, but at least better than as-is, precisely because the number of people "switching" is so much smaller than the number using a given language.

Normalization would be extremely helpful indeed. C, C#, C++, Java, and Python are all very popular languages and have large values in both dimensions, which isn't particularly useful.

Presumably also on a time series. People used to switch from Ruby to Node.js, now they switch to ???

Agree. Depending on what one wants to measure, giving every sample equal weight can be potentially wrong.

If you mean to account for the larger variances of rarer events, that would be difficult. Perhaps some bootstrap sampling methods may help, but we need a true statistician here.

It would be interesting if the results could be controlled for time. Age of the language, year to year switch rates, etc.

At the bottom he writes, "the stationary distribution is actually independent of adding a constant diagonal (identity) matrix," but I'm not sure how that could be true (intuitively it doesn't make sense, but I don't know the math)

edit: A identity matrix wouldn't affect the stationary distribution, but if you had the actual "stay" probabilities they wouldn't all be the same, and thus not an identity matrix at all.

Adding a constant diagonal matrix indeed would not affect the equilibrium distribution. Mathematically, if the original transition matrix is A and you add a multiple of the identity matrix to it, then after normalisation by row sums, the new transition matrix becomes tA+(1-t)I for some 0<=t<=1. Since the Perron vector x of A corresponds to the eigenvalue 1, it must also be an eigenvector of tA+(1-t)I, because (tA+(1-t)I)x=tAx+(1-t)x=tx+(1-t)x=x. Hence the stationary distribution remains unchanged.

But as you said, the stationary distribution does change if the missing diagonal is not constant. And there is no reason to believe it is constant in the first place. In the end, what the author measures is still different from what he thinks he measures.

I agree with you, but your last line is wrong: the footnotes make it clear the author knows the omission.

This is similar to people measuring popularity of a library or language by stackoverflow question counts.

Sure, you can't get a lot of questions for an obscure library that on one uses, but the mere fact that tons of people post questions about it doesn't mean it's good or popular. It might mean lots of beginners want to use it because it's been marketed and hyped as easy to use. (Angular comes to mind).

It might also mean that it's confusing and poorly designed, leading many, many of its users to ask questions about it.

Exactly! (That's what I intended to say but I just realized that I didn't emphasize the point).

Or that the official documentation is poor.

The real issue is that he is not counting the projects that start from zero ('why did I do this project in php') and the ones that end ('why did I finish this project in Haskell'). If he adds this data to it's matrix somehow then it would be much more interesting.

> when it at best only represents the future popularity of a language among those who constantly switch their languages.

This is literally in the title

>The eigenvector of "Why we moved from language X to language Y"

The parent's point is that it is a very poor proxy for future language popularity.

I took this whole post as very tongue-in-cheek. No one should really be using this to predict the future. Instead, it was a fun data-analysis exercise.

I wouldn't call that a "flaw" so much as an "entirely different question".

No it's definitely a flaw. The analysis doesn't model what he thinks it models. He claims that it doesn't matter by (correctly) pointing out that adding a diagonal doesn't change anything, and that hides the modelling assumption he makes:

That there is the same absolute number of people choosing to stay within every language studied.

His analysis would only be correct if the number of people who choose to stay with Java at any moment is equal to the number of people who choose to stay with Rust. That is absurd and hence the eigenvector is meaningless.

Doesn't exactly matter much given that this is just a bit of fun on a blog post though. :P

Doesn't it make sense if you ask: Given that someone will change the language they're using (and writes a public blog post about it), which are they most likely to change to?

It's not unreasonable to use that as a proxy for industry trends. I recall reading about manufacturing jobs which, sure, might have lots of factories not changing, but the ones that do change _definitely_ opt for more automation with fewer workers. That's still a trend worth thinking about.

> but he doesn't count how many of them do not move at all. The whole diagonal of his (sample) transition matrix are actually missing values, but he treats them as zeroes

Great point. It's interesting to think about how exactly this could bias the results -- would it bias it in favor of languages that developers tend to initially not choose for their project?

I do see that as very consistent with C, C++, and Java being up there. For new projects, developers love to choose anything but those, but then they find themselves gravitating towards them when the project gets bigger and practical concerns intrude

I assumed it was a joke.

Great. We'll wait for you to come up with a better research then.

I wish more people would read 'Hack and HHVM', written by Owen Yamauchi, formerly a member of Facebook’s core Hack and HHVM teams.


The hidden lesson for me was that rewriting the code in <new-language> is not the only option. Another option is to slowly improve the language/runtime itself until you've essentially switched it out underneath the application, which is what happened at Facebook. Meanwhile keep refactoring the code to take advantage. (Granted, this is isn't an option for a small company.)

I work at Facebook and sometimes write www code. When I was interviewing and thought about writing PHP code, I didn't get a warm and cozy feeling, being reminded of terrible PHP code I've seen (and written myself) in the 2000s as a "webdev". Thanks in part to the advances described in the book, the codebase is definitely not like that; it's easily the best large scale codebase I've ever seen (I've seen 3-4).

My thoughts in blog form (written 3 months after I joined):


As an unintended side-effect (maybe), the mere existence of HHVM acted as a spur to PHP generally, things have improved radically in PHP-land over the last few years.

I don't think the Google queries measure what the author thinks it does.

I noticed it lists 14 results from Haskell to Erlang, which I was skeptical of. When I google "move from Haskell to Erlang" or "switch from Haskell to Erlang" I do find results (such as quora questions, versus questions,lecture notes) but none of those results are the type of article we're looking for.

If they really want to do this, I think they also need to validate that some of those keywords are in the title of each page.

I'm also concerned that for some blogs, a single post might appear several times in Google's estimated search results (due to crawling the same blog under different hostnames or the same post under different paths, or because of syndication or posts to link aggregators). So maybe some individual posts reflecting particular teams' decisions are reflected 5, 10, or 100 times in the Google result count.

Is the table itself cached anywhere?

I appreciate that the author wanted to implement their own eigen vector/value method, but really they should use:

Numerical stability can be hard to get right...

Yes, but... The matrix has all non-negative entries, and the author is after the highest eigenvalue/vector so I think this means stability is just not an issue. The only possible issue is time to convergence.

The nice thing about the power method is its conceptual simplicity. In cases like this, it's quite hard to screw it up. And it will scale far beyond those numpy functions (not that this is needed for this example.)

Also, did anyone mention PageRank yet?

That's not really the definition of numerical stability. You're thinking of linear system stability, which is a whole other topic. Numerical stability is how resilient a computation is to computational error. These computations are taking place using floating point values, the issues arise in the error terms for floats. The author's algorithm is reproduced below:

    m /= m.sum(axis=0)[numpy.newaxis,:]
    u = numpy.ones(len(items))

    for i in xrange(100):
        u = numpy.dot(m, u)
        u /= u.sum()
Immediately concerning is the use of a dot product and a sum, which will lose a lot of information when the values being added are of different orders of magnitude. (For example `M + epsilon = M` in many floating point computations).

Also, while scalability for this size computation is way overkill, it is precisely a problem where the numerical stability problem gets even worse. Imagine if I have data on some kind of power law, and compute left-to-right `M+epsilon_0+...+epsilon_n`. No matter now large `n` is, for sufficiently different order of magnitude M and epsilon all the information is lost -- `epsilon_0+...+epsilon_n+M` could be an entirely different number. Highly recommend checking out the LAPACK stability guide here http://www.netlib.org/lapack/lug/node72.html for more.

Why do you say it will scale far beyond? Mat mul is N^3 as is eigenvalue solving.

It's actually the second highest eigenvalue. The highest eigenvalue is always 1 for stochastic matrices.

Power method is not matrix-matrix multiplication (which is not N^3, BTW [1]), but rather matrix-vector multiplication. So the power method is N^2*k where k is the number of iterations required to reach precision (usually polylogarithmic).

All this being said, scalability is _obviously_ a non-issue when talking about a matrix of programming languages. All methods are constant time.

[1]: https://en.wikipedia.org/wiki/Matrix_multiplication_algorith... Interesting tidbit: nobody can even prove it's not N^2 :)

Oh right, of course, because you're iterating the distribution. Duh.

And yea, mat mul is not N^3 theoretically, but most implementations are. I've heard that some (mkl maybe) are 2.8, but haven't had someone point code to me. My personal attempts at implementing Strassen were slower than a tuned N^3 implementation, at least for matrices that fit into memory.

They are just describing power iteration which is a standard technique for finding the primary eigenvector.

It is commonly used for web scale recommendation problems which likely explains it's usage given the author's prior background. Also underlies pagerank.

AFAIK it is quite stable but converges slowly. From my experience within 50 iterations you will have converged to a stable result.

10000 vocal webdevs make a blogpost about moving from Node/Python/Ruby to Go because their app is slow as shit and the JVM isn't trendy enough for them. Also I wonder if I'm reading this correctly but are there actually people moving from Cassandra/DynamoDB to Mongo??

there is clearly a mistake in the database graph ... that it must be inverted... because also no one is moving from mariadb to mysql

Well the fact that C is going so strong guarantees we'll be dealing with easily preventable bugs for the next 100 years

Or moving to C is something so exceptional nowadays that it warrants an explanation and thus gets included in this matrix, whereas everybody and their dog thinks moving away from C is the most natural thing in the world and doesn't blog about it.

If that were true wouldn't we see greater results from C to X? Seems like people are moving away from C less often than Java or even Python strangely enough

Only if people using C are equally likely to blog as people using Java or Python.

Sadly getting rid of C means getting rid of UNIX, as they are symbiotic and UNIX vendors will surely never rewrite them in anything else or replace POSiX standard.

C and say, rust or c++ can interface.

You don't need to rewrite, just stop writing extra stuff in C, maybe when you do a really big refactor in C, port it. In the end, we can migrate away from C gradually.

That makes me wonder, is there any chance in hell we get some RUST in the Linux source code?

Upstream? Likely never. Out of tree? https://github.com/tsgates/rust.ko/blob/master/README.md

I recall that Linus once said on the LKML that he could see himself accepting Rust code into the kernel. But I cannot find the source right now.

The problem isn't technical rather political.

If you want to fix UNIX security issues related to C, first you need to convince kernel developers across all UNIX variants to move away from C.

Then even a fully modernized UNIX kernel needs to provide support to unsafe POSIX APIs.

As we all know, Windows is totally secure, GUIs are awesome and the Burroughs architecture is unique.

See you already know, improving the world one person at a time.

One issue that would make Rust more likely in Linux would be a GCC toolchain for Rust. I personally don't want to see fragmented development of the language, though. Perhaps a backend for MIR for GCC could be done?

But, Linus et al still love C...

No way will Torvalds allow modifying the build process to include Rust.

Write your own out-of-tree Rust modules if you like, but they'll never be merged in.

Really? Why can't the C bits be replaced with e.g. Rust?

They could be. But you'd need a reason in order to be able to fund it (or maybe you just want to scratch an itch).

But the codebase numbers in the 100's of millions of LOC, and replacing that with Rust (or anything else for that matter) will come with a number of requirements:

- it really needs to be better in terms of bugs

- it needs to be about as fast or faster

- it would have to come with similar start-up times for the runtime

- you'd need to find funding somewhere or convince people the need is so high they should volunteer their time

For a single individual this is likely not a viable project, even doing just the basics (coreutils) would take you a couple of years at a minimum.

What might be a better way is to reboot unix entirely starting from the kernel and working your way up into userland.

There is plenty there that could use a more modern look at things, after all UNIX really is showing its age and LINUX is a re-implementation of something that was already old when it was started.

The Linux ecosystem was created with the userland already having being rewritten from scratch, by the GNU Project. Then a Finnish student came along and wrote a kernel. So you can probably start anywhere, but maybe starting with a working kernel would be better.

Linux doesn't really resemble UNIX anymore. It's starting to look a little like Plan 9, to be honest... give it 20 more years.

> Linux doesn't really resemble UNIX anymore. It's starting to look a little like Plan 9, to be honest... give it 20 more years.

Hm. I'm not sure I see the similarities there. plan9: small, elegant, really an improvement on Unix in many respects but unfortunately somewhat theoretical rather than practically oriented.

Linux: bloated, blunt, practically oriented, gets the job done but it never feels like it's the shortest path.

No because UNIX requires C semantics, so even if someone writes a UNIX like OS in Rust, Ada whatever language it might be, for compatibility with UNIX software it would require a POSIX API to be available.

POSIX is defined in terms of C semantics, which includes C unsafety, like managing pointers and the respective length as separate entities, using null terminated strings or casting void* to specific data structures.

Which means the POSIX translation layer would need some sprinkles of unsafe code to be able to comply with the required semantics, thus opening it to the same exploits as C code.

This issue is visible in OS that aren't written in C, but do expose POSIX runtime layers like mainframes.

As mainframes, they usually restrict possible security exploits thanks to running POSIX applications on their own containers or enclaves.

There used to be POSIX standards for both Ada and Fortran as well a C (POSIX.5 and POSIX.12). Though I don't think they can be described as successful.

If I remember right, the Fortran one was defined in terms of the C one, but the Ada one was written as if it was an independent specification.

I suppose this was only possible because POSIX misses out a lot of the fiddlier bits of Unix anyway (and the Ada one specified a spawn to use instead of fork+exec).

Yes I remember those, but you still have the safety problem.

Let's say how does one make memcpy() safe in Ada? The very first step of converting the pointers + length into an access type must trust the caller did the right thing.

I wouldn't say the OS requires unsafe POSIX compatibility. The whole system can be mostly unaware of POSIX if you can use the available APIs to create shims. Basically a wine equivalent - Linux didn't implement any windows APIs after all.

So all the unsafe bits can exist only in the process which is based on them.

Good Luck with your precious Rust to be compatible with your client's millions of lines of code.

For larger, and legacy project, maintainability and continuity is the overwhelmingly top priority, being flashy is the least of their concern.

Writing Rust off as flashy is demeaning to the excellent work that has been going on in the Rust ecosystem to specifically address the area you mention with a safe language alternative.

That doesn't make the OP's claim of rewriting 100 million of lines of code with Rust and that would solve the whole problem, sounds less stupider.

Maybe, I think it would be possible to take a BSD or Linux (tech not a UNIX) and port the kernel space to Rust and user space to Go?

Not all UNIXs are closed source.

Might as well write it from scratch without POSIX and all the lessons we've learned since then.

I think rust opens the door to designing kernels that are small and tight with most of the OS stuff that was traditionally in kernel space moved into user space.

Opens which door? Minix3 is microkernel based and userland compatible with NetBSD. XNU is originally based on the Mach microkernel and has part of FreeBSD bolted on. Both are open source, as are Mach proper and L4.

Perhaps someone could start with Minix3 or Darwin rather than Linux or BSD or from scratch. Replace one component at a time in Rust, or D, or Ada...

Easily machine verified kernels that have a high degree of having no security exploits because of buffer overruns or null pointer shenanigans etc...

All software has bugs. Any half arsed developer can create a lot of them in any language. C alone doesn't guarantee anything. Heck maybe there's fewer because C devs don't get lulled into thinking the language has their back?

>Heck maybe there's fewer because C devs don't get lulled into thinking the language has their back?

You make the serious mistake of assuming all or the majority of C devs know how powerful the language is, or how to properly use a language like C. All software has bugs, true. It's just that C bugs tend to be a wee bit more dangerous than a lot of other bugs because of the raw power of low level languages like C.

Writing secure code in low level languages is tough, and it requires knowledge of all the pitfalls and nasty corners of the language, which many devs don't have the time, interest or desire to learn.

Yes, you're right. Tools don't matter at all. /s

Not as much as understanding what you're writing and taking the time to write it correctly.

C is a tadpole in the ocean of easily preventable bugs.

And when it grows up, it becomes the toad that is C++?

I'm not sure I get your analogy.

Besides, Real Programmers™ eat mutable state and NULL for breakfast.

NULL is just macro for Cheerios...

Then why do we see a major internet security bug that would be simply impossible in any other language every couple of months?


Note the types of vulnerabilities and the languages they use. DoS, File Inclusion, XSS, Exec Code, Dir Traversal, Priv Escalation, SQLI, Bypass, CSRF, Info leak, etc.

Many of them (including ones in C) have nothing to do with memory protection. Those that do (null pointer deref, use-after-free, memory leak, buffer overflow, etc) are all trivially protected with small kernel patches that have been around for 17 years, and of course most of these vulns would become trivial with proper mandatory access control. But for reasons that completely escape me, nobody has adopted these basic techniques to prevent small bugs from becoming big holes.

Now balance the common holes in C against all the other bugs in higher level languages with otherwise suitable memory protection, and consider that at least with C there are basic steps that prevent many of these from becoming problems, whereas with other languages you need a hodge-podge of different, more complicated countermeasures. C is actually easier to secure because its bugs are common and not difficult to catch by the kernel.

Honestly, if people spent as much time developing new industry best practices for use of the language as they do complaining about it, this would be a non-issue. But C isn't trendy, so let's all crap on it and pretend it's the only issue so we can have fun reinventing classical bugs with new languages.

> consider that at least with C there are basic steps that prevent many of these from becoming problems, whereas with other languages you need a hodge-podge of different, more complicated countermeasures. C is actually easier to secure because its bugs are common and not difficult to catch by the kernel.

I think this is completely wrong. In other languages you just don't have these problems in the first place because you are memory-safe by default. It's not like there's some point-buy system where all languages have to have the same number of opportunities for bugs and not having stupid memory safety bugs means you have more complex subtle bugs instead. You just eliminate a huge proportion of your bugs.

Not saying that C is not to blame, but there seems to be relatively little code written in languages other than C that is as widely used.

C++ comes to mind for writing browser engines, but those have their fair share (if not more) of security issues as well.

My sense is that most new line-of-business code these days is written in Java or C#. Which are still dated languages with serious flaws (e.g. goto fail would still be possible) but do at least address C's memory safety issues.

I don't know why so much Internet infrastructure managed to miss that shift. Possibly a case of "ain't broke, don't fix it" - Apache, OpenSSL and what have you haven't changed that much since the early '00s and few people are motivated to write a replacement. Databases mostly existed since then, and newer datastores do tend to use better languages.

You need to compare bugs on a per-usage basis (e.g. lines of code, programs written, some other similar metric), not a raw total. You're comparing numerators, not ratios.

Because the vast amount of C code running the internet. That's all. Rewrite everything in <language> and you'd see just as many security vulnerabilities.

I very much doubt that. Buffer overflows simply wouldn't happen in most languages, and those are the majority of the vulnerabilities we see.

That appears not to be the case.

  pwillis@windows:~/Downloads$ wget -q -O allitems.csv  https://cve.mitre.org/data/downloads/allitems.csv
  pwillis@windows:~/Downloads$ ( for year in `seq 1999 2016` ; do TOTAL=`cat allitems.csv | grep -v RESERVED | grep "^CVE-$year" | wc -l`; BUFF=`cat allitems.csv | grep -v RESERVED | grep "^CVE-$year" | grep -i -e "buffer.*overflow\|overflow.*buffer" | wc -l`; PERCENT=`awk "BEGIN{print $BUFF/$TOTAL*100}" | cut -d. -f1`; echo "Year $year: $TOTAL CVEs, $BUFF buffer overflow related, $PERCENT% total" ; done )
  Year 1999: 1573 CVEs, 307 buffer overflow related, 19% total
  Year 2000: 1237 CVEs, 250 buffer overflow related, 20% total
  Year 2001: 1540 CVEs, 278 buffer overflow related, 18% total
  Year 2002: 2370 CVEs, 481 buffer overflow related, 20% total
  Year 2003: 1519 CVEs, 346 buffer overflow related, 22% total
  Year 2004: 2670 CVEs, 470 buffer overflow related, 17% total
  Year 2005: 4686 CVEs, 519 buffer overflow related, 11% total
  Year 2006: 7047 CVEs, 612 buffer overflow related, 8% total
  Year 2007: 6510 CVEs, 868 buffer overflow related, 13% total
  Year 2008: 7034 CVEs, 619 buffer overflow related, 8% total
  Year 2009: 4888 CVEs, 589 buffer overflow related, 12% total
  Year 2010: 4954 CVEs, 419 buffer overflow related, 8% total
  Year 2011: 4441 CVEs, 428 buffer overflow related, 9% total
  Year 2012: 5219 CVEs, 399 buffer overflow related, 7% total
  Year 2013: 5731 CVEs, 382 buffer overflow related, 6% total
  Year 2014: 7494 CVEs, 326 buffer overflow related, 4% total
  Year 2015: 6526 CVEs, 357 buffer overflow related, 5% total
  Year 2016: 7180 CVEs, 471 buffer overflow related, 6% total
According to this really shitty review of CVEs, buffer overflow is less than 10% (recently less than 6%) of tracked vulnerabilities. That's still a lot, of course.

Fair enough; I was thinking of the occasional "the whole internet is broken" CVEs we see a few times a year. My impression is that they're mostly C and mostly memory-safety, but that's a human subjective thing.

The right thing would be to weight by severity and number of users impacted, and bundle all the memory-safety vulnerabilities together (i.e. buffer overflow, double free, use-after-free, aliasing violation). I'll add it to the big list of blog posts I want to write.

XSS (cross-site scripting) replaced buffer overflows as the most common vulnerability in 2005.

source: http://maxedv.com/wp-content/uploads/2011/12/Sourcefire-25-Y...

SEL4 is written in ... C.

Am I reading this incorrectly, or there is more movement from Swift to Objective-C than the other way around? Do I sense a methodological error?

Absolutely an error there. If you try to search the literal string "move from swift to Objective-c", Google rejects the precision given there are so few results, and searches only for the words independently. The bulk of results in that case are actually about moving from Objective-C to Swift.

If you search "move from Objective-c to Swift", there are actually enough results that it honors the literal string, yielding far fewer results.

Definitely a major methodology error for pairs with a large asymmetry.

Nonetheless I found the post humorous, and I don't think it was held with the conviction some of the top posts seem to think.

I'm guessing that there are a lot of articles saying "We tried to embrace Swift, but found that it's not ready for prime time, so we're going back to Objective-C for now." Meanwhile, no one's going to bother to write an article justifying their decision to switch from Objective-C to Swift, because it's an expected migration.

So maybe that's a flaw in the whole exercise: it's only taking into account remarkable/unusual language switches, the ones people think are worth writing articles about.

I had this exact reaction to the graphs that also showed:

1. Movement from Postgres to MySQL

2. Movement from Mariadb to MySQL (and NOT the other way around?!?)

3. Movement from PHP to Java (I remember the sort of people leaving Java for PHP 10 years ago, and I don't think they'd go back, or that PHP people would pick Java as their choice to move to)

I think maybe he has the axes labeled wrong?

MySQL is actually seeing a resurgence as people realize that ACID is valuable and performance is just fine for 99.9% of use-cases.

And Java is seeing a bit of a resurgence as well as people get fed up with shitty PHP and other dynamically typed languages. Java has some frameworks like Dropwizard and Spring Boot that make it not as terrible anymore.

> MySQL is actually seeing a resurgence as people realize that ACID is valuable and performance is just fine for 99.9% of use-cases.

That wouldn't explain why people jump from the database with better ACID (Postgres) to the one with generally worse ACID (MySQL). Or why people would move from the open source non-Oracle fork (MariaDB) to the Oracle-acquired original project that everyone forked away from (MySQL).

> And Java...

Oh, I agree. Java is awesome these days. I'm just making a disparaging blanket generalization about the people who jumped to shitty PHP to begin with.

Author here. Yes you are reading this incorrectly. Look at the contingency table.

objective c to swift: 5216 swift to objective c: 1639

(sorry about the small font size though. had to squint really hard)

The contingency table shows these very clearly, but I believe the question is about the future popularity table.

The future probability table shows that Swift is to the left of (smaller future probability) of Objective C. The coloring of the chart shows a much darker square for Swift -> Objective C than for Objective C -> Swift.

It seems surprising given your contingency table that your analysis would show that Objective C is going to be the more popular of the two.

if I understand it correctly that means that most people who move away from swift move to Objective C, but people moving away from Objective C also move to C# and Java.

Yes, this is correct. The second table shows conditional probabilities. Almost everyone moving from ObjC goes to Swift, but conversely out of the people moving from Swift it's more spread out

This looks pretty interesting. Most striking to me is that Go is taking from other 'target' languages. You can see the 5x5 block of the other strongest target languages giving to Go, but not taking from it. To make what the Eigenvector says explicit:

Top 5 giving to Go directly: C, Python, Java, Ruby, Scala

Top 5 giving to C: C#, R, Java', C++, Fortran

Top 5 giving to Python: C', Perl, Java', C#, C++

Top 5 giving to Java: C', C++, PHP, Python', C#

Top 5 giving to Ruby: Python', PHP, Perl, Java', (C++ — only 215)

Top 5 giving to Scala: Java', Ruby, (Python', C#, PHP — only 100, 17, 16)

': language also in top 5 givers to Go.

The other top languages take from each other (there is migration in both directions), but currently Go mostly takes here. However it does lose people to Rust — which is actually the strongest go-to language from Go. And C++ does not give to go.

This might point to a discrepancy between Go marketing and reality (efficiency and replacing C++).

It would be great if you could repeat this exercise next year to see how things changed.

(besides: the script is nice and concise!)

Or it could be that Go is a fairly new language and has so few users (relative to other top languages), the search queries will show up in only one direction. You should note down what is happening in the other "new" languages.

Rob Pike has a great blog post that talks about the creation of Go and expected adoption vs where many of them ended up migrating from


Highly skeptical of so many people migrating from C# to C, or python to Matlab, to give a couple of examples. This seems like a highly flawed methodology from many perspectives, as pointed out in comments.

Or Rust to COBOL? I had the same thought.

Then reading more I realized the results included things like optimizing certain portions of programs into a language for hot areas of code. Though the Rust to COBOL one I should go read. That's nuts.

Also C# to VB. That's pretty much a one-way street in .NET development. I have yet to meet a developer interested in moving from C# back to VB.

Indeed. C# devs tend to go full in on Microsoft and Microsoft isn't about C these days.

I find it amusing that for Javascript frameworks the approximate end-state is perpetual oscillation between React and Vue.

Anyone else routinely roll their eyes at "why we moved from x to y" blogs?

They are always just "We wanted to do this in a particular way. So we fought the framework till we decided to move to another that does things the way we thought they should be done. Now things are much better but we will fail to mention down the line all the new compromises we have to deal with"

Most often, it's a "We built this thing initially in X until the legacy technical debt and proto-duction compromises caught up with us, then we rewrote it in Y, leveraging all the domain knowledge and experience we've gained after doing it in X. Amazingly, the second system in Y is better/faster/has less bugs!"

This is not a "why" post. He did an N x N contingency table of how many people moved from language X to language Y, then rendered his findings as a set of directed graphs.

Who is the one person who rewrote their matlab homework in php?

Apparently it is a scientific program that was made into a web app https://www.ufz.de/index.php?en=39156

It should be noted that the premise dictates this eigenvector is limited in scope. It seems to apply mostly to people who both wish to create products (usually for some commercial venture) and have a habit of saying things like "Well this looks hard. Let's try reinventing this wheel with different tools and see what happens."

This is super interesting. It's also interesting to see the converse - who isn't moving anywhere. Go, Elixir, Dart, and Clojure all seem pretty happy!

This is an absolutely terrible methodology for asking that question. People tend to blog about major language changes. Very few C# shops/devs are publishing blog posts about how they're still using C# this year.

Unfortunately, you'd probably be hard-pressed to find a decent methodology.

But indeed, "we used this language for 10 years without thinking about switching" is a much more interesting metric than "we switched languages 3 months ago".

Yeah true. My mistake, but fair point. I'm not exactly sure how to measure that to be honest. Seeing what people used for ten years without switching would be pretty neat.

Just look at the labour market. Even when companies don't switch languages they constantly have employee turnover.

The only problem is that the labour market is a lagging indicator. Perhaps the best approach is to combine statistics about active switchers (like the OP) with labour market statistics.

Alas, this data says nothing about how many people aren't switching languages. The small number of people switching away from Go, Elixir, Dart, and Clojure might simply be because you can't switch away from something you've never used...

Yeah you're totally right. Also that they haven't been around for too long either. Didn't occur to me, but makes sense.

Is it though? The public sector in Denmark has been and will continue to use JAVA and C#. This is where a lot of the programming jobs are, mind you, yet if you look at who's switching to what and what languages are "trending" you'd see things like Go here as well.

Node as an example has been trending in my area for long enough for there to be job listings. Yet if I look at the job listings 90% of them are for JAVA, C# and .NET or PHP. In fact in a region of 1,5 million people where node has been trending for a while, there are 0 Node related job openings this morning.

Our workhorses are Java, .NET,Objective-C, Swift, JavaScript and C++.

I don't see it changing until customers request any other kind of deliverables.

Oh and we have zero projects in any sort of public repos.

I really appreciate his method of breaking it down per-niche.

When it comes down to it, all languages are DSLs. Even LISP/Scheme are DSLs for making DSLs (like Butterick's "Beautiful Racket" earlier today).

Presenting them as per-niche directed graphs is probably less likely to steer newbies (and sadly, not-so-newbies) into another round of "let's redo everything in X!"

Agreed. Nevertheless, everything will be redone in JavaScript, by the look of things.

"Has a small, but awesome CUMmunity"

I wonder whether that typo was intentional ;)

The whole thing is like a "Yo Dawg, I hear you like Javascript" meme, so probably so.

I don't see any database that we actually care about.

SQL Server, Oracle, DB2, Informix.

Big corps don't write blog posts about their tech stack.

I can't really tell if you're joking, but I think your question illustrates a different flaw in the article.

People really write about the databases you listed, because their users have an entirely different mindset. Sure people may switch from Oracle to DB2, or from SQL Server to Oracle, but some organisations just have "standard databases" that they work with. Switch would be a multi year process, and certainly not something to be advertised, unless it's: "Now with support for SQL Server" in the marketing material.

I am not joking, those are the type of databases I use daily on the the programming stacks I posted in another thread.

My employer does enterprise consulting.

I love golang, but I hate trying to search HN articles for the word go... I need a find go but not ago search button in FireFox ;)

grep go | grep -v ago

I can't tell you the number of times that I've searched for Rust and get back results about protecting your old favorite car. And project names like corrode don't help narrow the search, it in fact raises the error rate.

It's funny, there was this company Yahoo! that was trying to organize the internet to try and fix this...

Another thing you can do is simply put a leading and trailing space in your search... " go " (quotes not needed)

If you use Ctrl+F for search, you can select "Whole Words" to "find go but not ago".

Ctrl-F has a 'Whole Words' option that should do what you want.

Edit: Oops, missed the same reply from xudongz

I've only had success searching HN for Golang.

searching for "r" is even worse.

Search for " go"?

Only relatively small project can afford a rewrite. So, this is statistic among projects that can affod a swith. And as far as I can see, this is eigenvector of trend. First derivative of actual state of things. More informative of the state of fashion today

I always thought that there is no such thing as best programming language in the world. After all they only recycle the same recipe , again and again. Then I found out Smalltalk and realized how wrong I was.

There is nothing that comes close that can compete with the massive success that Smalltalk has been. Its blows my mind how it can be so much better than anything else out there including the usual suspects (Lisp, haskell, blah blah).

But in the end its not about the language , its about the libraries. Hence why Python remains my No1 choice.

In the end however even Smalltalk is terrible outdated. The state of software is in abysmal condition trapped in its own futile efforts of maintaining backward compatibility, KISS and do not reinvent the wheel.

In sort software is doing its best to keep innovation at a minimum and as such pretty much everything sucks big time and is still stuck in stone age.

I once considered becoming a professional coder working in a company doing the usual thing, I am glad I was wise enough not to choose that path. I would have killed myself right now with all this nonsense that makes zero logical sense.

But my hope is in AI, the sooner we get rid of coders, the better. Fingers crossed that is sooner than later. Bring on our robotic overlords.

Saying that I know a lot of people that really love coding and respect it as an art and science, so there is definitely hope.

The author is surprised that angular is holding up. I've been learning angular2 the past few weeks after having never used a single page application framework before and I'm loving every second of it. I'm never going back to ASP.NET MVC except to use it as an API.

Consider vue js https://vuejs.org/v2/guide/

* I find the template syntax is more sensible than react.

* It's not as total as angular. You can use just the small parts you'd like.

> I'm never going back to ASP.NET MVC except to use it as an API

Angular isn't going to help you write your server. That's such a strange statement.

Quite often ASP.NET MVC is/was taught with the Razor view syntax wrapped around the axle of everything else that MVC can do, because it was such a revelation compared to the suck of WebForms.

The Python/Ruby axis are interesting. You've got over twice as many Python to Erlang posts out there...and a fraction of the Python to Elixir's. Ruby has the opposite. A bunch of Ruby to Elixir and very few Ruby to Erlangs.

I wonder why that is?

The Elixir syntax is quite alike to Ruby's, so if you are going to change from Ruby to a language running on the Beam VM, switching to Elixir is easier than to Erlang because you don't have to relearn as much. I suppose that this effect is missing for Python to Elixir and that therefore relatively more people choose to switch to Erlang.

Elixir is Ruby-like syntax on the Erlang VM. So, Rubyists enjoy it.

To me it's unsurprising that people move from "thing that was popular a while ago" to "newer thing".

Generally when a language/framework/toolset first hits, it looks magical and fixes loads of problems people are currently experiencing.

It's own crop of problems has yet to emerge (generally these only emerge once a sufficiently large number of projects have been using it for a sufficient length of time).

So at the moment Go is the new thing, and it's surplanting the older new things... come back in 3-4 years and it'll likely be on the losing end to something else.

This is certainly interesting. Great for happy hour bullshitting, but too flawed to take seriously.

I'm not dismissing it. Just wanting to define proper context. Else some twithole will start a shit storm over null.

The prevalence of Go may be caused by the stupid fact that Google search results include pages where "go" is used just as a common verb and not a language name.

@platz : The time component is missing in the analysis to make comparisons meaningful to me - so in comes a cubic meter of salt.

> Surprisingly, (to me, at least) Go is the big winner here. There’s a ton of search results for people moving from X to Go.

You mean ... Google search results?

I'm not trying to suggest that Google's search engine is intentionally biased toward Google projects, but I think it's reasonable to assume that their own projects wouldn't fall into whatever unintentional blind spots their search engine may have.

They didn't exactly make it search engine friendly either, given that the name is quite generic/short.

It would be interesting to do the same thing but with Bing and see if C# and .net comes up first ... ;)

I find it very hard to believe that no-one is moving to node.

They are. Mainly from Java, Python and PHP.

You may be reading the axes the wrong way around... look down the node column rather than the node row.

I find it very hard to believe that anyone would be moving to node.

Oi. Another person who thinks the number of search results returned is a real number that means something.... the fact that it gives even plausible results is impressive as the number is made up by googles servers.

Made up? Can you explain? I ask because I'm professionally working on a project which uses those results and I've often wondered about their validity. I know there are...Issues with them in various ways, but what are you aware of?

It's not appropriate for me to give you the details, so I'll just say I wouldn't rely on that at all.

I half-expected Unicode to have an emoticon for this https://assets-cdn.github.com/images/icons/emoji/trollface.p...

But alas, this post had to use an image.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact