Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A baffling scale transform on a chart of university course selection trends (columbia.edu)
174 points by luu on Jan 30, 2023 | hide | past | favorite | 35 comments


https://twitter.com/benmschmidt/status/1562489422503120896

  It's a log scale centered on 100%, so that the relative positions of dots accurately shows their proportions relative to each other. E.g. "doubled" is the same distance from "stayed the same" as is "halved." A weird hills I choose to die on is that log scales are often better.


Yep; I agree. The original has some layout issues (the 0 line should be highlighted). But the scale is better.

In the "fixed" chart the scale goes from -130% (meaningless - implies negative degrees were awarded) to about +130%. A course dropping its graduation rate to 0 would be a big deal - but this chart implies it wouldn't be as shocking as the CS graduate rate doubling.

The top entry is at 130% - meaning the number of graduates approximately doubled (2.3x). The lowest entry looks to be about -48%, so the number of graduates approximately halved. These are similar changes, so they should look like similar magnitudes in the chart's scale. But in the "corrected" chart, they look very different. In a geometric / logarithmic scale, they would look like they'd drifted a similar distance from the middle ("unchanged").

If I were going to improve this even more, I'd draw lines from each data point back to the centre point ("0") implying movement from the centre (which is what the chart is showing).

Nitpicking mode: I'd also label the centre as "unchanged from 2011" and I'd add the absolute numerical values to the chart. (There's plenty of room!). I also find the title a bit vague. "% change in undergraduate degrees 2011-21" made me wonder if it was talking about the enrollment or graduation rate. I'd mention that its counting undergraduate degrees awarded, and that the data set is for the USA. (As opposed to a particular college, or the whole world or something).

As it is, both variants seem a little underwhelming to me.

Original: https://statmodeling.stat.columbia.edu/wp-content/uploads/20...

"Fixed": https://statmodeling.stat.columbia.edu/wp-content/uploads/20...


Coming out really strong on the quadratic fit is pretty weird. People aren't going to judge you for keeping your mouth shut, but not recognizing a log scale?

I mean over this short a range (and I do which he'd plotted -50% -30% 0 +40% +100% +140%) a linear interpolation is easy and pretty accurate. You'll probably notice the sqrt2 in there, if you look.


I think the original graph maker is 99% the way there in the followup. This data makes no sense at all in a graph like this, a table sorted by percentage would have been more illustrative, as well as something histogram flavored based on the absolute counts.

Log graphs might be better but without a log unit that people can actually grok I think it’s pointless. We’re looking at percent change. What does the difference in the distance between +25% to +15% vs +25% to -10% even tell you? Ahh yes the relative difference of the relative change.


Log scales are impossible to reason about unless you're dealing with exponential data. Otherwise, the transform is meaningless because log is literally an exponent, so not having an exponent in the data is equivalent to garbage in, garbage out, more or less.


I think there is a world of electrical and acoustic engineers that would beg to differ.

The dB scale is very common, useful, and easy to understand over a large range of amplitudes for very linear systems. Resonance plots are an example where data is unreadable without and easily readable with log plotting. Wide ranges of delta% (from unity=100%) seem like another good choice.


The change you experience in a sensation is related to how much you're already sensing that thing. This applies to hearing and vision and is known as Weber–Fechner law: https://en.wikipedia.org/wiki/Weber%E2%80%93Fechner_law

What this means in practice is that human senses are a logarithm of the input. Because hearing is a logarithm of the input, we deal with the perception of sound in exponential quantities of sound.


Log scales are also widely used for physical quantities that humans can't directly perceive, like radio-frequency electric fields. The logarithmic nature of human perception provides an additional benefit in some cases, but it's not the sole or primary benefit. For example, the amplitude response of any linear differential equation to a sinusoidally varying input is (roughly, at low Q, etc.) piecewise linear vs. frequency on a log-log Bode plot. It has no similarly useful structure on a linear plot. That structure is relevant to all manner of problems in electromagnetics, optics, acoustics, dynamics in mechanics, and other areas. Any introductory course in signals and systems will cover it.

That's not very relevant to this university chart, though. For a simpler example, stock price charts are sometimes logarithmic. On such a chart, if I buy a constant dollar amount and then sell it, I'll make the same dollar gain or loss for any buy and sell points the same vertical distance apart. I believe this chart's creator was thinking of an analogous property; the labels are placed in pairs geometrically equidistant from 0%, since (1 + 0.25)*(1 - 0.2) = (1 + 0.5)*(1 - 0.333...) = 1. I believe the rounding to 34% and not 33% is a mistake, but that the scale is otherwise fine.


> piecewise linear vs. frequency on a log-log Bode plot. It has no similarly useful structure on a linear plot.

log-log graphs are not the same as a log graph. A log-log graph is useful for turning monomials into lines, which is exactly what a Bode plot is. But this is a good point. Maybe "exponential function" is the wrong terminology. "Function with an exponent" is more apt.

> stock price charts are sometimes logarithmic.

Stock price charts are logarithmic when the movement of a stock is exponential over the time frame being looked at. Stock prices, in general, are kind of related to perception as well. A $4 stock that moves +/- $2 is perceived very differently than a $100 stock that moves +/- $2. The perception of a price delta is relative to the current price, not the absolute dollar amount of the movement. This is why stock prices exhibit compound returns, which is of course an exponential function.


You could say that a log scale is useful only when the variable can be usefully regarded as the exponential of something, and I think you'd be tautologically right. That exponential structure just shows up very often, whether from human sensory perception, or from linear systems math, or from economics, or from many other causes.

I generally prefer log units to linear percentages for anything like a scale factor or a ratio. A lot of stuff comes out cleaner and more symmetric, for example because (1 + 0.10)*(1 - 0.10) isn't equal to one exactly, but exp(0.1)*exp(-0.1) is. The case for that in the university chart seems slightly pedantic but fine to me.


The quote literally explains how to reason about it.


> "doubled" is the same distance from "stayed the same" as is "halved."

What's more natural: looking at a change relative to 0, or looking at a change relative to all the other points? Personally, I think looking at a change relative to 0 makes much more sense. Trying to figure out the meaning of the points by looking at relative values is garbage, because there is no underlying exponent to justify the use of the logarithm.


If the number of students enrolled in my course goes up by 25% one year, and then the next year the number of students enrolled in my course drops by 20%, how many students are now in enrolled in my course compared to the beginning?

What if it were reversed: what if the number of students dropped by 20% one year, and then when up by 25% the following year. How many students are now enrolled in my course compared to the beginning in this scenario?


In any scenario like that, where a quantity X changes by Y% and then that new quantity changes by Z%, I almost always think about it like (1 + Z) * ((1 + Y) * X).

That is, I compute the new size after adjusting by Y% first, and then take Z% of that. To me, this is still treating values as magnitudes from 0 and the Y% change is totally independent from the Z% change. I guess you can multiply Y and Z first, but that’s not how I think about it.


rssoconner is saying that answering those questions becomes trivial when using the log scale. I find this argument persuasive despite having the same reaction you did to the original graph. The rewritten version is easier to understand immediately and is less confusing, but it's not as usable for answering questions about the data.

If the original author's had included a brief explanation of the unexpected looking axis label and its advantages, the whole problem would have been avoided. And I think the argument that "log scales are standard" is not a valid defense here, as this whole thread demonstrates.


Point is that it seems by the above calculation that a change of increase-by-25% and a change of decrease-by-20% are opposite changes of equivalent magnitude as they perfectly cancel each other out. This is why they are plotted equidistant from 0% on the original chart.


The scale is fine. It's either logarithmic or really close to it.

The centre is 1 (base + 0%). Where base = 1.

The first step to the right is 5/4 (base + 25%). The first step to the left is 4/5 (base - 20%). The distance represented is a multiplication by 5/4. (4/5 * 5/4 = 1. 1*5/4 = 5/4). And so on.

The second steps are 3/2 and 2/3 (base +50% and base -34%).

The third step on the right is 2/1, and if the one on the left were present, it would probably be 1/2.

From visual inspection, the relative lengths (apart from those who are equal) also look correct.

For instance If you start at 5/4 and move a similar length to the right, you should get (5/4)^2 = 25/16. This is a bit more than 3/2. So the distance from 1 to 5/4 is longer than from 5/4 to 3/2.

Now, this IS indeed good evidence that better STEM education is needed.


The post's author's having missed that it is a logarithmic scale doesn't say much for his chart reading. The choice of a logarithmic scale by the charts creator doesn't say much for his presentation skills. It's a distraction, and it conveys nothing meaningful about the data to almost anyone. His explanation - that he wanted a halving to look visually equal to a doubling is just weird, since hardly anyone would expect that in a chart about change in a single period. I'd call it "too clever by half."


Most people will not care about the scale at all. The scale only annoys people that are educated/bright/nitpicking enough to notice it is not linear and not educated/bright enough to instantly recognize that it's exponential.


Not so. It annoyed me, and I understood it for what it was immediately.


> The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0.

It is. +0% is 1.0, -20% is 0.8, and +25% is 1.25.

0.8, 1.0, and 1.25 are a geometric progression (x1.25)



This made my eyes bleed. Thank you :(


Ugh, this is terrible. Good post *groan*


Another issue with the chart is it only shows percentage changes and for some weird reason drops a bunch of other fields see [1]. So, while CS might show a dramatic increase it is probably starting from a much smaller base. I would venture if you looked at the raw numbers the dramatic shift is not from humanities to STEM but from humanities to Nursing and Exercise Science (Healthcare).

[1] https://twitter.com/whyvert/status/1562310209879646208/photo...


Besides the log vs linear scale debate, the grouping by field is not an improvement. The color labels gave you this info while still preserving the sort order, giving you two dimensions at once.


Someone with an artistic eyes likes the smoothness of the straight upward trending line. Non-STEM major perhaps?


I mean, it's pretty standard to sort data. The original was sorted based on the change in number of degrees awarded. With that sorting you can see instantly whether, say, Gender Studies or Chemistry gained more students.

The sort-by-field tells you nothing new that wasn't told by the colors.


Sorting data wasn't the problem. The scaling of the x-axis arbitrarily to fit the curve as a straight line is the problem.


In case anyone wanted to factor it (for future use/reference)

  - 50x2 + 240x - 7 = 0

  roots: 12/5 ± √562/10
  f(x) = -50(x - 4.7706539182259)(x - 0.02934608177406)


2/3, 4/5, 1, 5/4, 3/2, 2

Not really sure why.


Interesting, music to my ears.


Not even wrong.


The author's lack of mathematical literacy is pretty striking. I don't mean not realizing it was a log ratio, but presenting the raw percentage-to-pixel expression as if it were meaningful. Clearly the constant term and overall scale factor just depend on the basic dimensions of the plot, so ultimately the mapping they found is x-0.21x^2.

Beyond that, they were just picking a hypothesis without thinking how a bad plot could have been generated. It's interesting to note that the correct mapping, from their perspective, would have been the Taylor expansion of log(1+x), which would start x-x^2/2+x^3/3, so their goodness of fit metric (large R^2) wasn't a useful indicator, and there wasn't a meaningful place to cut off the degree.


This may be a real-life example of the Dunning-Kruger effect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: