A couple cautions about drawing conclusions from this data though:
1. A trend in the outliers of a distribution does not imply the same trend in the mean values of the distribution. Although I'm sure there's an upward trend in the mean values, it's not good science to encourage people to jump to that second conclusion. (Not without the supporting data at least.)
2. What about sampling bias? Intuitively, it seems like the reporting weather stations would not be uniformly distributed geographically, but would rather show a higher concentration near urban areas, which store and release extra heat. How large is this effect, and how could we correct for it?
For anyone interested, you can also get at the raw data without signing up for engima.io:
Your first point is very well taken - we also urge caution in drawing too many conclusions regarding an overall trend in mean temperatures. However, I think our conclusion - that we are having more hot outlier days now than ever before - is interesting in and of itself.
Your second point is an interesting one, and I too would be curious to look further into the correlation between proximity to a city and temperature. However, to look for outliers, we created a long-running distribution of seasonal temperatures for each station individually - so in some sense the map is already corrected for this effect. Each anomaly you see is an anomaly for that station alone - meaning if an urban station regularly gets higher temperatures than a rural one, it will take a proportionally higher temperature to trigger an anomaly on the former than on the latter.
Furthermore, NOAA has been good about getting good national distribution of these stations so it's less of a concern than you might think.
That said, urban stations may still show artifacts compared to rural ones, eg. when there is an extreme warm outlier, cities may be more likely to have another warm outlier the following day due to the heat storage effect you mention. I'm not sure.
About the second point, the U.S. population has been growing (from 179 million in 1960 to 308 million in 2010 according to the US Census). So a particular station that was in the same location in 2014 as it was in 1964 could well have a more urban surrounding in 2014. In fact, one knows that surely on average this will be the case. Since more urban surroundings lead to higher temperatures, this must be a biasing factor. Does anybody have any idea how large this biasing factor is? Is there any literature on that issue?
"However, over the Northern Hemisphere land areas where urban heat islands are most apparent, both the trends of lower-tropospheric temperature and surface air temperature show no significant differences. In fact, the lower-tropospheric temperatures warm at a slightly greater rate over North America (about 0.28°C/decade using satellite data) than do the surface temperatures (0.27°C/decade), although again the difference is not statistically significant. "
Yes it's been talked about extensively even in the public for over a decade at least. In blogs and comment fields and columns the "but it's just the urban heat island" is a common myth that pops up all the time and has to be debunked constantly.
Some GISS temperature data for example excludes urban stations. Classification by night lights in satellite images. These rural stations also show similar trends.
I'm in danger of doing it also, but I think you are reacting defensively to a valid specific question because of your views on the larger climate reality. There is definitely an effect on individual stations as a result of changes to their surrounding environment. The question is whether the corrections applied to correct for it significantly affect the results of any given analysis.
The magnitude of the changes made are quite large compared to the effects being measured. They average to zero, but are bimodal centered around about +1F and -1F:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/techreports/Technical%20Report%20NCDC%20No12-02-3.2.0-29Aug12.pdf
I think it's a brilliant visualization, but if we are to presume the corrections are necessary and correct, it would also be reasonable to question what conclusions can be drawn from an analysis of data that does not include such corrections. At the least, I think it would be interesting to see their analysis applied to the more rural CRN1 and CRN2 stations versus the majority of lower quality CRN3, CRN4, and CRN5 stations that make up the bulk of the readings.
Adding to the parent above, a few more consideratios on anomaly and sensor distribution.
As a quick datapoint right now--the snow pack in the high sierra in CA above 11,000 feet is exceedingly variant to that below. This observed data conflicts with reported NOAA data, because of sensor location issues.
We are having a drought in CA and in the Sierras but the snowpack even versus last year is anomolaous only at certain altitudes. From what I've heard our storms have only been precipitating above a threshold floor (for various reasons). This is something that we've experienced in previous years as well (to some extent, in 2013).
The Wind-energy potential is also very non-uniform:
Just for a quick example. This involves both altitude (eg, wind/weather shadows) as much as micro-topography (ridgelines, etc). Again, which can impact sensor-reported anomaly (micro-climates, etc).
So a bit of caution when extrapolating to things like continental scale.
You are currently grouping low daily maximums and low daily minimums together, and the same for high. I would be interested in being able to compare those separately, to test and quantify urban-heat-related theories that daily minimums have increased much more daily maximums have increased
Great point. We grouped the anomalies into simpler categories in order to make the project a bit more digestible. However, some of our initial analyses suggested that daily minimums have indeed increased more than daily maximums. Do you have any links to journal articles that discuss this theory?
The thing about weather temperature outliers, however, is that they're highly relevant to the effects resulting from them: an hard freeze at an unexpected time, or a colder seasonal low, can have huge impacts on agriculture, infrastructure (roads, pipes, overhead wires, and structures can be damaged by freezing or ice storms), similarly heat can result in ruined crops, heat deaths, drought, and fires. Diseases, pests, and parasites can also be strongly influenced by weather.
And in terms of the energy surplus (or deficit) represented, these are huge changes being represented. Takes a lot of energy to heat (or cool) the world.
> Intuitively, it seems like the reporting weather stations would not be uniformly distributed geographically, but would rather show a higher concentration near urban areas
Why would this be intuitive at all?
First, if you play the animation and look at the dots on the map and have a basic understanding of how the US population is distributed, I just can't see how you would draw your conclusion. The dots are pretty evenly distributed geographically.
But even without the visualization, why would you assume there were more reporting stations in urban areas? Why would people be inclined to waste time and money supporting a new reporting station if there is already one close by?
Finally, the "urban heat islands are skewing the data and climate change is not real" argument has been debunked so thoroughly that it is hard to believe anyone who brings it up brings it up in good faith. Especially when it is phrased as an "innocuous question" which could have been answered with 30 seconds and your favorite search engine.
> A trend in the outliers of a distribution does not imply the same trend in the mean values of the distribution
It does if the distribution is normal and the trend in the outliers is asymmetrical (as it is in this case). Because temperature distributions are (almost certainly) the result of many additive factors, they are normal by the mean value theorem.
One comment I might have is that your model assumes the average temperature on a given day of the year is constant across years. I would have to spend some time thinking of how to control for it, but have you considered the impact of climactic oscillation? Your data seems to reflect these patterns to some degree, and removing/reducing them might make the overall shift clearer.
Yeah, that is an interesting point and highlights the difference between "weather" and "climate" - it also brings up a tension between two different objectives of the chart: to visualize weather patterns over time intuitively, and to draw general conclusions about climate trends. In the context of the first intention, climatic oscillations like El Niño are interesting signals - you can see how they affect weather throughout the country in unexpected ways. But in the context of the latter goal, they are noise which should be filtered out/corrected for.
That makes sense; my only thought was that if you are graphing "anomalies" you might want to filter out non-anomalous behavior. Higher highs or lower lows are actually to some degree expected in those years. I suppose it could be best not to control for oscillatory behavior though as the affect of any climate shift on those oscillations is possibly not insignificant.
Gorgeous visualization! Takes a really long time to watch though, even after clicking the "up" on speed a bunch of times, and so it's hard to get the same sense from watching the points on the map as you get instantly from the chart under the map.
Have you considered aggregating a little more, so that years move by faster?
You aren't blind :) We just had trouble fitting a y-scale on the top chart that was both clean and readable and didn't get in the way of the time slider... However this is just supposed to be an overview of the trend - the proportional bar chart further down the page is the same data, plotted in more depth and explained further.
Little known fact: Hawaii's record high temperature (100F) is the same as Alaska's record high temperature (100F), and both are the lowest record high temperatures for the 50 states.
Alaska's record low (-80F) is a little bit colder than Hawaii's record low (15F) though.
Hail is not the same as snow. When I lived in the mid-atlantic region, we got hail almost exclusively in the summer, as it was typically caused by thunderstorms.
Wouldn't it be better to use a rolling window rather than monthly distribution for anomaly detection? One might expect more anomalies in the shoulder months using the monthly distribution.
Yes, that's a good point. It would be better to create one distribution for each day of the year, with the values drawn from the 2-3 weeks before / after that day, but we chose a month to minimize the computational complexity. Presumably these artifacts wouldn't affect the aggregate trend over time, only the appearance of more outliers at the beginning of March / October, etc. each year. We also were drawing inspiration from NOAA's Climate Extremes Index which uses monthly data:
https://www.ncdc.noaa.gov/extremes/cei/
Yes, this would be a good improvement to the data pipeline. I only built the front-end so I can't speak to exactly what difficulties would be involved, but I agree.
Really awesome visualization, but I didn't read the text because the columns were insanely wide on my 24" monitor. Might want to constrain the width a bit, chaps.
Hmm, that's odd - it was developed in Chrome and I haven't seen any Chrome-related issues after testing on several computers. Could be due to a badly-behaving browser extension? Let me know if you see any JS errors on the console, or if you figure out what was causing this...
Neat visualization, 100 megabytes of ram consumption! Sometimes as much as 8 meg GC during a JavaScript frame, making the frame bloat to over 50ms of running time.
I bet there are some cheap wins you can find in there OP! Keep up the good work.
You should've seen it before we found all the cheap wins :) Seriously though, I'd be interested to hear any ideas you have - it has already been optimized quite a bit but it's just a lot of data (3 million rows over 50 years). It uses canvas and throws away all the circle references every day so that helps a lot.
I was wanting to do something just like this recently. Very visualizing and pleasing to watch. Now we just need some slow building dramatic melody to watch as our planet melts away!
Theoretically no, the database we used contains weather station data from all over the world. However during our analysis we found that the quality of data from international stations was a lot more varied - as measured by many factors such as geographical distribution of stations, regularity of measurements and number of implausible outliers. I would still like to come back to this eventually & try to iron out as many of those issues as possible - but for the scope of this project it was much easier to just use US data as it is extremely well-curated.
"Armed with this refined dataset, we computed the historical range of low and high temperatures for each station for each month of the year. We then compared each station's daily temperatures to its corresponding monthly distribution. If one or both of these measurements fell in the bottom or top 2% on a given day, we labeled it an "anomaly" according to the typology above."
In case it wasn't apparent to you, there were a number of fairly obvious (let's call them "junior") errors of reasoning embedded in the original posting. As in, the kinds of things we would have readily lost points for in that undergraduate-level "quantitative reasoning" course we crammed through on the way to fulfilling our social science degree.
Wholly independent, mind you, of all the fancy-schmancy talk about generalized linear models and whatnot. Which, while being quite nice-sounding, serve only to distract from the more basic (and glaring) errors of inference in the surrounding text.
Which, in turn, is why the label "dataviz" is a more than appropriate description of the otherwise quite fun and entertaining HTML5 demo provided by the Enigma team.
Joe D'Aleo, the first director of meteorology at the Weather Channel who now works with WeatherBELL, is explaining to anyone who will listen that NOAA did exactly that.
"The NOAA, NASA and the Hadley Center press releases should be ignored. The reason which is expanded on with case studies in the full report is that the surface based data sets have become seriously flawed and can no longer be trusted for climate trend or model forecast assessment in decision making by congress or the EPA."
You have to ask yourself this question: If I saw a webpage like this, but with the opposite trend (or no trend), would I consider this to be strong evidence against global warming, or would I dismiss it as cherry picked data, a statistical artifact, etc.?
I think the solution for such issue is not at the MacroEconomical level, but at micro-behavioral level of our culture and life habits, if we solve it, all the macro factors will adapt accordingly
A couple cautions about drawing conclusions from this data though:
1. A trend in the outliers of a distribution does not imply the same trend in the mean values of the distribution. Although I'm sure there's an upward trend in the mean values, it's not good science to encourage people to jump to that second conclusion. (Not without the supporting data at least.)
2. What about sampling bias? Intuitively, it seems like the reporting weather stations would not be uniformly distributed geographically, but would rather show a higher concentration near urban areas, which store and release extra heat. How large is this effect, and how could we correct for it?
For anyone interested, you can also get at the raw data without signing up for engima.io:
http://www.ncdc.noaa.gov/oa/climate/ghcn-daily/