Hacker News new | comments | show | ask | jobs | submit login

Before anyone goes putting a ton of trust in these charts...

Compare the following: http://usafacts.org/metrics/31815 vs http://usafacts.org/metrics/12966

In the second chart half a million more people decide to die every 10th year?

( imgur link to screenshots in case the links don't work: http://imgur.com/a/tY02j )




>In the second chart half a million more people decide to die every 10th year?

Or it's from another source that gives different metrics obtained every 10 years (e.g. from census) with some extrapolation?

In any case, the differences are small in this instance over the long run, it's the lack of source and other metadata that its more troubling.


I can't access USAfacts right now but the census shouldn't matter for this if it's just crude deaths since those are reported through a different system and are a precise count. I can't think of any reason why there would be a 10 year spike in the data, it's definitely an error of some kind.


Why would it extrapolate 25% lower values on all other years? If the 10th years are the actual data points, the computed trend line through only census years shouldn't be so radically different from the drawn markers.

In any case, the differences are small in this instance over the long run

The differences in those years are _huge_.


>If the 10th years are the actual data points, the computed trend line through only census years shouldn't be so radically different from the drawn markers.

Except if the trend line is based on the other data source, and the peaks on the census data.


It just says deaths in the screenshot though, not death rate so it shouldn't be dependent on anything else.


>The differences in those years are _huge_.

They look similar to me. One has less granular data than the other. What exactly are you implying?


>They look similar to me.

Do a quick calculation for me, please.

What percent of 2 million is 500 thousand? Because the errors are 500 thousand on 2-2.5 million. That's a huge amount of error.


They probably were using death rates and multiplying by population (which gets readjusted every census.) Definitely a FIXME.


This would not result in the spikes that we are seeing. If it was a readjusted population, then the number would rise sharply every tens years and stay up.


Similarly, this chart says there are only 2,000-4,000 teachers in K-12 education: http://usafacts.org/metrics/34211. This is obviously not true.


Most likely missing scale. Probably in thousands.


Yep you are right. I see the numbers in the report and they are in the thousands: http://usafacts.org/report-slides?page=66.

This still shows the flaws in the website. It does not say what the scale is in on graph at all, and other graphs will not be so obviously incorrect.


Isn't this just likely to be caused by the census data?


Or more precisely, by the lack of census data in years where no actual enumeration is performed.

The big jumps on census years indicate that the census department does not estimate accurately when working with 9-year-old data.


What you're saying doesn't match the reality of what is actually shown in the graph at http://usafacts.org/metrics/12966

They aren't jumps suddenly correcting a bad estimate with new data. They are gigantic 25% spikes which are then immediately undone. There is no way to explain this chart by just saying that estimates worsen over time.


Births and deaths are derivatives. Thus a 25% spike is basically just an adjustment to reflect smaller (~2-3%) errors in the non-census years. To put numbers to this: imagine 100 people were born every year between 1991 and 2000, inclusive. The statistics only recorded 98 people being born every year, though. We then did a census, and discovered that the number of people <10 years old in 2000 was 1000 people, thus we recorded a number of births as 118 (since we thought that 882 people had been born so far in 1991-1999), and then back down to 100 in the next year.


Depending on the estimate, the most recent census data might not be fully incorporated into the model until several years following the census?

I can't look at your link at the moment because the corporate firewall is currently blocking the domain.

But there really isn't any way to get around the fact that real census data are only collected once every 10 years (and the 1890 census was burned, so that point is missing).


When you can look at the links you'll see what I mean.

To give a description of the problem in text...

Both charts are labeled "Deaths", but I'm going to describe one of them for you.

The time span from 1981 to 1999 goes like this: 1,968,365 - 1,998,559 - 2,033,124 - 2,068,679 - 2,091,359 - 2,105,024 - 2,163,984 - 2,161,764 - [1,637,394] - [2,656,721] - 2,180,115 - 2,226,027 - 2,282,854 - 2,284,363 - 2,317,918 - 2,321,933 - 2,330,759 - 2,359,088 - 2,386,995

And then 2000 is [2,979,442]

And then 2001 is 2,430,225

And so on. Every 10 years, and also in 1989, there is a fluctuation by 500,000 deaths from the expected number given the surrounding trends.

All of the numbers that are _not_ between [] above look like a smooth upward trend, yeah? So WTF is happening in the three that have [] if the data isn't bogus? I say the data must be bogus.


If anyone would like to know more about the 1890 census (I was intrigued) this is a good article on it:

https://www.archives.gov/publications/prologue/1996/spring/1...


You mean people are more likely to die in census years? I mean, it could be true, but I'm not willing to bet on it.

[edit] So I went and put "why is death rate higher in census years?" into the google search bar, and the first result is "Causes of Death - Census". I didn't actually click on the link to find out, but that title certainly sounds like the census kills people. So maybe you're right. :)


I think it's just that the intermediary years are estimations and the chart isn't compensated.

The estimates are just bad and not adjusted historically. Doesn't make the chart bad... just makes the data a little wonky.


The intermediate years use statistical methods that better address some of the practical difficulties in getting complete and unique responses; the "actual enumeration" is well-known to be less accurate (in fact, the same methods used for between census estimates are also used by the census bureau to produce and publish estimates of the over- and under-counts in the actual enumeration.)

There was (probably still is, despite the fact that it seems to be a lost cause) a movement to use the better methods for all purposes, but given that it would be a constitutional change and the errors benefit the already politically powerful, there is pretty much no chance of it happening any time in the foreseeable future.

OTOH, I don't know ow that that actually has anything to do with the chart at issue: there is no information on sources or methodology, just "sources of this data are coming soon". If you don't have the sources ready to cite, you have no business publishing visualizations of the supposed data.


It sounds like you're inventing reasons to post-hoc rationalize bad data. What exactly are the intermediate years estimates of? What does "isn't compensated" even mean in this context?

Doesn't make the chart bad... just makes the data a little wonky.

The chart is great. It perfectly represents exactly the data in the tabular form. The data is apparently garbage though.


It's the trend line generation, not the data. The start and end points are the same between the two graphs. To say that this nitpick throws shade on the entire project is a bit overstated.


It's the trend line generation, not the data.

What you just said doesn't make any sense and is a post-hoc rationalization besides.

The start and end points are the same between the two graphs.

Actually they are not. The starting numbers (1980) differ between the two charts by ~500,000 deaths.

To say that this nitpick throws shade on the entire project is a bit overstated.

My very first search on the data came up with this. I suppose I could have kept searching but that puts me personally at a 100% error rate. Maybe I'm just really really unlucky, though.




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: