For readability's sake we did not present the statistical side of the results, but they were unambiguous for everything we mentioned in the post. Perhaps in the future we should present those results in footnotes to assure inquiring minds that we addressed valid concerns like yours (e.g., overfitting).
Love the examples of valid reasons folks don't like trees. Almost any of them seem like viable hypotheses to explore for why folks fall in the 3% (though obviously we'd need very different data from that which we have handy).
To address the specific comment about trees per neighborhood: unfortunately we didn't have granular enough data about the locations of respondents to analyze that; agreed that there's likely something there.
Thanks again for the comments.
Late followup edit: Thanks to your suggestions, we added some clarifying footnotes to the post.
. Phone survey results were same as written survey results, so the issue isn't measurement error.
. All findings were backed by statistical testing, the p-values for which were almost always below 0.0001.
. Tests were replicated year over year; didn't add an additional note about overfitting but doesn't seem like a big concern with 300 datapoints and just a few variables of particular interest.
It wasn't that hard to stick this into the post as footnotes; in the future we'll do better with that. Thanks for the help.
Thanks, I appreciate the updates and thoughts. I did dive a little into the data, and yes, there does seem to be something real going on. Interestingly, there are quite a few people who answered that their neighborhood had too many trees, but the city as a whole had about the right number. Makes me wonder if this is speaking about "fairness" rather than actual tree preferences.
The statistical testing is tricky, though. Did you go in with the thesis that there might be a correlation between age and ideal number of trees, or did you first pin down "too many trees" and then search the data for correlation? Or even more problematically, did you search for cross-correlations on all columns and then discover the age-tree link? If either of the second, your definition of "significance" needs to be custom.
Went in there with the aim of exploring the trees question, and hypothesized there might be relationships between that and obvious demographic questions. The "crankiness" bit was much more driven by exploratory analysis.
We take a fairly pragmatic approach to the multiple comparisons problem you're highlighting. If p-values are < 0.00001 we don't really worry unless we're doing a crazy amount of analyses. And when we do get p-values that are borderline (like .01) and we're doing multiple comparisons we'll mention that there may or may not be something actually happening there (as we did in our post about Baltimore parking tickets: http://blog.statwing.com/baltimore-parking-tickets-revisited...).
For what it's worth, I think ceph_'s comment and link below (http://online.wsj.com/article/SB116165781554501615.html) about the relationship between trees and gentrification is the best available guess as to what's ultimately driving these attitudes. But yeah, it's probably a combination of a lot of things, many of which aren't actually about trees per se.