
We Used Broadband Data We Shouldn’t Have - Amorymeltzer
https://fivethirtyeight.com/features/we-used-broadband-data-we-shouldnt-have-heres-what-went-wrong/
======
code4tee
Nice to see they’re being so transparent about setting the record straight.
Mistakes like this happen in the journalism business, but few outlets make a
serious effort correct misreporting beyond burying corrections where nobody
sees them.

~~~
stusmall
Overall they do such a good job of that. They seem to really value self-
improvement and reflection. Its part of what keeps me as a regular reader. One
editor even does a yearly public review of the work he did that year, what he
did wrong, and how he plans on improving it.

[https://fivethirtyeight.com/tag/mea-
culpa/](https://fivethirtyeight.com/tag/mea-culpa/)

~~~
taoistextremist
Harry Enten is probably my favorite personality of FiveThirtyEight, mainly
because of the amount of humility he has despite the fact that he's _very_
good at his job from what I can tell.

------
jobu
Unfortunately they don't discuss two of the biggest disparities I've seen in
reporting on broadband internet access:

1) How do they define "Access"? Does it mean actual subscriptions? Does it
mean the building/home is connected? Or does it mean a line passes the
household, but there's actually no way to connect to it? (Look up New York
City's lawsuit against Verizon's FIOS rollout.)

2) How do they define "Broadband"? In 2010 the FCC defined it as 4 Mbit/s
down, 1 Mbit/s up. In 2015 they redefined it as 25 Mbit/s down, 3 Mbit/s up.
Currently I have 50Mbit/s down and 1Mbit/s up which Comcast absolutely defines
as "Broadband", but it doesn't meet the FCC definition.

~~~
danjoc
>In 2015 they redefined it as 25 Mbit/s down, 3 Mbit/s up.

How can that be? That's the same time we achieved net neutrality. That's not
net neutral. That's fast lane on download, slow lane on upload. Asymmetric
bandwidth is a prime reason why we all can't host our own clouds and social
networks.

~~~
ahupp
I have never seen anyone define net neutrality as including symmetric
bandwidth. I am also skeptical that lack up upload bandwidth has much to do
with the lack of success for self-hosted social networks.

~~~
narag
I've heard two theories about capped upload bandwith: making possible to
upsell premium hosting services and P2P damage control.

~~~
ahupp
I think the simplest explanation is that the transport uses a fixed allocation
of bandwidth between upstream and downstream flows, and they've chosen an
allocation that matches typical workloads. It looks like full-duplex DOCSIS
(data-over-coax protocol) was just announced last year:

[https://en.wikipedia.org/wiki/DOCSIS](https://en.wikipedia.org/wiki/DOCSIS)

~~~
fjsolwmv
That's begging the question. The up/down ratios they choose determine what
workloads are possible/typical. More upstream means there would be other
workflows enabled.

~~~
chimeracoder
> That's begging the question. The up/down ratios they choose determine what
> workloads are possible/typical. More upstream means there would be other
> workflows enabled.

Not really, because the options exist at different tiers. If you have 5/1,
10/3, 25/5, and 100/20 as options, and you notice that people choose 10/3 and
saturate their download but only use a third of their upload capacity, then
turning that into a symmetric system would result in them having to pay _more_
for the same usage.

On the other hand, if you noticed that people were choosing the 100/20 but
only peaking at 15 down and 15 up, then yes, it would be clear that customers
would prefer a symmetric system.

As it stands, P2P is a pretty rare use case for residential broadband, even
when the capacity exists. Access to symmetric speeds is unlikely to increase
P2P usage, because those aren't the main blockers.

~~~
narag
_Not really, because the options exist at different tiers._

There are not so many options where I live, none in practice.

Anyway, those tiers are, after all, commercial options. Is the cost for
providers really different? If there isn't a substantial difference, then the
upsell theory stands.

If people doesn't want upload, why do they throttle it? It would be a nice
selling point: you offer a plus that looks nice and costs you nothing.

 _Access to symmetric speeds is unlikely to increase P2P usage, because those
aren 't the main blockers._

That's backwards. What they're supposedly doing is _discouraging_ usage. There
is more usage than wanted (no matter if it's not common) so measures are put
in place to shrink it even more.

The measure does not affect most users so they can get away with it: "only
those few freeloaders want more".

Edit -> also consider this: upload limitation in P2P isn't obviously negative.
You still can download at max speed so you could think the limitation as a bad
thing _for others_. You need to think for a while to understand why it's also
bad for you.

------
outsidetheparty
So if I'm reading this right, there are three data sources here:

1) US Census, which is based on surveying households "do you have broadband,
Y/N"

2) FCC data, which is based on ISPs self-reporting (In a footnote the article
says they're using Pai's new slower definition of broadband, 10Mbps, not the
2015 definition of 25Mbps.)

3) ASU/Iowa, which depends on a derived variable in commercially-purchased
data which "denotes interest in ‘high tech’ products and/or services
[including] personal computers and internet service providers" as a proxy for
broadband ownership

...and the first two roughly match each other, while the third doesn't. The
academics claim the company that sold them the data told them it was a
reasonable proxy for broadband, the company says they didn't say that.

------
ACow_Adonis
I just wanted to comment that I think this is brilliant and the kind of
analysis and general skeptical of data we should see more of.

Just for context,of its not obvious, I work with data. Both putting it
together and analyzing it. And one of my chief frustrations with academia (and
biggest lessons to people I advise) is a kind of "cultural reverence for the
data set".

Just because data is collected, in no way does that assure that's it's right
or suitable, even if the valuable name says it is.

Be skeptics. Private suppliers have incentive to sell you data. Private
industries have incentives to keep data from you (it constitutes competitive
advantage). Government data has political interference on what is collected,
even if you're lucky enough to live in a world where the actual collection is
independent and rigorous. Reporters and responses to surveys and interviews
may be innacurate even when people thought they were being honest, and on
socially contentious topics they usually don't have that.

And even if you managed to avoid all that, it doesn't mean your data isn't
problematic. Our census in my country, for example, is done in the winter
time. How good is that at tracking information in seasonal towns?

Proper data collection is some of the hardest work you can do, and proper
analysis comes from measuring, corroborating, justifying, hypothesising on the
data you have. It does not involve just calculating a stat or, god forbid,
just testing things for statistical significance just because it's on your
data set.

For all those reasons, I highly commend this article. We need more of it.

------
dboreham
I'm really confused by this : surely if you want to make a data set that has
Internet speed binned by county surely the way to do that is as follows:

1\. Go to a large Internet services provider (Amazon, Google, Akamai,
Netflix).

2\. Ask them to statistically sample the TCP flow rate observed in client
traffic, by source IP address.

3\. Get a data set that geolocates IP addresses to ZIP code (Amazon for
example has this data).

4\. Join the two.

~~~
oxguy3
How many people are represented by a single IP address? How do you deal with
changing IP addresses (most residential ISPs assign dynamic addresses)? How do
you distinguish between residential and commercial traffic? How do you deal
with cellular devices (for which GeoIP data is much less accurate)? How do you
account for people who don't use the service you got the data from? (Netflix
in particular has a limited subscriber base, but all the providers would
exclude, say, elderly folks who do nothing but email).

Getting good data at nationwide scale is never as easy as it sounds in your
head, unfortunately.

~~~
dboreham
All good questions:

1\. Typically a single IP address represents a single connection. Yes there
are providers who NAT multiple subscribers onto one IP but they are rare
(because if they do that then they have to maintain NAT logs in order to
identify criminal subscribers to law enforcement -- easier to just have 1:1 IP
to subscriber).

2\. Residential addresses are in fact not really dynamic. Yes they can change
from time to time but for the most part they don't (see #1).

3\. Cellular traffic can be identified because the cell carriers use specific
identifiable netblocks.

4\. It doesn't matter if not everyone uses the sampling service because that's
the point of sampling.

------
CharlesMerriam2
Responsible adults; I approve.

Notice we applaud the careful report on a research report in exactly the same
way we applaud the post-outage report.

------
samstave
Personally - there is something that I would like:

A monitor of exactly how much traffic is used by ads vs content.

So if I load a page, and say that page is just an article with text. What % of
the content is the bandwidth-consumption I am interested vs the ads
surrounding it?

The reason why this number is important is Mobile.

So a user signs up for "3 gigs of data" \- how much of that 3 gigs is consumed
by ads and shit they dont want/need?

Actually - it would be good to have a standard on reporting for any given page
"this page weighs in at 50KB for content and 500KB for ads..."

Does this exist?

~~~
bluetwo
Large providers do scrape this kind of information, but I'm not sure anyone
discloses it.

------
bluetwo
I don't recall which FCC action it was last year, but as I recall, the large
providers are no longer required to at least show they attempted to offer
broadband to all households.

Previously they had to show state and fed gov't this info. Now they get to
concentrate on providing access to the most profitable while ignoring the less
profitable.

------
jinma
Kudos to FiveThirtyEight on being transparent and analyzing what happened. But
also...this was a series of mistakes, some of them pretty scary.

FiveThirtyEight's biggest mistake seems to be trusting an academic dataset
when they had no idea how it was collected. This is understandable, especially
when the data was published on the Arizona State University's Center for
Policy Informatics data portal. (You can go there right now and download the
bad data - scroll to CATALIST DATA here
[https://policyinformatics.asu.edu/broadband-data-
portal/data...](https://policyinformatics.asu.edu/broadband-data-
portal/dataaccess/countydata)) A university should be a trusted source. But
FiveThirtyEight took an unbelievable outlier from this dataset and wrote an
entire post about it ([https://fivethirtyeight.com/features/lots-of-people-in-
citie...](https://fivethirtyeight.com/features/lots-of-people-in-cities-still-
cant-afford-broadband/)). The dataset claims that only 29% of Washington
D.C.'s adults have broadband. (The real number according to the other datasets
FiveThirtyEight looked at in the new post is closer to 70%.) They even make a
point of how extreme the Washington D.C. datapoint is on the histogram in the
article as the only large county with such a low percent. That should be a
clue to question your data.

What I find worse is that the academic researchers published this dataset.
They bought behavioral marketing data and trusted a salesperson that the
variable HTIA (“Denotes interest in ‘high tech’ products and/or services as
reported via Share Force. This would include personal computers and internet
service providers. Blended with modeled data.”) was a good proxy for broadband
access. To be clear, HTIA includes modeled data, which means they took
demographics, voting records, and whatever other individual data they could
grab (maybe they have records of your purchases, I'm just guessing), and
predicted whether each adult in the US was interested in tech. This is the
kind of data companies buy for ad campaigns, figuring that if they advertise
to these adults, it might be better than random. There's no reason to think
the aggregates of these numbers would be accurate or calibrated correctly,
especially for an entirely different purpose (broadband vs high tech).

It's disturbing that these sort of datasets are floating out there in academia
and really makes you wonder what other bad data is being blindly trusted to
write blog posts, research papers, and news articles.

------
codezero
Does anyone here know if there is a way to opt-out of being in the Catalist
dataset?

------
zaroth
I few things I don't like about this;

"After further reporting, we can no longer vouch for the academics’ data set.
The preponderance of evidence we’ve collected has led us to conclude that it
is fundamentally flawed.... The idea behind the stories was to demonstrate
that broadband is not ubiquitous in the U.S. today, even as more of our lives
and the economy go online. We stand by this sentiment and the on-the-ground
reporting in the two stories even though we have lost confidence in the data
set."

If the data you used to reach a conclusion is fundamentally flawed, it's
pretty disingenuous to claim you stand by the _sentiment_. So they started
with a conclusion, set out to prove it, later found the data they used to
prove it was flawed, but still believe it's true.

The second thing I don't like is it seems readers are very confused between
_access_ and _usage_ and their sloppy wording often conflates the two. It
appears they were studying _usage_ (actual subscriptions) not _access_
(availability of a high speed connection).

Lastly, they also seem to disregard an LTE wireless connection as _usage_ of
broadband, when I would have assumed it would clearly be considered. If LTE
wireless is more commonly used as a form of access to broadband internet in
certain areas (i.e. rural areas where density can't justify running the fiber,
or dense metro where the LTE is so good there's no need for a wire) then not
surprising you'll find broadband "usage" is low in those areas, even if those
households are absolutely using broadband internet through an LTE hotspot.

~~~
fny
> So they started with a conclusion, set out to prove it, later found the data
> they used to prove it was flawed, but still believe it's true.

Nate Silver and 538 are fairly hardcore Bayesians, and this is how pretty much
all Bayesian thinking works.

You start out with some prior "sentiment" (a.k.a. prior belief), and use then
use data to update your "sentiment".

In turn, invalid data would mean that you'd revert back to your original prior
sentiment, and when you get new data you'd start Bayesian inference once
again.

Edit: Looking at the on-the-ground investigative reporting and the other
sources and studies they've cited in the related articles, I actually agree
that they have decent evidence to support their belief without these data
sets.[0][1]

I mean, ignoring the data sets, would you argue against the idea that many
people in cities can't afford broadband or that many places in rural America
have crummy internet infrastructure? Certainly, I'm less confident than I
would be with the additional data, but my confidence is still relatively high
without it.

I do agree with your other points. Mobile internet access--with or without
hotspots--is unreasonably ignored. The conflation between access and usage was
a lesser concern since I managed to navigate the articles well enough.

[0]: [https://fivethirtyeight.com/features/the-worst-internet-
in-a...](https://fivethirtyeight.com/features/the-worst-internet-in-america/)

[1]: [https://fivethirtyeight.com/features/lots-of-people-in-
citie...](https://fivethirtyeight.com/features/lots-of-people-in-cities-still-
cant-afford-broadband/)

------
cpsempek
I'm probably being too harsh but...

Good on them for writing this, it's important to admit when you're wrong.
However, I feel like this outlet has a larger responsibility to be actual data
analysts as well as journalists (over say a more traditional journalist for a
more traditional news outlet). As such, why was the analysis done in the
postmortem article not done prior to publishing the original articles? A good
analyst is one you can trust, and trust for an analyst is built by drawing
conclusions from highly defensible data, and highly defensible data is data
which has undergone sever scrutiny of the analyst _before_ conclusions are
drawn, not after. Also, they should probably update the now erroneous articles
with a disclaimer indicating that much of the research is now invalid.

~~~
olympus
It's true that we have lost trust in their analysis, but there is an important
thing to remember here: fivethirtyeight is not a peer reviewed journal. You
should treat their articles in roughly the same category as a CNN article on
something Trump tweeted, not in the category as an article published in
_Nature_. While they do tend to do better data analysis than the Associated
Press, 538 articles are not refereed and should not be treated as such. The
assumption of a "highly defensible" data analysis is a little strong for them.
This is very apparent on their sports section, which should give you a clue
about the actual rigor of their analyses. Don't cite a 538 article in your
academic research unless the 538 article is directly citing peer-reviewed
research (and even then you should probably just cite the original research).

If you think of them as a news organization that uses data as it's gimmick to
sell page views, you will be less surprised at events like this (disappointed,
yes- surprised, no). They have the same incentive to sensationalize things as
a regular news organization. Their mission is not to increase the knowledge of
the human race, their mission is to bring in page views to sell ads and make
money.

It may sound like I'm coming down harsh on fivethirtyeight. I genuinely enjoy
reading some of their articles, but I make sure to remember what kind of
organization they actually are and don't fall for the trap of thinking they
are a think tank staffed with postdocs.

~~~
TheCowboy
This anchors motives too much in economic determinism. It's a useful mental
model to remember this is a factor, and it can come to dominate and cause
problems, but it is not always the one true factor to rule them all through
which we should view all motives.

You basically end up arguing that it is all about money, and real journalism
cannot happen under profit-seeking organizations. It also trivializes that
journalism's big challenge right now is how to do real journalism when the big
tech players have vacuumed up their ad revenue.

I also wouldn't treat 538 as the same as CNN writing about a thing Trump
tweeted. They actively talk about and discuss their journalistic goals. They
try to be openly self-critical about what and how they cover topics. They are
trying to compete by not doing the same thing as other organizations.

~~~
olympus
I think that real journalism can happen under profit seeking organizations,
because people find value in real journalism. However, real journalists don't
do original research, they compile insights from experts. Journalists have a
tendency to get it wrong when they try to add original insight in an article.
Just about every person who is an expert in a field has a story about when
their field made it into the news cycle and the journalists butchered some
important concept.

I'm a statistician, and we always work with investigators that are experts in
their field- unless we are researching statistical methods, then we act as our
own experts. The statisticians handle the data analysis and make sure that the
investigators don't make silly data mistakes. The investigators handle the
reasoning and mechanisms behind the research. When they work together they can
collect good data that they are familiar with and know how to interpret
correctly. When they work separately they are prone to mistakes.

Fivethirtyeight (and also the ASU researchers) fell into this trap. They were
not involved in the data collection, so they don't really know what each
variable means, and just took someone's word for it who, (back to the economic
motives) has an incentive to sell their dataset and not to tell the
researchers to go look elsewhere.

I'll admit I was too harsh in comparing fivethirtyeight to an article about
Trump's tweet, 538 is typically decent investigative journalism. However, I
maintain that it isn't on the level of peer-reviewed academic research.
Articles on 538 don't go through peer review. They aren't submitted and
languish for months with revisions and answering follow up questions. I'm not
saying that this mistake _would_ have been caught by a referee, but in the
peer-review process a referee _could_ have caught it by asking questions about
the analysis process and if it's valid to use a variable named "X" was being
used as a proxy for a variable named "broadband access."

~~~
azeotropic
>I think that real journalism can happen under profit seeking organizations,
because people find value in real journalism. However, real journalists don't
do original research, they compile insights from experts. Journalists have a
tendency to get it wrong when they try to add original insight in an article.
Just about every person who is an expert in a field has a story about when
their field made it into the news cycle and the journalists butchered some
important concept.

It's much much worse than this. I was once interviewed about my research by an
NPR reporter. He had already decided what his story was about and tried a
variety of tricks get me to say some pithy quote he had devised so he could
use it on air. The problem was that my research actually debunked, rather that
supported, the story he wanted to run, and the pithy quote was scientifically
unsupportable.

I often wonder whether some of the quotes I hear on NPR are edited versions of
the interviewees saying "no it wouldn't be accurate to say x" cut down to "x".

