
The New York Times Releases Its Dataset of U.S. Confirmed Coronavirus Cases - infodocket
https://www.nytco.com/press/the-new-york-times-releases-its-dataset-of-u-s-confirmed-coronavirus-cases/
======
danso
Github repo:

[https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data)

~~~
pj_mukh
Noob Question: Anyone know an easy way to convert FIPS data (that this nytimes
dataset uses) to Postal code?

~~~
jake-low
I don't think it's generally possible. The FIPS code just identifies the
county (the first two digits are the state, e.g. 06 for California) and the
final three identify a county in that state. So technically that column is
redundant with the "county" and "state" columns; I suspect they've included it
to make joining this dataset to other data that might use different name-
formatting of counties/states easier.

ZIP/postal codes are generally smaller (in the county I live in there are
almost a hundred ZIP codes). I'm not sure they're even guaranteed to be
entirely within one county either. We tend to think of ZIP codes as boundaries
but they're actually delivery routes (which if you squint can be converted
into boundaries by joining the properties that those routes serve together).

You might be interested in this article: [https://carto.com/blog/zip-codes-
spatial-analysis/](https://carto.com/blog/zip-codes-spatial-analysis/)

~~~
pj_mukh
Right, so a simple mapping I'd want is given a Zip Code, what is the
corresponding FIPS Code.

~~~
tomrod
Pick a lat/long, and you can map between. Or find the centroid of the zip code
and map accordingly. Or you might consider unions between the geometry
objects. Qgis3 has a lot of great packages for this.

------
ABeeSea
Oh boy, I can hear the the bootcamp data scientists firing up their medium
blogs from here.

~~~
wolco
Let's try to not put people down for self learning. Knowledge can be found
anywhere. Those that mock people like this will soon find themselves left
behind with ancient university based education. Unless you plan to re-enroll
every 10 years the world will pass you by if you are not self learning.

~~~
AndrewUnmuted
Self learning would imply that the person is teaching themselves, not that the
person is attending a coding boot camp.

People that claim to be data scientists after attending a coding boot camp
must be clowns. I’ve never come across such a person. All the data scientists
I know are smart self taught people, not people who went to a boot camp.

~~~
ahuth
It's possible that all the data scientists you know are not a representative
sample.

I've met smart data scientists and engineers who've gone to a bootcamp. People
who are good and bad at their jobs come from a huge variety of backgrounds,
and it's not helpful to look down on folks whose background is different than
your own.

~~~
CaptArmchair
The issue with is the word "scientist".

In academia, science is the pursuit of postulating a falsifiable theory, then
gathering facts to verify that theory, and then going through a process of
intensive peer review that either confirms, dispels or amends those findings.

The formality of the academic process is a necessity. The value of scientific
findings entirely depends on the trustworthiness of the research. That is, how
were the results obtained, which line of thinking was followed, did the
research exclude crucial biases, etc.?

The importance here is that academic research is used as a pillar to produce
products and services we all use in daily life. For instance, if you need a
hip replacement, you want to be sure that prosthetic was designed based on
rigorous scientific findings and studies that can vouch for safety and
comfort.

The difference then with "data scientist" is that they often they don't apply
the same rigorous research practices. It's easy to pick a dataset and bang off
visualizations; it's a different story to actually come up with relevant
questions, assert the quality of the data at hand and publish your findings
towards a community of domain experts who are actually able to review your
findings.

One needs in-depth domain knowledge to do that. Good data scientists will
understand this limitation. They often work in a specific domain in a
supporting capacity: bringing technical skills and capabilities to domain
experts that don't have those skills.

Then there are those who purport to practice data science while grokking
datasets, creating visualisations and cobbling a blogpost together at the end
of the day. That's when you need to be really wary of what they publish, even
if the bigger picture contains truthiness.

Hence why I have extremely mixed feelings about what
[https://medium.com/@tomaspueyo](https://medium.com/@tomaspueyo) is doing.

To be sure, the core points of what he's telling are in line with what domain
experts are telling us. But the extreme number juggling is quite mind bending.
Moreover, the man is not a domain expert. He's an entrepreneur who happens to
know how to write viral blogposts such as "how to deliver your funny speech"
and "How to become the best in the world at something". What he does is
anything but scientific. And so, even though he's making a heartfelt plea
heard by many, one should be careful to not take the precise details in his
pieces at face value.

At the moment, we all are victims of our own confirmation bias. Each day
yields another data point, and given our desperate state, we want to see
trends that confirm improvement, a probability that one will survive this, low
mortality and so on. The reality is that we only have so few datapoints and
it's still far too soon to make conclusive assertions about how this will pan
out for the world at large and you in particular.

~~~
bart_spoon
Well as a statistician, I hate to break it to you, but there is a massive
reproduction crisis in science because heretofore the "rigourous research
practices" has often times been more a veneer of rigor rather than actual
science. The amount of published research (often in esteemed journals) that
has been found to be unreproducible and based on faulty methods is aburdly
high. The modern scientific method has arguably been as or more effective at
building a tower that shields scientists and academics from criticism from
work of questionable validity.

I'm not saying random data science blogs aren't often wrong. But you've
probably been burned just as often by publish science and simply haven't
realized it. And at least the data science blogs aren't behind expensive
paywalls, aren't couched in meaningless vernacular, and present the code/data
for reproducing their results, none of which can be said for a lot of science
these days.

~~~
CaptArmchair
Well, being professionally active in academia, I'm well aware of this issue
and that particular debate. And you're right.

Open science and open access are important movements as to the relationship
between publishers and academic research.

However, that debate doesn't negate my point as to citizen or data science.

For all the discussion about paywalls and the use of vernacular, the same
timeless critical considerations need applying: who wrote the article? what is
their background? are they a domain expert? where they did get their sources?
how did they come to conclusions? is their method sound? are they asking
relevant questions? etc. etc.

And I don't always see that happening online. On the contrary. The fluidity
and the speed at which information flows online seems to be a justification to
give in and accept what's being said at face value. The past few years should
have made it clear that such indulgence can lead to dire consequences.

With powerful tools comes responsibility. Sure, it's great to see people apply
free and open source tools to come to a new understanding of observations. But
that's only half of the story: you still have to apply critical thinking to
those conclusions. That's not something throwing more code or technical skills
can achieve.

Having a proper, critical debate takes time and experience.

~~~
bart_spoon
> who wrote the article?

This seems to be an improvement over academia imo. The person writing the
article should be immaterial. The same paper written by an unheard of
researcher should be treated no differently than the same paper written by a
well-known, tenured researcher at a prestigious institution. Unfortunately in
academia, who you are and who you know is often as big a deal as what you do
and what you know.

> What is their background? are they a domain expert?

Again, the methodology should stand alone. Being a "domain expert" or having a
particular background is only relevant if the readers of the article are
incapable of judging the results on their own merits, and have to instead rely
appeals to authority. But appeals to authority introduce their own problems,
including a moat that protects the status quo and the ingroup from potentially
important new ideas and newcomers. And its hardly scientific.

> where they did get their sources? how did they come to conclusions? is their
> method sound? are they asking relevant questions? etc. etc.

These need not be restricted to academic work. If a blog fails to properly
cite sources, you can move on. Plenty of academic papers are built on flimsy
premises, poor methodology, asking incorrect questions, etc. etc. There is
nothing differentiating the ability to assess the quality of something written
in a blog vs something written in an academic journal here.

> And I don't always see that happening online. On the contrary. The fluidity
> and the speed at which information flows online seems to be a justification
> to give in and accept what's being said at face value. The past few years
> should have made it clear that such indulgence can lead to dire
> consequences.

And yet, as noted, the more conservative approach of academia has resulted in
a huge amount of unreproducible "science". And now, in a time of crisis, the
stringent model is being set aside in favor of openly available preprints
shared online, without peer review, via social media. How can the traditional
model be considered useful when, in times without urgency, the validity of its
results are highly questionable, and in times of urgency, its thrown aside for
the sake of actual progress? It seems to me the modern academic method is
largely a facade, something closer to an ornate religious ritual that has been
divorced of its actual intentions.

> But that's only half of the story: you still have to apply critical thinking
> to those conclusions. That's not something throwing more code or technical
> skills can achieve.

This assumes that those coding or utilizing technical skills are doing so
without critical thinking. Perhaps this is true some of the time, but I don't
see the situation being any different in academia.

> Having a proper, critical debate takes time and experience.

What constitutes a "proper, critical debate" is subjective. And from my
position, it would appear that the advent of the internet and freedom of
information is making the current academic model unsustainable and obsolete.
And so academia has an entrenched interest in defining "proper, critical
debate" in a way that protects their livelihood.

I'm not saying there isn't plenty of poorly written science and analysis being
done on blogs out there. I'm simply saying I don't think traditional academia
and science is much different.

------
jashkenas
Here’s a quickie animated bubble map, pulling in this new NYT data from
GitHub: [https://observablehq.com/@jashkenas/united-states-
coronaviru...](https://observablehq.com/@jashkenas/united-states-coronavirus-
daily-cases-map-covid-19)

... might be a useful starting point if you want to fork off your own
visualizations and analyses.

~~~
echelon
At first I thought your animation wasn't working. Then it got to the last few
days.

Wow.

~~~
jashkenas
Yeah. Wow indeed. The dataset actually goes all the way back to the end of
January, but I'm trimming it to start at the beginning of March, because the
early days are pretty much invisible at the current scale.

You can change the value of `startDate` in that notebook if you'd like to look
back further in time.

~~~
SAI_Peregrinus
You could make the animation slow down exponentially, to make the growth look
linear. Basically turn it into a log plot. Misleading if you don't know what
you're looking at though.

------
apearson
Data:
[https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data)

------
santiagobasulto
Anybody know about data indicating age? It's always "confirmed", "dead",
"recovered" but no info about age of patients.

~~~
reaperducer
My state publishes its statistics hourly. I looked at it just this morning,
and was surprised to see that in the age breakdowns that there are more
confirmed cases in people 20-40 than in 60-80.

~~~
catalogia
Is that still true when you adjust for the relative size of those groups?
20-40 is a larger group of people than 60-80.

~~~
reaperducer
It's hard to say, because I don't have the time to do the math right now. But
my state is generally considered to be retirement-friendly.

~~~
MiroF
Florida?

------
wrkronmiller
This website appears to have more fine-grained statistics for the New York
City area (ironically):
[https://accelerator.weather.com/bi/?perspective=dashboard&pa...](https://accelerator.weather.com/bi/?perspective=dashboard&pathRef=.public_folders%2FCOVID19%2FDashboards%2FDS%2FCOVID-19%2B%2528Coronavirus%2529%2BGlobal%2BStatistics)

~~~
zaphod12
I don't understand why these lists make it so hard to find the USA. Every
country is written out including the United Kingdom - but the USA is "US."
Sometimes we're first because of ethnocentrism, sometimes last, but PLEASE BE
CONSISTENT!

Also, thanks for this link - rant aside, it's the best dataset I've seen.

------
jake-low
This is phenomenal. I've been scraping the data from primary sources for just
Washington state for the past week [0], in order to make this chart which I
hacked together last weekend [1].

[0]: [https://github.com/jake-low/covid-19-wa-data](https://github.com/jake-
low/covid-19-wa-data)

[1]: [https://observablehq.com/@jake-low/covid-19-in-washington-
st...](https://observablehq.com/@jake-low/covid-19-in-washington-state)

Doing this for just one state was a pretty substantial effort. I imagine there
are multiple people at the Times who are spending several hours a day
reviewing and cleaning scraped data (seems every couple of days some
formatting change breaks your scripts, or a source publishes data that later
needs to be retracted).

The Times dataset appears to contain per-county case and death observations in
a time series, going all the way back to the first confirmed U.S. case in
January in Snohomish County, WA. This makes it by far the most comprehensive
time series dataset of U.S. COVID-19 cases publicly available.

Some people in this thread linked to the Johns Hopkins CSSE dataset; I've
looked at this data but it doesn't go back very far in time for the U.S., and
the tables are published as daily summaries with differing table schemas which
makes them hard to use out of the box. For some days earlier in March,
"sublocations" aren't even structured (for example the same column contains,
"Boston, MA" and "Los Angeles County", making it very hard to use). No
disrespect to the team behind the JHU dataset; it attempts to cover the whole
world since the outbreak began which is an incredible and difficult goal. But
for mapping and studying the outbreak in the U.S., the Times dataset will
likely be the best choice right now.

Huge kudos to the New York Times team for making this data freely available.

------
yourapostasy
So before I duplicate any effort, has anyone found a tracker that takes this
dataset's death counts and updates a Pueyo/Khan Academy-style graph [1], at
different granular levels (world, nation, province/state)? Pueyo's analysis
technique uses deaths each day to impute a back-looking inferred actual
infected count instead of relying upon reported testing numbers, and that in
turn can be used to infer a two-week forward-looking infected and death count
based upon currently-known stats on average time to hospitalization and
mortality. After two weeks, it gets increasingly more speculative, but by the
time you have number of known deaths today, you almost have a baked-in outcome
two weeks from now.

[1]
[https://www.youtube.com/watch?v=mCa0JXEwDEk](https://www.youtube.com/watch?v=mCa0JXEwDEk)

~~~
JoeAltmaier
Not this data, but I like Johns Hopkins rendering of their data:

[https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.h...](https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6)

------
chadlavi
god damn it NYT, "New York City" is not a county. Why can't I get borough by
borough breakdowns? (each borough is its own county)

edit: someone else posted this, which DOES give county-by-county breakdowns:
[https://accelerator.weather.com/bi/?perspective=dashboard&pa...](https://accelerator.weather.com/bi/?perspective=dashboard&pathRef=.public_folders%2FCOVID19%2FDashboards%2FDS%2FCOVID-19%2B%2528Coronavirus%2529%2BGlobal%2BStatistics&id=iC2B38B09B142481EB83935F6419CA837&objRef=iC2B38B09B142481EB83935F6419CA837&options%5Bcollections%5D%5BcanvasExtension%5D%5Bid%5D=com.ibm.bi.dashboard.canvasExtension&options%5Bcollections%5D%5BfeatureExtension%5D%5Bid%5D=com.ibm.bi.dashboard.core-
features&options%5Bcollections%5D%5Bbuttons%5D%5Bid%5D=com.ibm.bi.dashboard.buttons&options%5Bcollections%5D%5Bwidget%5D%5Bid%5D=com.ibm.bi.dashboard.widgets&options%5Bcollections%5D%5BcontentFeatureExtension%5D%5Bid%5D=com.ibm.bi.dashboard.content-
features&options%5Bcollections%5D%5BboardModel%5D%5Bid%5D=com.ibm.bi.dashboard.boardModelExtension&options%5Bcollections%5D%5BsaveServices%5D%5Bid%5D=com.ibm.bi.dashboard.saveServices&options%5Bcollections%5D%5BserviceExtension%5D%5Bid%5D=com.ibm.bi.dashboard.serviceExtension&options%5Bcollections%5D%5BlayoutExtension%5D%5Bid%5D=com.ibm.bi.dashboard.layoutExtension&options%5Bcollections%5D%5BvisualizationExtension%5D%5Bid%5D=com.ibm.bi.dashboard.visualizationExtensionCA&options%5Bcollections%5D%5BcolorSetExtensions%5D%5Bid%5D=com.ibm.bi.dashboard.colorSetExtensions&options%5Bconfig%5D%5BsmartTitle%5D=false&options%5Bconfig%5D%5BeditPropertiesLabel%5D=true&options%5Bconfig%5D%5BnavigationGroupAction%5D=true&options%5Bconfig%5D%5BenableDataQuality%5D=false&options%5Bconfig%5D%5BmemberCalculation%5D=false&options%5Bconfig%5D%5BassetTags%5D%5B%5D=dashboard&options%5Bconfig%5D%5BfilterDock%5D=true&options%5Bconfig%5D%5BshowMembers%5D=true&options%5Bconfig%5D%5BassetType%5D=exploration&options%5Bconfig%5D%5BgeoService%5D=CA&isAuthoringMode=false&boardId=iC2B38B09B142481EB83935F6419CA837)

------
beefield
Is there anywhere (global) data about total number of tests made? The plain
number of positive cases is quite difficult to make sense of.

~~~
oli5679
[https://ourworldindata.org/coronavirus-testing-source-
data](https://ourworldindata.org/coronavirus-testing-source-data)

The key metric is tests/confirmed case.

When I've looked, this is very predictive of confirmed cases (2 weeks ago) per
death.

~~~
beefield
That was the best I had also found, but it is not of decent quality to me.
sources seem to be quite out of date and often based on news articles, not
actual daily data.

------
ravenstine
> In light of the current public health emergency, The New York Times Company
> is providing this database under the following free-of-cost, perpetual, non-
> exclusive license. Anyone may copy, distribute, and display the database, or
> any part thereof, and make derivative works based on it, provided (a) any
> such use is for non-commercial purposes only and (b) credit is given to The
> New York Times in any public display of the database, in any publication
> derived in part or in full from the database, and in any other public use of
> the data contained in or derived from the database.

How is this form of license valid for what's essentially just numbers from
publicly available data? Do I get to take numbers from various sources and
then require credit and that nobody can use them for commercial purposes?

~~~
chapium
"Just numbers" well not quite. They are data which can be interpreted as
information. This is like saying books are just a collection of letters.

This link may be informative and provide a US context. Sorry I don't have a
more content specific reference.

[https://academia.stackexchange.com/questions/63139/public-
da...](https://academia.stackexchange.com/questions/63139/public-dataset-
without-license-what-is-allowed)

~~~
paulmd
Phone books are actually legally just a collection of numbers and are not
considered a creative work, because there is nothing creative about the mere
act of collating them. Thus, they cannot be copyrighted.

[https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R...](https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._Rural_Telephone_Service_Co).

If NYT provides a "value add" here beyond mere aggregation and deduplication
they are probably copyrightable. Otherwise they probably don't hold a
copyright here (although they may attempt to claim it, just like RTS did).

~~~
chapium
Interesting!

------
z3ugma
My pet peeve with nearly all maps showing this data is that they are on
geographical maps. A virus spread is only a weakly geographical phenomenon -
most infections happen through human-to-human contact so it makes more sense
to show on a population-weighted cartogram. Has anyone made a map like that? A
great example is the 538 hexagonal-tile electoral college map:
[https://projects.fivethirtyeight.com/2016-election-
forecast/](https://projects.fivethirtyeight.com/2016-election-forecast/)

or one of these population-weighted projections of different areas:

[http://news.bbc.co.uk/2/hi/in_pictures/8284655.stm](http://news.bbc.co.uk/2/hi/in_pictures/8284655.stm)

------
bengebre
Kudos to NYT, but I think the COVID Tracking Project data is probably better
because it attempts to measures total testing as well (positives and
negatives). I've been using it to report state-level testing statistics and
new case/death curves:

[https://www.deptofnumbers.com/covid19/](https://www.deptofnumbers.com/covid19/)

From the data I've learned that Washington state appears to be getting their
arms around this thing:

[https://www.deptofnumbers.com/covid19/washington/](https://www.deptofnumbers.com/covid19/washington/)

------
chasebank
[https://coronavirus.jhu.edu/map.html](https://coronavirus.jhu.edu/map.html)

~~~
5cott0
[https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19)

The NYT spent more effort patting themselves on the back than they did
compiling the data.

~~~
16bytes
Their data is of higher quality than upstream sources like the Johns Hopkins
CSSE set. Notably, it's in long-form (one record per day per area) instead of
wide-form and the schema is consistent.

This takes some know-how and a bunch of ETL effort. There is significant value
in their transformation of upstream data.

------
peripitea
For those interested in data, the COVID Tracking Project
([http://covidtracking.com/](http://covidtracking.com/)) doesn't have county
data, but it _does_ have test data (both total, positive, and negative). Very
helpful given how relatively useless all of these case counts are without
knowing how many tests were conducted.

~~~
gboesel
I quickly hacked a website together this weekend based on the COVID Tracking
Project data because I wanted to see how much testing was being done and
where.

I think that it's pretty interesting to sort by the columns and browse the
data.

The site's not great on mobile, but I'm working on that this weekend...

[http://VirusTracking.net](http://VirusTracking.net)

------
panpanna
GitHub link in case NYT is blocked at your... home-office?

[https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data)

------
s09dfhks
Their data set shows only 1 case in san mateo county. That aint right

------
RcouF1uZ4gsC
And places it behind a login page, requiring people to sign up to see it.

~~~
gmaster1440
[https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data)

------
klaudius
Here's what I don't understand... If viruses and bacteria are all over your
body and do some useful things, how can some be bad? What separates 'bad'
virus from a good virus?

~~~
catalogia
Some berries taste great but some berries poison you. Real puzzling isn't it.

~~~
7952
It is interesting though. Why would a plant evolve to kill animals when the
purpose of a berry is to spread seed through consumption?

~~~
nwallin
Holly berries taste delicious to birds, but are slightly toxic to mammals.
(enough to cause nausea, vomiting, and diarrhea, but not to kill you) Birds
don't have receptors for capsaicin, the chemical that makes chilies spicy,
which makes most mammals (but not humans) not eat them. Deadly nightshade
isn't toxic to birds.

Birds tend to travel farther during any given time period than (typically
terrestrial) mammals do, making them better at spreading seeds over a wide
area. So it's evolutionarily advantageous for plants to optimize their fruits
to be eaten by birds instead of mammals.

