
Covid-19 Open Research Dataset - shinryudbz
https://pages.semanticscholar.org/coronavirus-research
======
axegon_
Call me cynical but isn't this a bit... Redundant? I mean sure, nlp, parse
some papers and make some sort of search/q&a type of thing? Fine, whatever.

Nice work on organizing the data and to a fair degree taking care of the
tedious pre-processing associated with any ml task but... Idk, I fail to see
how this could help. Personally I'll fiddle around with it in my spare time
but mostly for fun, I don't expect anything significantly useful to come out
of it.

On a higher level I feel like the success of the IT industry has made a lot of
people feel like it is the answer to every possible question. And sure,
bioinformatics is an incredible subject and I've invested a lot in studying it
in my spare time in the last couple of years but __purely__ out of curiosity.
But my 2 cents on the subject is that us(developers, ml engineers, etc) cannot
provide the answer to the meaning of life. In this situation our safest course
of action is to work from home as much as possible and avoid making decision
that could potentially make the situation worse. Basically protect ourselves
and those around us as much as possible. And by doing so, let people who have
the adequate training, knowledge and experience take care of the situation.
Sure, play around with it in your spare time, and if you come up with
something - share it. But the whole "stand back boys, let us men handle it"
mentality is what really bugs me. If anything history has taught us that this
goes badly 990 out of 1000 times(to be more in lines with Bayesians). We all
want the underdog to win the game but... Come on...

~~~
gillesjacobs
> _I mean sure, nlp, parse some papers and make some sort of search /q&a type
> of thing? Fine, whatever._

I think you severely underestimate the effort and expertise required for
getting decent content-aware search, let alone a fully functional question-
answering pipeline.

These are their own subfields of research in text mining. The fact that you
conflate these as if they were some trivial task on a new dataset shows your
lack of understanding of the field.

Biomedical text mining [1] is wide a subfield with plenty of open datasets and
competitions such as the bi-annual ACL BioNLP workshop [1]. Furthermore
existing knowledge-base creation and information extraction pipelines such as
protein-protein interaction extraction, NER, event extraction, drug-drug-
interaction minin, etc. could be applied to this novel dataset and provide
useful insights for researchers and staff.

~~~
axegon_
> These are their own subfields of research in text mining. The fact that you
> conflate these as if they were some trivial task on a new dataset shows your
> lack of understanding of the field.

On the contrary, I've built many, but in this context I see it as a waste of
time. As I said in another reply, it's a resume/portfolio task with no real
world application.

------
pp19dd
It takes downloading one of these files, gunzipping them, extracting the tar
and opening up a JSON file at random to really understand just how distant the
title of the project is from its contents. I fully realize this is about
natural language processing, but... this is beyond reach. You'd have to teach
a computer to become a doctor first.

Feels like where AI (and computing in general) should go in case of Covid-19
and other illnesses is analyzing and simulating how the heck we work to begin
with. Take a few minutes to watch these videos of supercomputer simulations
that show -- fragments -- of the fundamentals of how we exist.

Multi Scale Modeling of Chromatin and Nucleosomes -
[https://www.youtube.com/watch?v=4Z4KwuUfh0A](https://www.youtube.com/watch?v=4Z4KwuUfh0A)

DNA animation showing realtime DNA replication -
[https://www.youtube.com/watch?v=7Hk9jct2ozY](https://www.youtube.com/watch?v=7Hk9jct2ozY)

Somewhere in these mind-boggling processes is where the disruption called
Covid-19 puts a stick in our wheels. Compared to the complexity of the
simulations, abstract ideas contained in these files are so macroscopic by
comparison. It's both humbling and awe-inspiring.

~~~
typon
Even though the Multi Scale Modeling video is extremely impressive - it is
still an MD simulation that uses classical mechanics. A full atomistic quantum
simulation of such a large system is out of reach for even the largest super
computers. We barely know anything about biology.

~~~
ksk
They're already using MD simulation in certain areas of drug discovery
(computer aided drug design). Naturally, as you mentioned we're compute-
limited so the models are simpler and the questions we ask of it are also
'simple'. But that is not stopping us at all. While not exactly MD, My S.O.
works in drug discovery and they use simple computational screening models to
simulate the chemistry and screen out tens of thousands of possible drug
candidates.

------
clircle
How can applying a machine learning algorithm to this data (a collection of
research papers) help fight Covid? It’s possible most papers here are bogus or
low quality. Garbage in, garbage out?

~~~
google234123
I imagine they have some algorithm that takes an input consisting of the
title, abstract, and authors with the information about how impactful their
previous work has been (this is probably the most important factor) and
outputs some ranking or a likelihood that the research will be cited/clicked
on.

~~~
inertiatic
That's feasible, it's not even a hard thing to do (maybe hard to do very
accurately), I worked for a company that offered exactly that service, as a
small project.

Still, that probably doesn't answer the question, how does that really help
the fight.

I suppose we'll learn what they have in mind when those questions are asked in
the competition.

------
MuffinFlavored
Where are the most up to date, most reliable case numbers? I'm tracking US
day-by-day case growth.

These all have different numbers:

[http://covid19.fyi/](http://covid19.fyi/)

[https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_t...](https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_the_United_States)

[https://coronavirus.1point3acres.com/](https://coronavirus.1point3acres.com/)

~~~
MrAlexey
I recently had to deal with this problem when building out Covidly
(www.covidly.com)

Initially I tried using WHO and JHU, but quickly found their data to be
riddled with discrepancies, occasional bugs, and direct contradictions with
official statements from various countries.

I ended up aggregating multiple sources (including WHO/JHU/etc), performing
some sanity checks to remove outliers, then doing my best to merge the
remaining results.

Happy to share this data publicly if there's interest!

~~~
mattlutze
"Explosive growth" and "Virus is largely out of control" is dangerous risk
communication.

How often are you updating the data? If there's manual deconfliction, do you
clearly indicate how old data for a country or state/province is, or how
accurate the reporting is that your massaged summary comes from?

If you're meaning to put this out in the world as a source of information
please get some feedback first from people that do this sort of thing for a
living. Inaccuracy or excitable language can do more harm than good in
emergencies.

~~~
MrAlexey
You make a very good point about risk communication. I already had to make a
few updates (e.g. hiding the mortality rate) that were causing unnecessary
panic. I'll work on optimizing the existing language as well.

Regarding the data, it's updated and processed automatically every 10 minutes.

I really appreciate the feedback! Let me know if anything else stands out to
you.

~~~
mattlutze
Here, I would recommend that mortality rates are not bad. The goal in risk
communication is to instill a level of concern equal to the current threat.
It's all about context.

If you show infection, death and recovery rates, you have to provide context
and help people understand what a thing means.

1-10 scales can make parsing difficult (3 and 4 have the same description
right now, for example). Governments, militaries and emergency aid orgs put a
lot of effort into color and coding systems.

Give Peter Sandman a Google, and check out his site here:

[https://www.psandman.com/](https://www.psandman.com/)

He's an expert in how to talk about scary, hard-to-visualize things (like a
viral pandemic).

Also, how old is the data being drawn from, what algorithm do you use to de-
conflict the sources, and how do you disclose this to your audience (other
than the general about page)? If a source has different refresh rates for
countries that it tracks, how are you reflecting that to your audience?

A note, China is missing from your nifty "First 20 days" graph, which maybe
you should just call "First 20 days after 200 cases" or something like this to
make it clearer what's being tracked.

------
anon1253
It might be fun to checkout
[https://covid19.doctorevidence.com/](https://covid19.doctorevidence.com/)
they have loaded the CORD dataset with a dashboard like interface and a query
language. E.g.
[https://search.doctorevidence.com/search?query=ss(6f1da786-6...](https://search.doctorevidence.com/search?query=ss\(6f1da786-67ab-11ea-9b44-001b21bedfe8\))
(user/pass covid19/covid19) for a direct link, and it provides integration
with the other medically relevant feeds.

------
m3kw9
Without proper search assists like Elastic search/lucene etc, it’s not useful
for most people trying to read it. Maybe someone can set up a site with
elastic search with articles on there?

~~~
shardinator
I thought so too, I’m working on this. If anyone wants to collaborate please
let me know how I can get in touch email, Twitter.

~~~
inertiatic
I'm not sure what the cost would be to have ES/Solr hosted (is there a free
option?).

I'm too poor/cheap to do so, if it were up to me I'd set this up with a static
js page that uses something similar in the browser (eg. lunr js) and allows
you to download that dataset on the spot.

~~~
shardinator
I have a few apps that I host with a cloud provider (scalingo.com) and
initially I can cover the costs. If there’s a lot of traffic because it proves
to be helpful there are a few groups I could ask to help with some funding.

For now I’m building a custom index with mongoDB - if anyone is familiar with
building ES queries, would be keen to chat about how that might be better.

I'll post a link here later today.

~~~
inertiatic
A lucene based solution would probably scale better and would allow you to
implement more complex behavior. You get analyzers, stopword filters, synonyms
etc. out of the box, and you can express things like "virus within 5 words of
lung" or "covid or virus but covid is way more important". I believe given the
rather small dataset and the fact that you won't even expose more complex
queries on your interface, the main benefits are the analyzers and various
filters you can use.

If you are already doing stemming, stopword filtering and maybe synonyms
you're probably fine.

(My email is in my profile if you want to discuss this in further detail)

~~~
shardinator
Lucene does sound good, but I don't have any experience with it so I can't get
anything happening quickly. But very happy to have help if you can?

~~~
shardinator
The url is - [http://covid19-research.osc-
fr1.scalingo.io/](http://covid19-research.osc-fr1.scalingo.io/)

------
ISNIT
Have you put this in
[https://coronavirustechhandbook.com/data](https://coronavirustechhandbook.com/data)?

------
gillesjacobs
Many of the criticisms voiced in this thread stem from a lack of expertise in
biomedical Natural Language Processing and text mining.

Various annotated datasets and models already exist within the field which can
extract potentially useful information and be used in downstream task for
targetted document and information retrieval. Biomedical text mining [1] is
wide a subfield with plenty of open datasets and competitions such as the bi-
annual ACL BioNLP workshop [1].

\- Biomedical Named Entity Recognition: extract names of proteins, drugs,
diseases, symptoms, etc. and classify their biomedical category [3].
Extracting the terms of symptoms is a crucial in document discovery and
modeling and knowledge-base creation. Several open datasets can be found here
[4].

\- Biomedical relation and event extraction: Traditionally focused on
extracting protein-protein interactions, which are crucial for virtually every
process in a living cell. Information about these interactions provides the
foundations for new therapeutic approaches. Recently interest have been
shifted to the extraction of complex relations such as biomolecular events.
[2] These methods can detect and classify the causal relations between the
genes and proteins in a sentence like "TNF-alpha is a rapid activator of IL-8
gene expression by...".

\- Document retrieval: Helping researchers and medical staff find relevant
topic-specific papers by improving search with topic modeling, document
similarity, named entities, etc.

These are only some examples of common biomedical text mining tasks and there
are plenty more. Now of course, relying on previous annotated data is an issue
because the tagged categories might not relevant for many of the issues
related to COVID19. However, even unsupervised modeling like using SciBERT to
create topic models or document clusters of related documents can be helpful
for scientific discovery.

1\.
[https://en.wikipedia.org/wiki/Biomedical_text_mining](https://en.wikipedia.org/wiki/Biomedical_text_mining)

2\.
[https://aclweb.org/aclwiki/BioNLP_Workshop](https://aclweb.org/aclwiki/BioNLP_Workshop)

3\.
[https://www.hindawi.com/journals/cmmm/2015/571381/](https://www.hindawi.com/journals/cmmm/2015/571381/)

4\. [http://gcancer.org/clstmdata/](http://gcancer.org/clstmdata/)

5\.
[https://bmcbioinformatics.biomedcentral.com/articles/10.1186...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3321-4)

------
dougb
This is would be the perfect time for IBM to apply all Watson technologies and
resources to develop new insight into Covid-19.

~~~
heyitsguay
Watson is basically a brand name for IBM's data analytics consulting services.
My understanding is they're not that great at it, they haven't scored any
major wins outside of that Jeopardy run. I don't have any articles on hand but
i seem to recall reading about some failures with a medical partner in
particular, but then that's been a tough field for other big names like
Google, too.

