

Ask HN: Advice on a public health project. - XFrequentist

Hello HNers:<p>Question from a very non-technical person (with a bit of a statistics/biology background).<p>I'm an epidemiologist with a national public health agency. I'm working on a project to try and predict the emergence of new infectious diseases. Lots of data is reported (online or otherwise) by various groups, but not in a way our team can easily interpret or use.<p>We'd like to scrape data from these reports, populate a database of events, and produce a dashboard that would display events geographically and allow us to perform/display analyses on risk and context. The hope is that with the right data, timely analysis and a highly visual display, the next "big one" will be apparent in the early stages, while intervention and prevention are still possible.<p>Examples of similar projects (by groups we do or can collaborate with) can be found here:<p>healthmap.org<p>www.glews.net<p>empres-i.fao.org/empres-i/home<p>These are nice, but do not go nearly as far as we'd like to in terms of analysis.<p>As a first step, I'm generating a list of criteria that indicate various risk levels (from various experts and the published literature). Next, I'll need to develop the database, and I have no clue how that should be accomplished (I assume copy-past to excel is sub-optimal).<p>I was hoping HN might have some guidance on how the parsing/database populating ought to be done. I'm so informatically ignorant that I'm not even sure I'm posing an intelligent question, so please set me straight if I'm being dumb.<p>Basically, where should I start? What are the correct keywords I should be googling? Do I need to get access to the underlying SQL/XML database from my source websites? How should I structure my database? Do I need to hire a hacker? Are these sensible questions, or should I be asking something else?
======
srini1234
Your application spans several major areas of computing technology. First, you
would need a highly efficient data capture system - most epidemic information
(flu etc.) come from non-structured inputs such as web search trends, texting,
hospital emergency call patterns, Hospital staff reports, CDC tracking, PCP
tracking etc. These inputs are like fire-hoses. So, you must have an efficient
input interface. Keywords here are XMPP, eJabberD, SMS Gateway, etc.

Next the data must be stored and sorted efficiently, so that the analytical
engines can easily produce intelligent outcomes. Here the keywords are: NoSQL,
Hadoop, Map-Reduce, BigTable, Mongo, etc. Read various High Scalability case
studies such as Twitter, Facebook, Flicker, etc.

Next you need to deploy an analytical engine so that the visitors can run
their own decision making queries. You probably have to prepare some
standard/canned reports also. Here the Keywords are: Machine Learning (ML),
Weka, Bayesian Partitioning, Markov Chains, PMML, etc.

Then you have to actually write the web app and put it in a hosting facility
for the world to access. Here there are many options. Keywords are: Python,
Rails, Heroku, AWS, EC2, Rackspace, Azure etc.

Yes, it's a good idea to hire a hacker at least part-time or even find a Tech
Co-founder. In your case, it is better if the person has some domain knowledge
(Social Health issues).

