The Dirtier Data Might Be A Better DATA – DR DHESI BR

6/15/2017

The School of Public Health and colleagues from the academia have been asking me couple of times, What is the difference between conventional statistical methods and machine learning techniques in predicting disease outbreak.

Well, besides real time prediction, the biggest difference is that in computational science (in our work, not all), we ignore heterogeneity across data sets, whereas in traditional method in Public Health Schools, we are taught to reduce heterogeneity. A control environment is the best environment! Well, not true for me...

I quote, for example, people say, “I’m not going to use this sample because that patient had a different drug treatment”. In statistics, we have learned to take data sets and select samples making sure there are no confounding factors.

Keep in mind when we do this, it does not capture the heterogeneity of the disease or the pattern that we are trying to study and this is the reason why, we cant extrapolate and come up with novel findings as we program ourselves to a control environment (homogeneity) while in (AIME) extract, compile, analyse heterogeneity complexity in order to replicate the findings in other countries, combining various types of data sets in order to understand an outcome.

It is okay to have heterogeneity. Using dirty data allows you to account for heterogeneity (not just me, but even Prof Khatri from Stanford University Published an article on this as well).

But remember, to be sure that heterogeneity will not compromise our results, we need to have stringent criteria for validating purpose, just to make sure that statistical associations that we establish, in my case, between weather, environment factor and disease outbreak conditions were not by chance.

This stringent validation process had to be done in an independent way, and not related to a previous publication or a controlled environment, is another key point to take.

I quote “Dr Khatri from Standford published a set of guidelines so anyone can do it. It compares several methods and is quite technical, but here is the punch line: Reproducibility is good (greater than 85 percent) when you use three to five data sets with a total of 200-250 samples. Which statistical/meta-analysis method you choose is not important. What really matters is not having a large, homogeneous data set but rather multiple heterogeneous data sets” – Dr Khatri. Different from what we are taught in Public health Schools! Large Data Sets, Exclusion Criteria & Inclusion Criteria! In Khatri’s method, we include all of them then we validate them.

I also believe that using heterogeneous data (not in a controlled manner for specific study) is not just good, but required as there will be heterogeneity across data sets. However, many researchers prefer published papers, clean and homogenous data sets, but keep in mind of research biases in literature. The gist is that forming hypotheses based on what’s been published is akin to looking for your keys under a random street lamp because that’s where the light is better. There are more things out in the dark that are yet to be found….

Comments

The Dirtier Data Might Be A Better DATA – DR DHESI BR

DR DHESI BR (MD,MPH)