As a data scientist, your gut and your training tell you to use perfect data for an analysis. This is often a function of classical statistics education, with an intent to submit research and analysis for publication. This is fine and noble, but upon encountering real-world data, the cold reality of dirty data becomes prominent and one must learn to abandon hope of perfection or face an endless loop of frustration.
My wife Sarah, who did her graduate work in public health, has often used the phrase: “Don’t let the perfect be the enemy of the good.” When confronted with imperfect data, my classical training would say that this data is beyond hope, that it is could never be cleaned sufficiently, and that we would be unable to obtain anything that was truly meaningful. However, this is where we get to the key principle that this should not have to be a zero-sum decision. It is not good, nor is it bad, but it certainly is viable. How can we improve our policies and strategies in absence of perfect data? If it doesn’t meet the pristine standards of the classical approach, we must find ways to make the data work so that it can inform the critical decisions that are necessary to move ahead.
To explain, I need to step back in time.
In graduate school, I had the same professor for all of my statistics classes. She was thorough, excellent, and meticulous. ...