Recently, I was talking to a friend working as a statistician in manufacturing. He explained to me how his direct involvement in the manufacturing process helped him to understand where biases and limitations of the data may originate from. For example, he told a story about running an experiment, including a high volume device. The design of the experiment was especially challenging as for every run, this device would need to be filled completely with very costly ingredients. Thus, the sample size of the experiment was limited heavily by the cost constraints. It ended up being one of his most expensive experiments within manufacturing.
I had similar experiences in my career as a statistician, when I was learning about the burden of certain data entry forms. At the time, we were changing from paper forms (yes, people were entering all the data from paper into the clinical data bases) to electronic forms capturing the data directly at the site of the investigator. The tedious process and the slow responses of the system at the time helped me to rethink how much and in which way we should collect data in order to make it as easy as possible for the investigator. That way we hoped to decrease the amount of missing data.
Even John Snow – being a data scientist over 150 years ago – investigated how the data was collected and what would be factors affecting the data collection. In his famous map showing the deaths because of cholera in a part of London in 1854 you can see a building close to the water pump likely being the culprit of the spread of the disease. Why was nobody dying in this building? He found out about the brewery within the building and its own water supply (and of course beer supply). This saved the people living there from taking the water from the infected water pump.
Become a data detective as part of your job as a data scientist or statistician as well. Learn about how your data happens.
- Under what circumstances do people enter the data?
- What could be factors affecting the data collection and cleaning process?
- What incentives do people have to collect the data?
Reach out to your business partners close to the data collection site. Where is the brewery in your data?
Reference: John Snow: A Legacy of Disease Detectives
Never miss an episode of The Effective Statistician
Join hundreds of your peers and subscribe to get our latest updates by email!