Medical data is unique due to its large volume, heterogeneity and complexity. This necessitates costly active participation of medical domain experts in the task of cleansing medical data. In this paper we present a new data cleansing approach that utilizes Bayesian networks to correct errant attribute values. Bayesian networks capture expert domain knowledge as well as the uncertainty inherent in the cleansing process, both of which existing cleansing tools fail to model. Accuracy is improved by utilizing contextual information in correcting errant values. Our approach operates in conjunction with models of possible error types that we have identified through our cleansing activities. We evaluate our approach and apply our method to correcting instances of these error types.
Prashant Doshi, Lloyd Greenwald, John R. Clarke