Public-use sensor datasets are a useful scientific resource with the unfortunate feature that their provenance is easily disconnected from their content. To address this we introduce a technique to directly associate provenance information with sensor datasets. Our technique is similar to traditional watermarking but is intended for application to unstructured datasets. Our approach is potentially imperceptible given sufficient margins of error in datasets, and is robust to a number of benign but likely transformations including truncation, rounding, bit-flipping, sampling, and reordering. We provide algorithms for both one-bit and blind mark checking. Our algorithms are probabilistic in nature and are characterized by a combinatorial analysis. Categories and Subject Descriptors E.m [Data]: Miscellaneous; H.3.m [Information Systems]: Information Storage and Retrieval—Miscellaneous General Terms Design, Documentation, Reliability, Security. Keywords Provenance, Self-identifying da...
Stephen Chong, Christian Skalka, Jeffrey A. Vaugha