Data integration systems offer users a uniform interface to a set of data sources. Previous work has typically assumed that the data sources are independent of each other; however, in scenarios involving large numbers of sources, such as the Web or large enterprises, there is an eco-system of dependent sources, where some sources copy parts of their data from others. This paper considers the new optimization problems that arise while answering queries over large number of dependent sources. These are the (1) cost-minimization problem: what is the minimum cost we must incur to get all answer tuples, (2) maximum-coverage problem: given a bound on the cost, how can we get the maximum possible coverage, and (3) the source-ordering problem: for a set of data sources, what is the best order to query them so as to retrieve answer tuples as fast as possible. We consider these optimization problems under several cost models and we show that, in general, they are intractable. We describe effect...
Anish Das Sarma, Xin Luna Dong, Alon Y. Halevy