Supporting top-k queries over distributed collections of schemaless XML data poses two challenges. While XML supports expressive query languages such as XPath and XQuery, these languages require schema knowledge so as to write an appropriate query which may not be available in distributed systems with autonomous and dynamic sources. Thus, there is a need for approximate query processing. Furthermore, retrieving the top-k results incurs large communication and processing cost, since partial result lists from numerous sites need to be combined and ranked to assembly the top-k answers. To address both of these issues, we present an approach for approximate XPath processing over distributed collections of XML data based on a clustered path index, where data is grouped based on structural information. Our method gradually generalizes a query by applying a set of structural transformations to it and the retrieved results are ranked based on the edit distance between two path expressions. A ...