The top-k similarity joins have been extensively studied and used
in a wide spectrum of applications such as information retrieval, decision
making, spatial data analysis and data mining. Given two sets of objects U
and V, a top-k similarity join returns k pairs of most similar objects from
UV. In the conventional model of top-k similarity join processing, an object
is usually regarded as a point in a multi-dimensional space and the similarity
between two objects is usually measured by distance metrics such as Euclidean
distance. However, in many applications an object may be described
by multiple values (instances) and the conventional model is not applicable
since it does not address the distributions of object instances. In this paper,
we study top-k similarity join queries over multi-valued objects. We apply
quantile based distance to explore the relative instance distribution among
the multiple instances of objects. Efficient and effective techniques to process
top-k...