Recent work on parallel joins and data skew has concentrated on algorithm design without considering the causes and chara.cteristics of data. skew itself. Existming ana.lyt,ic models of skew do not cont.ain enough informat,ion to fully describe data skew in parallel implementations. Because the assumptions made about the nature of skew vary between authors, it is almost impossible to make valid comparisons of parallel algorithms. In t,his paper, a taxonomy of skew effects is developed, and a. new performance model is introduced. The model is used to compare the performance of two parallel join algorithms.
Christopher B. Walton, Alfred G. Dale, Roy M. Jene