Query performance prediction is essential for many important tasks in cloud-based database management including resource provisioning, admission control, and pricing. Recently, there has been some work on building prediction models to estimate execution time of traditional SQL queries. While suitable for typical OLTP/OLAP workloads, these existing approaches are insufficient to model performance of complex data processing activities for deep analytics such as cleaning and integration of data. These activities are largely based on similarity operations—radically different from regular relational operators. In this paper, we consider prediction models for set similarity joins. We exploit knowledge of optimization techniques and design details popularly found in set similarity join algorithms to identify relevant features, which are then used to construct prediction models based on statistical machine learning. An extensive experimental evaluation confirms the accuracy of our approac...