Discovering Interesting Subsets Using Statistical Analysis

15 years 8 months ago

Download www.cse.iitb.ac.in

In this paper we present algorithms for identifying interesting subsets of a given database of records. In many real life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure. For example, in the customer support database, it is important to identify subsets of tickets having service time which is too large (or too small) when compared with the service time of the rest of the tickets. We use Student's t-test to check whether the measure values for a subset and its complement differ significantly. We first discuss the brute-force approach and then present heuristic-based state-space search algorithm to discover interesting subsets of the given database. To use the proposed heuristic-based approach on large data sets, we then present a samplingbased algorithm that uses sampling together with the proposed heuristics to efficiently identify interesting sets in large data sets. We discuss an application of the...

Maitreya Natu, Girish Palshikar

Real-time Traffic