WHIRL is an extensionof relational databasesthat canperform "soft joins" basedon the similarity of textual identifiers;thesesoftjoins extendthe traditional operationof joining tablesbasedon the equivalenceof atomic values. This paper evaluatesWHIRL on a number of inductive classificationtasksusing datafrom the World Wide Web.We show thatalthoughWHIRL is designedfor moregeneralsimilaritybasedreasoningtasks,it is competitivewith matureinductive classificationsystemson theseclassificationtasks. In particular, WHIRL generally achieveslower generalizationerror than C4.5, RIPPER,and severalnearest-neighbormethods. WHIRL is also fast-p to 500 times fasterthan C4.5 on somebenchmarkproblems. We also show that WHIRL can be efficiently usedto selectfrom a large pool of unlabeled items thosethat can be classifiedcorrectly with high confidence.
William W. Cohen, Haym Hirsh