

Unsupervised morphological segmentation and clustering with document boundaries

13 years 10 months ago
Unsupervised morphological segmentation and clustering with document boundaries
Many approaches to unsupervised morphology acquisition incorporate the frequency of character sequences with respect to each other to identify word stems and affixes. This typically involves heuristic search procedures and calibrating multiple arbitrary thresholds. We present a simple approach that uses no thresholds other than those involved in standard application of 2 significance testing. A key part of our approach is using document boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem. We evaluate our model on English and the Mayan language Uspanteko; it compares favorably to two benchmark systems which use considerably more complex strategies and rely more on experimentally chosen threshold values.
Taesun Moon, Katrin Erk, Jason Baldridge
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Authors Taesun Moon, Katrin Erk, Jason Baldridge
Comments (0)