Statistically-based parsers for large corpora, in particular the Penn Tree Bank (PTB), typically have not used all the linguistic information encoded in the annotated trees on which they are trained. In particular, they have not in general used information that records the effects of derivations, such as empty categories and the representation of displaced phrases, as is the case with passive, topicalization, and whconstructions. Here we explore ways to use this information to "unwind" derivations, yielding a regularized underlying syntactic structure that can be used as an additional source of information for more accurate parsing. In effect, we make use of two joint sets of tree structures for parsing: the surface structure and its corresponding underlying structure where arguments have been restored to their canonical positions. We present a pilot experiment on passives in the PTB indicating that through the use of these two syntactic representations we can improve overall...
Igor Malioutov, Robert C. Berwick