ABSTRACT. In the framework of the LegDoc project at Xerox Research Centre Europe, we are developing components for the semantic annotation of semi-structured documents. While certain semantic entities have regular forms and might be easily extracted, more complex and heterogeneous collections favor the deployment of machine learning methods. Moreover, real world cases pose the technical challenge of the unavailable training sets for specific annotation tasks. As the manual annotation is costly and error-prone, our approach consists in applying active Quatrième conférence francophone en Recherche d’Information et Applications – mars/2007, pages 1 à 16