Abstract. Natural language parsing, as one of the central tasks in natural language processing, is widely used in many AI fields. In this paper, we address an issue of parser performance evaluation, particularly its variation across datasets. We propose three simple statistical measures to characterize the datasets and also evaluate their correlation to the parser performance. The results clearly show that different parsers have different performance variation and sensitivity against these measures. The method can be used to guide the choice of natural language parsers for new domain applications, as well as systematic combination for better parsing accuracy.