The JOS Morphosyntactically Tagged Corpus of Slovene

15 years 9 months ago

Download www.lrec-conf.org

The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million word partially hand validated corpus. The two corpora have been sampled from the 600M word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.

Tomaz Erjavec, Simon Krek

Real-time Traffic

Education | Hand Validated Morphosyntactic | JOS Morphosyntactic Resources | LREC 2008 | Monolingual Sampled Corpus |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Tomaz Erjavec, Simon Krek

Comments (0)

Sciweavers

The JOS Morphosyntactically Tagged Corpus of Slovene

Education | Hand Validated Morphosyntactic | JOS Morphosyntactic Resources | LREC 2008 | Monolingual Sampled Corpus |

Explore & Download

Productivity Tools

Sciweavers