Corpus-based development of large lexico-semantic resources for Polish Corpus-based development of large lexico-semantic resources for Polish

Tuesday 14th February, 15pm, room 55.410
Maciej Piasecki
G4.19 Research Group on Language Technology and Computational Linguistics, Department of Computational Intelligence, Wrocław University of Science and Technology, CLARIN-PL Language Technology Centre

plWordNet 3.0 emo has been developed since October 2005 and has become not only the largest wordnet in the world but also the pivot element of a large and comprehensive system of lexico-semantic resources for Polish. plWordNet construction was based on a unique corpus-based wordnet development method. It was initially motivated by the lack of open resources for Polish to leverage, but we soon found it very fruitful and allowed us to go beyond the descriptions in dictionaries and to achieve very good coverage of text in practical applications. The method is based on building large corpora for Polish (over 4 billion words) and a set of tools for extractinglexico-semantic knowledge from text and providing semi-automated support for the work of linguists. However, all final decisions are made by linguists as we aim at faithful description of the Polish lexico-semantic system.

During the talk we will discuss a unique model of plWordNet which is based on lexical units as building blocks and lexico-semantic constitutive relations as a fundament for the synset definition and the wordnet structure. We will briefly present the rich information stored in plWordNet including e.g. many types of links, emotive annotations, manually defined links to Wikipedia. The structure of the whole complex system of lexico-semantic resources will be presented. The system links together, e.g., plWordNet 3.0 emo, Walenty ( ( (a large syntactic-semantic valency dictionary for Polish), enWordNet 1.0 (a expanded version of Princeton WordNet 3.1) and NELexicon 2.0 (a large lexicon of Polish Proper Names mapped to plWordNet upper-levelsynsets). The system is also enriched by a very large manually created mapping of plWordNet toenWordNet and semi-automatically created mapping to SUMO. Finally, we will discuss the application of the corpus-based development method in relation to different types of words, and different types of resources in the system.