Back ColWordNet incorporates and proposes the preferred combinations of words inherent in the practice of language

ColWordNet incorporates and proposes the preferred combinations of words inherent in the practice of language

An extension of the most widely used lexical resource, WordNet, developed by members of the research group on Natural Language Processing that they will be presenting at the 2016 International Coling Conference, in Osaka (Japan) in December.

28.10.2016

 

In the field of natural language processing, WordNet is probably the best known lexical resource. WordNet is a lexical database in English that combines lexicographic information (which we can find in dictionaries), such as definitions and synonyms, with semantic information, such as the hyperonyms, or general and abstract term that may refer to another more specific and particular term, for example convertible and car.

In practice, WordNet is able to link words, such as cat/feline, ford/car and so on. All of these aspects are very important in Artificial Intelligence, since they are crucial in the teaching and learning process of an automatic system. Taking into account all these aspects of language makes a non-human device capable of reading a text properly.

But, the WordNet reference database lacks a very important aspect which is information on collocations or preferred combinations of words, a feature of language that humans learn inductively in the practice of speech and that today’s standard dictionaries hardly take into account. Members of the Research Group on Natural Language Processing (TALN) have created an extended version of WordNet, ColWordNet, which incorporates into the database millions of links between lexemes belonging to a collocation.

Collocations are a type of phraseological unit, consisting of elements that have a certain mutual attraction, a combinatorial preference, and have a transparent and compositional meaning. Luis Espinosa-Anke, first author of the paper, explains that “collocations are combinations of words whereby we might say that the use of one is conditioned by the presence of the other”. For example, while we say in Spanish, “dar un paseo” in English they say “take a walk” while in Spanish we never speak “tomar un paseo”.

This aspect of language is what the authors have introduced into the WordNet reference database in order to create ColWordNet because “we do not want a machine to say “big rain” or “colossal rain”, but we want it to refer to “heavy rain”. Although big/colossal/heavy are very close, almost synonymous, only one of them is correct when combined with “rain”, adds Anke-Espinosa.

This concretion exists in collocations, and measuring it, compared to other more grammaticalized phrases, is a complicated task. ColWordNet does not just read collocations from the McMillan Collocations Dictionary, it also uses a machine learning technique to discover new collocational relationships between concepts of WordNet.

This research has been carried out by Luis Espinosa-Anke, Sara Rodríguez, Horacio Saggion, researchers of the TALN, Leo Wanner, group leader and ICREA researcher with the Department of Information and Communication Technologies (DTIC), and José Camacho-Collados of Sapienza University of Rome (Italy), which they are to present at the 26th International Conference on Computational Linguistics (Coling 2016) to be held from 11 to 16 December 2016 in Osaka (Japan). This work has been partly funded by European projects MULTISENSOR (FP7), KRISTINA (H2020), HARENES; and the Spanish Maria de Maeztu Unit of Excellence programme and MINECO/ERDF; it is very much in the line promoted by the DTIC-MDM Strategic Programme for the promotion of reproducibility, given that the results, whenever possible, are published in Creative Commons and open-type licences in order to promote their use.

Reference work:

Luis Espinosa-Anke, José Camacho-Collados, Sara Rodríguez-Fernández, Horacio Saggion, Leo Wanner (2016), “Extending WordNet Fine-Grained Collocational Information via Supervised Distributional Learning”, The 26th International Conference on Computational Linguistics (Coling 2016), 11-16 de desembre, Osaka (Japó)

Multimedia

Categories:

SDG - Sustainable Development Goals:

Els ODS a la UPF

Contact