Collocation resources

The currently available collocation dataset is a list of about 10,000 collocations in English collected and tagged in terms of Lexical Functions (LFs) by I. Mel’čuk. In order to facilitate the use of this dataset in downstream NLP applications, we disambiguated the collocation bases (or “keywords” in the terminology of LFs) with respect to BabelNet synsets.

Please consult  here  the Readme for the precise description of the dataset

A subset of this dataset has been used in the experiments described in L. Espinosa-Anke, L. Wanner, and S. Schockaert. “Collocation Classification with Unsupervised Relation Vectors”, ACL 2019, Short paper track, Florence, Italy



A dataset of comparable size on French is in preparation and will be published soon.


Summarization and Information Extraction ready-made systems

Ready-made systems descrined in the paper  "Summarization and Information Extraction in your Tablet"  SEPLN 2015.

You will need the GATE system (

The applications (gapp files)  described in that paper can be downloaded from this page and executed in the GATE system.

For the summarization application you'll need to download and install SUMMA (

You'll need to copy the summarization application SUMMA-APP-DEMO-SPANISH.gapp to the gapps directory under summa_plugin

Any doubt contact the authors.

SUMMA GATE Plug-In (Basic and Advanced Components)

Summarization components: SUMMA jar file, creole.xml and other resources.

Corpus de sintaxis superficial del finés en formato conll

A 2,000-sentence corpus of weather-related sentences of Finnish has been annotated, using surface syntactic relations according to the Meaning-Text Theory. This corpus will be used to obtain resources that will help to improve the PESCaDO system and that will be adaptable to future projects. More in detail, such annotation will be used for i) training a parser, for extracting new data from Finnish webpages, and ii) automatically obtaining deeper levels of annotation (Deep Syntax and Semantics); this will allow for training a statistical generator which can be integrated to the Linguistic Generation module. The annotation is presented in standard CoNLL (one word per line) format.



Dependency parser

This is a very simple implementation of Nivre's arc-eager parsing algorithm by using LibSVM as machine learner.
It only has ~500 lines of Java code (without counting the svm package).
java -jar Parser.jar -t <0|1> -l <0|1> -c trainingSet -i testSet
The -t option is 0 for only testing and 1 for training and testing.
The -l option is 0 for unlabelled parsing and 1 for labelled parsing.
TrainingSet and testSet should be treebanks in conll data format.
The system will output a file "output.txt" with the output trees.


ColWordNet, an enriched WordNet with collocational information. The following are resources associated with the publication (coming soon): ---

Download the resource here. Retrofitted models with collocational information related with the following lexical functions: intense, weak, perform, create.

Evaluation data.

Reference paper:

Espinosa-Anke, L., Camacho-Collados, J., Rodríguez-Fernández, S., Saggion, H. and Wanner, L. Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning. Coling 2016.

Exploring Morphosyntactic Annotation Over a Spanish Corpus for Dependency Parsing: Results Table

Detailed results of paper "Exploring Morphosyntactic Annotation Over a Spanish Corpus for Dependency Parsing" in DepLing 2013

Title of table: Classification according to general LAS improvement of feature combinations.


Extended Wikidata Hypernym Branch in the Music Domain with BabelNet and SensEmbed

Extended Wikidata Hypernym Branch in the Music Domain with BabelNet and SensEmbed

(Version 0.9)


ExtaSem! Taxonomies

Version 0.9 of ExTaSem! Taxonomies produced in the domains of Food, Chemical, Science, Equipment, Artificial Intelligence and Terrorism.


Leveraging Spanish Google Ngrams for Correcting and Detecting Real-Word Spelling Errors

Source code for the real-word errors detector and corrector, including evaluation scripts for replicability.



MaltDiver is a tool developed to visualize the transitions performed by the transition-based parsers included in MaltParser and to show how the parsers interact with the sentences and the data structures within.



MaltOptimizer is a freely available tool developed to facilitate parser optimization using the open-source system MaltParser, a data-driven parser-generator that can be used to train dependency parsers given treebank data. MaltParser offers a wide range of parameters for optimization, including nine different parsing algorithms, two different machine learning libraries (each with a number of different learners), and an expressive specification language that can be used to define arbitrarily rich feature models. 
MaltOptimizer is an interactive system that first performs an analysis of the training set in order to select a suitable starting point for optimization and then guides the user through the optimization of parsing algorithm, feature model, and learning algorithm.

See the website for more information:

MSR-NLP Definition Corpus

Benchmarking dataset for Definition Extraction (DE) evaluation in the NLP domain. More details in:

Espinosa-Anke, L., Ronzano, F. and Saggion, H. (2015). Weakly Supervised Definition Extraction. In Proceedings of RANLP 2015. Hissar, Bulgaria.

SSyntSpa Corpus

This corpus contains fine-grained syntactic dependency annotation of Spanish sentences. It is currently released in the one-word-per-line CoNLL'08 format. It was annotated manually by the NLP group at Universitat Pompeu Fabra Barcelona.

This corpus has been derived from a section of the AnCora corpus; it can be downloaded through the AnCora webpage (look for AnCora Dependencies UPF).

ViZPar: A GUI for ZPar with Manual Feature Selection

Click here to be redirected to the Taxoembed website.

Reference Paper:

Espinosa-Anke, L., Camacho-Collados, J., Delli Bovi, C. and Saggion, H. Supervised Distributional Hypernym Discovery via Domain Adaptation. EMNLP 2016.


Automatic Text Simplification: An Introduction

Lecture given at the 2012 Altamira Summer School, Alicante, Spain.