GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish.

This lexicon can be used in any application for Text Analysis in Spanish, in particular those in need for a lemmatizer, POS tagger, or Named Entity recogniser.



The GliCom research group have interests in descriptive and theoretical research in the areas of Phonetics and Phonolgy, Morphology, Syntax, Semantics, Pragmatics and Discourse Structure, as well as the computational strategies involved in processing them (symbolic versus statistical approaches, finite state automata, context-free and mildly-context sensitive grammars, recent advances in Categorial Grammar, Unification Grammars, Constraint-based systems, computational techniques for speech processing...).

Also in development of linguistic tools applied to machine translation, text error correction, text-to-speech, and information retrieval.



The lexicon is distributed in two sublexicons:
  1. Word forms
  2. Verb-clitic combinations
The list of wordforms 1,152,242 entries, including (i) regular words (1,144,086), (ii) toponyms and anthroponyms (8,032), (iii) abbreviations and acronyms (775), and (iv) computational terms (124). Each entry consists of: form, lemma, morphosyntactic tag and the word type.
The list of verb-clitic combinations contains 4,283,637 entries, exhaustively covering all formal combinations (including infinitive, gerund and imperative). Note that some clitic combinations may be formally possible although semantically implausible. Each entry consists of: form, lemma of the verb and combination of morphosyntactic tags of the verb and the pronoun(s).




Fully developed. The lexicon is formatted as plain text in a tab separated file.




© UPF 2015



Database available to interested parties, both for research and commercialization purposes. The database can be distributed in parts or in a whole.



Computational lexicon available for licensing with technical cooperation



Marc Santandreu
Technology Transfer Unit 
(+34) 93 542 2896
[email protected]




Text analysis, wordform, lexicon, Spanish


Ref: TEC-0128

Fact Sheet