This corpus is developed under the project ONTOSEM (FFI2010-15006).

Corpus description

This diachronic corpus of Spanish, developed by Cristina Sánchez-Marco, includes 674 texts that cover from the 12th century to the 20th century containing a total of 44,470,288 words. The texts are sourced from open repositories and collections available in various univeristy libraries. Efforts have been made to ensure that the texts are representative and balanced in terms of genres and the presence of material from different centuries that are covered. The genres represented include: poetry, history, laws, didactics, prose, religion, medicine, letters, theater, and press. The objective of the project is to develop a tool that helps understand linguistic change in general and the evolution of Peninsular Spanish in particular; therefore, all texts belong to that variety.

Tagging of the corpus

The corpus has been tagged with an existing linguistic analyzer, Freeling. This tool is designed for contemporary Spanish, and it was necessary to expand it to handle the lexical and orthographic variability of a diachronic corpus. Here is how this process was carried out.

Freeling uses the Parole tagset, developed by the EAGLES group, as a common tool for the computerized processing of various European languages. However, in some aspects, the tags used in this corpus differ from the original Parole ones. Some changes have been introduced for two main reasons: on one hand, to adapt the tagset to the different stages of Spanish; and, on the other hand, new tags have been implemented so that the analysis result can function as suitable input for a syntactic analyzer, which is the next step of this project. On this page you can consult the list of tags used in this corpus. If you are already familiar with the Parole system, please refer to the list to see the changes that have been introduced.

Consulting the corpus

The corpus can be accessed through the Corpus Query Processor (CQP), both in its command-line version and in its graphical interface CQPweb. In any case, CQP allows conducting searches by complex patterns using positional attributes related to a single item (word, lemma, tag), or structural attributes related to phrases (length, position in the text...) or texts (date, title, author, genre...). The available attributes can be consulted here. These attributes can be used to refine searches through the 'Restricted Query' option, instead of 'Standard Query', in the CQPweb interface. In the current state of development, the CQPweb interface allows automatically restricting searches by century, genre, and author.

CQPweb allows using the CQP syntax, as you would in a terminal, or in a simplified language called 'Simple Query'. The manual for CQP, with the description of its syntax, can be found here, and a brief guide for conducting searches using 'Simple Query' is available here. This corpus can be accessed using the "guest" account on CQPweb.

Tagset

1. Adjectives

Position Attribute Value Code
1 Category Adjective A
2 Type Qualificative Q
Ordinal O
Possessive X
3 Degree - 0
Appreciative A
4 Gender Masculine M
Feminine F
Common C
5 Number Singular S
Plural P
Invariable N
6 Person - 0
First 1
Second 2
Third 3
7 Possessor - 0
Singular 3
Plural P

Qualificative and ordinal adjectives are tagged following the Parole system. However, there is a new category: possessive adjectives (AX). This includes words labeled as Possessive Pronouns in the original Parole. The reasons for this are primarily distributional: Freeling tags very frequent sequences like la mía with a null nominal category, as determiner + pronoun; however, these combinations should in principle be avoided in the language (*el las, *el tú), and a syntactic analyzer should be able to recognize the ungrammaticality of such a sequence. Since distributionally these elements (mío, tuya, nuestra...) behave like adjectives (and it could even be argued that semantically they are), we have made the decision to tag them as possessive adjectives.

Another change that must be taken into account regarding elements with participial morphology (-ado/ada/ados/adas) is that they all now form a separate new category (T; see below).

2. Adverbs

Position Attributes Value Code
1 Category Adverb R
2 Type General G
Negative N

There are no changes to the original Parole tagging system.

3. Determiners

Position Attribute Value Code
1 Category Determiner D
2 Type Demonstrative D
Possessive P
Interrogative T
Exclamative E
Indefinite I
Article A
Relative R
Numeral N
3 Person First 1
Second 2
Third 3
4 Gender Masculine M
Feminine F
Common C
Neuter N
5 Number Singular S
Plural P
Invariable N
6 Possessor Singular S
Plural P

There are no changes to the original Parole tagging system.

4. Quantifiers

Position Attribute Value Code
1 Category Quantifier Q
2 Gender Masculine M
Feminine F
Common C
3 Number Singular S
Plural P
Invariable N

This category did not exist in the original Parole tagging system. It includes, as lemmas, mucho, todo, and cada. Despite their differences, these three elements share common features:

  • They can license a nominal phrase: cada niño, todo niño, muchos niños.
  • Some can combine with a determiner: todos los niños, los muchos niños.
  • Some can behave as predicative adjectives or adverbs: son muchos, me gusta mucho.
  • They are very frequent.
  • Both traditional grammar and the Parole tagging system have difficulties treating them, often resorting to assigning them multiple grammatical categories (D, P, A, Adv...).

The Q tag is intended to provide a unified treatment for them and to generate suitable input for the syntactic analyzer. It's important to note that all occurrences of todo, cada, and mucho are analyzed as quantifiers. However, the exact definition of this category and the extension of the Q tag to other similar elements will be defined in the next phase of the project.

5. Nouns

Position Attribute Value Code
1 Category Noun N
2 Type Common C
Proper P
3 Gender Masculine M
Feminine F
Common C
4 Number Singular S
Plural P
Invariable N
5-6 Semantic classification Being-Person SP
Organization O0
Place G0
7 Degree Appreciative A

There are no changes to the original Parole tagging system.

6. Verbs

Position Attribute Value Code
1 Category Verb V
2 Type Main M
Auxiliary A
Semiauxiliary S
3 Mode Indicative I
Subjunctive S
Imperative M
Infinitive N
Gerund G
4 Tense Present P
Imperfect I
Future F
Past S
Conditional C
- 0
5 Person First 1
Second 2
Third 3
6 Number Singular S
Plural P
7 Gender Masculine M
Feminine F

Examples of complete verbal paradigms:

Tense Main verbs
Form Lemma Tag
Present Indicative canto cantar VMIP1S0
cantas cantar VMIP2S0
canta cantar VMIP3S0
cantamos cantar VMIP1P0
cantáis cantar VMIP2P0
cantan cantar VMIP3P0
Past Imperfect cantaba cantar VMII1S0
cantabas cantar VMII2S0
cantaba cantar VMII3S0
cantábamos cantar VMII1P0
cantábais cantar VMII2P0
cantaban cantar VMII3P0
Simple Past canté cantar VMIS1S0
cantaste cantar VMIS2S0
cantó cantar VMIS3S0
cantamos cantar VMIS1P0
cantasteis cantar VMIS2P0
cantaron cantar VMIS3P0
Future Indicative cantaré cantar VMIF1S0
cantarás cantar VMIF2S0
cantará cantar VMIF3S0
cantaremos cantar VMIF1P0
cantaréis cantar VMIF2P0
cantarán cantar VMIF3P0
Conditional cantaría cantar VMCP1S0
cantarías cantar VMCP2S0
cantaría cantar VMCP3S0
cantaríamos cantar VMCP1P0
cantaríais cantar VMCP2P0
cantarían cantar VMCP3P0
Present Subjunctive cante cantar VMSP1S0
cantes cantar VMSP2S0
cante cantar VMSP3S0
cantemos cantar VMSP1P0
cantéis cantar VMSP2P0
canten cantar VMSP3P0
Imperfect Subjunctive cantara cantar VMSI1S0
cantaras cantar VMSI2S0
cantara cantar VMSI3S0
cantáramos cantar VMSI1P0
cantarais cantar VMSI2P0
cantaran cantar VMSI3P0
cantase cantar VMSI1S0
cantases cantar VMSI2S0
cantase cantar VMSI3S0
cantásemos cantar VMSI1P0
cantaseis cantar VMSI2P0
cantasen cantar VMSI3P0
Future Subjunctive cantare cantar VMSF1S0
cantares cantar VMSF2S0
cantare cantar VMSF3S0
cantáremos cantar VMSF1P0
cantareis cantar VMSF2P0
cantaren cantar VMSF3P0
Gerund cantando cantar VMG0000
Imperative canta cantar VMMP2S0
cante cantar VMMP3S0
cantemos cantar VMMP1P0
cantad cantar VMMP2P0
canten cantar VMMP3P0
Infinitive cantar cantar VMN0000
Auxiliary verbs
Form Lemma Tag
soy ser VAIP1S0
eres ser VAIP2S0
es ser VAIP3S0
somos ser VAIP1P0
sois ser VAIP2P0
son ser VAIP3P0
era ser VAII1S0
eras ser VAII2S0
era ser VAII3S0
éramos ser VAII1P0
érais ser VAII2P0
eran ser VAII3P0
fui ser VAIS1S0
fuiste ser VAIS2S0
fue ser VAIS3S0
fuimos ser VAIS1P0
fuisteis ser VAIS2P0
fueron ser VAIS3P0
seré ser VAIF1S0
serás ser VAIF2S0
será ser VAIF3S0
seremos ser VAIF1P0
seréis ser VAIF2P0
serán ser VAIF3P0
sería ser VACP1S0
serías ser VACP2S0
sería ser VACP3S0
seríamos ser VACP1P0
seríais ser VACP2P0
serían ser VACP3P0
sea ser VASP1S0
seas ser VASP2S0
sea ser VASP3S0
seamos ser VASP1P0
seáis ser VASP2P0
sean ser VASP3P0
fuera ser VASI1S0
fueras ser VASI2S0
fuera ser VASI3S0
fuéramos ser VASI1P0
fuerais ser VASI2P0
fueran ser VASI3P0
fuese ser VASI1S0
fueses ser VASI2S0
fuese ser VASI3S0
fuésemos ser VASI1P0
fueseis ser VASI2P0
fuesen ser VASI3P0
fuere ser VASF1S0
fueres ser VASF2S0
fuere ser VASF3S0
fuéremos ser VASF1P0
fuereis ser VASF2P0
fueren ser VASF3P0
siendo ser VAG0000
ser VAMP2S0
sea ser VAMP3S0
seamos ser VAMP1P0
sed ser VAMP2P0
sean ser VAMP3P0
ser ser VAN0000

The tagging of verbs follows the Parole criteria, except for one important point: the treatment of past participles, which have been moved to a new category, T (see below).

7. Participles

Position Attribute Value Code
1 Category Participle T
2 Type Main M
Auxiliary A
Semiauxiliary S
3 Gender Masculine M
Feminine F
Common C
4 Number Singular S
Plural P
Invariable N

Examples:

Form Lemma Form
cantado cantar TMMS
cantada cantar TMFS
cantados cantar TMMP
cantadas cantar TMFP
estado estar TAMS

As mentioned above, all elements with participial morphology, originally included in the categories of adjectives and verbs in Parole, have been brought together within this new category. The reason for this change is that our tagging system must be able to cover the early stages of a Romance language, Spanish, in which it is not always easy to determine whether an element with the morphology -ado/ada/ados/adas behaves like a verb or like an adjective. We have decided to unify their treatment, so it is important to note that all uses of these elements (even when their function is clearly verbal or adjectival according to traditional criteria) are labeled as T.

8. Pronouns

Position Attribute Value Code
1 Category Pronoun P
2 Type Person P
Demonstrative D
Possessive X
Indefinite I
Interrogative T
Relative R
Numeral N
Exclamative E
3 Person First 1
Second 2
Third 3
4 Gender Masculine M
Feminine F
Common C
Neuter N
5 Number Singular S
Plural P
Invariable N
6 Case Nominative N
Accusative A
Dative D
Oblique O
7 Possessor Singular S
Plural P
8 Politeness Polite P

In this category, a very significant change has been introduced compared to the original Parole tagging system: the category of personal pronouns has been divided into personal pronouns (commonly referred to as 'strong pronouns') and clitics (or 'weak pronouns'). The clitics now form a separate category (L). The following table shows the new distribution. Please note that not all elements that belong to the two categories are represented here.

Personal Pronouns (PP) Clitics (L)
yo
mi
nosotros
nosotras
conmigo
ti

usted
ustedes
vos
vosotras
vosotros
contigo
él
ella
ellas
ello
ellos
me
nos
te
os
le
las
lo
lo
los
les
se
y (archaic)
en (archaic)

9. Clitics

Position Attribute Value Code
1 Category Clitic L
2 Person First 1
Second 2
Third 3
Neuter 0
3 Gender Masculine M
Feminine F
Common C
Neuter N
4 Number Singular S
Plural P
Neuter N
5 Case Accusative A
Dative D
Other (different forms of SE, etc.) O

Examples:

Form Lemma Tag
me (me preocupo) me L1CSO
se (... se vino a la corte) se L3CNO
se (...conseio de se defender ...) se L0CNO
lo (lo vio) lo L3MSA
lo (lo siento mucho) lo L3CNA
os (os dio un caballo) te L2CPD
las (las vio) lo L3FPA

All clitics (weak pronouns) are now placed within this new category L, which did not exist in the original Parole system. Clitics can be an important object of study in diachronic research. In the Parole system, they are not clearly distinguished from the rest of the personal pronouns, and the length of the resulting labels sometimes makes working with them cumbersome. To simplify this, we have created this new category, even though it means that a few elements that are pronouns are now outside the 'pronouns' category. The lemmas are always the singular masculine version of each clitic.

10. Conjunctions

Position Attribute Value Code
1 Category Conjunction C
2 Type Coordinating C
Subordinating S

There are no changes to the original Parole tagging system.

11. Interjections

Position Attribute Value Code
1 Category Interjection I

There are no changes to the original Parole tagging system.

12. Prepositions

Position Attribute Value Code
1 Category Adposition S
2 Type Preposition P
3 Form Simple S
Complex C
4 Gender - 0
5 Number - 0

There are no changes to the original Parole tagging system.

13. Punctuation marks

Position Attribute Value Code
1 Category Punctuation mark F

There are no changes to the original Parole tagging system.

14. Numbers

Position Attribute Value Code
1 Category Number Z
2 Type "Millar" d

There are no changes to the original Parole tagging system.