OLDES
This corpus is developed under the project ONTOSEM (FFI2010-15006).
Corpus description
This diachronic corpus of Spanish, developed by Cristina Sánchez-Marco, includes 674 texts that cover from the 12th century to the 20th century containing a total of 44,470,288 words. The texts are sourced from open repositories and collections available in various univeristy libraries. Efforts have been made to ensure that the texts are representative and balanced in terms of genres and the presence of material from different centuries that are covered. The genres represented include: poetry, history, laws, didactics, prose, religion, medicine, letters, theater, and press. The objective of the project is to develop a tool that helps understand linguistic change in general and the evolution of Peninsular Spanish in particular; therefore, all texts belong to that variety.
Tagging of the corpus
The corpus has been tagged with an existing linguistic analyzer, Freeling. This tool is designed for contemporary Spanish, and it was necessary to expand it to handle the lexical and orthographic variability of a diachronic corpus. Here is how this process was carried out.
Freeling uses the Parole tagset, developed by the EAGLES group, as a common tool for the computerized processing of various European languages. However, in some aspects, the tags used in this corpus differ from the original Parole ones. Some changes have been introduced for two main reasons: on one hand, to adapt the tagset to the different stages of Spanish; and, on the other hand, new tags have been implemented so that the analysis result can function as suitable input for a syntactic analyzer, which is the next step of this project. On this page you can consult the list of tags used in this corpus. If you are already familiar with the Parole system, please refer to the list to see the changes that have been introduced.
Consulting the corpus
The corpus can be accessed through the Corpus Query Processor (CQP), both in its command-line version and in its graphical interface CQPweb. In any case, CQP allows conducting searches by complex patterns using positional attributes related to a single item (word, lemma, tag), or structural attributes related to phrases (length, position in the text...) or texts (date, title, author, genre...). The available attributes can be consulted here. These attributes can be used to refine searches through the 'Restricted Query' option, instead of 'Standard Query', in the CQPweb interface. In the current state of development, the CQPweb interface allows automatically restricting searches by century, genre, and author.
CQPweb allows using the CQP syntax, as you would in a terminal, or in a simplified language called 'Simple Query'. The manual for CQP, with the description of its syntax, can be found here, and a brief guide for conducting searches using 'Simple Query' is available here. This corpus can be accessed using the "guest" account on CQPweb.
Tagset
1. Adjectives
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Adjective | A |
| 2 | Type | Qualificative | Q |
| Ordinal | O | ||
| Possessive | X | ||
| 3 | Degree | - | 0 |
| Appreciative | A | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Person | - | 0 |
| First | 1 | ||
| Second | 2 | ||
| Third | 3 | ||
| 7 | Possessor | - | 0 |
| Singular | 3 | ||
| Plural | P |
Qualificative and ordinal adjectives are tagged following the Parole system. However, there is a new category: possessive adjectives (AX). This includes words labeled as Possessive Pronouns in the original Parole. The reasons for this are primarily distributional: Freeling tags very frequent sequences like la mía with a null nominal category, as determiner + pronoun; however, these combinations should in principle be avoided in the language (*el las, *el tú), and a syntactic analyzer should be able to recognize the ungrammaticality of such a sequence. Since distributionally these elements (mío, tuya, nuestra...) behave like adjectives (and it could even be argued that semantically they are), we have made the decision to tag them as possessive adjectives.
Another change that must be taken into account regarding elements with participial morphology (-ado/ada/ados/adas) is that they all now form a separate new category (T; see below).
2. Adverbs
| Position | Attributes | Value | Code |
|---|---|---|---|
| 1 | Category | Adverb | R |
| 2 | Type | General | G |
| Negative | N |
There are no changes to the original Parole tagging system.
3. Determiners
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Determiner | D |
| 2 | Type | Demonstrative | D |
| Possessive | P | ||
| Interrogative | T | ||
| Exclamative | E | ||
| Indefinite | I | ||
| Article | A | ||
| Relative | R | ||
| Numeral | N | ||
| 3 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Possessor | Singular | S |
| Plural | P |
There are no changes to the original Parole tagging system.
4. Quantifiers
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Quantifier | Q |
| 2 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 3 | Number | Singular | S |
| Plural | P | ||
| Invariable | N |
This category did not exist in the original Parole tagging system. It includes, as lemmas, mucho, todo, and cada. Despite their differences, these three elements share common features:
- They can license a nominal phrase: cada niño, todo niño, muchos niños.
- Some can combine with a determiner: todos los niños, los muchos niños.
- Some can behave as predicative adjectives or adverbs: son muchos, me gusta mucho.
- They are very frequent.
- Both traditional grammar and the Parole tagging system have difficulties treating them, often resorting to assigning them multiple grammatical categories (D, P, A, Adv...).
The Q tag is intended to provide a unified treatment for them and to generate suitable input for the syntactic analyzer. It's important to note that all occurrences of todo, cada, and mucho are analyzed as quantifiers. However, the exact definition of this category and the extension of the Q tag to other similar elements will be defined in the next phase of the project.
5. Nouns
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Noun | N |
| 2 | Type | Common | C |
| Proper | P | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 5-6 | Semantic classification | Being-Person | SP |
| Organization | O0 | ||
| Place | G0 | ||
| 7 | Degree | Appreciative | A |
There are no changes to the original Parole tagging system.
6. Verbs
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Verb | V |
| 2 | Type | Main | M |
| Auxiliary | A | ||
| Semiauxiliary | S | ||
| 3 | Mode | Indicative | I |
| Subjunctive | S | ||
| Imperative | M | ||
| Infinitive | N | ||
| Gerund | G | ||
| 4 | Tense | Present | P |
| Imperfect | I | ||
| Future | F | ||
| Past | S | ||
| Conditional | C | ||
| - | 0 | ||
| 5 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 6 | Number | Singular | S |
| Plural | P | ||
| 7 | Gender | Masculine | M |
| Feminine | F |
Examples of complete verbal paradigms:
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The tagging of verbs follows the Parole criteria, except for one important point: the treatment of past participles, which have been moved to a new category, T (see below).
7. Participles
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Participle | T |
| 2 | Type | Main | M |
| Auxiliary | A | ||
| Semiauxiliary | S | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Invariable | N |
Examples:
| Form | Lemma | Form |
|---|---|---|
| cantado | cantar | TMMS |
| cantada | cantar | TMFS |
| cantados | cantar | TMMP |
| cantadas | cantar | TMFP |
| estado | estar | TAMS |
As mentioned above, all elements with participial morphology, originally included in the categories of adjectives and verbs in Parole, have been brought together within this new category. The reason for this change is that our tagging system must be able to cover the early stages of a Romance language, Spanish, in which it is not always easy to determine whether an element with the morphology -ado/ada/ados/adas behaves like a verb or like an adjective. We have decided to unify their treatment, so it is important to note that all uses of these elements (even when their function is clearly verbal or adjectival according to traditional criteria) are labeled as T.
8. Pronouns
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Pronoun | P |
| 2 | Type | Person | P |
| Demonstrative | D | ||
| Possessive | X | ||
| Indefinite | I | ||
| Interrogative | T | ||
| Relative | R | ||
| Numeral | N | ||
| Exclamative | E | ||
| 3 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Case | Nominative | N |
| Accusative | A | ||
| Dative | D | ||
| Oblique | O | ||
| 7 | Possessor | Singular | S |
| Plural | P | ||
| 8 | Politeness | Polite | P |
In this category, a very significant change has been introduced compared to the original Parole tagging system: the category of personal pronouns has been divided into personal pronouns (commonly referred to as 'strong pronouns') and clitics (or 'weak pronouns'). The clitics now form a separate category (L). The following table shows the new distribution. Please note that not all elements that belong to the two categories are represented here.
| Personal Pronouns (PP) | Clitics (L) |
|---|---|
| yo mi nosotros nosotras conmigo ti tú usted ustedes vos vosotras vosotros contigo él ella ellas ello ellos |
me nos te os le las lo lo los les se y (archaic) en (archaic) |
9. Clitics
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Clitic | L |
| 2 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| Neuter | 0 | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Neuter | N | ||
| 5 | Case | Accusative | A |
| Dative | D | ||
| Other (different forms of SE, etc.) | O |
Examples:
| Form | Lemma | Tag |
|---|---|---|
| me (me preocupo) | me | L1CSO |
| se (... se vino a la corte) | se | L3CNO |
| se (...conseio de se defender ...) | se | L0CNO |
| lo (lo vio) | lo | L3MSA |
| lo (lo siento mucho) | lo | L3CNA |
| os (os dio un caballo) | te | L2CPD |
| las (las vio) | lo | L3FPA |
All clitics (weak pronouns) are now placed within this new category L, which did not exist in the original Parole system. Clitics can be an important object of study in diachronic research. In the Parole system, they are not clearly distinguished from the rest of the personal pronouns, and the length of the resulting labels sometimes makes working with them cumbersome. To simplify this, we have created this new category, even though it means that a few elements that are pronouns are now outside the 'pronouns' category. The lemmas are always the singular masculine version of each clitic.
10. Conjunctions
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Conjunction | C |
| 2 | Type | Coordinating | C |
| Subordinating | S |
There are no changes to the original Parole tagging system.
11. Interjections
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Interjection | I |
There are no changes to the original Parole tagging system.
12. Prepositions
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Adposition | S |
| 2 | Type | Preposition | P |
| 3 | Form | Simple | S |
| Complex | C | ||
| 4 | Gender | - | 0 |
| 5 | Number | - | 0 |
There are no changes to the original Parole tagging system.
13. Punctuation marks
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Punctuation mark | F |
There are no changes to the original Parole tagging system.
14. Numbers
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Number | Z |
| 2 | Type | "Millar" | d |
There are no changes to the original Parole tagging system.