OLDCA
This corpus was initially developed under the Spanish national project FFI2010-15006 and extended under project FFI2013-41301-P.
Corpus description
This diachronic corpus of Catalan includes 222 texts from the 11th century to the 17th century containing a total of 5,020,237 words. The selection of texts has been made based on criteria of representativeness in terms of genre, date, and geographical origin.
Tagging of the corpus
The corpus has been tagged with an existing linguistic analyzer, Freeling. This tool is designed for contemporary Catalan, and it was necessary to expand it to handle the lexical and orthographic variability of a diachronic corpus. For this purpose, the technique developed by Cristina Sánchez-Marco has been utilized to adapt the Freeling version for contemporary Spanish to a diachronic corpus. Here is how this process was carried out.
Freeling uses the Parole tagset, developed by the EAGLES group, as a common tool for the computerized processing of various European languages. However, in some aspects, the tags used in this corpus differ from the original Parole ones. Some changes have been introduced for two main reasons: on one hand, to adapt the tagset to the different stages of Catalan; and, on the other hand, new tags have been implemented so that the analysis result can function as suitable input for a syntactic analyzer, which is the next step of this project. On this page you can consult the list of tags used in this corpus. If you are already familiar with the Parole system, please refer to the list to see the changes that have been introduced.
Consulting the corpus
The corpus can be accessed through the Corpus Query Processor (CQP), both in its command-line version and in its graphical interface CQPweb. In any case, CQP allows conducting searches by complex patterns using positional attributes related to a single item (word, lemma, tag), or structural attributes related to phrases (length, position in the text...) or texts (date, title, author, genre...). The available attributes can be consulted here. These attributes can be used to refine searches through the 'Restricted Query' option, instead of 'Standard Query', in the CQPweb interface. In the current state of development, the CQPweb interface allows automatically restricting searches by century, genre, and author.
CQPweb allows using the CQP syntax, as you would in a terminal, or in a simplified language called 'Simple Query'. The manual for CQP, with the description of its syntax, can be found here, and a brief guide for conducting searches using 'Simple Query' is available here. This corpus can be accessed using the "guest" account on CQPweb.
Tagset
1. Adjectives
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Adjective | A |
| 2 | Type | Qualificative | Q |
| Ordinal | O | ||
| Possessive | X | ||
| 3 | Degree | - | 0 |
| Appreciative | A | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Person | - | 0 |
| First | 1 | ||
| Second | 2 | ||
| Third | 3 | ||
| 7 | Possessor | - | 0 |
| Singular | 3 | ||
| Plural | P |
Qualificative and ordinal adjectives are tagged following the Parole system. However, there is a new category: possessive adjectives (AX). This includes words labeled as Possessive Pronouns in the original Parole. The reasons for this are primarily distributional: Freeling tags very frequent sequences like la mía with a null nominal category, as determiner + pronoun; however, these combinations should in principle be avoided in the language (*el les, *el tu), and a syntactic analyzer should be able to recognize the ungrammaticality of such a sequence. Since distributionally these elements (meva, teva, nostra...) behave like adjectives (and it could even be argued that semantically they are), we have made the decision to tag them as possessive adjectives.
Another change that must be taken into account regarding elements with participial morphology (-at/ada/ats/ades) is that they all now form a separate new category (T; see below).
2. Adverbs
| Position | Attributes | Value | Code |
|---|---|---|---|
| 1 | Category | Adverb | R |
| 2 | Type | General | G |
| Negative | N |
There are no changes to the original Parole tagging system.
3. Determiners
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Determiner | D |
| 2 | Type | Demonstrative | D |
| Possessive | P | ||
| Interrogative | T | ||
| Exclamative | E | ||
| Indefinite | I | ||
| Article | A | ||
| Relative | R | ||
| Numeral | N | ||
| 3 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Possessor | Singular | S |
| Plural | P |
Above, we have introduced the possessive adjective category (AX). This category has not entirely replaced the possessive determiner category. Cases that are not preceded by an article and are followed by a noun continue to be labeled as possessive determiners (ma, ton, sa, nostre...).
4. Quantifiers
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Quantifier | Q |
| 2 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 3 | Number | Singular | S |
| Plural | P | ||
| Invariable | N |
This category did not exist in the original Parole tagging system. It includes, as lemmas, molt, tot, and cada. Despite their differences, these three elements share common features:
- They can license a nominal phrase: cada nen, tot nen, molt nens.
- Some can combine with a determiner: tots els nens, els molts nens.
- Some can behave as predicative adjectives or adverbs: són molts, m'agrada molt.
- They are very frequent.
- Both traditional grammar and the Parole tagging system have difficulties treating them, often resorting to assigning them multiple grammatical categories (D, P, A, Adv...).
The Q tag is intended to provide a unified treatment for them and to generate suitable input for the syntactic analyzer. It's important to note that all occurrences of tot, cada, and molt are analyzed as quantifiers. However, the exact definition of this category and the extension of the Q tag to other similar elements will be defined in the next phase of the project.
5. Nouns
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Noun | N |
| 2 | Type | Common | C |
| Proper | P | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 5-6 | Semantic classification | Being-Person | SP |
| Organization | O0 | ||
| Place | G0 | ||
| 7 | Degree | Appreciative | A |
Here shows the original Parole tagging system. At the moment, the values for semantic classification and grammatical degree are not yet implemented. Proper names, at this time, all receive the label NP00000.
6. Verbs
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Verb | V |
| 2 | Type | Main | M |
| Auxiliary | A | ||
| Semiauxiliary | S | ||
| 3 | Mode | Indicative | I |
| Subjunctive | S | ||
| Imperative | M | ||
| Infinitive | N | ||
| Gerund | G | ||
| 4 | Tense | Present | P |
| Imperfect | I | ||
| Future | F | ||
| Past | S | ||
| Conditional | C | ||
| - | 0 | ||
| 5 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 6 | Number | Singular | S |
| Plural | P | ||
| 7 | Gender | Masculine | M |
| Feminine | F |
Examples of complete verbal paradigms:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The tagging of verbs follows the Parole criteria, except for one important point: the treatment of past participles, which have been moved to a new category, T (see below).
7. Participles
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Participle | T |
| 2 | Type | Main | M |
| Auxiliary | A | ||
| Semiauxiliary | S | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Invariable | N |
Examples:
| Form | Lemma | Form |
|---|---|---|
| cantat | cantar | TMMS |
| cantada | cantar | TMFS |
| cantats | cantar | TMMP |
| cantades | cantar | TMFP |
| estat | estar | TAMS |
As mentioned above, all elements with participial morphology, originally included in the categories of adjectives and verbs in Parole, have been brought together within this new category. The reason for this change is that our tagging system must be able to cover the early stages of a Romance language, Catalan, in which it is not always easy to determine whether an element with the morphology -at/ada/ats/ades behaves like a verb or like an adjective. We have decided to unify their treatment, so it is important to note that all uses of these elements (even when their function is clearly verbal or adjectival according to traditional criteria) are labeled as T.
8. Pronouns
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Pronoun | P |
| 2 | Type | Person | P |
| Demonstrative | D | ||
| Possessive | X | ||
| Indefinite | I | ||
| Interrogative | T | ||
| Relative | R | ||
| Numeral | N | ||
| Exclamative | E | ||
| 3 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 4 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 5 | Number | Singular | S |
| Plural | P | ||
| Invariable | N | ||
| 6 | Case | Nominative | N |
| Accusative | A | ||
| Dative | D | ||
| Oblique | O | ||
| 7 | Possessor | Singular | S |
| Plural | P | ||
| 8 | Politeness | Polite | P |
In this category, a very significant change has been introduced compared to the original Parole tagging system: the category of personal pronouns has been divided into personal pronouns (commonly referred to as 'strong pronouns') and clitics (or 'weak pronouns'). The clitics now form a separate category (L). The following table shows the new distribution (with lemmas in parentheses). Please note that not all elements that belong to the two categories are represented here.
| Personal Pronouns (PP) | Clitics (L) |
|---|---|
| jo (jo) mi (jo) nosaltres (jo) nós (jo) tu (tu) vostè (tu) vostès (tu) vós (tu) vosaltres (tu) ella (ell) ell (ell) ells (ell) elles (ell) |
em (em) et (et) el (el) la (el) l' (el) li (li) es (es) ens (em) us (et) els (els) les (el) ho (ho) hi (hi) en (en) |
9. Clitics
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Clitic | L |
| 2 | Person | First | 1 |
| Second | 2 | ||
| Third | 3 | ||
| 3 | Gender | Masculine | M |
| Feminine | F | ||
| Common | C | ||
| Neuter | N | ||
| 4 | Number | Singular | S |
| Plural | P | ||
| Neuter | N | ||
| 5 | Case | - | 0 |
| Accusative | A | ||
| Dative | D |
Examples:
| Form | Lemma | Tag |
|---|---|---|
| m' | em | L1CS0 |
| la | el | L3FSA |
| els | els | L0CP0 |
| ho | ho | L3NN0 |
| hi | hi | L3CN0 |
| se | es | L3CN0 |
All clitics (weak pronouns) are now placed within this new category L, which did not exist in the original Parole system. Clitics can be an important object of study in diachronic research. In the Parole system, they are not clearly distinguished from the rest of the personal pronouns, and the length of the resulting labels sometimes makes working with them cumbersome. To simplify this, we have created this new category, even though it means that a few elements that are pronouns are now outside the 'pronouns' category. The lemmas are always the singular masculine version of each clitic. There is one exception: els, which can be dative (li in singular) or accusative (el in singular), has els as its lemma.
10. Conjunctions
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Conjunction | C |
| 2 | Type | Coordinating | C |
| Subordinating | S |
There are no changes to the original Parole tagging system.
11. Interjections
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Interjection | I |
There are no changes to the original Parole tagging system.
12. Prepositions
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Adposition | S |
| 2 | Type | Preposition | P |
| 3 | Form | Simple | S |
| Complex | C | ||
| 4 | Gender | - | 0 |
| 5 | Number | - | 0 |
There are no changes to the original Parole tagging system.
13. Punctuation marks
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Punctuation mark | F |
There are no changes to the original Parole tagging system.
14. Numbers
| Position | Attribute | Value | Code |
|---|---|---|---|
| 1 | Category | Number | Z |
There are no changes to the original Parole tagging system.