This corpus was initially developed under the Spanish national project FFI2010-15006 and extended under project FFI2013-41301-P.

Corpus description

This diachronic corpus of Catalan includes 222 texts from the 11th century to the 17th century containing a total of 5,020,237 words. The selection of texts has been made based on criteria of representativeness in terms of genre, date, and geographical origin.

Tagging of the corpus

The corpus has been tagged with an existing linguistic analyzer, Freeling. This tool is designed for contemporary Catalan, and it was necessary to expand it to handle the lexical and orthographic variability of a diachronic corpus. For this purpose, the technique developed by Cristina Sánchez-Marco has been utilized to adapt the Freeling version for contemporary Spanish to a diachronic corpus. Here is how this process was carried out.

Freeling uses the Parole tagset, developed by the EAGLES group, as a common tool for the computerized processing of various European languages. However, in some aspects, the tags used in this corpus differ from the original Parole ones. Some changes have been introduced for two main reasons: on one hand, to adapt the tagset to the different stages of Catalan; and, on the other hand, new tags have been implemented so that the analysis result can function as suitable input for a syntactic analyzer, which is the next step of this project. On this page you can consult the list of tags used in this corpus. If you are already familiar with the Parole system, please refer to the list to see the changes that have been introduced.

Consulting the corpus

The corpus can be accessed through the Corpus Query Processor (CQP), both in its command-line version and in its graphical interface CQPweb. In any case, CQP allows conducting searches by complex patterns using positional attributes related to a single item (word, lemma, tag), or structural attributes related to phrases (length, position in the text...) or texts (date, title, author, genre...). The available attributes can be consulted here. These attributes can be used to refine searches through the 'Restricted Query' option, instead of 'Standard Query', in the CQPweb interface. In the current state of development, the CQPweb interface allows automatically restricting searches by century, genre, and author.

CQPweb allows using the CQP syntax, as you would in a terminal, or in a simplified language called 'Simple Query'. The manual for CQP, with the description of its syntax, can be found here, and a brief guide for conducting searches using 'Simple Query' is available here. This corpus can be accessed using the "guest" account on CQPweb.

Tagset

1. Adjectives

Position Attribute Value Code
1 Category Adjective A
2 Type Qualificative Q
Ordinal O
Possessive X
3 Degree - 0
Appreciative A
4 Gender Masculine M
Feminine F
Common C
5 Number Singular S
Plural P
Invariable N
6 Person - 0
First 1
Second 2
Third 3
7 Possessor - 0
Singular 3
Plural P

Qualificative and ordinal adjectives are tagged following the Parole system. However, there is a new category: possessive adjectives (AX). This includes words labeled as Possessive Pronouns in the original Parole. The reasons for this are primarily distributional: Freeling tags very frequent sequences like la mía with a null nominal category, as determiner + pronoun; however, these combinations should in principle be avoided in the language (*el les, *el tu), and a syntactic analyzer should be able to recognize the ungrammaticality of such a sequence. Since distributionally these elements (meva, teva, nostra...) behave like adjectives (and it could even be argued that semantically they are), we have made the decision to tag them as possessive adjectives.

Another change that must be taken into account regarding elements with participial morphology (-at/ada/ats/ades) is that they all now form a separate new category (T; see below).

2. Adverbs

Position Attributes Value Code
1 Category Adverb R
2 Type General G
Negative N

There are no changes to the original Parole tagging system.

3. Determiners

Position Attribute Value Code
1 Category Determiner D
2 Type Demonstrative D
Possessive P
Interrogative T
Exclamative E
Indefinite I
Article A
Relative R
Numeral N
3 Person First 1
Second 2
Third 3
4 Gender Masculine M
Feminine F
Common C
Neuter N
5 Number Singular S
Plural P
Invariable N
6 Possessor Singular S
Plural P

Above, we have introduced the possessive adjective category (AX). This category has not entirely replaced the possessive determiner category. Cases that are not preceded by an article and are followed by a noun continue to be labeled as possessive determiners (ma, ton, sa, nostre...).

4. Quantifiers

Position Attribute Value Code
1 Category Quantifier Q
2 Gender Masculine M
Feminine F
Common C
3 Number Singular S
Plural P
Invariable N

This category did not exist in the original Parole tagging system. It includes, as lemmas, molt, tot, and cada. Despite their differences, these three elements share common features:

  • They can license a nominal phrase: cada nen, tot nen, molt nens.
  • Some can combine with a determiner: tots els nens, els molts nens.
  • Some can behave as predicative adjectives or adverbs: són molts, m'agrada molt.
  • They are very frequent.
  • Both traditional grammar and the Parole tagging system have difficulties treating them, often resorting to assigning them multiple grammatical categories (D, P, A, Adv...).

The Q tag is intended to provide a unified treatment for them and to generate suitable input for the syntactic analyzer. It's important to note that all occurrences of tot, cada, and molt are analyzed as quantifiers. However, the exact definition of this category and the extension of the Q tag to other similar elements will be defined in the next phase of the project.

5. Nouns

Position Attribute Value Code
1 Category Noun N
2 Type Common C
Proper P
3 Gender Masculine M
Feminine F
Common C
4 Number Singular S
Plural P
Invariable N
5-6 Semantic classification Being-Person SP
Organization O0
Place G0
7 Degree Appreciative A

Here shows the original Parole tagging system. At the moment, the values ​​for semantic classification and grammatical degree are not yet implemented. Proper names, at this time, all receive the label NP00000.

6. Verbs

Position Attribute Value Code
1 Category Verb V
2 Type Main M
Auxiliary A
Semiauxiliary S
3 Mode Indicative I
Subjunctive S
Imperative M
Infinitive N
Gerund G
4 Tense Present P
Imperfect I
Future F
Past S
Conditional C
- 0
5 Person First 1
Second 2
Third 3
6 Number Singular S
Plural P
7 Gender Masculine M
Feminine F

Examples of complete verbal paradigms:

Tense Main verbs
Form Lemma Tag
Present Indicative canto cantar VMIP1S0
cantes cantar VMIP2S0
canta cantar VMIP3S0
cantem cantar VMIP1P0
canteu cantar VMIP2P0
canten cantar VMIP3P0
Past Imperfect cantava cantar VMII1S0
cantaves cantar VMII2S0
cantava cantar VMII3S0
cantàvem cantar VMII1P0
cantàveu cantar VMII2P0
cantaven cantar VMII3P0
Preterite cantí cantar VMIS1S0
cantares cantar VMIS2S0
cantà cantar VMIS3S0
cantàrem cantar VMIS1P0
cantàreu cantar VMIS2P0
cantaren cantar VMIS3P0
Future Indicative cantaré cantar VMIF1S0
cantaràs cantar VMIF2S0
cantarà cantar VMIF3S0
cantarem cantar VMIF1P0
cantareu cantar VMIF2P0
cantaran cantar VMIF3P0
Conditional cantaria cantar VMCP1S0
cantaries cantar VMCP2S0
cantarie cantar VMCP3S0
cantaríem cantar VMCP1P0
cantaríeu cantar VMCP2P0
cantarien cantar VMCP3P0
Present Subjunctive canti cantar VMSP1S0
cantis cantar VMSP2S0
canti cantar VMSP3S0
cantem cantar VMSP1P0
canteu cantar VMSP2P0
cantin cantar VMSP3P0
Imperfect Subjunctive cantés cantar VMSI1S0
cantessis cantar VMSI2S0
cantés cantar VMSI3S0
cantéssim cantar VMSI1P0
cantéssiu cantar VMSI2P0
cantessin cantar VMSI3P0
Gerund cantant cantar VMG0000
Imperative canta cantar VMMP2S0
canti cantar VMMP3S0
cantem cantar VMMP1P0
canteu cantar VMMP2P0
cantin cantar VMMP3P0
Infinitive cantar cantar VMN0000
Auxiliary verbs
Form Lemma Tag
soc ser VAIP1S0
ets ser VAIP2S0
és ser VAIP3S0
som ser VAIP1P0
sou ser VAIP2P0
són ser VAIP3P0
era ser VAII1S0
eres ser VAII2S0
era ser VAII3S0
érem ser VAII1P0
éreu ser VAII2P0
eren ser VAII3P0
fui ser VAIS1S0
fores ser VAIS2S0
fou ser VAIS3S0
fórem ser VAIS1P0
fóreu ser VAIS2P0
foren ser VAIS3P0
seré ser VAIF1S0
seràs ser VAIF2S0
serà ser VAIF3S0
serem ser VAIF1P0
sereu ser VAIF2P0
seran ser VAIF3P0
seria ser VACP1S0
series ser VACP2S0
seria ser VACP3S0
seríem ser VACP1P0
seríeu ser VACP2P0
serien ser VACP3P0
sigui ser VASP1S0
siguis ser VASP2S0
sigui ser VASP3S0
siguem ser VASP1P0
sigueu ser VASP2P0
siguin ser VASP3P0
fos ser VASI1S0
fossis ser VASI2S0
fos ser VASI3S0
fóssim ser VASI1P0
fóssiu ser VASI2P0
fossin ser VASI3P0
essent ser VAG0000
sigues ser VAMP2S0
sigui ser VAMP3S0
siguem ser VAMP1P0
sigueu ser VAMP2P0
siguin ser VAMP3P0
ser ser VAN0000

The tagging of verbs follows the Parole criteria, except for one important point: the treatment of past participles, which have been moved to a new category, T (see below).

7. Participles

Position Attribute Value Code
1 Category Participle T
2 Type Main M
Auxiliary A
Semiauxiliary S
3 Gender Masculine M
Feminine F
Common C
4 Number Singular S
Plural P
Invariable N

Examples:

Form Lemma Form
cantat cantar TMMS
cantada cantar TMFS
cantats cantar TMMP
cantades cantar TMFP
estat estar TAMS

As mentioned above, all elements with participial morphology, originally included in the categories of adjectives and verbs in Parole, have been brought together within this new category. The reason for this change is that our tagging system must be able to cover the early stages of a Romance language, Catalan, in which it is not always easy to determine whether an element with the morphology -at/ada/ats/ades behaves like a verb or like an adjective. We have decided to unify their treatment, so it is important to note that all uses of these elements (even when their function is clearly verbal or adjectival according to traditional criteria) are labeled as T.

8. Pronouns

Position Attribute Value Code
1 Category Pronoun P
2 Type Person P
Demonstrative D
Possessive X
Indefinite I
Interrogative T
Relative R
Numeral N
Exclamative E
3 Person First 1
Second 2
Third 3
4 Gender Masculine M
Feminine F
Common C
Neuter N
5 Number Singular S
Plural P
Invariable N
6 Case Nominative N
Accusative A
Dative D
Oblique O
7 Possessor Singular S
Plural P
8 Politeness Polite P

In this category, a very significant change has been introduced compared to the original Parole tagging system: the category of personal pronouns has been divided into personal pronouns (commonly referred to as 'strong pronouns') and clitics (or 'weak pronouns'). The clitics now form a separate category (L). The following table shows the new distribution (with lemmas in parentheses). Please note that not all elements that belong to the two categories are represented here.

Personal Pronouns (PP) Clitics (L)
jo (jo)
mi (jo)
nosaltres (jo)
nós (jo)
tu (tu)
vostè (tu)
vostès (tu)
vós (tu)
vosaltres (tu)
ella (ell)
ell (ell)
ells (ell)
elles (ell)
em (em)
et (et)
el (el)
la (el)
l' (el)
li (li)
es (es)
ens (em)
us (et)
els (els)
les (el)
ho (ho)
hi (hi)
en (en)

9. Clitics

Position Attribute Value Code
1 Category Clitic L
2 Person First 1
Second 2
Third 3
3 Gender Masculine M
Feminine F
Common C
Neuter N
4 Number Singular S
Plural P
Neuter N
5 Case - 0
Accusative A
Dative D

Examples:

Form Lemma Tag
m' em L1CS0
la el L3FSA
els els L0CP0
ho ho L3NN0
hi hi L3CN0
se es L3CN0

All clitics (weak pronouns) are now placed within this new category L, which did not exist in the original Parole system. Clitics can be an important object of study in diachronic research. In the Parole system, they are not clearly distinguished from the rest of the personal pronouns, and the length of the resulting labels sometimes makes working with them cumbersome. To simplify this, we have created this new category, even though it means that a few elements that are pronouns are now outside the 'pronouns' category. The lemmas are always the singular masculine version of each clitic. There is one exception: els, which can be dative (li in singular) or accusative (el in singular), has els as its lemma.

10. Conjunctions

Position Attribute Value Code
1 Category Conjunction C
2 Type Coordinating C
Subordinating S

There are no changes to the original Parole tagging system.

11. Interjections

Position Attribute Value Code
1 Category Interjection I

There are no changes to the original Parole tagging system.

12. Prepositions

Position Attribute Value Code
1 Category Adposition S
2 Type Preposition P
3 Form Simple S
Complex C
4 Gender - 0
5 Number - 0

There are no changes to the original Parole tagging system.

13. Punctuation marks

Position Attribute Value Code
1 Category Punctuation mark F

There are no changes to the original Parole tagging system.

14. Numbers

Position Attribute Value Code
1 Category Number Z

There are no changes to the original Parole tagging system.