Back 03.04 Constitution of the corpus

03.04 Constitution of the corpus

Quantitative methods in communication research > Constitution of the corpus
04.11.2021

 

A corpus is defined as a complete set of units collected and stored according to specific criteria previously defined, well marked and edited within the objectives of our project. It can be all kinds of documents, which can then be analyzed as they are or by breaking them down into smaller parts, for example, the scenes of a movie, the news of a newspaper, the photographs accompanying those news or the main plots of the episodes of a fiction series. Such a structured set of units is collected in order to be analyzed and to observe in them the presence of patterns, recurrences, variations and, above all, meanings.

A corpus can be very numerous or very populated. In order to facilitate the analysis while a corpus of these characteristics is being formed, it is very common to break it down into segments, so that the corpus, in reality a work in progress that can take a long time, or that can be nourished as new units are produced or found, will be composed of corpora. Corpora are, therefore, articulated parts of a larger corpus.

There are several types of corpus (and corpora):

1. General.

Synchronous: they are formed according to a segment or temporal evolution.

3.         Historical: when they recover units produced in the past, generally ordered according to an established temporal periodization.

4.         Varied, as they are completed according to a combination of criteria.

The units of the corpus are stored and recorded, for example by means of a database, using record fields that identify these units. The database itself that we design for this purpose can also be used (there are fields generally called "containers") to store these units, whether they are textual documents, images, videos, sounds or any other element.

The purpose of creating a corpus is, of course, to analyze it later by generating and applying categories and codes, in order to identify, for example, concordances or semantic meanings, in a comprehensive and systematic way.

The production of a corpus will, in turn, give rise to two types of methodological approaches: corpus-based and corpus-driven. This is a concept that comes from linguistics. Both approaches work with a corpus, but while corpus-driven analysis systematically goes through all the units of the corpus in search of patterns and recurrences, corpus-based analysis uses it instead to expose, test or exemplify theories and descriptions formulated before the constitution of large corpora allowed the other approach. Elena Tognini-Bonelli summarizes it this way:

In a corpus-driven approach the commitment of the linguist is to the integrity of the data as a whole, and descriptions aim to be comprehensive with respect to corpus evidence. The corpus, therefore, is seen as more than a repository of examples to back pre-existing theoriesor a probabilistic extension to an already well defined system. […] Examples are normally taken verbatim, in other words they are not adjusted in any way to fit the predefined categories of the analyst; recurrent patterns and frequency distributions are expected to form the basic evidence for linguistic categories; the absence of a pattern is considered (Tognini-Bonelli, 2001: 65).

Normally, in the absence of a theory or even a set of well-established categories, a corpus-driven approach is recommended. While this approach is inductive (bottom-up, from detail to generalization), the other, corpus-based analysis, is deductive (from generalization to detail). One looks for the rule. The other seeks confirmation of the rule.

Modern digital techniques allow the construction and management of very large corpora. The tendency is to handle what are called machine-readable texts (or, more generally, documents), which can also be marked up, for example using XML.

To build the corpus, the operations to be performed are:

1) Define the principles and characteristics to be met by the units, in order to proceed then to their systematic and exhaustive collection.

2) Define the strategies for collecting these units, e.g., if any software is needed for this purpose.

3) It must also be decided how to store and catalog these units, for which it is recommended, as we have said, to create a database. This will allow not only to store but also to register these units.

4) Depending on the format of the units, it is possible that, before being able to operate with them, other operations may have to be carried out: technical format conversions, transcription and editing of texts, if necessary orthographic modernization, translation into another language, etc. and, finally, and if necessary, their marking.

Multimedia

Categories:

SDG - Sustainable Development Goals:

Els ODS a la UPF

Contact