Emergence of phonological biases in large language models
Emergence of phonological biases in large language models
Emergence of phonological biases in large language models
We will explore if phonological biases that are a hallmark of human language processing also emerge in current artificial intelligence large language models.
In the present project we will explore if phonological biases that are a hallmark of human language processing also emerge in current artificial intelligence large language models. Current large language models, such as OpenAI’s ChatGPT, have captured the public’s attention because how remarkable they are in the use of language. In humans, acquiring a language leads to processing biases. For example, from their second year of life, humans focus more heavily on the consonants than on the vowels to identify words. This consonant bias has been extensively documented across different modalities (oral, written), ages (from young infants to adults), native languages (English, French, Spanish, Dutch) and tasks (word learning, word reconstruction, masked priming). Crucially for the present project, it has been proposed that the consonant bias emerges in humans because of mere exposure to differences in relative frequency between consonants and vowels that are present in natural languages. Thus, it is our hypothesis that we should be able to observe similar phonological biases in current large language models, such as ChatGPT. This is because these models are trained on massive amounts of linguistic data, so they detect statistical dependencies and learn to predict what word most likely follows another in a given context. Such training might thus be enough to detect the differences in relative frequency between consonants and vowels and thus allow for the emergence of phonological biases that are observed in humans. The results from this project will provide insights at two different levels. First, it will provide information about what is shared and what is unique across diverse, natural and artificial, cognitive systems. Second, it will help to identify the conditions under which some of the complexities of human language might arise.
Phase 1 of the project will consist of the data collection and modelling, during 12 months. Upon this data collection and modelling is complete, phase 2 will consist of the analysis by the existing members of the group.
The present proposal on phonological biases stems from previous experimental work included in the ERC Starting grant (“Biological origins of linguistic constraints”; BioCon; ref: 312519) to the IP. It will complement and enhance research carried out in project Ministerio de Ciencia e Innovación project PID2021-123973NB-I00.