Atrás 23/01/2023 Seminari de COLT, a càrrec de Philip Rust

23/01/2023 Seminari de COLT, a càrrec de Philip Rust

"Language Modelling with Pixels", a càrrec de Philip Rust, University of Copenhagen

16.01.2023

 

 

When: January 23rd, 2023, 14.30 to 15.30
Where: Room 52.737 or online on zoom (Passcode: 011358)

Description: 

Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. In this talk, I will present how one can circumvent these issues by leveraging pixel-based language representations. Specifically, I will talk about PIXEL, the Pixel-based Encoder of Language. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. The 86M parameter PIXEL model is trained on the same English data as BERT to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. In experiments on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts, PIXEL substantially outperforms BERT on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts.Furthermore, PIXEL exhibits stronger robustness than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.

Multimedia

Categorías:

ODS - Objetivos de desarrollo sostenible:

Els ODS a la UPF

Contact