Back COLT seminar : Philip Rust will speak about Language Modelling with Pixels on January 23rd

Títol: "Language Modelling with Pixels", a càrrec de Philip Rust, University of Copenhagen

When: January 23rd, 2023, 14.30 to 15.30

Where: Room 52.737 or online on zoom (mail us for the passcode ;) )

Description: Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. In this talk, I will present how one can circumvent these issues by leveraging pixel-based language representations. Specifically, I will talk about PIXEL, the Pixel-based Encoder of Language. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. The 86M parameter PIXEL model is trained on the same English data as BERT to reconstruct the pixels of masked patches instead of predicting a distribution over tokens. In experiments on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts, PIXEL substantially outperforms BERT on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts.Furthermore, PIXEL exhibits stronger robustness than BERT to orthographic attacks and linguistic code-switching, further confirming the benefits of modelling language with pixels.

We're looking forward to seeing you there!



