The MultiLS (lexical simplification) dataset part of the MLSP2024 shared task.

The MultiLS dataset was created as part of the MLSP2024 shared task. The dataset contains 5,624 instances across 10 target languages. Each instance consists of a sentence from an educational text, with a marked target word. For each target word in the given context, two annotations are given. Firstly, an aggregate complexity score derived from asking 10 annotators to mark the level of difficulty of the target token on a scale of 1-5. Secondly, a list of possible substitutions for the target word which would make the original sentence easier to understand whilst retaining the original meaning. These two tasks constitute important steps in the lexical simplification pipeline, a method of simplifying texts in a targeted manner for end users. Further information on the dataset and the protocols used to create it are available at the following references.

https://huggingface.co/datasets/MLSP2024/MLSP2024