Piotr Przybyła presented the Śmigiel Dataset at the LREC 2026
Piotr Przybyła presented the Śmigiel Dataset at the LREC 2026

Great to share that Piotr Przybyła presented in the poster session the Śmigiel Dataset at the LREC 2026 Conference this week at Palma de Mallorca, Spain.
Śmigiel Dataset: Laying Foundations for Investigating Machine-Generated Text Detection in Polish, Strebeyko, J., Wróblewska, A., & Przybyła, P. (2026).
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present Śmigiel, the first open dataset for training and evaluating machine-generated text (MGT) in Polish. The dataset includes a collection of human-written text fragments from six domains, which are used to prompt text generation by eight language models capable of producing credible Polish text. In addition to the raw corpus of over 462K generated texts, we also release a cleaned source- and domain-balanced dataset suitable for training and evaluating MGT detectors. Finally, we conduct preliminary experiments with text classifiers, showing that task difficulty depends on the text domain, the generating language model, and the availability of similar data in training. The results indicate that MGT detection in Polish can be approached with general-purpose classifiers that generalize well to new LLMs, but struggle to adapt to genres not represented in the training data.
http://www.lrec-conf.org/proceedings/lrec2026/pdf/2026.lrec2026-1.828.pdf