Low-resource text classification using cross-lingual models for bullying detection in the ukrainian language

Oliinyk, V.; Matviichuk, І.

Low-resource text classification using cross-lingual models for bullying detection in the ukrainian language

dc.contributor.author	Oliinyk, V.
dc.contributor.author	Matviichuk, І.
dc.date.accessioned	2023-05-16T08:43:38Z
dc.date.available	2023-05-16T08:43:38Z
dc.date.issued	2023
dc.description.abstract	This paper aims on building bullying detection model for Ukrainian language. Considering absence of labeled datasets for bullying detection and classification in Ukrainian, small Ukrainian dataset (4k samples) was gathered and used for testing models in this research. Taking into account very small number of Ukrainian datasets in general this dataset is publicly available for testing and benchmarking other text classification models. Modern approaches to text class classification in low-resource languages are studied in the paper. We apply zero-shot technique and evaluate performance of modern multilingual, cross-lingual state-of-the-art models and embeddings for text classification in Ukrainian language, including mBERT, XLM-R, LASER and MUSE. Experimental results shows that zero-shot approaches for classification task allow to achieve F1 score of 67-69% for multilingual models trained on English dataset only, having 88-91% test accuracy on English data. We also show that machine translation of English data can be used for estimating model performance in other languages, i.e. only 0-2% difference in test accuracy compared to natural data was received for best models XLM-R and LASER. Zero-shot approach for binary detection task showed even better results 81% compared to 91,59% on original English data. We then enhance the best XLM-R model by training it on our natural Ukrainian dataset and confirm benefits of augmenting low-resource language dataset with machine transla tions from resource-rich English data. Finally, the model for bullying detection in the Ukrainian language is built achieving F1 score of 91,59% with only 12k samples dataset in different languages.	uk
dc.format.pagerange	Pp. 87-100	uk
dc.identifier.citation	Oliinyk, V. Low-resource text classification using cross-lingual models for bullying detection in the ukrainian language / V. Oliinyk, І. Matviichuk // Адаптивні системи автоматичного управління : міжвідомчий науково-технічний збірник. – 2023. – № 1 (42). – С. 87-100. – Бібліогр.: 24 назви.	uk
dc.identifier.doi	https://doi.org/10.20535/1560-8956.42.2023.279093
dc.identifier.uri	https://ela.kpi.ua/handle/123456789/55725
dc.language.iso	en	uk
dc.publisher	КПІ ім. Ігоря Сікорського	uk
dc.publisher.place	Київ	uk
dc.relation.ispartof	Адаптивні системи автоматичного управління : міжвідомчий науково-технічний збірник, 2023, № 1 (42)	uk
dc.subject	multilingual models	uk
dc.subject	zero-shot classification	uk
dc.subject	bullying detection	uk
dc.subject	XLM-RoBERTa	uk
dc.subject	mBERT	uk
dc.subject	LASER	uk
dc.subject	MUSE	uk
dc.subject.udc	004.852	uk
dc.title	Low-resource text classification using cross-lingual models for bullying detection in the ukrainian language	uk
dc.type	Article	uk

Файли

Контейнер файлів

Зараз показуємо 1 - 1 з 1

Назва:: 279093-643342-1-10-20230512.pdf
Розмір:: 806.57 KB
Формат:: Adobe Portable Document Format
Опис:

Завантажити

Ліцензійна угода

Зараз показуємо 1 - 1 з 1

Назва:: license.txt
Розмір:: 9.1 KB
Формат:: Item-specific license agreed upon to submission
Опис:

Завантажити

Зібрання

Адаптивні системи автоматичного управління : міжвідомчий науково-технічний збірник. – 2023. – № 1 (42)