Developing Language Resources for Recognizing Moral Aspects in the Serbian Language: Multi-label Categorization Through the Perspective of Moral Foundations Theory

Šošić, Milena; Stanković, Ranka; Graovac, Jelena

Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/3221

DC Field	Value	Language
dc.contributor.author	Šošić, Milena	en_US
dc.contributor.author	Stanković, Ranka	en_US
dc.contributor.author	Graovac, Jelena	en_US
dc.date.accessioned	2026-03-19T15:00:44Z	-
dc.date.available	2026-03-19T15:00:44Z	-
dc.date.issued	2025	-
dc.identifier.uri	https://research.matf.bg.ac.rs/handle/123456789/3221	-
dc.description.abstract	Through this study, we present the development process of an annotated corpus, a lexicon of moral words, and advanced machine learning models for recognizing the moral aspect of Serbian language using a multi-label annotation approach based on Moral Foundations Theory – MFT (Graham, 2009). Due to the lack of available linguistic resources for specific language aspects for languages like Serbian that exist for other languages (Hoover, 2020), this study aims to expand the capabilities for analyzing moral discourse through the application of advanced computational linguistics technologies. Constructed Social-Moral.SR corpus contains ~13.6k conversational messages collected from social media platforms X and Reddit written in Serbian language and annotated for moral categories. The development of the corpus began with the automatic collection and keyword-based selection of the messages, followed by pre-annotation using advanced language models that understand the Serbian language. The selected advanced model, Falcon-7b-Instruct, with such characteristics, enabled the initial classification of messages obtained from social networks. The categorization was performed according to the ten categories of moral sentiment defined by MFT6, such as: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion and purity/degradation. A critical challenge in pre-annotation lay in adapting the model to the Serbian language and culture, which required additional manual verification and adjustments to ensure the relevance of the obtained moral categorizations (He, 2024; Zangari, 2025). The evaluation of the verified annotations was performed using Cohen's Kappa coefficient as the main measure to verify the degree of agreement between the annotators. The reliability coefficient, with an average value of 0.36, indicates a moderate but acceptable level of agreement among annotators for the given task and the number of categories in the established annotation schema. Based on the annotated Social-Moral.SR corpus, the lexicon of moral words in Serbian language, named MFD.SR encompassing in total ~4.3k lemma-PoS pairs, was developed automatically using a combination of natural language processing (NLP) resources already developed for Serbian language such as PoS taggers, lemmatizers, NER taggers and sentiment intensity evaluators (SRPOL). To systematically identify characteristic moral words for each category, a class-based Tf-Idf algorithm was applied. This approach facilitates the extraction of words that are characteristic of each moral foundation. The verification of the lexicon developed using this approach was conducted through a survey that assessed the understanding of moral foundations among anonymous participants from the Serbian-speaking community. Metrics of statistical significance (F-statistics) and correlation (Pearson coefficient), measured on the number of recognized moral words on textual descriptions obtained from the survey answers, indicate the high level of statistical significance and correlation on each moral category, which confirms the lexicon’s ability to appropriately recognize moral words in the written texts. Moreover, advanced machine learning models Moral-BERT.SR and Moral-LLaMA.SR were developed as a result of fine-tuning different distributions of pre-trained BERT (multiple model versions) and LlaMA (model LlaMA-3.2-3B-Instruct) language models using the annotated Social-Moral.SR corpus as a training source. With this approach, fine-tuned base models are designed to recognize moral categories in texts from social networks, with multi-label annotation objective set in the task of recognizing moral signals in textual content. Moral-BERT.SR and Moral-LLaMA.SR achieved F1 score of ~68% and ~56% respectively, which are in the range with the results published in other studies on the same task (Bulla, 2022). The Moral-LLaMA.SR showed significant improvement in the performance compared with the LlaMA-3.2-3B-Instruct base model on the moral classification task of Serbian messages (zero-shot technique), indicating the necessity for fine-tuning the model for the task and language in use. The limitations observed with the fine-tuned LLaMA model underscore the current gaps in Serbian language support and the necessity to train the models on larger corpora with a greater number of parameters for enhanced performance. Developed models significantly improve the accuracy and understanding of morality within the Serbian language compared to the lexiconbased model, which make them valuable tools for further research and analysis of moral content in conversational texts in the Serbian discourse. This research represents a significant step towards developing new linguistic resources for the Serbian language in the context of moral psychology and computational linguistics. Despite numerous challenges, the developed linguistic resources provide a solid foundation for future research and applications, including moral reasoning analysis, moral sentiment recognition, and support for sociological studies in the Serbian language. Future work will focus on enriching and verifying the corpus, enhancing the models of automatic moral stance recognition, and refining methods for extracting moral words and phrases from texts. These efforts aim to increase the effectiveness of the developed linguistic resources for analyzing the moral aspects of the Serbian language.	en_US
dc.language.iso	en	en_US
dc.publisher	Beograd : Srpska akademija nauka i umetnosti	en_US
dc.subject	moral	en_US
dc.subject	conversational texts	en_US
dc.subject	annotation	en_US
dc.subject	Classification	en_US
dc.subject	linguistic resources	en_US
dc.subject	Serbian language	en_US
dc.title	Developing Language Resources for Recognizing Moral Aspects in the Serbian Language: Multi-label Categorization Through the Perspective of Moral Foundations Theory	en_US
dc.type	Conference Object	en_US
dc.relation.conference	Artificial Intelligence Conference (3 ; 2025 ; Belgrade)	en_US
dc.relation.publication	3. Artificial Intelligence Conference 2025 : Book of Abstracts	en_US
dc.identifier.url	https://www.mi.sanu.ac.rs/~ai_conf/2025/AI_Conference_Book_of_Abstracts.pdf	-
dc.contributor.affiliation	Informatics and Computer Science	en_US
dc.description.rank	M34	en_US
dc.relation.firstpage	149	en_US
dc.relation.lastpage	150	en_US
item.fulltext	No Fulltext	-
item.grantfulltext	none	-
item.openairetype	Conference Object	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.languageiso639-1	en	-
item.cerifentitytype	Publications	-
crisitem.author.dept	Informatics and Computer Science	-
crisitem.author.orcid	0000-0002-9323-4695	-
Appears in Collections:	Research outputs

Show simple item record

Google Scholar^TM

Check

Google ScholarTM

Google Scholar^TM