Please use this identifier to cite or link to this item:
https://research.matf.bg.ac.rs/handle/123456789/3221| Title: | Developing Language Resources for Recognizing Moral Aspects in the Serbian Language: Multi-label Categorization Through the Perspective of Moral Foundations Theory | Authors: | Šošić, Milena Stanković, Ranka Graovac, Jelena |
Affiliations: | Informatics and Computer Science | Keywords: | moral;conversational texts;annotation;Classification;linguistic resources;Serbian language | Issue Date: | 2025 | Rank: | M34 | Publisher: | Beograd : Srpska akademija nauka i umetnosti | Related Publication(s): | 3. Artificial Intelligence Conference 2025 : Book of Abstracts | Conference: | Artificial Intelligence Conference (3 ; 2025 ; Belgrade) | Abstract: | Through this study, we present the development process of an annotated corpus, a lexicon of moral words, and advanced machine learning models for recognizing the moral aspect of Serbian language using a multi-label annotation approach based on Moral Foundations Theory – MFT (Graham, 2009). Due to the lack of available linguistic resources for specific language aspects for languages like Serbian that exist for other languages (Hoover, 2020), this study aims to expand the capabilities for analyzing moral discourse through the application of advanced computational linguistics technologies. Constructed Social-Moral.SR corpus contains ~13.6k conversational messages collected from social media platforms X and Reddit written in Serbian language and annotated for moral categories. The development of the corpus began with the automatic collection and keyword-based selection of the messages, followed by pre-annotation using advanced language models that understand the Serbian language. The selected advanced model, Falcon-7b-Instruct, with such characteristics, enabled the initial classification of messages obtained from social networks. The categorization was performed according to the ten categories of moral sentiment defined by MFT6, such as: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion and purity/degradation. A critical challenge in pre-annotation lay in adapting the model to the Serbian language and culture, which required additional manual verification and adjustments to ensure the relevance of the obtained moral categorizations (He, 2024; Zangari, 2025). The evaluation of the verified annotations was performed using Cohen's Kappa coefficient as the main measure to verify the degree of agreement between the annotators. The reliability coefficient, with an average value of 0.36, indicates a moderate but acceptable level of agreement among annotators for the given task and the number of categories in the established annotation schema. Based on the annotated Social-Moral.SR corpus, the lexicon of moral words in Serbian language, named MFD.SR encompassing in total ~4.3k lemma-PoS pairs, was developed automatically using a combination of natural language processing (NLP) resources already developed for Serbian language such as PoS taggers, lemmatizers, NER taggers and sentiment intensity evaluators (SRPOL). To systematically identify characteristic moral words for each category, a class-based Tf-Idf algorithm was applied. This approach facilitates the extraction of words that are characteristic of each moral foundation. The verification of the lexicon developed using this approach was conducted through a survey that assessed the understanding of moral foundations among anonymous participants from the Serbian-speaking community. Metrics of statistical significance (F-statistics) and correlation (Pearson coefficient), measured on the number of recognized moral words on textual descriptions obtained from the survey answers, indicate the high level of statistical significance and correlation on each moral category, which confirms the lexicon’s ability to appropriately recognize moral words in the written texts. Moreover, advanced machine learning models Moral-BERT.SR and Moral-LLaMA.SR were developed as a result of fine-tuning different distributions of pre-trained BERT (multiple model versions) and LlaMA (model LlaMA-3.2-3B-Instruct) language models using the annotated Social-Moral.SR corpus as a training source. With this approach, fine-tuned base models are designed to recognize moral categories in texts from social networks, with multi-label annotation objective set in the task of recognizing moral signals in textual content. Moral-BERT.SR and Moral-LLaMA.SR achieved F1 score of ~68% and ~56% respectively, which are in the range with the results published in other studies on the same task (Bulla, 2022). The Moral-LLaMA.SR showed significant improvement in the performance compared with the LlaMA-3.2-3B-Instruct base model on the moral classification task of Serbian messages (zero-shot technique), indicating the necessity for fine-tuning the model for the task and language in use. The limitations observed with the fine-tuned LLaMA model underscore the current gaps in Serbian language support and the necessity to train the models on larger corpora with a greater number of parameters for enhanced performance. Developed models significantly improve the accuracy and understanding of morality within the Serbian language compared to the lexiconbased model, which make them valuable tools for further research and analysis of moral content in conversational texts in the Serbian discourse. This research represents a significant step towards developing new linguistic resources for the Serbian language in the context of moral psychology and computational linguistics. Despite numerous challenges, the developed linguistic resources provide a solid foundation for future research and applications, including moral reasoning analysis, moral sentiment recognition, and support for sociological studies in the Serbian language. Future work will focus on enriching and verifying the corpus, enhancing the models of automatic moral stance recognition, and refining methods for extracting moral words and phrases from texts. These efforts aim to increase the effectiveness of the developed linguistic resources for analyzing the moral aspects of the Serbian language. |
URI: | https://research.matf.bg.ac.rs/handle/123456789/3221 |
| Appears in Collections: | Research outputs |
Show full item record
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.