Please use this identifier to cite or link to this item:
https://research.matf.bg.ac.rs/handle/123456789/1663
Title: | Language Identification: The Case of Serbian | Authors: | Zečević, Andjelka Vujičić Stanković, Staša |
Affiliations: | Informatics and Computer Science | Keywords: | Language identification;Serbian language | Rank: | M63 | Publisher: | Belgrade : University of Belgrade, Faculty of Mathematics | Related Publication(s): | Natural Language Processing for Serbian – Resources and Application | Abstract: | Serbian and other national standard languages that are used instead of common standard Serbo-Croatian have a phonologically based orthography. The characteristics of this orthography are that Serbian can be written in two alphabets (Latin and Cyrillic) and in two dialects (Ekavian and Ijekavian) which is directly reproduced in a written language. Consequently, Serbian is hard to identify because there are languages that are very similar (sharing alphabets and dialects). Therefore the problems typical for closely related languages are strongly presented in Serbian. The existing top-level tools do not give results comparable to the other classes of languages, so it is necessary to locate the problem and use the cumulative linguistic knowledge to overcome it. This paper summarizes the first results towards that goal. We have chosen several top-level language identi cation tools and tested theirs sensibility for both the alphabets and both the dialects. For the testing purpose we have created corpora encompassing the newspaper articles, the literary works written by Serbian authors and the translations of many widely-circulated novels. The obtained results indicate that not all the tools support Latin and Cyrillic scripts and con rm that the language identification of documents written in Ijekavian variant is much more error prone in comparison to documents written in Ekavian variant. |
URI: | https://research.matf.bg.ac.rs/handle/123456789/1663 | ISSN: | 978-86-7589-088-1 |
Appears in Collections: | Research outputs |
Show full item record
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.