Please use this identifier to cite or link to this item:
https://research.matf.bg.ac.rs/handle/123456789/495
Title: | A variant of N-gram based language classification | Authors: | Tomović, Andrija Janičić, Predrag |
Affiliations: | Informatics and Computer Science | Issue Date: | 1-Jan-2007 | Related Publication(s): | Congress of the Italian Association for Artificial Intelligence AI*IA 2007 | Journal: | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Abstract: | Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 languages. Experimental comparison shows that our approach gives better results than four other popular approaches. © Springer-Verlag Berlin Heidelberg 2007. |
URI: | https://research.matf.bg.ac.rs/handle/123456789/495 | ISBN: | 9783540747819 | ISSN: | 03029743 | DOI: | 10.1007/978-3-540-74782-6_36 |
Appears in Collections: | Research outputs |
Show full item record
SCOPUSTM
Citations
5
checked on Dec 18, 2024
Page view(s)
11
checked on Dec 25, 2024
Google ScholarTM
Check
Altmetric
Altmetric
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.