Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/495
Title: A variant of N-gram based language classification
Authors: Tomović, Andrija
Janičić, Predrag 
Affiliations: Informatics and Computer Science 
Issue Date: 1-Jan-2007
Related Publication(s): Congress of the Italian Association for Artificial Intelligence AI*IA 2007
Journal: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract: 
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 languages. Experimental comparison shows that our approach gives better results than four other popular approaches. © Springer-Verlag Berlin Heidelberg 2007.
URI: https://research.matf.bg.ac.rs/handle/123456789/495
ISBN: 9783540747819
ISSN: 03029743
DOI: 10.1007/978-3-540-74782-6_36
Appears in Collections:Research outputs

Show full item record

SCOPUSTM   
Citations

5
checked on Dec 18, 2024

Page view(s)

11
checked on Dec 25, 2024

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.