Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/2552
Title: A variant of n-gram based language-independent text categorization
Authors: Graovac, Jelena 
Affiliations: Informatics and Computer Science 
Issue Date: 2014
Rank: M23
Publisher: Sage Journals
Journal: Intelligent Data Analysis
Abstract: 
A technique for automated categorization of text documents, based on byte-level n-gram profiles and a new dissimilarity measure between profiles is presented. K nearest neighbors classifier is used. The technique is language independent. It has been applied to four document collections in English, Chinese and Serbian: Reuters-21578 newswire articles, 20-Newsgroups, Tancorp and Ebart. The evaluation was done by using the micro- and macro-averaged function. The results obtained confirm that the presented technique, although very simple, in the case of Tancorp and 20-Newsgroups corpora achieves better results than other n-gram based techniques. As compared to other state-of-the-art methods, it performs better than “bag-of-words” K nearest neighbors classifier and in the case of 20-Newsgroups corpus it works even better than “bag-of-words” Support vector machines classifier. It can be successfully used in a variety of related problems.
URI: https://research.matf.bg.ac.rs/handle/123456789/2552
DOI: 10.3233/ida-140663
Appears in Collections:Research outputs

Show full item record

SCOPUSTM   
Citations

21
checked on Jun 13, 2026

Page view(s)

4
checked on Jun 13, 2026

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.