Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/2552
Title: A variant of n-gram based language-independent text categorization
Authors: Graovac, Jelena 
Affiliations: Informatics and Computer Science 
Issue Date: 2014
Rank: M23
Publisher: Sage Journals
Journal: Intelligent Data Analysis
Abstract: 
A technique for automated categorization of text documents, based on byte-level n-gram profiles and a new dissimilarity measure between profiles is presented. K nearest neighbors classifier is used. The technique is language independent. It has been applied to four document collections in English, Chinese and Serbian: Reuters-21578 newswire articles, 20-Newsgroups, Tancorp and Ebart. The evaluation was done by using the micro- and macro-averaged function. The results obtained confirm that the presented technique, although very simple, in the case of Tancorp and 20-Newsgroups corpora achieves better results than other n-gram based techniques. As compared to other state-of-the-art methods, it performs better than “bag-of-words” K nearest neighbors classifier and in the case of 20-Newsgroups corpus it works even better than “bag-of-words” Support vector machines classifier. It can be successfully used in a variety of related problems.
URI: https://research.matf.bg.ac.rs/handle/123456789/2552
DOI: 10.3233/ida-140663
Appears in Collections:Research outputs

Show full item record

SCOPUSTM   
Citations

21
checked on Sep 24, 2025

Google ScholarTM

Check

Altmetric

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.