Please use this identifier to cite or link to this item: https://research.matf.bg.ac.rs/handle/123456789/2457
Title: Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study
Authors: Graovac, Jelena 
Pavlović Lažetić, Gordana
Kovačević, Jovana 
Affiliations: Informatics and Computer Science 
Informatics and Computer Science 
Keywords: Arabic;byte-level n-gram;English;kNN;natural language text categorization
Issue Date: 2015
Publisher: Brazilian Computer Society Special Interest Group on Databases
Journal: Journal of Information and Data Management
Abstract: 
We introduce a new language independent text categorization technique based on n-grams profile representation of restricted size of both document and a category, an n-gram weighting factors scheme, and a simple algorithm for comparing profiles. The technique does not require any morphological analysis of texts, any preprocessing steps, or any prior information about document content or language. We apply it to the text categorization problem in two widely spoken yet paradigmatically quite different languages – English and Arabic, thus demonstrating language-independence. We used their publicly available document collections – 20-Newsgroups and Mesleh-10, respectively. Experimental results presented in terms of macro- and micro-averaged F1 measures imply that the new technique outperforms other n-gram based and bag-of-words machine learning techniques when applied to English and Arabic text categorization.
URI: https://research.matf.bg.ac.rs/handle/123456789/2457
Appears in Collections:Research outputs

Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.