American Journal of Information Systems
ISSN (Print): 2374-1953 ISSN (Online): 2374-1988 Website: http://www.sciepub.com/journal/ajis Editor-in-chief: Sergii Kavun
Open Access
Journal Browser
Go
American Journal of Information Systems. 2013, 1(1), 26-30
DOI: 10.12691/ajis-1-1-4
Open AccessArticle

Adapt Clustering Methods for Arabic Documents

Boumedyen Shannaq1,

1Computer science and Information Technology Department, Mazoon College, “University College”, Muscat, Sultanate of Oman

Pub. Date: November 26, 2013

Cite this paper:
Boumedyen Shannaq. Adapt Clustering Methods for Arabic Documents. American Journal of Information Systems. 2013; 1(1):26-30. doi: 10.12691/ajis-1-1-4

Abstract

This research paper develops new clustering method (FWC) and further proposes a new approach to filtering data collected from internet resources. The focus of this research paper is clustering groups’ data instances into subsets in such a manner that similar instances are grouped together, while different instances belong to different groups. The instances are thereby organized into an efficient representation that characterizes the population being sampled thereby reducing the gigantic size of retrieved data. This has been done by removing dissimilar text files, and grouping similar documents into homogeneous clusters. Arabic text files of 974 MB has been collected, processed, analyzed and filtered by using common clustering methods. This new clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed and tested by the proposed new clustering method. Two experiments were conducted to establish the effectiveness of FWC methods and the obtained results show that the new FCW method suggested in this paper produced better results and outperformed existing clustering methods.

Keywords:
clustering knowledge management information retrieval system

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Figures

Figure of 4

References:

[1]  Allen J., Aslam J., Belkin N., Buckley C., CallanJ.,“Challenges in information retrieval and language modeling”, Special Interest Group on Information Retrieval(SIGIR), Vol 37,No. 1, pp. 31-47, 2003.
 
[2]  Shannaq B., Aleksandrov V.,“ Clustering the Arabic Documents(CAD)”, Universal Journal of Applied computer Science and Technology (UNIASCIT), Vol. 1 No. 3, pp. 90-94, 2011.
 
[3]  Araujo M.,Navarro G., Zivani N.,“ Large text searching allowing errors”, 4th South American Workshop on String Processing (WSP '97), pp. 2-24, 1997.
 
[4]  Allan J.,Carterette B., LewisJ., “When will information retrieval be “good enough?”, Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 443-440, 2005.
 
[5]  Shannaq B., “ Using Russian and English Ontology In Expanding The Arabic Query”, Universal Journal of Applied computer Science and Technology (UNIASCIT ), Vol. 1 No. 3,pp. 95-100, 2011.
 
[6]  Shannaq B., Aleksandrov V., “Using Product Similarity for Adding BusinessValue and Returning Customers” , Global Journal of Computer Science and Technology, Vol. 10, No. 12. pp. 2-8, 2010.
 
[7]  MorleyD., Parker C., “Understanding Computers Today and Tomorrow Comprehensive” 13th edition, 2010.
 
[8]  Shaw W., Burgin R., Howell P., “Performance standards and evaluations in IR test collections: Cluster-based retrieval models”. Information Processing and Management, Vol. 33, No. 1, pp. 1-14, 1997.
 
[9]  Baeza R., Ribeiro B., “Modren Information Retrieval”, ACM Press, New York, 1999.
 
[10]  Mason, Oliver, Berglund, Ylva, “Low-level parameters reflecting the naturalness of texts”. Proceedings of JADT2002, 6th International Conference on Textual Data Statistical Analysis, Saint Malo, March 13-15. Vol.2, pp. 507-516, 2002.
 
[11]  Giinther R., Levitin L., Chapiro B., Wagner P.,“Zipf's law and the act of ranking on probability distributions”, International Journal of Theoretical Physics, Vol. 35, pp. 395-417, 1996.
 
[12]  Shannaq, Boumedyen. “Investigating the Distribution of Arabic and English Keywords and Their Progress Over Different Text File Formats”. American Journal of Computing Research Repository 1.1 (2013): 1-5.
 
[13]  Kokorin P.,Shannaq B., “Algorithm of Normalization and Ontological Clusters Texts”, Information-measuring and operating systems Journal, Vol. 7, No. 8,pp 60-64, 2010.
 
[14]  Shannaq B., Aleksandrov V., “Super Arabic Morphological Analyzer (SAMA1)”, information-measuring and operating systems Journal, Vol.11, No. 7,pp. 60-63, 2009.
 
[15]  Witten H., Frank E., “Data Mining & Practical Machine Learning Tools and Techniques”,Elsevier,2005.
 
[16]  Giudici P., “Applied Data Mining, Staistical Methods for Business and Industry”, Wiley,England, 2003.
 
[17]  Boumedyen Shannaq,“Methods and Algorithms for Searching Arabic Name Entity”, International Journal of Computer Applications, Vol.82 - Number 8, 2013.
 
[18]  Boumedyen Shannaq, Kaneez Fatima, “Hierarchy Concept Analysis in Accounting Ontology”, Asian Journal Of Computer Science And Information Technology, Vol.2: 2, 2012 13-20.