American Journal of Computing Research Repository
ISSN (Print): 2377-4606 ISSN (Online): 2377-4266 Website: http://www.sciepub.com/journal/ajcrr Editor-in-chief: Vishwa Nath Maurya
Open Access
Journal Browser
Go
American Journal of Computing Research Repository. 2013, 1(1), 1-5
DOI: 10.12691/ajcrr-1-1-1
Open AccessArticle

Investigating the Distribution of Arabic and English Keywords and Their Progress Over Different Text File Formats

Boumedyen Shannaq1,

1Computer science and Information Technology Department, Mazoon University College, Muscat, Sultanate of Oman

Pub. Date: November 13, 2013

Cite this paper:
Boumedyen Shannaq. Investigating the Distribution of Arabic and English Keywords and Their Progress Over Different Text File Formats. American Journal of Computing Research Repository. 2013; 1(1):1-5. doi: 10.12691/ajcrr-1-1-1

Abstract

This paper explicates a systematic approach of implementing text format categorization. It also emphasizes defined corpus linguistics and accordingly demonstrates how various Text files Html, Pdf, Doc and Txt format respectively could be analyzed. This work concentrates on comparing Arabic text format with English text format, for which various text formats have been considered. Hence the idea is implemented by calculating a distributed factor for the keywords distribution with respect to Arabic and English text documentation. All the text selected is from the Computer Technology domain. The text categorization process is implemented on the text collection and consists of two main corpus namely, Arabic and English text respectively. The obtained results show that the Arabic text format document is well distributed in Doc files compared to the English text document which is well distributed in Xml files. These results shall contribute in handling and building an effective Electronic Learning System for Arabic and English Texts. The results and conclusions are presented here with various graphical outputs for better understanding.

Keywords:
information retrieval text categorization distributing factor natural language processing future trends

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Figures

Figure of 8

References:

[1]  Boumedyen, "The New Arabic Document Summarization techniques (NADST)", MECIT, Oman, 2011.
 
[2]  P.P.Kokorin, Boumedyen.Shannag, E. V. ShChelkunova, “Algorithm of normalization and ontological Clusters texts” information-measuring and operating systems Journal, 2010. http://www.radiotec.ru/catalog.php?cat=jr.
 
[3]  Aksenov A.Y., Zaytseva A. A., Boumedyen Shannaq. "The rank method of text data regions localization", information-measuring and operating systems Journa, 211l.
 
[4]  P.P.Kokorin, B.Shannag, E.V.ShChelkunova, “Algorithm of normalization and ontological Clusters texts” information-measuring and operating systems Journal, 2010. http://www.radiotec.ru/catalog.php?cat=jr.
 
[5]  Kokorin P. P. Kolesnikov R. A., Andreeva N. A, Frolov K. V, Boumedyen Shannaq, “The info logical approach to develop edutainment systems”. St. Petersburg institute for Informatics and Automation of Russian RAS, Academy of Sciences, 199178, Russia, VAX UDC 004.9, Information-measuring and operating systems Journal, 2009. http://www.radiotec.ru/catalog.php?cat.
 
[6]  Boumedyen Shannaq, S.V. Kuleshov,” Super Arabic morphological analyzer (SAMA1) “ St. Petersburg institute for Informatics and Automation of Russian RAS ,Academy of Sciences, 199178, Russia ,VAX UDC 003.9, information-measuring and operating systems Journal, 2009.
 
[7]  Alekcandov V.V, Kuleshov S.V, Boumedyen Shannaq, “ Phenomenon of identification” information-measuring and operating systems Journal, 2010. http://www.radiotec.ru/catalog.php?cat=jr
 
[8]  Baayen, Harald (2001), Word frequency distributions. Dordrecht: Kluwer.
 
[9]  Baldi, Pierre/Frasconi, Paolo/Smyth, Padhraic (2003), Modeling the Internet and the web. Chichester: Wiley.
 
[10]  Ha, Le Quan, Sicilia-Garcia, E. I., Ming, Ji, and Smith, F. J. (2002), extension of Zipf’s law to words and phrases, in Proceedings of COLING 2002, Taipei, Taiwan.
 
[11]  Heaps, H. S. (1978) Information Retrieval – Computational and Theoretical Aspects, Academic Press.
 
[12]  Li, Wentian (1992). Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842–1845.
 
[13]  Sichel, H. S. On a distribution law for word frequencies. Journal of the American Statistical Association, vol: 70, 542-547.
 
[14]  H. Guiter and M. V. Arapov, editors. Studies on Zipf's Law, Wissens chaftlicher Verlag Trier, 1982.
 
[15]  R. Gunther, L. Levitin, B. Schapiro, and P. Wagner, Zipf's law and the act of ranking on probability distributions. International Journal of Theoretical Physics, 15:395, 1996.
 
[16]  Wentian Li. References on Zipf's law. URL:http://linkage.rockefeller.edu/wli/zipf/.
 
[17]  Mason, Oliver; Berglund, Ylva, Low-level parameters reflecting the naturalness of texts. Proceedings of JADT2002, 6th International Conference on Textual Data Statistical Analysis, Saint Malo, March 13-15 2002. Vol.2, p.507-516. ISBN: 2-7261-1198X.