Word Segmentation Model for Sindhi Text

Zeeshan Bhatti; Imdad Ali Ismaili; Waseem Javaid Soomro; Dil Nawaz Hakro

American Journal of Computing Research Repository. 2014, 2(1), 1-7
DOI: 10.12691/ajcrr-2-1-1

Open AccessArticle

Word Segmentation Model for Sindhi Text

Zeeshan Bhatti^1,, Imdad Ali Ismaili¹, Waseem Javaid Soomro¹ and Dil Nawaz Hakro¹

¹Institute of Information and Communication Technology, University of Sindh, Jamshoro

Pub. Date: January 01, 2014

View Full Text Full Text PDF (395 KB) Full Text ePUB(599 KB)

Cite this paper:
Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro and Dil Nawaz Hakro. Word Segmentation Model for Sindhi Text. American Journal of Computing Research Repository. 2014; 2(1):1-7. doi: 10.12691/ajcrr-2-1-1

Abstract

Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation.

Keywords:
word segmentation sindhi tokenization sindhi language Sindhi Spell Checker

This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Figures

Figure 3 of 10

References:

[1]	Mahar, J. A., Shaikh, H., Memon,G. Q., “A Model for Sindhi Text Segmentation into Word Tokens”, Sindh University Research Journal (Science Series), Vol.44 (1) pp.43-48 (2012).

[2]	Haruechaiyasak, C.; Kongyoung, S.; Dailey, M.; “A comparative study on Thai word segmentation approaches,” Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on, vol.1, no., pp.125-128, 14-17 May 2008.

[3]	Nadir D. And Sarmad H. 2010. Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 528-536.

[4]	“A fast morphological algorithm with unknown word guessing induced by a dictionary for web search engine” Source: http://company.yandex.ru/articles/iseg-las-vegas.xml Retrieved on: 12, June 2011.

[5]	Hull, D. A, “Stemming Algorithms A Case Study for Detailed Evaluation,” (Rank Xerox Research Centre), JASIS vol. 47, 1996.

[6]	Ismaili, I.A, Bhatti, Z., Shah, A. A. “Design and Development of Graphical User Interface for Sindhi Language (GUISL)”. Mehran University Research Journal of Engineering & Technology, Volume 30, No. 4, October 2011.

[7]	Rahman M U (2010). Towards Sindhi Corpus Construction, Conference on Language and Technology, Lahore, Pakistan.

[8]	Shaalan K. “Arabic GramCheck: A Grammar Checker for Arabic”, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.

[9]	Zribi, C. B. O. And Ben Ahmed, M. 2003. “Efficient automatic correction of misspelled Arabic words based on contextual information.” In Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES’03). V. Palade, R. J. Howlett, and L. Jain Eds., Oxford, Springer, 770-777.

[10]	Farghaly, A., Shaalan, K. “Arabic Natural Language Processing: Challenges and Solutions,” ACM Transactions on Asian Language Information Processing (TALIP), the Association for Computing Machinery (ACM), 8(4)1-22, December 2009, 8(4), 1-22.

[11]	Shaalan, K., Allam, A., Gohah, A., “Towards Automatic Spell Checking for Arabic”. Conference on Language Engineering, ELSE, Cairo, Egypt, 2003.

[12]	Uzzaman, N., And Khan, M., “A Double Metaphone Encoding for Bangla and its Application in Spelling Checker”, Proc. 2005 IEEE Natural Language Processing and Knowledge Engineering, Wuhan, China, October, 2005.

[13]	Chaudhuri,B. B., “Towards Indian Language Spell-checker Design,” lec, pp.139, Language Engineering Conference (LEC'02), 2002.

[14]	Bal K. B. Et. Al., “Nepali Spellchecker”, PAN Localization Working Papers 2004-2007, Centre for Research in Urdu Language Processing, National University of compute and Emerging Sciences, Lahore, Pakistan, pp.316-318.

[15]	Dhanabalan, T., Parthasarathi, R., & Geetha, T. V. (N.D.). “Tamil Spell Checker” Resource Center for Indian Language Technology Solutions – Tamil, School of Computer Science and Engineering, Anna University, Chennai, India, pp.18-27. 2003.

[16]	Naseem, T., & Hussain, S. “Spelling Error Corrections for Urdu”. Published online: 26 September 2007 © Springer Science Business Media B.V. 2007. PAN Localization Working Papers 2007, Centre for Research in Urdu Language Processing, National University of compute and Emerging Sciences, Lahore, Pakistan, pp.117-128.

[17]	Naseem, T. And Hussain, S, “A Novel Approach for Ranking Spelling Mistakes in Urdu”, Language Resources and Evaluation, 2007. 41:117-128.

[18]	Shaikh, N. A., Shaikh, Z. A., & Ali, G. (2009). Segmentation of Arabic text into characters for recognition. In Wireless Networks, Information Processing and Systems (pp.11-18). Springer Berlin Heidelberg.

[19]	Shaikh, N. A., Mallah, G. A., & Shaikh, Z. A. (2009). Character Segmentation of Sindhi, an Arabic Style Scripting Language, using Height Profile Vector.Australian Journal of Basic and Applied Sciences, 3(4), 4160-4169.

[20]	Akram, M. (2009) “Word Segmentation for Urdu OCR System”, Master’s Thesis, Department of Computer Science, National University Of Computer & Emerging Sciences, Lahore, Pakistan.

[21]	Mahar, J.A., Memon, G. Q., Danwar, H.S., (2011), Algorithms for Sindhi Word Segemtnatin Using Lexicon Driven Approach, International Journal of Academic Research, Vol. 3. No.3. May, 2011.

[22]	Ismaili, I.A., Bhatti, Z., Shah, A. A., “Development of Unicode based bilingual Sindhi-English Dictionary”. Mehran University Research Journal of Engineering & Technology Volume 31, No. 1, January 2012.

[23]	Bhatti, Z., Ismaili, I.A., Shaikh, A. A., Soomro, W. J. “Spelling Error Trends and Patterns in Sindhi”. Journal of Emerging Trends in Computing and Information Sciences, Vol. 3, No.10, 2012.

[24]	Bhatti, Z., Ismaili, I.A., Khan, W., Nizamani, A. S., “Development of Unicode based Sindhi Typing System”, Journal of Emerging Trends in Computing and Information Sciences, Vol. 4 No. 3, 2013.