Article citationsMore >>

Akram, M. (2009) “Word Segmentation for Urdu OCR System”, Master’s Thesis, Department of Computer Science, National University Of Computer & Emerging Sciences, Lahore, Pakistan.

has been cited by the following article:

Article

Word Segmentation Model for Sindhi Text

1Institute of Information and Communication Technology, University of Sindh, Jamshoro


American Journal of Computing Research Repository. 2014, Vol. 2 No. 1, 1-7
DOI: 10.12691/ajcrr-2-1-1
Copyright © 2014 Science and Education Publishing

Cite this paper:
Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro, Dil Nawaz Hakro. Word Segmentation Model for Sindhi Text. American Journal of Computing Research Repository. 2014; 2(1):1-7. doi: 10.12691/ajcrr-2-1-1.

Correspondence to: Zeeshan  Bhatti, Institute of Information and Communication Technology, University of Sindh, Jamshoro. Email: zeeshan.bhatti@usindh.edu.pk

Abstract

Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation.

Keywords