American Journal of Computing Research Repository. 2014, 2(1), 1-7
DOI: 10.12691/ajcrr-2-1-1
Word Segmentation Model for Sindhi Text

Zeeshan Bhatti1, , Imdad Ali Ismaili1, Waseem Javaid Soomro1 and Dil Nawaz Hakro1

1Institute of Information and Communication Technology, University of Sindh, Jamshoro

Pub. Date: January 01, 2014

Zeeshan Bhatti, Imdad Ali Ismaili, Waseem Javaid Soomro and Dil Nawaz Hakro. Word Segmentation Model for Sindhi Text. American Journal of Computing Research Repository. 2014; 2(1):1-7. doi: 10.12691/ajcrr-2-1-1


Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation.

word segmentation sindhi tokenization sindhi language Sindhi Spell Checker

