Logo
Please use this identifier to cite or link to this item: http://20.198.91.3:8080/jspui/handle/123456789/8748
Title: Statistical sandhi splitter for bengali compound words
Authors: Chatterjee, Soumyabrata
Advisors: Saha, Diganta
Keywords: Statistical Sandhi Splitter;Bengali Compound Words
Issue Date: 2022
Publisher: Jadavpur University, Kolkata, West Bengal
Abstract: Bengali language is a rich agglutinative language having compound letters as well as compound words. Compounding is one of the most common method of new word formation in Bengali in which new words are generated by combining two (rarely more) root words. In machine translation or information Retrieval, context of a word is more important to understand its use and meaning. For this purpose, root words play a major role. It is noted that in Bengali new compound words are very often generated by combining two or more words or stems following the word-formation rules and methods applicable in the language to satisfy the linguistic needs of the language. For some of the words, the compound word needs to be split as it is morphologically difficult to analyse it and if not split, may degrade the performance of NLP applications. Sandhi splitting is an important step in NLP applications for languages having compound words formed by Sandhi rules. Here a statistical sandhi splitter for Bengali compound words is proposed. Our approach uses Conditional Random Field (CRF) which is one of the most successful statistical learning methods in NLP for labelling and segmenting sequential data. CRF is trained to find the splitting point of compound words where actual morphological changes occur. From the segments obtained after splitting, the CRF model is again used to find the class label or the sandhi rule that was applied for the formation of the given compound word. 515 words from standard Bengali text book is taken to prepare the dataset. Using this split point and predicted label the root words of the given compound word is determined. Previous tasks for compound splitting mainly were based on rule-based approach and used vocabulary to determine root words. Here our proposed model can determine out of vocabulary words as well and is faster and requires less manual effort. The model could achieve an accuracy of 90% for segmentation stage and 83% for label assignment stage.
URI: http://20.198.91.3:8080/jspui/handle/123456789/8748
Appears in Collections:Dissertations

Files in This Item:
File Description SizeFormat 
M.E. (Computer Science and Engineering) Soumyabrata Chatterjee.pdf520.68 kBAdobe PDFView/Open


Items in IR@JU are protected by copyright, with all rights reserved, unless otherwise indicated.