【24h】

A Language Independent Approach to Develop Urdu Stemmer

机译:一种独立的开发Urdu Sewermer的方法

获取原文

摘要

Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. This paper presents an unsupervised approach for the development of an Urdu stemmer. To train the system a training dataset, taken from CRULP [22], consists of 111,887 words is used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that these are very efficient algorithms having accuracy of 85.36% and 79.76%.
机译:特别是,在过去几年中,在印度,乌尔都语,孟加拉,泰米尔和泰卢固定等印度区域语言中的广泛信息已经以电子数据的形式提供。但是对这些数据存储库的访问非常低,因为支持这些语言的有效搜索引擎/检索系统非常有限。因此,自动信息处理和检索成为迫切要求。本文介绍了乌尔都语发展的无监督方法。要培训系统,从Crulp [22]采取培训数据集,由111,887个单词组成。为了产生后缀规则,已经提出了基于频率的剥离和基于长度的剥离。评估是由埃米尔语料库中提取的1200字。实验结果表明,这些是非常有效的算法,精度为85.36%和79.76%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号