
The Rule-Based Sundanese Stemmer


获取原文并翻译 | 示例


Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the derivational affix as the first step is reasonable. To handle ambiguity, the last recognized affix was returned as the result. As the baseline, a Confix-Stripping Approach that applies Porter Stemmer for the Indonesian language was used. This stemmer shares similarities in terms of affix type, but uses a different stemming order. To observe whether the baseline stems the Sundanese affixed word properly, some features that were not covered by the baseline, such as the infix and allomorph removal, were added. The evaluation was done using 4,453 unique affixed words collected from Sundanese online magazines. The experiment shows that, as a whole, our stemmer outperforms the modified baseline in terms of recognized affixed type accuracy and properly stemmed affixed words. Our stemmer recognized 68.87% of the Sundanese affixed types and produced 96.79% of the correctly affixed words; the modified baseline resulted in 21.70% and 71.59%, respectively
机译:我们的研究提出了一种迭代的Sundanese词干分析器,方法是先删除形变词之前的派生词缀。选择该方案的原因是,在Sundanese粘贴中,在形态过程的最后阶段应用了一个缀​​(派生词缀)。此外,大多数Sun他语词缀都是衍生词,因此删除衍生词缀作为第一步是合理的。为了处理歧义,将最后识别的词缀作为结果返回。作为基线,使用了将Porter Stemmer用于印度尼西亚语的Confix-Stripping方法。该词干在词缀类型方面具有相似之处,但使用不同的词干顺序。为了观察基线是否正确地保留了Sundanese附加词,添加了一些基线未涵盖的功能,例如中缀和异形去除。使用从Sundanese在线杂志收集的4,453个独特的附加词来完成评估。实验表明,从整体上看,我们的词干在识别的附加词类型准确性和正确词干的词缀方面比修改过的基准更好。我们的词干识别出68.87%的Sundanese词缀类型,并产生了96.79%的正确词缀;修改后的基准分别导致21.70%和71.59%



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号