...
首页> 外文期刊>Journal of Theoretical and Applied Information Technology >A HYBRID METHOD OF FEATURE EXTRACTION AND NAIVE BAYES CLASSIFICATION FOR SPLITTING IDENTIFIERS
【24h】

A HYBRID METHOD OF FEATURE EXTRACTION AND NAIVE BAYES CLASSIFICATION FOR SPLITTING IDENTIFIERS

机译:分离标识符的特征提取和朴素贝叶斯分类的混合方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Nowadays integrating natural language processing techniques on software systems has caught many researchers attentions. Such integration can be represented by analyzing the morphology of the source code in order to gain meaningful information. Feature location is the process of identifying specific portions of the source code. One of the most important information lies on such source code is the identifiers (e.g. Student). Unlike the traditional text processing, the identifiers in the source code is formed as multi-word such as Employee-Name. Such multi-words are not divided using white space, instead it can be formed using special characters (e.g. Employee_ID), CamelCase (e.g. EmployeeName) or using abbreviations (e.g. EmpNm). This makes the process of extracting such identifiers more challenging. Several approaches have been performed to resolve the problem of splitting multi-word identifiers. However, there is still room for improvement in terms of accuracy. Such improvement can be represented by utilizing more robust features that have the ability to analyses the morphology of identifiers. Therefore, this study aims to propose a hybrid method of feature extraction and Na?ve Bayes classifier in order to separate multi-word identifiers within source code. The dataset that has been used in this study is a benchmark-annotated data that contains large number of Java codes. Multiple experiments have been conducted in order to evaluate the proposed features independently and with combinations. Results shown that the combination of all features have obtained the best accuracy by achieving 64.7% of f-measure. Such finding implies the usefulness of the proposed features in terms of discriminating multi-word identifiers.
机译:如今,将自然语言处理技术集成到软件系统中已引起了许多研究人员的关注。为了获得有意义的信息,可以通过分析源代码的形态来表示这种集成。功能位置是识别源代码特定部分的过程。此类源代码中最重要的信息之一就是标识符(例如Student)。与传统的文本处理不同,源代码中的标识符形成为多词,例如Employee-Name。此类多字不使用空格进行分隔,而是可以使用特殊字符(例如Employee_ID),CamelCase(例如EmployeeName)或缩写(例如EmpNm)形成。这使得提取这样的标识符的过程更具挑战性。已经执行了几种方法来解决分割多词标识符的问题。但是,在准确性方面仍有改进的空间。可以通过利用更强大的功能来表示这种改进,这些功能可以分析标识符的形态。因此,本研究旨在提出一种特征提取与朴素贝叶斯分类器的混合方法,以在源代码中分离多词标识符。本研究中使用的数据集是包含大量Java代码的基准注释数据。为了独立地和组合地评估所提出的特征,已经进行了多次实验。结果表明,所有特征的组合通过达到f.measure的64.7%获得了最佳精度。这样的发现暗示了所提出的特征在区分多词标识符方面的有用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号