首页> 外文会议>Moratuwa Engineering Research Conference >Analyzing source code identifiers for code reuse using NLP techniques and WordNet
【24h】

Analyzing source code identifiers for code reuse using NLP techniques and WordNet

机译:分析使用NLP技术和WordNet的代码重用源代码标识符

获取原文

摘要

Massive amount of source codes are available free and open. Reusing those open source codes in projects can reduce the project duration and cost. Even though several Code Search Engines (CSE) are available, finding the most relevant code can be challenging. In this paper we propose a framework that can be used to overcome the above said challenge. The proposed solution starts with a Software Architecture (Class Diagram) in XML format and extracts information from the XML file, and then, it fetches relevant projects using three types of crawlers from GitHub, SourceForge, and GoogleCode. Then it finds the most relevant projects among the vast amount of downloaded projects. This research considers only Java projects. All java files in every project will be represented in Abstract Syntax Tree (AST) to extract identifiers (class names, method names, and attributes name) and comments. Action words (verbs) are extracted from comments using Part of Speech technique (POS). Those identifiers and XML file information need to be analyzed for matching. If identifiers are matched, marks will be given to those identifiers, likewise marks will be added together and then if the total mark is greater than 50%, the .java file will be considered as a relevant code. Otherwise, WordNet will be used to get synonym of those identifiers and repeat the matching process using those synonyms. For connected word identifiers, camel case splitter and N-gram technique are used to separate those words. The Stanford Spellchecker is used to identify abbreviated words. The results indicate successful identification of relevant source codes.
机译:源代码的大量可用和打开。重用项目中的开源代码可以降低项目持续时间和成本。尽管有几个代码搜索引擎(CSE)可用,但查找最相关的代码也可能具有挑战性。在本文中,我们提出了一个可以用来克服上述挑战的框架。所提出的解决方案以XML格式的软件架构(类图)开始,并从XML文件中提取信息,然后使用GitHub,SourceForge和GoogleCode的三种类型的爬虫获取相关项目。然后它找到了大量下载项目中最相关的项目。本研究仅考虑Java项目。每个项目中的所有Java文件都将在抽象语法树(AST)中表示,以提取标识符(类名称,方法名称和属性名称)和注释。使用部分语音技术(POS)从注释中提取动作词(动词)。需要分析这些标识符和XML文件信息以进行匹配。如果匹配标识符,则将给出标记到这些标识符,同样将添加在一起,然后如果总标记大于50 %,则.java文件将被视为相关代码。否则,WordNet将用于获取这些标识符的同义词,并使用这些同义词重复匹配进程。对于连接的单词标识符,骆驼盒式分路器和n-gram技术用于分离这些单词。 stanford spellchecker用于识别缩写字。结果表明相关源代码的成功识别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号