Analyzing source code identifiers for code reuse using NLP techniques and WordNet

机译：分析使用NLP技术和WordNet的代码重用源代码标识符

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Massive amount of source codes are available free and open. Reusing those open source codes in projects can reduce the project duration and cost. Even though several Code Search Engines (CSE) are available, finding the most relevant code can be challenging. In this paper we propose a framework that can be used to overcome the above said challenge. The proposed solution starts with a Software Architecture (Class Diagram) in XML format and extracts information from the XML file, and then, it fetches relevant projects using three types of crawlers from GitHub, SourceForge, and GoogleCode. Then it finds the most relevant projects among the vast amount of downloaded projects. This research considers only Java projects. All java files in every project will be represented in Abstract Syntax Tree (AST) to extract identifiers (class names, method names, and attributes name) and comments. Action words (verbs) are extracted from comments using Part of Speech technique (POS). Those identifiers and XML file information need to be analyzed for matching. If identifiers are matched, marks will be given to those identifiers, likewise marks will be added together and then if the total mark is greater than 50%, the .java file will be considered as a relevant code. Otherwise, WordNet will be used to get synonym of those identifiers and repeat the matching process using those synonyms. For connected word identifiers, camel case splitter and N-gram technique are used to separate those words. The Stanford Spellchecker is used to identify abbreviated words. The results indicate successful identification of relevant source codes.

机译：源代码的大量可用和打开。重用项目中的开源代码可以降低项目持续时间和成本。尽管有几个代码搜索引擎（CSE）可用，但查找最相关的代码也可能具有挑战性。在本文中，我们提出了一个可以用来克服上述挑战的框架。所提出的解决方案以XML格式的软件架构（类图）开始，并从XML文件中提取信息，然后使用GitHub，SourceForge和GoogleCode的三种类型的爬虫获取相关项目。然后它找到了大量下载项目中最相关的项目。本研究仅考虑Java项目。每个项目中的所有Java文件都将在抽象语法树（AST）中表示，以提取标识符（类名称，方法名称和属性名称）和注释。使用部分语音技术（POS）从注释中提取动作词（动词）。需要分析这些标识符和XML文件信息以进行匹配。如果匹配标识符，则将给出标记到这些标识符，同样将添加在一起，然后如果总标记大于50 ％，则.java文件将被视为相关代码。否则，WordNet将用于获取这些标识符的同义词，并使用这些同义词重复匹配进程。对于连接的单词标识符，骆驼盒式分路器和n-gram技术用于分离这些单词。 stanford spellchecker用于识别缩写字。结果表明相关源代码的成功识别。

著录项

来源
《Moratuwa Engineering Research Conference》|2017年|xxii 496 p.|共6页
会议地点
作者
P. Pirapuraj; Indika Perera;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类输配电工程、电力网及电力系统;
关键词
XML; Java; Data mining; Crawlers; Syntactics; Software architecture; Software;

机译：XML;Java;数据挖掘;爬虫;句法;软件架构;软件;

相似文献

外文文献
中文文献
专利

1. TURNING WORDNET INTO AN INFORMATION RETRIEVAL RESOURCE: SYSTEMATIC POLYSEMY AND CONVERSION TO HIERARCHICAL CODES [J] . RADA MIHALCEA International Journal of Pattern Recognition and Artificial Intelligence . 2003,第5期

机译：将wordnet变成信息检索资源：系统的方法和转换为分层代码
2. rust-code-analysis: A Rust library to analyze and extract maintainability information from source codes [J] . Luca Ardito, Luca Barbato, Marco Castelluccio, SoftwareX . 2020,第2期

机译：Rust-Code-Analysis：从源代码分析和提取可维护性信息的RUST库
3. Spotting and Removing WSDL Anti-pattern Root Causes in Code-first Web Services Using NLP Techniques: A Thorough Validation of Impact on Service Discoverability [J] . Matías Hirsch, Ana Rodriguez, Juan Manuel Rodriguez, Computer standards & interfaces . 2018,第FEBa期

机译：使用NLP技术在代码优先Web服务中发现和消除WSDL反模式根源：对服务可发现性影响的全面验证
4. Analyzing source code identifiers for code reuse using NLP techniques and WordNet [C] . P. Pirapuraj, Indika Perera Moratuwa Engineering Research Conference . 2017

机译：使用NLP技术和WordNet分析源代码标识符以进行代码重用
5. SOME TECHNIQUES IN UNIVERSAL SOURCE CODING AND CODING FOR COMPOSITE SOURCES [D] . WALLACE, MARK STANLEY. 1982

机译：通用源编码和复合源编码中的某些技术
6. Validity of ICD‐9 and ICD‐10 codes used to identify acute liver injury: A study in three European data sources [O] . Joan Forns, Miguel Cainzos‐Achirica, Maja Hellfritzsch, -1

机译：用于识别急性肝损伤的ICD-9和ICD-10代码的有效性：对三个欧洲数据源的研究
7. Identifying Source Code Reuse across Repositories using LCS-based Source Code Similarity [O] . 2015

机译：使用基于LCs的源代码相似性识别存储库中的源代码重用

Analyzing source code identifiers for code reuse using NLP techniques and WordNet

摘要

著录项

相似文献

相关主题

期刊订阅