首页> 外文会议>IEEE International Conference on Software Maintenance and Evolution >On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis
【24h】

On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis

机译:论销杆的影响与参数对基于N-GRAM的代码分析

获取原文

摘要

Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n-gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment.
机译:最近的研究表明,如N-GRAM模型,如N-GRAM模型,在各种软件工程任务中都有用,例如代码完成,错误识别,代码汇总等。但是,这些模型需要适当的多个参数集。此外,可以读取代码的不同方式基本上产生不同的模型(基于不同的令牌序列)。在本文中,我们专注于n-gram模型,评估使用标记,平滑,未知阈值和n值如何影响这些模型的预测能力。因此,我们将多个标记和不同参数(平滑,未知阈值和n值)的使用进行比较,目的是识别最合适的组合。我们的结果表明,改进的Kneser-Ney平滑技术表现最佳,而N值依赖于销售器的选择,其中值4或5在熵和计算时间之间提供良好的权衡。有趣的是,我们发现将代码视为简单文本的标记程序是最强大的。最后,我们证明了令牌增值人之间的差异具有实际重要性,并且具有改变给定实验的结论的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号