首页> 外文期刊>Mathematical Problems in Engineering >An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data
【24h】

An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data

机译:一种简化训练数据的跨项目缺陷预测的改进方法

获取原文
获取原文并翻译 | 示例
           

摘要

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. The objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances' similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. The results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. That is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.
机译:对于历史数据有限的项目,跨项目缺陷预测(CPDP)引起了广泛关注。据我们所知,由于低质量的跨项目培训数据,现有方法的性能通常很差。这项研究的目的是通过简化训练数据(称为TDSelector)来提出一种改进的CPDP方法,该方法考虑了每个训练实例所具有的相似性和缺陷数量(以缺陷表示),并证明了这种方法的有效性。建议的方法。我们的工作包括三个主要步骤。首先,我们根据实例的相似性和缺陷的线性加权函数构造TDSelector。其次,我们使用Logistic回归分类算法构建了我们实验中使用的基本缺陷预测器。第三,我们分析了相似度和缺陷归一化的不同组合对预测性能的影响,然后与两种现有方法进行了比较。我们评估了从两个公共存储库收集的14个项目的方法。结果表明,所提出的TDSelector方法的平均性能优于两种基线方法,并且AUC值分别增加了10.6%和4.3%。也就是说,包含缺陷确实有助于选择CPDP的高质量训练实例。另一方面,欧氏距离和线性归一化的组合是TDSelector的首选方法。另一个实验还表明,直接选择包含更多错误的实例作为训练数据可以进一步提高通过我们的方法训练的错误预测器的性能。

著录项

  • 来源
    《Mathematical Problems in Engineering》 |2018年第6期|2650415.1-2650415.18|共18页
  • 作者

    He Peng; He Yao; Yu Lvjun; Li Bing;

  • 作者单位

    Hubei Univ, Sch Comp Sci & Informat Engn, Wuhan 430062, Hubei, Peoples R China;

    Hubei Univ, Sch Comp Sci & Informat Engn, Wuhan 430062, Hubei, Peoples R China;

    Hubei Univ, Sch Comp Sci & Informat Engn, Wuhan 430062, Hubei, Peoples R China;

    Wuhan Univ, Sch Comp, Wuhan 430072, Hubei, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号