A Systemathic Data Collection Procedure for Software Defect Prediction

Translate original post with Google Translate

Researchers: Mauša, G.; Galinac Grbac, T. and Dalbelo Bašić, B.

About the study:

  Empirical research is highly dependent on the quality of the data used. Software Defect Prediction (SDP) data comes from development repositories without a formal link between them and the choice of the linking technique may induce significant bias. The area of  SDP lacks a set of guidelines for collection and validation of data. Instead, it relies on some often
used practices. This study presents an exhaustive survey of techniques and approaches used in the data collection process. It identifies the most damaging issues that induce bias, provides a number of measures for comparing different collection techniques and presents a data collection procedure that uses a bug-code linking technique based on regular expression.


  The quality of the linked data has been addressed in other research fields as well and it depends on the data linking approach. Without a standard for data linking in SDP studies, datasets may suffer from bias and the generalisation of results cannot be achieved. The contribution of this study is to identify concrete issues that must be discussed when applying a particular linking technique, to address threats to its validity and the generalization of its results, and a set of guidelines how to systematically evaluate its performance, and a comparison to existing ones. The complete set of linking techniques, data collection approaches, existent datasets (in upper part) and the projects which they originate from (in lower part) is presented in following figure:


  This study confirmed that our BuCo Regex linking techniques demonstrates the best performance. It outperformed other linking techniques and achieved similar results as the ReLink tool, as is presented on following figure for Eclipse JDT dataset:

  The popular SZZ approach used an overly strict linking technique which lead to increasing discrepancy between the number of linked bugs. The Relink tool was more thoroughly examined in a manual comparison and it was outperformed by simpler BuCo Regex. In Apache OpenNLP benchmark dataset it missed certain links and in Eclipse JDT 2.0 it made incorrect predictions. The following figure exhibits the results for the Eclipse JDT 2.0 dataset:

  Furthermore, our results indicated that context influences the bug linking effectiveness. We proved that data quality may vary between different research communities, within the same research community, among projects that share a common development process and even throughout the evolution of a single project. Hence, it is important to continue to build a common evaluation network, as proposed in this study.

Supported by: project UIP-2014-09-7945 and research grant

Published in: ComSIS journal

Comments are closed.