Software Defect Prediction (SDP) datasets are difficult to find and hard to collect. This page serves fellow researchers to find the datasets that we have collected during our research. The description of the collection procedure can be found in our paper “A Systematic Data Collection Procedure for Software Defect Prediction“. Should you choose to use these datasets in your research, please give a proper citation of this paper.
Here are the datasets of 5 subsequent releases of three major projects from the Eclipse open source community:
To request the password for the given datasets, please, send an e-mail to goran.mausa[at]riteh.hr . If you use this dataset for your research, please reference the following paper: Mauša, G.; Galinac Grbac, T.; Dalbelo Bašić, B.: A Systematic Data Collection Procedure for Software Defect Prediction. Computer Science and Information Systems Journal, Vol. 13 (1), 2016, pp. 173–197.
The datasets are given in .csv format with comma as the delimiter and point as decimal mark. First row contains description of metrics: column 1 is the file path, columns 2 – 49 are independent variables (software metrics) described with their abbreviations, column 50 is the dependent variable (number of defects). The datasets were cleared of files that contained “.example” or “.tests” in their path. A complete description of features inside the datasets is given in Table of Metrics. Metrics #2, #3 and #38 can be deduced from the csv file and metric #1 and metric #51 is an optional string parameter so they are ommited from the datasets.
The BuCo tool that was used to collect these datasets was first presented in paper Software defect prediction with bug-code analyzer – a data collection tool demo. The technique we have developed for linking the source code and the issue tracking repositores was explained in paper Techniques for Bug–Code Linking. Further discussion on issues present in the data collection process of SDP datasets is presented in paper Data Collection for Software Defect Prediction – an Exploratory Case Study of Open Source Software Projects.
These datasets had been used in following papers:
- Mauša, G.; Galinac Grbac, T.: Assessing the Impact of Untraceable Bugs on the Quality of Software Defect Prediction Datasets, Proceedings of SQAMIA 2016, pp. 47-56, Budapest, Hungary.
- Mauša, G.; Galinac Grbac, T.; Dalbelo Bašić, B.: A Systematic Data Collection Procedure for Software Defect Prediction. Computer Science and Information Systems Journal, Vol. 13 (1), 2016, pp. 173–197
- Vranković, A.; Galinac Grbac, T.: Structural dependencies between system fault distribution principles, Proceedings of SoftCOM PhD Forum 2016, Split, Croatia.
- Rubinic, E; Mauša, G; Galinac Grbac T; Software Defect Classification with a Variant of NSGA-II and Simple Voting Strategies, SSBSE2015 Graduate Student Track, Bergamo, Italy.
- Mauša, G.; Bogunović, N.; Galinac Grbac, T.; Dalbelo Bašić, B.; Rotation Forest in Software Defect Prediction, Proceedings of SQAMIA 2015, pp. 35-43, Maribor, Slovenia.
- Mauša, G.; Galinac Grbac, T.; Dalbelo Bašić, D.; Data Collection for Software Defect Prediction – an Exploratory Case Study of Open Source Software Projects, Proceedings of MIPRO CTI, 2015, pp. 513-519, Opatija, Croatia.
- Petric J., Galinac Grbac T., Software structure evolution and relation to system defectiveness, Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering EASE2014, London, UK.
- Mauša, Goran; Perković, Paolo; Galinac Grbac, Tihana; Štajduhar, Ivan, Techniques for Bug–Code Linking, Proceedings of SQAMIA 2014, pp. 47-55, Lovran, Croatia.
- Mauša G., Galinac Grbac T., Dalbelo Bašić B. : “Software defect prediction with bug-code analyzer – a data collection tool demo”, In: Proceedings of SoftCOM ’14, Split, Croatia, 2014