Software Defect Prediction Data

Software Defect Prediction (SDP) datasets are difficult to find and hard to collect. This page serves fellow researchers to find the datasets that we have collected during our research. The description of the collection procedure can be found in our paper “A Systematic Data Collection Procedure for Software Defect Prediction“. Should you choose to use these datasets in your research, please give a proper citation of this paper.


Here are the datasets of 5 subsequent releases of three major projects from the Eclipse open source community:

To request the password for the given datasets, please, send an e-mail to goran.mausa[at] . If you use this dataset for your research, please reference the following paper: Mauša, G.; Galinac Grbac, T.; Dalbelo Bašić, B.: A Systematic Data Collection Procedure for Software Defect Prediction. Computer Science and Information Systems Journal, Vol. 13 (1), 2016, pp. 173–197.

Datasets Description

The datasets are given in .csv format with comma as the delimiter and point as decimal mark. First row contains description of metrics: column 1 is the file path, columns 2 – 49 are independent variables (software metrics) described with their abbreviations, column 50 is the dependent variable (number of defects). The datasets were cleared of files that contained “.example” or “.tests” in their path. A complete description of features inside the datasets is given in Table of Metrics. Metrics #2, #3 and #38 can be deduced from the csv file and metric #1 and metric #51 is an optional string parameter so they are ommited from the datasets.

Data Collection

The BuCo tool that was used to collect these datasets was first presented in paper Software defect prediction with bug-code analyzer – a data collection tool demo. The technique we have developed for linking the source code and the issue tracking repositores was explained in paper Techniques for Bug–Code Linking. Further discussion on issues present in the data collection process of SDP datasets is presented in paper Data Collection for Software Defect Prediction – an Exploratory Case Study of Open Source Software Projects.


These datasets had been used in following papers:

Comments are closed.