Researchers: Mauša, G.; Sarro, F. and Galinac Grbac, T.
About the study:
Genetic Programming (GP) has recently been proposed for classification and was found capable of dealing with unbalanced data, a common problem in Software Defect Prediction (SDP). This study investigates different learning techniques for GP: 5 fitness functions (accuracy, f-measure, G-mean and two novel ones) and 2 training scenarios (learning from the most recent release or from all previous releases). It also compares the GP configurations with two widely used machine learning techniques, namely Nave Bayes (NB) and Support Vector Machine (SVM). The study is based on carefully collected SDP datasets from 7 subsequent releases of Eclipse PDE and 8 subsequent releases of Apache Hadoop project.
There may be signicant value in joining the Search Based Software Engineering (SBSE) and predictive modeling. The use of SBSE in SDP is still rather limited to data preprocessing, feature selection and optimization of existing classification models. However, GP yields promising results when used for general purpose classification, especially in presence of unbalanced data.
The best overall performance and the highest minority accuracy is achieved by GP configurations when guided by fitness functions based on Average weighted accuracy (Ave), F-measure (FM) and Geometric Mean (GM). The approapriate choice of learning technique is influenced by the development context of project under study and the level of its data imbalance.