Reto Zenger just finished his master’s thesis. The goal of his master’s thesis was to investigate if the concept of collaborative filtering can be applied to cross-project defect prediction. But first, you’ll need a little background.
One research field in computer science is concerned with discovering and predicting defects in software. Software defects are also known as software bugs. Lots of effort is invested into developing and improving predictive models to discover and predict defects in software. The idea behind is to support developers to avoid creating bugs, find bugs more quickly, and help project managers to plan the necessary personal resources.
To predict bugs, machine learning algorithms are used to learn from the source code repositories and bug tracking system predictive models. These models are highly optimized for one single software project. No, they are not overfitted models! It has been shown that applying bug prediction models of one software project perform bad on predicting bugs on other software projects. The reason is that software projects differ a lot in size, people, organization, etc. That’s why not much effort has been invested into cross-project defect prediction.
Collaborative filtering is a recommendation creation approach, which exploits the similar preferences of people to create recommendation. In my current research on recommender systems, I’m using machine learning to build user preference models to predict the value of certain items. Further, I formulated techniques to compare machine learning models and compute the similarity. I use this approach to compute the preference similarity of two people and apply traditional collaborative filtering.
Now, what if we learn bug prediction models for each software project, compute the similarity of these bug prediction models based on my research, and use a collaborative filtering approach to combine these bug prediction models to predict and discover bugs in other similar software projects in terms of similar bug prediction models? Well, was a long question, isn’t it. But the hypotheses is set.
Reto Zenger used 19 different Eclipse projects, prepared tons of data for machine learning, built bug prediction models for all projects, and used my algorithms to determine the similarities among these bug prediction models. To predict bugs in a software project, he applied collaborative filtering. More specifically, he created a bug prediction combining all bug prediction models and weight their predictions with the similarity to the software project’s bug prediction model. In the following, some results are presented.
In the following two figures, the java implementation (J48) of the decision tree learner C4.5 is used to build bug prediction models for all projects. The green curve indicates the AUC curve of the collaborative filtering approach using all bug prediction models together with the model similarities. The blue curve indicates the AUC curve of the bug prediction model of the corresponding project. The bigger the area under the curve is (a.k.a. AUC), the better. Please refer to the meaning of AUC or ROC curve.
As it is shown in both figures, the collaborative filtering approach for cross-project defect prediction performs better. In the following, the evaluation results of all evaluation metrics and all projects are consolidated. The results show clear evidence, that collaborative filtering for cross-project defect prediction works and provides better bug predictions.
To conclude, applying collaborative filtering to cross-project defect predictions works and even outperforms the bug prediction models of the corresponding software projects.
Abstract from the master’s thesis
Reliable defect predictions enable a better management of the software developerís effort during the process of software engineering. The identified bug-prone parts can be reengineered or tested with special care. However, defect prediction works only if enough data is available to learn the prediction models. If the data is not sufficient, prediction models of other projects can be applied. Traditional cross-project defect predictions achieve superficial results. That is why we propose a completely new approach. Based on the collaborative filtering framework RECOMIZER, we predict post-release defects of 19 Eclipse plug-ins. Therefore we measure the similarities between the prediction models derived from the different projects. Combining the defect models with the highest similarity to the model of the project under investigation, we perform cross-project defect prediction based on collaborative filtering. We are able to confirm our main hypothesis, that the performance of the defect predictions based on collaborative filtering outperforms the predictions we did while considering the model of the project under investigation only. We achieve a promising mean AUC of 0.745 using a Naive Bayes classifier. In the case of a J48 decision tree, we achieve a mean AUC of 0.734. We also analyze the similarities of the different defect models. The projects organized after their model similarity, rather build a clew than the expected clusters.
Reto Zenger: Collaborative defect prediction: applying collaborative filtering to cross-project defect prediction, University of Zurich, Faculty of Economics, 2011. (Master Thesis)