How to Evaluate Statistical Fault Localization

by  Tim Henderson

Cite as:

Tim A. D. Henderson. How To Evaluate Statistical Fault Localization. Blog. 2018. https://hackthology.com/how-to-evaluate-statistical-fault-localization.html
PDF. WEB.

Note

This is a conversion from a latex paper I wrote. If you want all formatting correct you should read the pdf version.

Introduction

Automatic fault localization is a software engineering technique to assist a programmer during the debugging process by suggesting suspicious locations that may be related to the root cause of the bug. The big idea behind behind automatic fault localization (or just fault localization) is by pointing the programmer towards the right area of the program the programmer will find the cause of the bug more quickly.

One approach to fault localization is Spectrum Based Fault Localization which is also known as Coverage Based Statistical Fault Localization (CBSFL) [1]-[3]. This approach uses test coverage information to rank the statements from most "suspicious" to least suspicious. To perform CBSFL, test cases are run through an instrumented program. The instrumentation collects coverage profiles which report each statement1 executed during the test run. A test oracle is used to label each execution profile with whether or not the test passed or failed. Such oracles can be either automatic or manual (i.e. a human). The labeled execution profiles are referred to as the coverage spectra.

CBSFL techniques score each program element (location in the program) by its "statistical suspiciousness" such that the most suspicious element has the highest score. The scores are computed by suspiciousness metrics which attempt to quantify the relationship between the execution each program element and the occurrence of program failure.

There have been a great many statistical fault localization suspiciousness metrics proposed [2] since the idea was first proposed by Jones, Harrold and Stasko in 2002 [1]. The majority of the metrics are computed from just a few values: the number of tests \(n\), the number of passing tests \(p\), the number failing tests \(f\), the number of test runs an element \(e\) was executed in \(n_e\), the number of passing test runs an element was executed in \(p_e\), and the number of failing test runs an element was executed in \(f_e\). For instance, using these simple statistics one can estimate the conditional probability of program failure \(F\) given that a particular element \(e\) was executed: \[\begin{aligned} {\textrm{Pr}\left[{F|e}\right]} &= \frac{{\textrm{Pr}\left[{F \cap e}\right]}}{{\textrm{Pr}\left[{e}\right]}} \approx \frac{\frac{f_{e}}{n}}{\frac{n_{e}}{n}} = \frac{f_e}{n_e}\end{aligned}\] While many of the studies in statistical fault localization use more complex formulas [2], [3] (with various technical motivations) most of the metrics are measures statistical association as in the equation above.

The runtime information used to compute the above statistics is commonly referred to as the coverage spectra of the program. Coverage spectra is a matrix \({\bf C}\) where \({\bf C}_{i,j} > 0\) indicates that the program element \(\mathcal{E}_i\) was executed at least once during test \(\mathcal{T}_j\). Additionally, there is an additional "test status" vector \({\bf S}\) where \({\bf S}_{j} = \text{"F''}\) if test \(\mathcal{T}_j\) failed and \({\bf S}_{j} = \text{"P''}\) if test \(j\) passed.

To collect the coverage spectra the program is instrumented to collect a runtime profile of the executed elements. This instrumentation can be done (in principle) at any granularity including: expression, statement, basic block2, function, class, file, or package. The statistical methods which make use of spectra are agnostic to their granularity. While the granularity does not effect the statistical computation it changes how the localization results are perceived by the programmer. In the past, programmers have indicated a desire for finer grained results: at statement, basic block, or function level [5] over very coarse grained results at the class, file, or package level.

Some statistical fault localization techniques use additional information to either improve accuracy or provide more explainable results. For instance, work on Causal Fault Localization uses additional static and dynamic information to control for statistical confounding [6]. In contrast Suspicious Behavior Based Fault Localization (SBBFL) uses runtime control flow information (the behavior) to identify groups of collaborating suspicious elements [7]. These techniques leverage data mining techniques [8] such as frequent [9]-[11] or significant pattern mining [7], [12]. When significant patterns are mined metrics (such as statistical fault localization suspiciousness metrics) are used to identify the most significant patterns [7], [13].

Finally, a variety of non-statistical (or mixed methods) techniques for fault localization have been explored [14]-[17]. These range from delta debugging [18] to nearest neighbor queries [19] to program slicing [20], [21] to information retrieval [22]-[24] to test case generation [25]-[27]. Despite differences in the technical and theoretical approach of these alternate methods they also suggest locations (or groups of locations) for the programmer to consider when debugging.

Evaluation Methods

Some of the earliest papers in fault localization do not provide a quantitative method for evaluating performance (as is seen in later papers [28]). For instance, in the earliest CBSFL paper [1] (by Jones et al.'s) the technique is evaluated using a qualitative visualization. At the time, this was entirely appropriate as Jones was proposing a technique for visualizing test coverage for assisting the debugging process. The test coverage visualization was driven by what is now called a statistical fault localization metric (Tarantula). The evaluation visualization aggregated the visualizations all of the programs included in the study.

While, the evaluation method used in the Jones paper effectively communicated the potential of CBSFL (and got many researchers excited about the idea) it was not good way to compare multiple fault localization techniques. In 2005 Jones and Harrold [29] conducted a study which compared their Tarantula technique to 3 other techniques: Set Union and Intersection [30], Nearest Neighbor [19], and Cause-Transitions [31]. These techniques all took unique approaches toward the fault localization problem and were originally evaluated in different ways. Jones and Harrold re-evaluated all 5 methods under a new common evaluation framework.

In the 2005 paper, Jones and Harrold evaluate the effectiveness of each technique by using the technique to rank the statements in the subject programs. Each technique ranked the statements from most likely to be the cause of the fault to least likely. For Tarantula, the statements are ranked using the Tarantula suspiciousness score:3

Definition. Tarantula Rank Score [29]

Given a set of locations \(L\) with their suspiciousness scores \(s(l)\) for \(l \in L\) the Rank Score for a location \(l \in L\) is: \[\begin{aligned} {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) > s(l) \right\} }\right|}} + {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) = s(l) \right\} }\right|}} \end{aligned}\]

For Set Union and Intersection, Nearest Neighbor and Cause-Transitions the statements are ranked using a System Dependence Graph (SDG) [32] technique from Renieres and Reiss [19] who first suggested the ranking idea. The ranks are then used to calculate the Tarantula Rank Score.

In the Jones and Harrold evaluation the authors do not use the Tarantula Rank Score directly but instead use a version normalized by program size:

Definition. Tarantula Effectiveness Score (Expense) [29]

The percentage of program elements that do not need to be examined to find the fault when the elements are arranged according to their rank. Formally: let \(n\) be the total number of program elements, and let \(r(f)\) be the Tarantula Rank Score of the faulty element \(f\) then the score is: \[\begin{aligned} \frac{n-r(f)}{n} \end{aligned}\]

Using the normalized effectiveness score Jones and Harrold directly compare the fault localization effectiveness of each of the considered methods. They did this in two ways. First, they presented a table (Table 2) which bucketed all the buggy versions form all the programs by the percentage given by the Tarantula Effectiveness Score. Second, they presented a figure (Figure 2) which showed the data in Table 2 as a cumulative curve.

The basic evaluation method presented by Jones and Harrold has become the standard evaluation method. Faulty statements are scored, ranked, rank-scored, normalized, and then aggregated over all versions and programs to provide an overall representation of the fault localization method's performance (a few examples: [2], [3], [33]-[36]). While the basic method has stayed fairly consistent, there has been some innovation in the scoring (both the Rank Score and the Effectiveness Scores).

For instance, Wong et al. [34] introduced the most commonly used Effectiveness Score the \(\mathcal{EXAM}\) score. This score is essentially the same as the Expense score except it gives the percentage of elements which need to be examined rather than those avoided.

Definition. \(\mathcal{EXAM}\) Score [34]

The percentage of program elements that need to be examined to find the fault when the elements are arranged according to their rank. Formally: let \(n\) be the total number of program elements, and let \(r(f)\) be the Tarantula Rank Score of the faulty element \(f\) then the score is: \[\begin{aligned} \frac{r(f)}{n} \end{aligned}\]

Ali et al. [37] identified an important problem with the Jones and Harrold evaluation: some fault localization metrics and algorithms rank statements equally. This is captured in the second term in the definition for the Tarantula Rank Score. However, Ali points out that this introduces bias towards algorithms that always assign unique scores (that are close together) rather than those that would score the same group of statement equally. The fix is to instead compute the expected number of statements the programmer would examine if they chose the next equally scored element at random.

Definition. Rank Score

Gives the expected number of locations a programmer would inspect before finding the bug. Formally, given a set of locations \(L\) with their suspiciousness scores \(s(l)\) for \(l \in L\) the Rank Score for a location \(l \in L\) is [37]: \[\begin{aligned} {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) > s(l) \right\} }\right|}} + \frac{ {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) = s(l) \right\} }\right|}} }{ 2 } \end{aligned}\]

Following Ali, we recommend utilizing the above definition for Rank Score over the Tarantula definition.

Parin and Orso [38] conducted a user study which looked at the programmer experience of using a statistical fault localization tool (Tarantula [1]). Among their findings they found that programmers would not look deeply through the list of locations and would instead only consider the first few items. As a result they encouraged studies to no longer report scores as percentages. While some studies still report the percentages most studies are now reporting the absolute (non-percentage) rank scores. Reporting as absolute scores is important for another reason, if percentage ranks are reported larger programs can have much larger absolute ranks for the same percentage rank. This biases the evaluation toward large programs even when the actual localization result is poor.

Steimann et al. [33] identified a number of threats to validity in CBSFL studies including: heterogeneous subject programs, poor test suites, small sample sizes, unclear sample spaces, flaky tests, total number of faults, and masked faults. For evaluation they used the Rank Score modified to deal with \(k\) faults tied at the same rank.

Definition. Steimann Rank Score

Gives the expected number of locations a programmer would inspect before finding the bug when multiple faulty statements have the same rank. Formally, given a set of locations \(L\) with their suspiciousness scores \(s(l)\) for \(l \in L\) the Rank Score for a location \(l \in L\) is [33]: \[\begin{aligned} & {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) > s(l) \right\} }\right|}}\\ & + \frac{ {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) = s(l) \right\} }\right|}} + 1 }{ {{\left|{ \left\{ x ~:~ x \in L \wedge s(x) = s(l) \wedge x \text{ is a faulty location} \right\}}\right|}} + 1 } \end{aligned}\]

Moon et al. [39] proposed Locality Information Loss (LIL) as an alternative evaluation framework. LIL models the localization result as a probability distribution constructed from the suspiciousness scores:

Definition. LIL Probability Distribution

Let \(\tau\) be a suspicious metric normalized to the \([0,1]\) range of reals. Let \(n\) be the number statements in the program. Let \(S\) be the set of statements. For all \(1 \le i \le n\) let \(s_i \in S\). The constructed probability distribution is: \[\begin{aligned} P_{\tau}(s_i) = \frac{\tau(s_i)}{\sum^{n}_{j=1} \tau(s_j)} \end{aligned}\]

LIL uses a measure of distribution divergence (Kullback-Leibler) to compute a score of how different the constructed distribution is from the "perfect" expected distribution. The advantage of the LIL framework is it does not depend on a list of ranked statements and can be applied to non-statistical methods (using a synthetic \(\tau\)). The disadvantage of LIL is it does not indicate programmer effort (as indicated by the Rank Score). However, it may be a better metric to use when evaluating fault localization systems as a component for automated bug repair systems.

Pearson et al. [28] re-evaluated a number of previous results using new real world subject programs with real defects and test suites. In contrast to previous work they made use of statistical hypothesis testing and confidence intervals to test the significance of the results. To evaluate the performance of each technique under study they used the \(\mathcal{EXAM}\) score reporting best, average, and worst case results for multi-statement faults.

T-Score [40] is designed for non-statistical fault localization methods which produce a small set of suspicious statements in the program. To evaluate how helpful these reports are T-Score uses the Program Dependence Graph (PDG) [32], [41] to compute a set of vertices in the graph that must be examined in order to reach any faulty vertex. This set is computed via a breadth first search from the set of vertices in the report. Finally, the score is computed as the percentage of examined vertices out of the total number of vertices in the graph.

Multiple Fault Evaluations

With the exception of LIL, the evaluation methods discussed so far are generally defined to operate with a single faulty location. However, there may be multiple faults or multiple locations associated with a single fault or both. Multiple faults can interact [42] and interfere with the performance of the fault localizer. For evaluation purposes one of the most popular methods is to take either best result [43], the average result [14], [44], or the worst result [43].

Evaluating Other Techniques

One of the challenges with the methods presented so far is they may not work well for evaluating alternate fault localization methods. For instance, information retrieval based localization methods do not necessarily score and rank every program location. Instead they produce a report of associated regions. Jones and Harrold [29] used a synthetic ranking system [19] based on the SDG [32] which in principle could be used in such situation. However, like the T-Score it uses an arbitrary method (minimal dependence spheres) to compute the number of SDG nodes which must be examined.

The LIL method could also potentially be used to evaluate alternative methods. It does not rely on ranking but instead on the suspiciousness scores which it converts into a probability distribution. To support evaluating report based localization the reports are converted to a probability distribution with all locations in the report set to a equal high probability and all locations not in the report set to a tiny probability.

Suspicious Behavior Based Fault Localization [7] requires particular care. These methods are produce a ranked set of "behaviors" which are structured groups of interacting program locations. The structure could be a call invocation structure [45]-[49], a general control flow structure [7], [13], [50], or even an information flow structure [51]. The structures are scored and ranked similar to CBSFL. However, unlike in CBSFL all program locations are not necessarily included. In the past, studies have used a variety of techniques to evaluated the effectiveness including precision and recall [13] and scores based off of the \(\mathcal{EXAM}\) score [7].

Another subtle special case involves comparing statistical techniques which operate on different granularity levels. As mentioned previously, coverage can be collected at any granularity level: expression, statement, basic block, method or function, class, file, and even non-structural elements such as paths. Any of the CBSFL metrics can be used with any of these granularities. However, a single study using any of the previous evaluation methods must keep the granularity consistent. This makes it impossible to compare across granularity levels. This is makes it particularly difficult to accurately compare method level behavioral approaches [45] to CBSFL.

Assumptions

The biggest assumption that all evaluation models make is so-called perfect bug understanding which assumes programmers will recognize a bug as soon as they "examine" the faulty location. This assumption is obviously false [38]. However, it continues to be a useful simplifying assumption for evaluation purposes of the localization algorithms. From the standpoint of automated fault localization there are really two tasks: 1) finding the fault and 2) explaining the fault. Assuming perfect bug understanding is reasonable for evaluating a tools performance on task 1. However, there is the important caveat that programmers need more assistance at task 2. As a research community we do not currently have a standard method for evaluating our algorithmic performance on task 2.

The second assumption is that programmers will follow the rank list or suspiciousness scores when debugging a program using a fault localization tool. This assumption is obviously false as well [38]. A programmer may follow the list for the very first item and even the second but where they go from there is likely only partially influenced by the list. The bigger influence will be from the conclusions they are drawing from what they learn upon inspecting each location.

References

[1] J. Jones, M. Harrold, and J. Stasko, "Visualization of test information to assist fault localization," Proceedings of the 24th International Conference on Software Engineering. ICSE 2002, 2002, doi:10.1145/581339.581397.

[2] Lucia, D. Lo, L. Jiang, F. Thung, and A. Budi, "Extended comprehensive study of association measures for fault localization," Journal of Software: Evolution and Process, vol. 26, no. 2, pp. 172-219, Feb. 2014, doi:10.1002/smr.1616.

[3] S.-F. Sun and A. Podgurski, "Properties of Effective Metrics for Coverage-Based Statistical Fault Localization," in 2016 ieee international conference on software testing, verification and validation (icst), 2016, pp. 124-134, doi:10.1109/ICST.2016.31.

[4] A. Aho, R. Sethi, M. S. Lam, and J. D. Ullman, Compilers: principles, techniques, and tools. 2007.

[5] P. S. Kochhar, X. Xia, D. Lo, and S. Li, "Practitioners' expectations on automated fault localization," in Proceedings of the 25th international symposium on software testing and analysis - issta 2016, 2016, pp. 165-176, doi:10.1145/2931037.2931051.

[6] G. G. K. Baah, A. Podgurski, and M. J. M. Harrold, "Causal inference for statistical fault localization," in Proceedings of the 19th international symposium on software testing and analysis, 2010, pp. 73-84, doi:10.1145/1831708.1831717.

[7] T. A. D. Henderson and A. Podgurski, "Behavioral Fault Localization by Sampling Suspicious Dynamic Control Flow Subgraphs," in IEEE conference on software testing, validation and verification, 2018.

[8] C. C. Aggarwal and J. Han, Eds., Frequent Pattern Mining. Cham: Springer International Publishing, 2014.

[9] R. Agrawal, T. Imieliński, and A. Swami, "Mining association rules between sets of items in large databases," ACM SIGMOD Record, vol. 22, no. 2, pp. 207-216, Jun. 1993, doi:10.1145/170036.170072.

[10] X. Yan and J. Han, "gSpan: graph-based substructure pattern mining," in 2002 ieee international conference on data mining, 2002. proceedings., 2002, pp. 721-724, doi:10.1109/ICDM.2002.1184038.

[11] C. C. Aggarwal, M. A. Bhuiyan, and M. A. Hasan, "Frequent Pattern Mining Algorithms: A Survey," in Frequent pattern mining, Cham: Springer International Publishing, 2014, pp. 19-64.

[12] X. Yan, H. Cheng, J. Han, and P. S. Yu, "Mining Significant Graph Patterns by Leap Search," in Proceedings of the 2008 acm sigmod international conference on management of data, 2008, pp. 433-444, doi:10.1145/1376616.1376662.

[13] H. Cheng, D. Lo, Y. Zhou, X. Wang, and X. Yan, "Identifying Bug Signatures Using Discriminative Graph Mining," in Proceedings of the eighteenth international symposium on software testing and analysis, 2009, pp. 141-152, doi:10.1145/1572272.1572290.

[14] R. Abreu, P. Zoeteweij, and A. Van Gemund, "An Evaluation of Similarity Coefficients for Software Fault Localization," in 2006 12th pacific rim international symposium on dependable computing (prdc'06), 2006, pp. 39-46, doi:10.1109/PRDC.2006.18.

[15] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund, "A practical evaluation of spectrum-based fault localization," Journal of Systems and Software, vol. 82, no. 11, pp. 1780-1792, 2009, doi:10.1016/j.jss.2009.06.035.

[16] P. Agarwal and A. P. Agrawal, "Fault-localization Techniques for Software Systems: A Literature Review," SIGSOFT Softw. Eng. Notes, vol. 39, no. 5, pp. 1-8, Sep. 2014, doi:10.1145/2659118.2659125.

[17] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, "A Survey on Software Fault Localization," IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707-740, Aug. 2016, doi:10.1109/TSE.2016.2521368.

[18] A. Zeller, "Yesterday, My Program Worked. Today, It Does Not. Why?" SIGSOFT Softw. Eng. Notes, vol. 24, no. 6, pp. 253-267, Oct. 1999, doi:10.1145/318774.318946.

[19] M. Renieres and S. Reiss, "Fault localization with nearest neighbor queries," in 18th ieee international conference on automated software engineering, 2003. proceedings., 2003, pp. 30-39, doi:10.1109/ASE.2003.1240292.

[20] F. Tip, "A survey of program slicing techniques," Journal of programming languages, vol. 3, no. 3, pp. 121-189, 1995.

[21] X. Mao, Y. Lei, Z. Dai, Y. Qi, and C. Wang, "Slice-based statistical fault localization," Journal of Systems and Software, vol. 89, no. 1, pp. 51-62, 2014, doi:10.1016/j.jss.2013.08.031.

[22] A. Marcus, A. Sergeyev, V. Rajlieh, and J. I. Maletic, "An information retrieval approach to concept location in source code," Proceedings - Working Conference on Reverse Engineering, WCRE, pp. 214-223, 2004, doi:10.1109/WCRE.2004.10.

[23] J. Zhou, H. Zhang, and D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports," Proceedings - International Conference on Software Engineering, pp. 14-24, 2012, doi:10.1109/ICSE.2012.6227210.

[24] T.-D. B. Le, R. J. Oentaryo, and D. Lo, "Information retrieval and spectrum based bug localization: better together," in Proceedings of the 2015 10th joint meeting on foundations of software engineering - esec/fse 2015, 2015, pp. 579-590, doi:10.1145/2786805.2786880.

[25] S. Artzi, J. Dolby, F. Tip, and M. Pistoia, "Directed Test Generation for Effective Fault Localization," in Proceedings of the 19th international symposium on software testing and analysis, 2010, pp. 49-60, doi:10.1145/1831708.1831715.

[26] S. K. Sahoo, J. Criswell, C. Geigle, and V. Adve, "Using likely invariants for automated software fault localization," in Proceedings of the eighteenth international conference on architectural support for programming languages and operating systems, 2013, vol. 41, p. 139, doi:10.1145/2451116.2451131.

[27] A. Perez, R. Abreu, and A. Riboira, "A Dynamic Code Coverage Approach to Maximize Fault Localization Efficiency," J. Syst. Softw., vol. 90, pp. 18-28, Apr. 2014, doi:10.1016/j.jss.2013.12.036.

[28] S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, "Evaluating and Improving Fault Localization," in Proceedings of the 39th international conference on software engineering, 2017, pp. 609-620, doi:10.1109/ICSE.2017.62.

[29] J. A. Jones and M. J. Harrold, "Empirical Evaluation of the Tarantula Automatic Fault-localization Technique," in Proceedings of the 20th ieee/acm international conference on automated software engineering, 2005, pp. 273-282, doi:10.1145/1101908.1101949.

[30] H. Agrawal, J. Horgan, S. London, and W. Wong, "Fault localization using execution slices and dataflow tests," in Proceedings of sixth international symposium on software reliability engineering. issre'95, 1995, pp. 143-151, doi:10.1109/ISSRE.1995.497652.

[31] H. Cleve and A. Zeller, "Locating causes of program failures," Proceedings of the 27th international conference on Software engineering - ICSE '05, p. 342, 2005, doi:10.1145/1062455.1062522.

[32] S. Horwitz, "Identifying the Semantic and Textual Differences Between Two Versions of a Program," SIGPLAN Not., vol. 25, no. 6, pp. 234-245, Jun. 1990, doi:10.1145/93548.93574.

[33] F. Steimann, M. Frenkel, and R. Abreu, "Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators," Proceedings of the 2013 International Symposium on Software Testing and Analysis - ISSTA 2013, p. 314, 2013, doi:10.1145/2483760.2483767.

[34] E. Wong, T. Wei, Y. Qi, and L. Zhao, "A Crosstab-based Statistical Method for Effective Fault Localization," in 2008 international conference on software testing, verification, and validation, 2008, pp. 42-51, doi:10.1109/ICST.2008.65.

[35] D. Landsberg, H. Chockler, D. Kroening, and M. Lewis, "Evaluation of Measures for Statistical Fault Localisation and an Optimising Scheme," in International conference on fundamental approaches to software engineering, 2015, vol. 9033, pp. 115-129, doi:10.1007/978-3-662-46675-9.

[36] Y. Zheng, Z. Wang, X. Fan, X. Chen, and Z. Yang, "Localizing multiple software faults based on evolution algorithm," Journal of Systems and Software, vol. 139, pp. 107-123, 2018, doi:10.1016/j.jss.2018.02.001.

[37] S. Ali, J. H. Andrews, T. Dhandapani, and W. Wang, "Evaluating the Accuracy of Fault Localization Techniques," 2009 IEEE/ACM International Conference on Automated Software Engineering, pp. 76-87, 2009, doi:10.1109/ASE.2009.89.

[38] C. Parnin and A. Orso, "Are Automated Debugging Techniques Actually Helping Programmers?" in ISSTA, 2011, pp. 199-209.

[39] S. Moon, Y. Kim, M. Kim, and S. Yoo, "Ask the Mutants: Mutating faulty programs for fault localization," Proceedings - IEEE 7th International Conference on Software Testing, Verification and Validation, ICST 2014, pp. 153-162, 2014, doi:10.1109/ICST.2014.28.

[40] C. Liu, C. Chen, J. Han, and P. S. Yu, "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis," in Proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining, 2006, pp. 872-881, doi:10.1145/1150402.1150522.

[41] J. Ferrante, K. J. Ottenstein, and J. D. Warren, "The program dependence graph and its use in optimization," vol. 9. pp. 319-349, Jul-1987.

[42] N. DiGiuseppe and J. A. Jones, "On the influence of multiple faults on coverage-based fault localization," in Proceedings of the 2011 international symposium on software testing and analysis - issta '11, 2011, p. 210, doi:10.1145/2001420.2001446.

[43] W. E. Wong, Y. Qi, L. Zhao, and K. Y. Cai, "Effective fault localization using code coverage," Proceedings - International Computer Software and Applications Conference, vol. 1, no. Compsac, pp. 449-456, 2007, doi:10.1109/COMPSAC.2007.109.

[44] L. Naish, H. J. Lee, and K. Ramamohanarao, "A model for spectra-based software diagnosis," ACM Transactions on Software Engineering and Methodology, vol. 20, no. 3, pp. 1-32, 2011, doi:10.1145/2000791.2000795.

[45] C. Liu, H. Yu, P. S. Yu, X. Yan, H. Yu, J. Han, and P. S. Yu, "Mining Behavior Graphs for ‘Backtrace' of Noncrashing Bugs," in Proceedings of the 2005 siam international conference on data mining, 2005, pp. 286-297, doi:10.1137/1.9781611972757.26.

[46] F. Eichinger, K. Böhm, and M. Huber, "Mining Edge-Weighted Call Graphs to Localise Software Bugs," in European conference machine learning and knowledge discovery in databases, 2008, pp. 333-348, doi:10.1007/978-3-540-87479-9_40.

[47] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani, "HOLMES: Effective Statistical Debugging via Efficient Path Profiling," in Proceedings of the 31st international conference on software engineering, 2009, pp. 34-44, doi:10.1109/ICSE.2009.5070506.

[48] A. Yousefi and A. Wassyng, "A Call Graph Mining and Matching Based Defect Localization Technique," in 2013 ieee sixth international conference on software testing, verification and validation workshops, 2013, pp. 86-95, doi:10.1109/ICSTW.2013.17.

[49] T. Diamantopoulos and A. Symeonidis, "Localizing Software Bugs using the Edit Distance of Call Traces," International Journal on Advances in Software, vol. 7, no. 1 & 2, pp. 277-288, 2014.

[50] Z. Mousavian, M. Vahidi-Asl, and S. Parsa, "Scalable Graph Analyzing Approach for Software Fault-localization," in Proceedings of the 6th international workshop on automation of software test, 2011, pp. 15-21, doi:10.1145/1982595.1982599.

[51] F. Eichinger, K. Krogmann, R. Klug, and K. Böhm, "Software-defect Localisation by Mining Dataflow-enabled Call Graphs," in Proceedings of the 2010 european conference on machine learning and knowledge discovery in databases: Part i, 2010, pp. 425-441.


  1. The coverage can be collected at other levels as well. For instance there has been work which collects it at the class, method, and basic block levels.

  2. A basic block is a sequence of sequential instructions, always entered from the first instruction and exited from the last [4].

  3. It is ahistorical to call Tarantula metric a suspiciousness score when referring to the 2002 paper [1]. Jones introduced the term suspiciousness score in the 2005 paper [29] for the purpose of ranking the statements. However, the term is now in common use and it was explained above.