Sampling Code Clones from Program Dependence Graphs with GRAPLE

by  Tim Henderson and Andy Podgurski

Tim A. D. Henderson and Andy Podgurski. Sampling Code Clones from Program Dependence Graphs with GRAPLE. SWAN 2016.


We present GRAPLE, a method to generate a representative sample of recurring (frequent) subgraphs of any directed labeled graph(s). GRAPLE is based on frequent subgraph mining, absorbing Markov chains, and Horvitz-Thompson estimation. It can be used to sample any kind of graph representation for programs. One of many software engineering applications for finding recurring subgraphs is detecting duplicated code (code clones) from representations such as program dependence graphs (PDGs) and abstract syntax trees. To assess the usefulness of clones detected from PDGs, we conducted a case study on a 73 KLOC commercial Android application developed over 5 years. Nine of the application's developers participated. To our knowledge, it is the first study to have professional developers examine code clones detected from PDGs. We describe a new PDG generation tool jpdg for JVM languages, which was used to generate the dependence graphs used in the study.


Code clones are similar fragments of program code. They can arise from copying and pasting, using certain design patterns or certain APIs, or adhering to coding conventions, among other causes. Code clones create maintenance hazards, because they often require subtle context-dependent adaptation and because other changes must be applied to each member of a clone class. To manage clone evolution the clones must first be found. Clones can be detected using any program representation: source code text, tokens, abstract syntax trees (ASTs), flow graphs, dependence graphs, etc. Each representation has advantages and disadvantages for clone detection.

PDG-based clone detection finds dependence clones corresponding to recurring subgraphs of a program dependence graph (PDG). Since PDGs are oblivious to semantics preserving statement reorderings they are well suited to detect semantic (functionally equivalent) clones. A number of algorithms find clones from PDGs. However, as Bellon notes, "PDG based techniques are computationally expensive and often report non-contiguous clones that may not be perceived as clones by a human evaluator." Most PDG-based clone detection tools are biased, detecting certain clones but not others.

The root cause of scalability problems with PDG-based clone detection is the number of dependence clones. The Background Section (in the pdf) illustrates this with an example in which we used an unbiased frequent subgraph mining algorithm to detect all dependence clones in Java programs. In programs with about 70 KLOC it detected around 10 million clones before disk space was exhausted. Processing all dependence clones is impractical even for modestly sized programs.

Instead of exhaustively enumerating all dependence clones, an unbiased random sample can be used to statistically estimate parameters of the whole "population" of clones, such as the prevalence of clones exhibiting properties of interest. For these reasons, we developed a statistically unbiased method for sampling dependence clones and for estimating parameters of the whole clone population.

We present GRAPLE (GRAph samPLE), a method to generate a representative sample of recurring subgraphs of any directed labeled graph(s). It can be used to sample subgraphs from any kind of program graph representation. GRAPLE is not a general purpose clone detector but it can answer questions about dependence clones that other PDG-based clone detection tools cannot. We conducted a preliminary case study on a commercial application and had its developers evaluate whether the sampled subgraphs represented code duplication. To our knowledge, it is the first study to have professional programmers examine dependence clones. GRAPLE has applications in bug mining, test case selection, and bioinformatics. The sampling algorithm also applies to frequent item sets, subsequences, and subtrees allowing code clone sampling from tokens and ASTs.


See the PDF for the complete paper. ACM has the exclusive rights to publish this paper. Tim Henderson and Andy Podgurski own the copyright. Follow the DOI for the ACM copy. This copy is posted here with the permission of ACM.