Semantic code clones are regions of duplicated code that may appear dissimilar but compute similar functions. Since in general it is algorithmically undecidable whether two or more programs compute the same function, locating all semantic code clones is infeasible. One way to dodge the undecidability issue and find potential semantic clones, using only static information, is to search for recurring subgraphs of a program dependence graph (PDG). PDGs represent control and data dependence relationships between statements or operations in a program. PDG-based clone detection techniques, unlike syntactically-based techniques, do not distinguish between code fragments that differ only because of dependence-preserving statement re-orderings, which also preserve semantics. Consequently, they detect clones that are difficult to find by other means. Despite this very desirable property, work on PDG-based clone detection has largely stalled, apparently because of concerns about the scalability of the approach. We argue, however, that the time has come to reconsider PDG-based clone detection, as a part of a holistic strategy for clone management. We present evidence that its scalability problems are not as severe as previously thought. This suggests the possibility of developing integrated clone management systems that fuse information from multiple clone detection methods, including PDG-based ones.
Fragments of similar code are typically scattered throughout large code bases. These repeated fragments or code clones often result from programmers copying and pasting code. Code clones (or just clones) may also result from limitations of a programming language, use of certain APIs or design patterns, following coding conventions, or a variety of other causes. Whatever their causes, existing clones need to be managed. When a programmer modifies a region of code that is cloned in another location in the program, they should make an active decision whether or not to modify the other location. Clearly, such decisions can only be made if the programmer is aware of the other location.
In general, there are 4 types of code clones:
- Type-1 Clones - Identical regions of code (excepting whitespace and comments).
- Type-2 Clones - Syntactically equivalent regions (excepting names, literals, types, and comments).
- Type-3 Clones - Syntactically similar regions (as in Type-2) but with minor differences such as statement additions or deletions.
- Type-4 Clones - Regions of code with functionally equivalent behavior but possibly with different syntactic structures.
Much of the research on code clone detection and maintenance has been geared toward Type-1 and Type-2 clone, as they are easier to detect and validate than Type-3 and Type-4 clones. The two most popular detection methods involve searching for clones in token streams and abstract syntax trees (ASTs).
An alternative approach to clone detection is to search for them in a Program Dependence Graph (PDG), which represents the control and data dependences between statements or operations in a program. Recurring subgraphs in PDGs represent potential dependence clones. Some of the previous work on PDG-based clone detection used forward and backward path-slicing to find clones. This method can detect matching slices, but it cannot detect all recurring subgraphs. The latter can be identified using frequent subgraph mining (FSM). However, for low frequency thresholds, the number of PDG subgraphs discovered by FSM may be enormous. For example, we found that for a Java program with 70,000 lines of code (LOC), over 700 million PDG subgraphs with 5 or more instances were discovered by FSM.
Since it is infeasible for developers to examine so many subgraphs, we previously developed GRAPLE, an algorithm to select representative samples of maximal frequent subgraphs. In this paper, the core sampling process remains the same as in GRAPLE but we present a new algorithm for traversing the k-frequent subgraph lattice. One tricky aspect of FSM is how to define exactly what "frequency" means in a large connected graph. In order to handle pathological cases that occur in real programs, we introduce a new metric to measure subgraph frequency (or "support"), called the Greedy Independent Subgraphs (GIS) measure. The results section details the first empirical examination of the scalability and speed of sampling dependence clones from large programs. The study showed that our new system can quickly sample from programs with 500 KLOC of code and successfully sample from programs with perhaps 2 MLOC. Finally, since at times the sampling algorithm may return several potential clones, which are quite similar to each other, we evaluate the performance of a density-based clustering algorithm on the samples collected.