An Improved Method for Code Cloning in Web Mining

International Journal of Latest Trends in Engineering and Technology (IJLTET)
An Improved Method for Code Cloning in Web
Mining
Ramandeep Kaur
Student, Cec landran, Punjab
Gurdeep Kaur
Asst Prof,Cec landran, Punjab
[email protected]
Parminder Singh
Asst Prof,Cec landran, Punjab
Abstract-Code cloning in web mining has been an active area for many years. Cloning is the process of detecting
duplications in the source code. There are many techniques that have been proposed to find duplicate unwanted code also
known as software clones. In this paper, we are presenting an improved method for cloning of code. This paper presents a
technique for finding code clones using k means clustering algorithm. We have applied our algorithm as a clone detection
tool called deckard and analysed it on large code bases written in Java. Our experimental resultsshows that our tool is
effective and efficient in accuracy as well as in speed.Recognition of clones helps in design of the system for better
maintenance. Cloned code can be occured for many reasons such as multiple unnecessary duplicates of code which
increases the size of source code, maintenance cost and inconsistent changes to cloned code can create defects and which
lead to incorrect program behaviors.Existing approaches either do not scale to large code bases or are not robust against
slightly codemodifications.
Keywords – Cloning, K- means, Cluster
I.
INTRODUCTION
Code Cloning is the phenomenon which arises usually in large systems. These code clones occurs due to several
reasons like making a copy of a code fragment. This leads to code clone, on the basis of which it is regarded as bad
practice. During the maintenance, this unwarranted code gives rise to various problems:
1. If one has to repair an error in the system with the help of code clone, all possible clone of that error should be
checked.
2. The compile time will be more if the code clone increases the size of the code.
Various methods and tools for finding code clone are thus highly desired commodity especially in software
maintenance community. There are several researches that have proposed a great number of approaches with
suitable results. Moreover, the code clone still arises in large software systems. In software system the code clones
are one of the main component in reducing maintainability.To detect code duplication automatically from large scale
software various code clone detection methods have been proposed. Moreover, it is still difficult to detect code
duplication to enhance maintainability because there are many code duplications that should persist. A code clone is
a code portion in source file that is similar or identical to another. This is a main issue in software development for
many reasons. Hence the source code becomes larger as well as difficult to understand. Clones seem to be a useful
approach to development as it is associated with implementation, reuse, speed up and development. Moreover, the
code implication can be very negative [2].
Types of code clones
Type I:
Similar code fragments except for variations in whitespace as well as in comments called Exact clones.
Type II:
Vol. 5 Issue 2 March 2015
385
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
Syntactically or structurally identical fragments except for variations in literals, identifiers, layouts, types and
comments called Renamed clones.
Type III:
Copied fragments with further changes. Statements can be added, changed or removed in addition to variations in
literals, identifiers, layouts, types and comments called Gapped clones.
Type IV:
Code fragments that perform same functionality but are executed by different syntactic variants called Semantic
clones.
II. TECHNIQUES OF CLONING
A. String based technique
String based techniques are used in basic string transformations and comparison algorithms. This makes them
independent for programming languages. Comparing calculated signatures per line is one of the alternative to
identify matching substrings. Line matching that comes in two variants is an option and it is selected as
representative for this category as it uses general string manipulations.
Simple Line Matching
It is the first variant of line matching. In this both detection phases are straightforward. Only small changes are
applied using string manipulation operations. This can be operated with little or without knowledge about possible
language constructs. Distinctive transformations are the removal of whitespaces and empty lines. All lines are
compared with each other during comparison using string matching algorithm. This results in a large search space
that is usually minimized by using hashing buckets. Before comparison of all the lines, they are hashed into one of n
possible buckets. After this all pairs in the same bucket are compared.
Parameterized Line Matching
It is another variant of line matching. It detects both identical and similar code fragments. The idea is that since
literals and identifier names are more probably to change when cloning a code fragment, so they are considered as
changeable parameters. Hence same fragments which are different only in the naming of these parameters are
permitted. To enable such parameterization, the set of transformations is expanded with an additional transformation
which substitute all literals and identifiers with one, common identifier symbol like ”$”. Due to this additional
replacement the comparison does not depend on the parameters. Hence no additional changes are needed to the
comparison algorithm itself.
B. Token based techniques
This technique uses a more sophisticated transformation algorithm. It needs a lexer as it constructs a token stream
from the source code. The availability of such tokens makes it possible to use enhanced comparison algorithms.
Then next to parameterized matching with suffix trees, which will act as a representative will be included in this
category as it also transforms the source code in a token structure which is matched later on. The latter tries to
eliminate much more detail by reviewing non interesting code fragments.
Parameterized Matching With Suffix Trees
It consists of three consecutive steps influencing a suffix tree as internal representation. In the first step, a lexical
analyser passes over the source text transforming literals and identifiers in parameter symbols, while the
typographical structure of each line is encoded in a non-parameter symbol. One symbol always refers to the same
literal, identifier or structure. The first step results in a parameterized string or p-string. Once the p-string is
obtained, a criterion to decide whether two sequences in this p-string are a parameterized match or not is mandatory.
Two strings are a parameterized match if one can be changed into the other by employing a one-to-one mapping
Vol. 5 Issue 2 March 2015
386
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
renaming the parameter symbols. To verify this criterion an additional encoding prev (S) of the parameter symbols
helps us. In this encoding, every first occurrence of a parameter symbol is substituted by 0. All later occurrences are
substituted by the distance since the previous occurrence of the same symbol. Thus, when two sequences have the
same encoding, they are the same besides for a systematic renaming of the parameter symbols. A data structure
called a parameterized suffix tree (p-suffix tree) is built for the p string after the lexical analysis. A p-suffix tree is a
generalisation of the suffix tree data structure which includes the prev() encoding of every suffix of a P-string. The
use of a suffix tree allows a more effective and efficient detection of maximal, parameterized matches. Last step
finds maximal paths in the p-suffix tree that are longer than a predefined character length.
C. PDG (Program dependency graph) based techniques
In this approach, control and dataflow dependency of a function may be depicted by a program dependency graph.
Clones may be recognized as isomorphic subgraphs. The detection accuracy is very high as it can detect code clones
which are not detected in other methods such as reordered clones, semantic clones. As it requires complex
computations therefore it is very difficult to implement to large software.
D. Metric based techniques
In this technique, initially the source code is divided into different functional units. After this, metrics for each unit
is defined. Those units which have similar metric value are defined as code clones.
Metrics based techniques collect a number of metrics for code fragments and then compare metrics vectors rather
than code or abstract syntax tree (AST) directly. In most cases, the source code is first parsed to an control flow
graph (CFG) or abstract syntax tree on which the metrics are then calculated. Metrics based approaches have been
applied to detect duplicate web pages and clones in web documents.
E. Tree based techniques
Tree based methods first transform the program to abstract syntax tree (AST) or parse tree using a parser for the
target language. Tree matching techniques are then applied to find similar subtrees and the corresponding code
segments are returned as classes or clone pairs. Literal values, variable names and other leaves (tokens) in the source
may be abstracted in the tree representation, allowing for more advanced detection of clones.
III. LITERATURE SURVEY
Gayathri Devi et al. [2] This paper describes a method for finding code clone using fragment distance with
clustering. Initially we tokenize the source code into tokens. Then by distance and clustering we detect the similarity
until all clusters are merged. After this we analyse and find the code fragments using distance cluster.
Deepak Sethi et al. [1] This paper presents the code clone or duplicated code is one of the major factor that
deteriorate the structure and the design of software. This method can be implemented using standard parsing
technology. It detects clones in arbitrary language then constructs and detects the number of clones without
modifying the operation of the program. Solid SDD tool provides a way of visualizing clone detection results in a
manner which is observably different from the popular visualization using scatter plots.
GirijaGupta et al. [4] This paper design and implement a code clone detector tool to detect clones. The novel aspect
of the work is implemented by using metric based approach on java source codes. For calculating metrics it used
java byte code.Then the source code refactoring is done in order to decrease code clones. The byte code converts the
source code into uniform representation. It is given as an input to the tool for calculating metrics value, so up to
some extent it is able to find the semantic clones. However byte code is platform independent which makes this tool
more effective than the previously existing tools. As abstract syntax tree based approach and program dependence
graph approach have some disadvantages. They take a lot of time, they are complex too for detection of clones. The
proposed tool have reduced the work by detecting potential clones with more ease.
Vol. 5 Issue 2 March 2015
387
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
IV. PROPOSED METHODOLOGY
K MEANS CLUSTERING
K means clustering is a partitioning based cluster analysis technique. According to this algorithm we first need to
select k data value as starting cluster centers and then calculate the distance between each data value and each
cluster center. Then we have assign it to the nearest cluster, update the mean of all clusters, repeat the process until
the criteria is not match [14].
K means clustering aims to divide the data into k clusters in which each data value belongs to the cluster with the
nearest mean.
Basic K-mean algorithm:
Initially we chose K number of clusters.
Initialize the center of the clusters K.
Assign the nearest cluster to each data point.
Update the position of each cluster to the mean of all data points which belongs to that cluster.
This process is repeated till all the objects are allocated to its clusters.
Specify a number k as the number
of clusters
Select the center of the cluster k
Assign closest cluster to each data
point
Update the position of each
cluster
Repeat above steps till all objects
are allocated
Figure1. Flowchart of k means clustering
V. EXPERIMENT AND RESULTS
For the analysis of our proposed algorithm we have used a laptop of 2 GB RAM with a processor of dual core
having speed 2GHZ and Ubuntu 12.04 installed. We have evaluated lines of code and execution time. Our results
indicate that the proposed method achieves better efficiency.
Vol. 5 Issue 2 March 2015
388
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
(a) Liquibase
(b) Page Turner
Vol. 5 Issue 2 March 2015
389
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
(c) Record Breaker
Figure2. Lines of code by different number of projects
Figure 2: shows the total number of cloned lines by different number of projects. For deckard, we used a variety of
configuration options: minT (minimum number of tokens required for clones) was set to 30 or 50, stride (distance
between two code segments) was set to 2,4,8,16 and similarity (how similar two points should be) ranged between
0.9,0.95 and 1.0.
Figure2(a), (b), (c) shows the cloned lines detected by Deckard. The detected cloned lines that is total number of
lines of code increases with the similarity decreased.
Vol. 5 Issue 2 March 2015
390
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
Execution Time (s)
Figure 3.Execution time of different no. of projects
The results show that our method is more efficient by using k means clustering algorithm. The execution time is
more in liquibase and less in page turner.
VI. CONCLUSION
In this paper, we have presented a new technique for detecting code clones. By using k means clustering we are able
to find the position of clusters. On detecting code clones, the quality of code is improved. We have evaluated our
tool on large code bases written in java. The results show that deckard tool can find more code clones. We can
achieve faster execution time and higher accuracy. In this paper we have focused on clone detection types and
techniques. As k mean clustering algorithm is simple to implement and it also takes less memory. We believe that
our technique is useful andscalable.This tool finds a significant amount of code clones. Identification and subsequent
unification of simple clones is useful in software maintenance. Our main goal is to identify the total number of lines
of codeand execution time with the help of k means clustering.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Deepak Sethi, Manisha Sehrawat, Bharat Bhushan Naib, “Detection of Code Clone usingDatasets”, International Journal of Advanced
Research in Computer Science and Software Engineering, pp. 263-268,Volume 2, Issue 7, July 2012.
D. Gayathri Devi , Dr. M. Punithavalli, “An Effective Software Clone Detection Using Distance Clustering”, International Journal of
Engineering and Technology (IJET),pp.232-238,Vol 5 No 1 Feb-Mar 2013.
Marius Muja, and David G. Lowe, “Scalable Nearest Neighbor Algorithms for High Dimensional Data”, IEEE TRANSACTIONS ON
PATTERN ANALYSIS AND MACHINE INTELLIGENCE, pp. 2227-2240, VOL. 36, NO. 11, NOVEMBER 2014.
Girija Gupta, Indu Singh, “A Novel Approach Towards Code Clone Detection and Redesigning”,International Journal of Advanced
Research in Computer Science and Software Engineering, pp.331-338,Volume 3, Issue 9, September 2013.
Chanchal K. Roya, James R. Cordy, Rainer Koschke,“Comparison and evaluation of code clone detection techniques and tools: A
qualitative approach”, Science of Computer programming, ELSEVIER, pp. 470-495, 2009.
Prajila Prem, “A Review on Code Clone Analysis and Code Clone Detection”, International Journal of Engineering and Innovative
Technology (IJEIT), pp.43-46,Volume 2, Issue 12, June 2013.
Mohammed Abdul Bari, Dr. Shahanawaj Ahamad, “Code Cloning: The Analysis, Detection and Removal”,International Journal of
Computer Applications, pp.34-38,Volume 20– No.7, April 2011.
Doaa M. Shawky, Ahmed F. Ali, “An Approach for Assessing Similarity Metrics Used in Metric-based Clone Detection Techniques”,
Computer Science and Information Technology (ICCSIT),pp.580-584 3rd IEEE International Conference on (Volume:1 ) 2010.
Vol. 5 Issue 2 March 2015
391
ISSN: 2278-621X
International Journal of Latest Trends in Engineering and Technology (IJLTET)
[9]
[10]
[11]
[12]
[13]
[14]
G.Anil kumar, Dr.C.R.K.Reddy, Dr. A. Govardhan, Gousiya Begum,“ Code Clone detectionwith Refactoring support Through Textual
Analysis”,International Journal of Computer Trends And Technology, pp. 147-150,Volume 2 Issue2-2011.
C.K. Roy, J.R. Cordy, “Near-miss function clones in open source software: an empirical study, Journal of Software Maintenance and
Evolution:” Research and Practice 2009.
Swarupa S. Bongale, Prof. K. B. Manwade, Prof. G. A. Patil, “An Efficient Data Mining Approach for Complex Clone Detection in
Software”, International Journal of Advanced Research in Computer Science and Software Engineering, pp.714-721,Volume 3, Issue 5,
May 2013.
S.Mythili and Dr. S. Sarala, “Detection of Recurring Clones Using Weighted Frequent Itemset Mining ”, International Journal of Software
Engineering and Its Applications, pp.159-176, Vol.8, No.7 (2014).
Chanchal K. Roy and James R. Cordy, “Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on June 2008.
Er. Nikhil Chaturvedi and Er. Anand Rajavat, “An Improvement in K-mean Clustering Algorithm Using Better Time and Accuracy”,
International Journal of Programming Languages and Applications ( IJPLA ), pp.13-19, Vol.3, No.4, October 2013.
Vol. 5 Issue 2 March 2015
392
ISSN: 2278-621X