In Open Source System (OSS) development, software components are often imported and reused; for t... more In Open Source System (OSS) development, software components are often imported and reused; for this reason we might expect that files are copied in multiple projects (file clones). In this paper, we propose a file clone detection tool called FCFinder and show the analysis performed with it on the FreeBSD Ports Collection, a large OSS project collection. We found many file clones among similar or related projects, which are systematically introduced from base projects.
Finding code clones in the open source systems is important for efficient and safe reuse of exist... more Finding code clones in the open source systems is important for efficient and safe reuse of existing open source software. In this paper, we propose a novel search model, open code clone search, to explore code clones in open source repositories on the Internet. Based on this search model, we have designed and implemented a prototype system named OpenCCFinder . This system takes a query code fragment as its input, and returns the code fragments containing the code clones with the query. It utilizes publicly available code search engines as external resources. Using OpenCCFinder , we have conducted several case studies for Java code. These case studies show the applicability of our system.
Software maintenance is the most expensive activity in software development. Many software compan... more Software maintenance is the most expensive activity in software development. Many software companies spent a large amount of cost to maintain the existing software systems. In perfective maintenance, refactoring has often been applied to the software to improve the understandability and complexity. One of the targets of refactoring is code clone. A code clone is a code fragment in a source code that is identical or similar to another. In an actual software development process, code clones are introduced because of various reasons such as reusing code by 'copy-and-paste' and so on. Code clones are one of the factors that make software maintenance difficult. In this paper, we propose a method which removes code clones from object oriented software by using existing refactoring patterns, especially "Extract Method" and "Pull Up Method". Then, we have implemented a refactoring supporting tool based on the proposed method. Finally, we have applied the tool to an open source program and actually perform refactoring.
Code clone detection tools may report a large number of code clones, while software developers ar... more Code clone detection tools may report a large number of code clones, while software developers are interested in only a subset of code clones that are relevant to software development tasks such as refactoring. Our research group has supported many software developers with the code clone detection tool CCFinder and its GUI front-end Gemini. Gemini shows clone sets (i.e., a set of code clones identical or similar to each other) with several clone metrics including their length and the number of code clones; however, it is not clear how to use those metrics to extract interesting code clones for developers. In this paper, we propose a method combining clone metrics to extract code clones for refactoring activity. We have conducted an empirical study on a web application developed by a Japanese software company. The result indicates that combinations of simple clone metric is more effective to extract refactoring candidates in detected code clones than individual clone metric.
Most studies of the evolution of software systems are based on the comparison of simple software ... more Most studies of the evolution of software systems are based on the comparison of simple software metrics. In this paper, we present our preliminary investigation of the evolution of the Linux kernel using code-clone analysis and the code-clone coverage metrics. We examined 136 versions of the stable Linux kernel using a distributed extension of the code clone detection tool CCFinder. The result is shown as a heat map.
Sourcecodeofopen-sourcesoftwareispermittedtobereused when and only when the conditions of its lic... more Sourcecodeofopen-sourcesoftwareispermittedtobereused when and only when the conditions of its license are satisfied. There are many different conditions for reusing, since various open-source licenses are used. Therefore, the license of the source code may affect the frequency of reusing or the property of the software for which the source is reused. To identify the relationship between software license and reusing, we are planning to classify copy-and-pasted code fragments based on the license of the fragments. This paper presents a preliminary and manual investigation on a small source file set. The result indicates that the license of a fragment affects the quantity and the license of copied fragments.
The increasing performance-price ratio of computer hardware makes possible to explore a distribut... more The increasing performance-price ratio of computer hardware makes possible to explore a distributed approach at code clone analysis. This paper presents D-CCFinder, a distributed approach at large-scale code clone analysis. D-CCFinder has been implemented with 80 PC workstations in our student laboratory, and a vast collection of open source software with about 400 million lines in total has been analyzed with it in about 2 days. The result has been visualized as a scatter plot, which showed the presence of frequently used code as easy recognizable patterns.
2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017
Clone-and-own approach is a natural way of source code reuse for software developers. To assess h... more Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the topfive components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.
Free and open source software (FOSS) plays an important role in source code reuse practice. They ... more Free and open source software (FOSS) plays an important role in source code reuse practice. They usually come with one or more software licenses written in the header part of source files, stating the requirements and conditions which should be followed when been reused. Removing or modifying the license statement by re-distributors will result in the inconsistency of license with its ancestor, and may potentially cause license infringement. In this paper, we describe and categorize different types of license inconsistencies and propose a method to detect them. Then we applied this method to Debian 7.5 and a collection of 10,514 Java projects on GitHub and present the license inconsistency cases found in these systems. With a manual analysis, we summarized various reasons behind these license inconsistency cases, some of which imply potential license infringement and require attention from the developers. This analysis also exposes the difficulty to discover license infringements, highlighting the usefulness of finding and maintaining source code provenance.
11th IEEE International Software Metrics Symposium (METRICS'05)
Generally, code clones are regarded as one of the factors that make software maintenance more dif... more Generally, code clones are regarded as one of the factors that make software maintenance more difficult. A code clone is a set of source code fragments identical or similar to each other. From the viewpoint of software maintainability, code clones should be removed. However, sometimes there are dependency relations among each of which belong to the different code clone, and it is advisable to refactor all of such code clones at once. In this paper, we focus on the case that such code fragment corresponds to a method body in Java programs. We defined "chained method" as a set of methods that have dependency relations. A set of "chained methods" whose elements are each other's code clone is called "chained clone", and an equivalence class of "chained clone" is called a "chained clone set". We propose a refactoring support method for "chained clone set" by providing an appropriate refactoring pattern to them. Finally, we present the "chained clone set" refactoring support tool that we have developed, together with some case studies to show the usefulness of the proposed method.
One of the important uses of source code clone detection analysis is plagiarism detection, where ... more One of the important uses of source code clone detection analysis is plagiarism detection, where a file is compared against a known corpus of source code to try to find potential matches. As the availability of Free and Open Source Software (FOSS) continues to increase it has become important to know if specific source code has been created from copies of FOSS software. Version 5.0.2 of Debian GNU/Linux contains approximately 323 millions SLOCs, distributed in approximately 1.45 million files. Current clone detection tools are incapable of dealing with a corpus of this size, and might either take literally months to complete a detection run, or might simply crash due to lack of resources. In this paper we propose a time and space efficient token-based method to detect clones of a source code file against a known corpus of source code. With an empirical study, we demonstrate that our method is capable of finding clones of a file in a corpus of 100,000 files source code files in a few seconds.
Recently, code clone has been regarded as one of factors that make software maintenance more diff... more Recently, code clone has been regarded as one of factors that make software maintenance more difficult. A code clone is a code fragment in a source code that is identical or similar to another. For example, if we modify a code fragment which has code clones, it is necessary to consider whether we have to modify each of its code clones. There are two ways of maintenance support for code clones. One is to comprehend and manage code clones, and the other is to remove them. For the former support, we have developed code clone analysis environment Gemini. For the latter support, several methods have proposed. But, it is difficult to apply them to industrial software because of various reasons such as high time complexity. In this paper, we propose a method that detects refactoring-oriented code clone in practical use time. And, we develop a characterization of code clones by some metrics, which suggest how to remove them. Then, we develop refactoring support tool Cancer. We expect Cancer can support software maintenance more effectively.
Proceedings of the 5th International Workshop on Software Clones, 2011
Code clone detection tools may report a large number of code clones, while software developers ar... more Code clone detection tools may report a large number of code clones, while software developers are interested in only a subset of code clones that are relevant to software development tasks such as refactoring. Our research group has supported many software developers with the code clone detection tool CCFinder and its GUI front-end Gemini. Gemini shows clone sets (i.e., a set of code clones identical or similar to each other) with several clone metrics including their length and the number of code clones; however, it is not clear how to use those metrics to extract interesting code clones for developers. In this paper, we propose a method combining clone metrics to extract code clones for refactoring activity. We have conducted an empirical study on a web application developed by a Japanese software company. The result indicates that combinations of simple clone metric is more effective to extract refactoring candidates in detected code clones than individual clone metric.
Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007), 2007
Most studies of the evolution of software systems are based on the comparison of simple software ... more Most studies of the evolution of software systems are based on the comparison of simple software metrics. In this paper, we present our preliminary investigation of the evolution of the Linux kernel using code-clone analysis and the code-clone coverage metrics. We examined 136 versions of the stable Linux kernel using a distributed extension of the code clone detection tool CCFinder. The result is shown as a heat map.
IEICE Transactions on Information and Systems, 2015
So far, many approaches for detecting code clones have been proposed based on the different degre... more So far, many approaches for detecting code clones have been proposed based on the different degrees of normalizations (e.g. removal of white spaces, tokenization, and regularization of identifiers). Different degrees of normalizations lead to different granularities of source code to be detect as code clones. To investigate how the normalizations impact the code clone detection, this study proposes six approaches for detecting code clones with preprocessing input source files using different degrees of normalizations. More precisely, each normalization is applied to the input source files and then equivalence class partitioning is performed to the files in the preprocessing. After that, code clones are detected from a set of files that are representatives of each equivalence class using a token-based code clone detection tool named CCFinder. The proposed approaches can be categorized into two types, approaches with non-normalization and normalization. The former is the detection of only identical files without any normalization. Meanwhile, the latter category is the detection of identical files with different degrees of normalizations such as removal of all lines containing macros. From the case study, we observed that our proposed approaches detect code clones faster than the approach that uses only CCFinder. We also found the approach with nonnormalization is the fastest among the proposed approaches in many cases.
2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), 2010
In Open Source System (OSS) development, software components are often imported and reused; for t... more In Open Source System (OSS) development, software components are often imported and reused; for this reason we might expect that files are copied in multiple projects (file clones). In this paper, we propose a file clone detection tool called FCFinder and show the analysis performed with it on the FreeBSD Ports Collection, a large OSS project collection. We found many file clones among similar or related projects, which are systematically introduced from base projects.
Maintaining software systems is getting more complex and difficult task. Code clone is one of the... more Maintaining software systems is getting more complex and difficult task. Code clone is one of the factors that make software maintenance more difficult. A code clone is a code portion in source files that is identical or similar to another. If some faults are found in a code clone, it is necessary to correct the faults in its all code clones. We have developed a maintenance support environment, Gemini, which provides the user with the useful functions to analyze the code clones and modify them. However, through case studies, several problems were reported. That is, the clones provided by Gemini were not appropriate to merge into one module. In this paper, we intend to extend the functionality of Gemini to cope with the problems. Finally, we apply the extended Gemini to several software and evaluate the applicability of the new functions. As the size and the complexity of software increase, it becomes important to develop high-quality software cost-effectively within a specified period. Software process improvement is one of the promising method to attain it. Recently, it is pointed out that maintenance phase is the most expensive one in the entire software development process. Many research studies have reported that large software companies spent a lot of cost to maintaining the existing systems. Maintenance of software system is defined as modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the products to a modified environment . Code clone is one of the factors that make software maintenance more difficult . A code clone is a code portion in source files that is identical or similar to another. Clones are introduced because of various reasons such as reusing code
A variety of application results of code clone detection and analysis has been reported. There ar... more A variety of application results of code clone detection and analysis has been reported. There are many reports of code clone detection and analysis on open source software whereas few reports on industrial systems are open to the public. This paper reports an experience of code clone analysis on a governmental project. In the project, a software system was developed by multiple Japanese vendors. We detected and analyzed code clones in the system, and found that there were many code clones in the project, however we concluded that the presence of the code clones did not have negative impacts on the maintenance of the system because of the following reasons: (1) when different modules are similar to each other in the design document, they also share many code clones in the source code; (2) code clones located in trusted modules, which are libraries maintained by one of the companies.
Product Focused Software Process Improvement, 2004
Software maintenance is the most expensive activity in software development. Many software compan... more Software maintenance is the most expensive activity in software development. Many software companies spent a large amount of cost to maintain the existing software systems. In perfective maintenance, refactoring has often been applied to the software to improve the understandability and complexity. One of the targets of refactoring is code clone. A code clone is a code fragment in a source code that is identical or similar to another. In an actual software development process, code clones are introduced because of various reasons such as reusing code by 'copy-and-paste' and so on. Code clones are one of the factors that make software maintenance difficult. In this paper, we propose a method which removes code clones from object oriented software by using existing refactoring patterns, especially "Extract Method" and "Pull Up Method". Then, we have implemented a refactoring supporting tool based on the proposed method. Finally, we have applied the tool to an open source program and actually perform refactoring.
Uploads
Papers by Katsuro Inoue