crolsim: Cross language software similarity detector using api documentation

Section 1: Publication

Publication Type

Authorship

Nafi, K. W., Roy, B., Roy, C. K., & Schneider, K. A.

Title

crolsim: Cross language software similarity detector using api documentation

Year

2018

Publication Outlet

In 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM) (pp. 139-148). IEEE

DOI

https://doi.org/10.1109/SCAM.2018.00023

ISBN

ISSN

Citation

Nafi, K. W., Roy, B., Roy, C. K., & Schneider, K. A. (2018). crolsim: Cross language software similarity detector using api documentation. In 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM) (pp. 139-148). IEEE. https://doi.org/10.1109/SCAM.2018.00023

Abstract

In today's open source era, developers look forsimilar software applications in source code repositories for anumber of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similarapplications written in the same programming language, there isa marked lack of studies for finding similar software applicationswritten in different languages. In this paper, we fill the gapby proposing a novel modelCroLSimwhich is able to detectsimilar software applications across different programming lan-guages. In our approach, we use the API documentation tofind relationships among the API calls used by the differentprogramming languages. We adopt a deep learning based word-vector learning method to identify semantic relationships amongthe API documentation which we then use to detect cross-language similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. Weobserved thatCroLSimcan successfully detect similar softwareapplications across different programming languages with a meanaverage precision rate of 0.65, an average confidence rate of3.6 (out of 5) with 75% high rated successful queries, whichoutperforms all related existing approaches with a significantperformance improvement.

Plain Language Summary