Binary2Source function similarity detection is regarded as one of the fundamental tasks in software composition analysis. In the existing binary2Source matching works, the 1-to-1 matching mechanism is mainly adopted, where one binary function is matched against one source function. However, it is found that such a mapping may be 1-to-n (one binary function is mapped to multiple source functions) due to the existence of function inlining. A 30% performance loss is suffered by the existing binary2Source matching methods under function inlining due to this difference. Aimed at the matching requirement of binary to source functions in the scene of function inlining, a binary2Source function similarity detection method for 1-to-n matching is proposed in this study, which is designed to generate source function sets as the matching objects for the inlined binary functions to make up for the lack of the source function library. The effectiveness of the proposed method is evaluated through a series of experiments. The experimental data indicate that the method can not only improve the existing binary2Source function similarity detection ability but also identify the inlined source code functions, helping the existing tools better cope with the challenges of inlining.
[1] Hemel A, Kalleberg KT, Vermaas R, Dolstra E. Finding software license violations through binary code clone detection. In: Proc. of the 8th Working Conf. on Mining Software Repositories. Honolulu: ACM, 2011. 63–72.
[2] Rahimian A, Charland P, Preda S, Debbabi M. RESource: A framework for online matching of assembly with open source code. In: Proc. of the 5th Int’l Symp. on Foundations and Practice of Security. Montreal: Springer, 2012. 211–226. [doi: 10.1007/978-3-642-37119-6_14]
[3] Kim D, Cho S, Han S, Park M, You I. Open source software detection using function-level static software birthmark. Journal of Internet Services and Information Security, 2014, 4(4): 25–37.
[4] Miyani D, Huang Z, Lie D. BinPro: A tool for binary source code provenance. arXiv:1711.00830, 2017.
[5] Duan RA, Bijlani A, Xu M, Kim T, Lee W. Identifying open-source license violation and 1-day security risk at large scale. In: Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security. Dallas: ACM, 2017. 2169–2185. [doi: 10.1145/3133956.3134048]
[6] Feng MY, Mao WX, Yuan ZM, Xiao Y, Ban G, Wang W, Wang SY, Tang Q, Xu JH, Su H, Liu BH, Huo W. Open-source license violations of binary software at large scale. In: Proc. of the 26th Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Hangzhou: IEEE, 2019. 564–568. [doi: 10.1109/SANER.2019.8667977]
[7] Yuan ZM, Feng MY, Li F, Ban G, Xiao Y, Wang SY, Tang Q, Su H, Yu CD, Xu JH, Piao AH, Xuey J, Huo W. B2SFinder: Detecting open-source software reuse in COTS software. In: Proc. of the 34th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). San Diego: IEEE, 2019. 1038–1049. [doi: 10.1109/ASE.2019.00100]
[8] Ban G, Xu LL, Xiao Y, Li XH, Yuan ZM, Huo W. B2SMatcher: Fine-grained version identification of open-source software in binary files. Cybersecurity, 2021, 4(1): 21.
[9] Ji YD, Cui L, Huang HH. BugGraph: Differentiating source-binary code similarity with graph triplet-loss network. In: Proc. of the 2021 ACM Asia Conf. on Computer and Communications Security. Hong Kong: ACM, 2021. 702–715. [doi: 10.1145/3433210.3437533]
[10] Gui Y, Wan Y, Zhang HY, Huang HF, Sui YL, Xu GD, Shao ZY, Jin H. Cross-language binary-source code matching with intermediate representations. In: Proc. of the 2022 IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Honolulu: IEEE, 2022. 601–612. [doi: 10.1109/SANER53432.2022.00077]
[11] Yu ZP, Zheng WX, Wang JQ, Tang QY, Nie S, Wu S. CodeCMR: Cross-modal retrieval for function-level binary source code matching. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 326.
[12] Jia A, Fan M, Jin WX, Xu X, Zhou ZH, Tang QY, Nie S, Wu S, Liu T. 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis. ACM Trans. on Software Engineering and Methodology, 2023, 32(4): 87.
[13] Theodoridis T, Grosser T, Su ZD. Understanding and exploiting optimal function inlining. In: Proc. of the 27th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Lausanne: ACM, 2022. 977–989. [doi: 10.1145/3503222.3507744]
[14] Damásio T, Pacheco V, Goes F, Pereira F, Rocha R. Inlining for code size reduction. In: Proc. of the 25th Brazilian Symp. on Programming Languages. Joinville: ACM, 2021. 17–24. [doi: 10.1145/3475061.3475081]
[15] Gupta P, Jha A, Gupta B, Sumpi K, Sahoo S, Chalapathi MMV. Techniques and trade-offs in function inlining optimization. EAI Endorsed Trans. on Scalable Information Systems, 2024, 11(4): 1–7.
[16] Weingarten ME, Theodoridis T, Prokopec A. Inlining-benefit prediction with interprocedural partial escape analysis. In: Proc. of the 14th ACM SIGPLAN Int’l Workshop on Virtual Machines and Intermediate Languages. Auckland: ACM, 2022. 13–24. [doi: 10.1145/3563838.3567677]
[17] Ben-Asher Y, Faour N, Shinaar O. Mutual inlining: An inlining algorithm to reduce the executable size. In: Proc. of the 2022 CS & IT Conf. 2022. 1–16. [doi: 10.5121/csit.2022.120601]
[18] Muts K, Falk H. Multi-criteria function inlining for hard real-time systems. In: Proc. of the 28th Int’l Conf. on Real-time Networks and Systems. Paris: ACM, 2020. 56–66. [doi: 10.1145/3394810.3394819]
[19] Romano A, Wang WH. When function inlining meets WebAssembly: Counterintuitive impacts on runtime performance. In: Proc. of the 31st ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. San Francisco: ACM, 2023. 350–362. [doi: 10.1145/3611643.3616311]
[20] Chandramohan M, Xue YX, Xu ZZ, Liu Y, Cho CY, Tan HBK. BinGo: Cross-architecture cross-os binary search. In: Proc. of the 24th ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. Seattle: ACM, 2016. 678–689. [doi: 10.1145/2950290.2950350]
[21] Ding SHH, Fung BCM, Charland P. Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: Proc. of the 2019 IEEE Symp. on Security and Privacy (SP). San Francisco: IEEE, 2019. 472–489. [doi: 10.1109/SP.2019.00003]
[22] Kim D, Kim E, Cha SK, Son S, Kim Y. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Trans. on Software Engineering, 2023, 49(4): 1661–1682.
[23] Moyano JM, Gibaja EL, Cios KJ, Ventura S. Review of ensembles of multi-label classifiers: Models, experimental study and prospects. Information Fusion, 2018, 44: 33–45.
[24] Bogatinovski J, Todorovski L, Džeroski S, Kocev D. Comprehensive comparative study of multi-label classification methods. Expert Systems with Applications, 2022, 203: 117215.
[25] Kocev D, Vens C, Struyf J, Džeroski S. Tree ensembles for predicting structured outputs. Pattern Recognition, 2013, 46(3): 817–833.
[26] Tsoumakas G, Katakis I. Multi-label classification: An overview. Int’l Journal of Data Warehousing and Mining, 2007, 3(3): 1–13.
[27] Read J. Scalable multi-label classification [Ph.D. Thesis]. Hamilton: University of Waikato, 2010.
[28] Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. In: Proc. of the 11th Annual Conf. on Computational Learning Theory. Madison: ACM, 1998. 80–91. [doi: 10.1145/279943.279960]
[29] Kenner A, Kästner C, Haase S, Leich T. TypeChef: Toward type checking #ifdef variability in C. In: Proc. of the 2nd Workshop on Feature-Oriented Software Development. Eindhoven: ACM, 2010. 25–32. [doi: 10.1145/1868688.1868693]