面向函数内联场景的二进制到源代码函数相似性检测方法
作者:
通讯作者:

范铭,E-mail:mingfan@mail.xjtu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(62232014, 62272377, 62372368, 62372367); 陕西省科学技术协会青年人才托举计划


Binary2Source Function Similarity Detection Method Under Function Inlining
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [29]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    二进制到源代码函数相似性检测是软件组成成分分析的基础性工作之一. 现有方法主要采用一对一的匹配策略, 即使用单一的二进制函数和单一的源代码函数进行比对. 然而, 由于函数内联的存在, 函数之间的映射关系实际上表现为一对多——单一的二进制函数能够关联至多个源代码函数. 这一差异导致现有方法在函数内联场景下遭受了30%的性能损失. 针对函数内联场景下的二进制到源代码函数匹配需求, 提出了一种面向一对多匹配的二进制到源代码函数相似性检测方法, 旨在生成源代码函数集合作为内联二进制函数的匹配对象, 以弥补源代码函数库的缺失. 通过一系列实验评估了方法的有效性. 实验数据表明, 方法不仅能够提升现有二进制到源代码函数相似性检测的能力, 而且还能够找到内联的源代码函数, 帮助现有工具更好地应对内联挑战.

    Abstract:

    Binary2Source function similarity detection is regarded as one of the fundamental tasks in software composition analysis. In the existing binary2Source matching works, the 1-to-1 matching mechanism is mainly adopted, where one binary function is matched against one source function. However, it is found that such a mapping may be 1-to-n (one binary function is mapped to multiple source functions) due to the existence of function inlining. A 30% performance loss is suffered by the existing binary2Source matching methods under function inlining due to this difference. Aimed at the matching requirement of binary to source functions in the scene of function inlining, a binary2Source function similarity detection method for 1-to-n matching is proposed in this study, which is designed to generate source function sets as the matching objects for the inlined binary functions to make up for the lack of the source function library. The effectiveness of the proposed method is evaluated through a series of experiments. The experimental data indicate that the method can not only improve the existing binary2Source function similarity detection ability but also identify the inlined source code functions, helping the existing tools better cope with the challenges of inlining.

    参考文献
    [1] Hemel A, Kalleberg KT, Vermaas R, Dolstra E. Finding software license violations through binary code clone detection. In: Proc. of the 8th Working Conf. on Mining Software Repositories. Honolulu: ACM, 2011. 63–72.
    [2] Rahimian A, Charland P, Preda S, Debbabi M. RESource: A framework for online matching of assembly with open source code. In: Proc. of the 5th Int’l Symp. on Foundations and Practice of Security. Montreal: Springer, 2012. 211–226. [doi: 10.1007/978-3-642-37119-6_14]
    [3] Kim D, Cho S, Han S, Park M, You I. Open source software detection using function-level static software birthmark. Journal of Internet Services and Information Security, 2014, 4(4): 25–37.
    [4] Miyani D, Huang Z, Lie D. BinPro: A tool for binary source code provenance. arXiv:1711.00830, 2017.
    [5] Duan RA, Bijlani A, Xu M, Kim T, Lee W. Identifying open-source license violation and 1-day security risk at large scale. In: Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security. Dallas: ACM, 2017. 2169–2185. [doi: 10.1145/3133956.3134048]
    [6] Feng MY, Mao WX, Yuan ZM, Xiao Y, Ban G, Wang W, Wang SY, Tang Q, Xu JH, Su H, Liu BH, Huo W. Open-source license violations of binary software at large scale. In: Proc. of the 26th Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Hangzhou: IEEE, 2019. 564–568. [doi: 10.1109/SANER.2019.8667977]
    [7] Yuan ZM, Feng MY, Li F, Ban G, Xiao Y, Wang SY, Tang Q, Su H, Yu CD, Xu JH, Piao AH, Xuey J, Huo W. B2SFinder: Detecting open-source software reuse in COTS software. In: Proc. of the 34th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). San Diego: IEEE, 2019. 1038–1049. [doi: 10.1109/ASE.2019.00100]
    [8] Ban G, Xu LL, Xiao Y, Li XH, Yuan ZM, Huo W. B2SMatcher: Fine-grained version identification of open-source software in binary files. Cybersecurity, 2021, 4(1): 21.
    [9] Ji YD, Cui L, Huang HH. BugGraph: Differentiating source-binary code similarity with graph triplet-loss network. In: Proc. of the 2021 ACM Asia Conf. on Computer and Communications Security. Hong Kong: ACM, 2021. 702–715. [doi: 10.1145/3433210.3437533]
    [10] Gui Y, Wan Y, Zhang HY, Huang HF, Sui YL, Xu GD, Shao ZY, Jin H. Cross-language binary-source code matching with intermediate representations. In: Proc. of the 2022 IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Honolulu: IEEE, 2022. 601–612. [doi: 10.1109/SANER53432.2022.00077]
    [11] Yu ZP, Zheng WX, Wang JQ, Tang QY, Nie S, Wu S. CodeCMR: Cross-modal retrieval for function-level binary source code matching. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 326.
    [12] Jia A, Fan M, Jin WX, Xu X, Zhou ZH, Tang QY, Nie S, Wu S, Liu T. 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis. ACM Trans. on Software Engineering and Methodology, 2023, 32(4): 87.
    [13] Theodoridis T, Grosser T, Su ZD. Understanding and exploiting optimal function inlining. In: Proc. of the 27th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Lausanne: ACM, 2022. 977–989. [doi: 10.1145/3503222.3507744]
    [14] Damásio T, Pacheco V, Goes F, Pereira F, Rocha R. Inlining for code size reduction. In: Proc. of the 25th Brazilian Symp. on Programming Languages. Joinville: ACM, 2021. 17–24. [doi: 10.1145/3475061.3475081]
    [15] Gupta P, Jha A, Gupta B, Sumpi K, Sahoo S, Chalapathi MMV. Techniques and trade-offs in function inlining optimization. EAI Endorsed Trans. on Scalable Information Systems, 2024, 11(4): 1–7.
    [16] Weingarten ME, Theodoridis T, Prokopec A. Inlining-benefit prediction with interprocedural partial escape analysis. In: Proc. of the 14th ACM SIGPLAN Int’l Workshop on Virtual Machines and Intermediate Languages. Auckland: ACM, 2022. 13–24. [doi: 10.1145/3563838.3567677]
    [17] Ben-Asher Y, Faour N, Shinaar O. Mutual inlining: An inlining algorithm to reduce the executable size. In: Proc. of the 2022 CS & IT Conf. 2022. 1–16. [doi: 10.5121/csit.2022.120601]
    [18] Muts K, Falk H. Multi-criteria function inlining for hard real-time systems. In: Proc. of the 28th Int’l Conf. on Real-time Networks and Systems. Paris: ACM, 2020. 56–66. [doi: 10.1145/3394810.3394819]
    [19] Romano A, Wang WH. When function inlining meets WebAssembly: Counterintuitive impacts on runtime performance. In: Proc. of the 31st ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. San Francisco: ACM, 2023. 350–362. [doi: 10.1145/3611643.3616311]
    [20] Chandramohan M, Xue YX, Xu ZZ, Liu Y, Cho CY, Tan HBK. BinGo: Cross-architecture cross-os binary search. In: Proc. of the 24th ACM SIGSOFT Int’l Symp. on Foundations of Software Engineering. Seattle: ACM, 2016. 678–689. [doi: 10.1145/2950290.2950350]
    [21] Ding SHH, Fung BCM, Charland P. Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: Proc. of the 2019 IEEE Symp. on Security and Privacy (SP). San Francisco: IEEE, 2019. 472–489. [doi: 10.1109/SP.2019.00003]
    [22] Kim D, Kim E, Cha SK, Son S, Kim Y. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Trans. on Software Engineering, 2023, 49(4): 1661–1682.
    [23] Moyano JM, Gibaja EL, Cios KJ, Ventura S. Review of ensembles of multi-label classifiers: Models, experimental study and prospects. Information Fusion, 2018, 44: 33–45.
    [24] Bogatinovski J, Todorovski L, Džeroski S, Kocev D. Comprehensive comparative study of multi-label classification methods. Expert Systems with Applications, 2022, 203: 117215.
    [25] Kocev D, Vens C, Struyf J, Džeroski S. Tree ensembles for predicting structured outputs. Pattern Recognition, 2013, 46(3): 817–833.
    [26] Tsoumakas G, Katakis I. Multi-label classification: An overview. Int’l Journal of Data Warehousing and Mining, 2007, 3(3): 1–13.
    [27] Read J. Scalable multi-label classification [Ph.D. Thesis]. Hamilton: University of Waikato, 2010.
    [28] Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. In: Proc. of the 11th Annual Conf. on Computational Learning Theory. Madison: ACM, 1998. 80–91. [doi: 10.1145/279943.279960]
    [29] Kenner A, Kästner C, Haase S, Leich T. TypeChef: Toward type checking #ifdef variability in C. In: Proc. of the 2nd Workshop on Feature-Oriented Software Development. Eindhoven: ACM, 2010. 25–32. [doi: 10.1145/1868688.1868693]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

贾昂,范铭,徐茜,晋武侠,王海军,刘烃.面向函数内联场景的二进制到源代码函数相似性检测方法.软件学报,2025,36(7):1-19

复制
分享
文章指标
  • 点击次数:83
  • 下载次数: 183
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-08-22
  • 最后修改日期:2024-10-15
  • 在线发布日期: 2024-12-10
文章二维码
您是第19876148位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号