Malware Similarity Measurement Method Based on Multiplex Heterogeneous Graph
Author:
Affiliation:

Clc Number:

TP311

  • Article
  • | |
  • Metrics
  • |
  • Reference [42]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Existing malware similarity measurement methods cannot accommodate code obfuscation technology and lack the ability to model the complex relationships between malware. This study proposes a malware similarity measurement method called API relation graph enhanced multiple heterogeneous proxembed (RG-MHPE) based on multiplex heterogeneous graph to solve the above problems. This method first uses the dynamic and static feature of malware to construct the multiplex heterogeneous graph and then proposes an enhanced proximity embedding method based on relational paths to solve the problem that proximity embedding cannot be applied to the similarity measurement of the multiplex heterogeneous graph. In addition, this study extracts knowledge from API documents on the MSDN website, builds an API relation graph, learns the similarity between Windows APIs, and effectively slows down the aging speed of similarity measurement models. Finally, the experimental results show that RG-MHPE has the best performance in similarity measurement performance and model anti-aging ability.

    Reference
    [1] Jones L, Sellers A, Carlisle M. CARDINAL: Similarity analysis to defeat malware compiler variations. In: Proc. of the 11th Int’l Conf. on Malicious and Unwanted Software (MALWARE). Fajardo: IEEE, 2016. 1–8.
    [2] Luo LN, Ming J, Wu DH, Liu P, Zhu SC. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 2017, 43(12): 1157–1177. [doi: 10.1109/TSE.2017.2655046]
    [3] Adkins F, Jones L, Carlisle M, Upchurch J. Heuristic malware detection via basic block comparison. In: Proc. of the 8th Int’l Conf. on Malicious and Unwanted Software: “The Americas”(MALWARE). Fajardo: IEEE, 2013. 11–18.
    [4] Pewny J, Garmany B, Gawlik R, Rossow C, Holz T. Cross-architecture bug search in binary executables. In: Proc. of the 2015 IEEE Symp. on Security and Privacy. San Jose: IEEE, 2015. 709–724.
    [5] Cesare S, Xiang Y. Malware variant detection using similarity search over sets of control flow graphs. In: Proc. of the 10th IEEE Int’l Conf. on Trust, Security and Privacy in Computing and Communications. Changsha: IEEE, 2011. 181–189.
    [6] Eschweiler S, Yakdan K, Gerhards-Padilla E. discovRE: Efficient cross-architecture identification of bugs in binary code. In: Proc. of the 23rd Annual Network and Distributed System Security Symp. San Diego: The Internet Society, 2016. 21–24.
    [7] Zhang XC, Pang JM, Liu XN. Common program similarity metric method for anti-obfuscation. IEEE Access, 2018, 6: 47557–47565. [doi: 10.1109/ACCESS.2018.2867531]
    [8] Gao J, Yang X, Fu Y, Jiang Y, Sun JG. VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary. In: Proc. of the 33rd IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Montpellier: IEEE, 2018. 896–899.
    [9] Feng Q, Zhou RD, Xu CC, Cheng Y, Testa B, Yin H. Scalable graph-based bug search for firmware images. In: Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security. New York: Association for Computing Machinery, 2016. 480–491.
    [10] Xu XJ, Liu C, Feng Q, Yin H, Song L, Song D. Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proc. of the 2017 ACM SIGSAC Conf. on Computer and Communications Security. New York: Association for Computing Machinery, 2017. 363–376.
    [11] Kumar N, Meenpal T. Texture-based malware family classification. In: Proc. of the 10th Int’l Conf. on Computing, Communication and Networking Technologies (ICCCNT). Kanpur: IEEE, 2019. 1–6.
    [12] Sun YZ, Han JW, Yan XF, Yu PS, Wu TY. Pathsim: Meta path-based top-K similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 2011, 4(11): 992–1003. [doi: 10.14778/3402707.3402736]
    [13] Shi C, Kong XN, Huang Y, Yu PS, Wu B. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10): 2479–2492. [doi: 10.1109/TKDE.2013.2297920]
    [14] Liu ZM, Zheng VW, Zhao Z, Zhu FW, Chang KCC, Wu MH, Ying J. Semantic proximity search on heterogeneous graph by proximity embedding. In: Proc. of the 31st AAAI Conf. on Artificial Intelligence. San Francisco: AAAI Press, 2017. 154–160.
    [15] Cen YK, Zou X, Zhang JW, Yang HX, Zhou JR, Tang J. Representation learning for attributed multiplex heterogeneous network. In: Proc. of the 25th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery, 2019. 1358–1368.
    [16] Zhang XH, Zhang Y, Zhong M, Ding DZ, Cao YZ, Zhang YK, Zhang M, Yang M. Enhancing state-of-the-art classifiers with API semantics to detect evolved android malware. In: Proc. of the 2020 ACM SIGSAC Conf. on Computer and Communications Security. New York: Association for Computing Machinery, 2020. 757–770.
    [17] Alkhateeb EMS. Dynamic malware detection using API similarity. In: Proc. of the 2017 IEEE Int’l Conf. on Computer and Information Technology (CIT). Helsinki: IEEE, 2017. 297–301.
    [18] Anderson B, Quist D, Neil J, Storlie C, Lane T. Graph-based malware detection using dynamic analysis. Journal in Computer Virology, 2011, 7(4): 247–258. [doi: 10.1007/s11416-011-0152-x]
    [19] Nikolopoulos SD, Polenakis I. A graph-based model for malware detection and classification using system-call groups. Journal of Computer Virology and Hacking Techniques, 2017, 13(1): 29–46. [doi: 10.1007/s11416-016-0267-1]
    [20] 任益辰, 肖达. 基于程序双维度特征的恶意程序相似性分析. 计算机工程与应用, 2021, 57(1): 118-125. [doi: 10.3778/j.issn.1002-8331.2004-0259]
    Ren YC, Xiao D. Similarity analysis of malicious programs based on two dimensional characteristics of programs. Computer Engineering and Applications, 2021, 57(1): 118–125 (in Chinese with English abstract). [doi: 10.3778/j.issn.1002-8331.2004-0259]
    [21] 郑荣锋, 方勇, 刘亮. 基于动态行为指纹的恶意代码同源性分析. 四川大学学报(自然科学版), 2016, 53(4): 793-798. [doi: 103969/j.issn.0490-6756.2016.07.016]
    Zheng RF, Fang Y, Liu L. Homology analysis of malicious code based on dynamic-behavior fingerprint. Journal of Sichuan University (Natural Science Edition), 2016, 53(4): 793–798 (in Chinese with English abstract). [doi: 103969/j.issn.0490-6756.2016.07.016]
    [22] Gu YH, Li LX, Zhang Y. Robust Android malware detection based on attributed heterogenous graph embedding. In: Xu GQ, Liang KT, Su CH, eds. Frontiers in Cyber Security (FCS). Singapore: Springer, 2020. 432–446, 2020.
    [23] Ye YF, Hou SF, Chen LW, Lei JW, Wan WQ, Wang JB, Xiong Q, Shao FD. Out-of-sample node representation learning for heterogeneous graph in real-time Android malware detection. In: Proc. of the 28th Int’l Joint Conf. on Artificial Intelligence Main Track. Macao: IJCAI, 2019. 4150–4156.
    [24] Fan YJ, Hou SF, Zhang YM, Ye YF, Abdulhayoglu M. Gotcha-sly malware!: Scorpion a metagraph2vec based malware detection system. In: Proc. of the 24th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery, 2018. 253–262.
    [25] Yin SN, Kang HS, Chen ZG, Kim SR. A malware detection system based on heterogeneous information network. In: Proc. of the 2018 Conf. on Research in Adaptive and Convergent Systems. New York: Association for Computing Machinery, 2018. 154–159.
    [26] 石川, 孙怡舟, 菲利普·俞. 异质信息网络的研究现状和未来发展. 中国计算机学会通讯, 2017, 13(11): 35-40.
    Shi C, Sun YZ, Yu PS. Research status and future development of heterogeneous information networks. Communications of the CCF, 2017, 13(11): 35–40 (in Chinese). (查阅所有网上资料, 未找到对应的英文翻译, 请联系作者确认)
    [27] Lao N, Cohen WW. Fast query execution for retrieval models based on path-constrained random walks. In: Proc. of the 16th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2010. 881–888.
    [28] Yang C, Liu MX, He F, Zhang XK, Peng J, Han JW. Similarity modeling on heterogeneous networks via automatic path discovery. In: Proc. of the Joint European Conf. on Machine Learning and Knowledge Discovery in Databases. Ghent: Springer, 2018. 37–54.
    [29] Wang Y, Wang Z, Zhao ZY, Li ZJ, Jian X, Xin H, Chen L, Song JC, Chen ZH, Zhao M. Effective similarity search on heterogeneous networks: A meta-path free approach. IEEE Trans. on Knowledge and Data Engineering, 2020, 34(7): 3225–3240.
    [30] Liu ZM, Zheng VW, Zhao Z, Li Z, Yang HX, Wu MH, Ying J. Interactive paths embedding for semantic proximity search on heterogeneous graphs. In: Proc. of the 24th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. New York: Association for Computing Machinery, 2018. 1860–1869.
    [31] Liu ZM, Zheng V, Zhao Z, Zhu FW, Chang K, Wu MH, Yiang J. Distance-aware DAG embedding for proximity search on heterogeneous graphs. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI, 2018. 2355–2362.
    [32] Liu ZM, Zheng VW, Zhao Z, Yang HX, Chang KCC, Wu MH, Ying J. Subgraph-augmented path embedding for semantic user search on heterogeneous social network. In: Proc. of the 2018 World Wide Web Conf. Republic and Canton of Geneva: Int’l World Wide Web Conferences Steering Committee, 2018. 1613–1622.
    [33] Bordes A, Usunier N, Garcia-Durán A, Weston J, Yakhnenko O. Translating embeddings for modeling multi-relational data. In: Proc. of the 26th Int’l Conf. on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2013. 2787–2795.
    [34] Sebastián M, Rivera R, Kotzias P, Caballero J. Avclass: A tool for massive malware labeling. In: Proc. of the 19th Int’l Symp. on Research in Attacks, Intrusions, and Defenses. Paris: Springer, 2016. 230–253.
    [35] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations. In: Proc. of the 20th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2014. 701–710.
    [36] Fang Y, Lin WQ, Zheng VW, Wu M, Chang KCC, Li XL. Semantic proximity search on graphs with metagraph-based learning. In: Proc. of the 32nd IEEE Int’l Conf. on Data Engineering (ICDE). Helsinki: IEEE, 2016. 277–288.
    [37] Syakur MA, Khotimah BK, Rochman EMS, Satoto BD. Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering, 2018, 336(1): 012017.
    [38] Pendlebury F, Pierazzi F, Jordaney R, Kinder J, Cavallaro L. TESSERACT: Eliminating experimental bias in malware classification across space and time. In: Proc. of the 28th USENIX Security Symp. Santa Clara: USENIX Association, 2019. 729–746.
    [39] Moser A, Kruegel C, Kirda E. Exploring multiple execution paths for malware analysis. In: Proc. of the 2007 IEEE Symp. on Security and Privacy. Berkeley: IEEE, 2007. 231–245.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

谷勇浩,王翼翡,刘威歆,吴铁军,孟国柱.基于多重异质图的恶意软件相似性度量方法.软件学报,2023,34(7):3188-3205

Copy
Share
Article Metrics
  • Abstract:1297
  • PDF: 3290
  • HTML: 1293
  • Cited by: 0
History
  • Received:March 16,2021
  • Revised:August 20,2021
  • Online: January 28,2022
  • Published: July 06,2023
You are the first2051277Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063