基于BERT与自编码器的概念漂移恶意软件分类优化
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金面上项目(62172168)


Optimization of Concept Drift Malware Classification Based on BERT and Autoencoder
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    软件概念漂移指同类型软件的软件结构和组成成分会随着时间的推移而改变. 在恶意软件分类领域, 发生概念漂移意味着同一家族的恶意样本的结构和组成特征会随时间发生变化, 这会导致固定模式的恶意软件分类算法的性能会随时间推移而发生下降. 现有的恶意软件静态分类研究方法在面临概念漂移场景时都会有显著的性能下降, 因此难以满足实际应用的需求. 针对这一问题, 鉴于自然语言理解领域与二进制程序字节流分析领域的共性, 基于BERT和自定义的自编码器架构提出一种高精度、鲁棒的恶意软件分类方法. 该方法首先通过反汇编分析提取执行导向的恶意软件操作码序列, 减少冗余信息; 然后使用BERT理解序列的上下文语义并进行向量嵌入, 有效地理解恶意软件的深层程序语义; 再通过几何中位数子空间投影和瓶颈自编码器进行任务相关的有效特征筛选; 最后通过全连接层构成的分类器输出分类结果. 在普通场景和概念漂移场景中, 通过与最先进的9种恶意软件分类方法进行对比实验验证所提方法的实际有效性. 实验结果显示: 所提方法在普通场景下的分类F1值达到99.49%, 高于所有对比方法, 且在概念漂移场景中的分类F1值比所有对比方法提高10.78%–43.71%.

    Abstract:

    Software concept drift means that the structure and composition of the same type of software will change over time. In malware classification, concept drift means that the structure and composition characteristics of malware samples from the same family can change over time. This will cause a decline in the performance of fixed-mode malware classification algorithms over time. Existing methods for static malware classification experience significant performance degradation when faced with concept drift scenarios, making it difficult to meet the needs of practical applications. To address this problem, given the commonalities between natural language understanding and binary byte stream analysis, a highly accurate and robust malware classification method is proposed based on BERT and a custom autoencoder architecture. This method extracts execution-oriented malware opcode sequences through disassembly analysis to reduce redundant information. Then, it uses BERT to understand the contextual semantics of the sequences and perform vector embedding to effectively understand the deep program semantics of the malware samples. It also screens effective task-related features through the geometric median subspace projection and bottleneck autoencoders. Finally, a classifier composed of fully connected layers is used to output the classification results. The practical effectiveness of the proposed method is validated through comparative experiments with nine state-of-the-art malware classification methods in both normal and concept drift scenarios. Experimental results show that the proposed method achieves an F1 score of 99.49% in normal scenarios, outperforming those nine methods. Moreover, in concept drift scenarios, the F1 score is improved by 10.78% to 43.71% compared to the nine methods.

    参考文献
    相似文献
    引证文献
引用本文

赵浩钧,邹德清,薛文杰,吴月明,金海.基于BERT与自编码器的概念漂移恶意软件分类优化.软件学报,,():1-17

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-12-09
  • 最后修改日期:2024-04-28
  • 录用日期:
  • 在线发布日期: 2024-12-04
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号