国家自然科学基金(62072007, 62192733, 61832009, 62192731, 62192730)
尽管静态分析工具能够在软件开发生命周期的早期阶段帮助开发人员检测软件中的潜在缺陷, 但该类工具往往存在警报假阳性率高的问题. 为了提高该类工具的可用性, 研究人员提出许多警报确认技术来对假阳性警报进行自动分类. 然而, 已有方法集中于利用手工设计的特征或语句级的抽象语法树标记序列来表示缺陷代码, 难以从报告的警报中捕获语义. 为了克服传统方法的局限性, 利用深度神经网络强大的特征抽取和表示能力从控制流图路径中学习代码语义表征用于警报确认. 控制流图是程序的执行过程抽象表示, 因此控制流图路径序列能够引导模型更精确地学习与潜在缺陷相关的语义信息. 通过微调预训练语言模型对路径序列进行编码并从中捕捉语义特征用于模型构建. 最后在8个开源项目上与最先进的基线方法进行大量对比实验验证所提方法的有效性.
Static analysis tools often suffer from high false positive rates of reported alarms, despite their ability to aid developers in detecting potential defects early in the software development life cycle. To improve the availability of these tools, many automated warning identification techniques have been proposed to assist developers in classifying false positive alarms. However, existing approaches mainly focus on using hand-engineered features or statement-level abstract syntax tree token sequences to represent the defective code, failing to capture semantics from the reported alarms. To overcome the limitations of traditional approaches, this study employs deep neural networks with powerful feature extraction and representation abilities to generate code semantics from control flow graph paths for warning identification. The control flow graph abstractly represents the execution process of a given program. Thus, the generated path sequences of the control flow graph can guide the deep neural networks to learn semantic information about the potential defect more accurately. In this study, the pre-trained language model is fine-tuned to encode the path sequences and capture the semantic representations for model building. Finally, the study conducts extensive experiments on eight open-source projects to verify the effectiveness of the proposed approach by comparing it with the state-of-the-art baselines.