• Online First

    Select All
    Display Type: |
    • Code Similarity Detection Supported by Longest Common Subsequence Embedding

      Online: May 22,2025 DOI: 10.13328/j.cnki.jos.007362

      Abstract (20) HTML (0) PDF 892.00 K (22) Comment (0) Favorites

      Abstract:The longest common subsequence (LCS) is a practical metric for assessing code similarity. However, traditional LCS-based methods face challenges in scalability and in effectively capturing critical semantics for identifying code fragments that are textually different but semantically similar, due to their reliance on discrete representation-based token encoding. To address these limitations, this study proposes an LCS-oriented embedding method that encodes code into low-dimensional dense vectors, effectively capturing semantic information. This transformation enables the computationally expensive LCS calculation to be replaced with efficient vector arithmetic, further accelerated using an approximate nearest neighbor algorithm. To support this approach, an embeddable LCS-based distance metric is developed, as the original LCS metric is non-embeddable. Experimental results demonstrate that the proposed metric outperforms tree-based and literal similarity metrics in detecting complex code clones. In addition, two targeted loss functions and corresponding training datasets are designed to prioritize retaining critical semantics in the embedding process, allowing the model to identify textually different but semantically similar code elements. This improves performance in detecting complex code similarities. The proposed method demonstrates strong scalability and high accuracy in detecting complex clones. When applied to similar bug identification, it has reported 23 previously unknown bugs, all of which are confirmed by developers in real-world projects. Notably, several of these bugs are complex and challenging to detect using traditional LCS-based techniques.

    • Heterogeneous Graph Attention Network for Entity Alignment

      Online: May 22,2025 DOI: 10.13328/j.cnki.jos.007371

      Abstract (7) HTML (0) PDF 6.09 M (21) Comment (0) Favorites

      Abstract:Entity alignment (EA) aims to identify equivalent entities across different knowledge graph (KG). Embedding-based EA methods still have several limitations, listed below. First, the heterogeneous structures within KGs are not fully modeled. Second, the utilization of text information is constrained by word embeddings. Third, alignment inference algorithms are underexplored. To address these limitations, we propose a heterogeneous graph attention network for entity alignment (HGAT-EA). HGAT-EA consists of two channels: one for learning structural embeddings and the other for learning character-level semantic embeddings. The first channel employs a heterogeneous graph attention network (HGAT), which fully leverages heterogeneous structures and relation triples to learn entity embeddings. The second channel utilizes character-level literals to learn character-level semantic embeddings. HGAT-EA incorporates multiple views through these channels and maximizes the use of heterogeneous structures through HGAT. HGAT-EA introduces three alignment inference algorithms. Experimental results validate the effectiveness of HGAT-EA. Following these results, we provide detailed analyses of the various components of HGAT-EA and present the corresponding conclusions.

    • ReproLink: Reproducibility-oriented Research Data Management System

      Online: May 22,2025 DOI: 10.13328/j.cnki.jos.007372

      Abstract (11) HTML (0) PDF 8.71 M (20) Comment (0) Favorites

      Abstract:The reproducibility of scientific research results is a fundamental guarantee for the reliability of scientific research and the cornerstone of scientific and technological advancement. However, the research community is currently facing a serious reproducibility crisis, with many research results published in top journals and conferences being irreproducible. In the field of data science, the reproducibility of research results faces challenges such as heterogeneous research data from multiple sources, complex computational processes, and intricate computational environments. To address these issues, this study proposes ReproLink, a reproducibility-oriented research data management system. ReproLink constructs a unified model of research data, abstracting it into research data objects that consist of three elements: identifier, attribute set, and data entity. Through fine-grained modeling of the reproduction process, ReproLink establishes a precise method for describing multi-step, complex reproduction processes. By integrating code and operating environment modeling, ReproLink eliminates the uncertainties caused by different environments affecting code execution. Performance tests and case studies show that ReproLink performs well with data scales up to one million records, demonstrating practical value in real-world scenarios such as paper reproduction and data provenance tracking. The technical architecture of ReproLink has been integrated into Conow Software, the only integrated comprehensive management and service platform in China specifically designed for scientific research institutes, supporting the reproducibility needs of hundreds of such institutes across the country.

    • Implicit Multi-scale Alignment and Interaction for Text-image Person Re-identification Method

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007293

      Abstract (30) HTML (0) PDF 1.71 M (53) Comment (0) Favorites

      Abstract:The purpose of text-image person re-identification is to employ the text description to retrieve the target persons in the image database. The main challenge of this technology is to embed image and text features into common potential space to achieve cross-modal alignment. Many existing studies try to adopt separate pre-trained unimodal models to extract visual and text features, and then employ segmentation or attention mechanisms to obtain explicit cross-modal alignment. However, these explicit alignment methods generally lack the underlying alignment ability needed to effectively match multimodal features, and the utilization of preset cross-modal correspondence to achieve explicit alignment may result in modal information distortion. An implicit multi-scale alignment and interaction for text-image person re-identification method is proposed. Firstly, the semantic consistent feature pyramid network is employed to extract multi-scale features of the images, and attention weights are adopted to fuse different scale features including global and local information. Secondly, the association between image and text is learned using a multivariate interaction attention mechanism, which can effectively capture the corresponding relationship between different visual features and text information, narrow the gap between modes, and achieve implicit multi-scale semantic alignment. Additionally, the foreground enhancement discriminator is adopted to enhance the target person and extract purer person features, which is helpful for alleviating the information inequality between images and texts. Experimental results on three mainstream text-image person re-identification datasets of CUHK-PEDES, ICFG-PEDES and RSTPReid show that the proposed method effectively improves the cross-modal retrieval performance, which is 2%?9% higher than the Rank-1 of SOTA algorithm.

    • Formal Analysis of Cross-chain Protocol IBC

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007356

      Abstract (20) HTML (0) PDF 6.61 M (26) Comment (0) Favorites

      Abstract:Since the advent of Bitcoin, blockchain technology has profoundly influenced numerous fields. However, the absence of effective communication mechanisms between heterogeneous and isolated blockchain systems has hindered the advancement and sustainable development of the blockchain ecosystem. In response, cross-chain technology has emerged as a rapidly evolving field and a focal point of research. The decentralized nature of blockchain, coupled with the complexity of cross-chain scenarios, introduces significant security challenges. This study proposes a formal analysis of the IBC (inter-blockchain communications) protocol, one of the most widely adopted cross-chain communication protocols, to assist developers in designing and implementing cross-chain technologies with enhanced security. The IBC protocol is formalized using TLA+, a temporal logic specification language, and its critical properties are verified through the model-checking tool TLC. An in-depth analysis of the verification results reveals several issues impacting the correctness of packet transmission and token transfer. Corresponding recommendations are proposed to mitigate these security risks. The findings have been reported to the IBC developer community, with most of them receiving acknowledgment.

    • Deep-learning-driven Software Vulnerability Prediction: Problems, Progress, and Challenges

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007376

      Abstract (103) HTML (0) PDF 8.34 M (61) Comment (0) Favorites

      Abstract:Software vulnerabilities are code segments in software that are prone to exploitation. Ensuring that software is not easily attacked is a crucial security requirement in software development. Software vulnerability prediction involves analyzing and predicting potential vulnerabilities in software code. Deep learning-driven software vulnerability prediction has become a popular research field in recent years, with a long time span, numerous studies, and substantial research achievements. To review relevant research findings and summarize the research hotspots, a survey of 151 studies related to deep learning-driven software vulnerability prediction published between 2017 and 2024 is conducted. It summarizes the research problems, progress, and challenges discussed in the literature, providing a reference for future research.

    • Detection of Timer Concurrency Bug in Linux Kernel

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007377

      Abstract (17) HTML (0) PDF 6.30 M (18) Comment (0) Favorites

      Abstract:A timer is used to schedule and execute delayed tasks in an operating system. It operates asynchronously in an atomic context and can execute concurrently with different threads at any time. If developers fail to account for all possible scenarios of multithread interleaving, various types of concurrency bugs may be introduced, posing a serious threat to the security of the operating system. Timer concurrency bugs are more difficult to detect than typical concurrency bugs because they involve not only multithread interleaving but also the delayed and repeated scheduling of timer handlers. Currently, there are no tools that can effectively detect such bugs. In this study, three types of timer concurrency bugs are summarized: sleeping timer bugs, timer deadlock bugs, and zombie timer bugs. To enhance detection efficiency, firstly, all timer-related code is extracted through pointer analysis, reducing unnecessary analysis overhead. A context-sensitive, path-sensitive, and flow-sensitive interprocedural control flow graph is then constructed to provide a foundation for subsequence analysis. Based on static analysis techniques, including call graph traversal, lockset analysis, points-to analysis, and control flow analysis, three detection algorithms are designed to identify the different types of timer concurrency bugs. To evaluate the effectiveness of the proposed algorithm, they are applied to the Linux 5.15 kernel, where 328 real-world timer concurrency bugs are detected. A total of 56 patches are submitted to the Linux kernel community, with 49 patches merged into the mainline kernel, 295 bugs confirmed and fixed, and 14 CVE identifiers assigned. These results demonstrate the effectiveness of the proposed method. Finally, a systematic analysis of performance, false positives, and false negatives is conducted through comparative experiments, and methods for repairing the three types of bugs are summarized.

    • Review on mmWave-based Human Perception

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007378

      Abstract (31) HTML (0) PDF 7.36 M (25) Comment (0) Favorites

      Abstract:With the rapid development of embedded technology, mobile computing, and the Internet of Things (IoT), an increasing number of sensing devices have been integrated into people’s daily lives, including smartphones, cameras, smart bracelets, smart routers, and headsets. The sensors embedded in these devices facilitate the collection of personal information such as location, activities, vital signs, and social interactions, thus fostering a new class of applications known as human-centric sensing. Compared with traditional sensing methods, including wearable-based, vision-based, and wireless signal-based sensing, millimeter wave (mmWave) signals offer numerous advantages, such as high accuracy, non-line-of-sight capability, passive sensing (without requiring users to carry sensors), high spatiotemporal resolution, easy deployment, and robust environmental adaptability. The advantages of mmWave-based sensing have made it a research focus in both academia and industry in recent years, enabling non-contact, fine-grained perception of human activities and physical signs. Based on an overview of recent studies, the background and research significance of mmWave-based human sensing are examined. The existing methods are categorized into four main areas: tracking and positioning, motion recognition, biometric measurement, and human imaging. Commonly used publicly available datasets are also introduced. Finally, potential research challenges and future directions are discussed, highlighting promising developments toward achieving accurate, ubiquitous, and stable human perception.

    • Black-box Adversarial Attack for Deep Vulnerability Detection Model

      Online: May 14,2025 DOI: 10.13328/j.cnki.jos.007379

      Abstract (26) HTML (0) PDF 4.46 M (37) Comment (0) Favorites

      Abstract:In recent years, impressive capabilities have been demonstrated by deep learning-based vulnerability detection models in detecting vulnerabilities. Previous research has widely explored adversarial attacks using variable renaming to introduce disturbances in source code and evade detection. However, the effectiveness of introducing multiple disturbances through various transformation techniques in source code has not been adequately investigated. In this study, multiple synonymous transformation operators are applied to introduce disturbances in source code. A combination optimization strategy based on genetic algorithms is proposed, enabling the selection of source code transformation operators with the highest fitness to guide the generation of adversarial code segments capable of evading vulnerability detection. The proposed method is implemented in a framework named non-vulnerability generator (NonVulGen) and evaluated against deep learning-based vulnerability detection models. When applied to recently developed deep learning models, an average attack success rate of 91.38% is achieved against the CodeBERT-based model and 93.65% against the GraphCodeBERT-based model, representing improvements of 28.94% and 15.52% over state-of-the-art baselines, respectively. To assess the generalization ability of the proposed attack method, common models including Devign, ReGVD, and LineVul are targeted, achieving average success rates of 98.88%, 97.85%, and 92.57%, respectively. Experimental results indicate that adversarial code segments generated by NonVulGenx cannot be effectively distinguished by deep learning-based vulnerability detection models. Furthermore, significant reductions in attack success rates are observed after retraining the models with adversarial samples generated based on the training data, with a decrease of 96.83% for CodeBERT, 97.12% for GraphCodeBERT, 98.79% for Devign, 98.57% for ReGVD, and 97.94% for LineVul. These findings reveal the critical challenge of adversarial attacks in deep learning-based vulnerability detection models and highlight the necessity for model reinforcement before deployment.

    • Vulnerability Sample Generation Method Based on Abstract Syntax Tree Variation

      Online: May 07,2025 DOI: 10.13328/j.cnki.jos.007309

      Abstract (20) HTML (0) PDF 8.29 M (59) Comment (0) Favorites

      Abstract:With the continuous development of information technology, the quantity and variety of software products are increasing, but even high-quality software may contain vulnerabilities. In addition, the software update speed is fast, and the software architecture is increasingly complex, which leads to the gradual evolution of vulnerabilities into new forms. Consequently, traditional vulnerability detection methods and rules are difficult to apply to new vulnerability features. Due to the scarcity of zero-day vulnerability samples, zero-day vulnerabilities that appear in the software evolution process are difficult to find, which brings great potential risks to software security. This study proposes a vulnerability sample generation method based on abstract syntax tree mutation, which can simulate the structure and syntax rules of real vulnerabilities, generate vulnerability samples more in line with the actual situation, and provide a more effective solution for software security and reliability. This method analyzes the abstract syntax tree structure generated by Eclipse CDT, extracts the syntactic information in the nodes, reconstructs the nodes and abstract syntax trees, optimizes the abstract syntax tree structure, and designs a series of mutation operators. Subsequently, it performs mutation operations on the optimized abstract syntax trees. The method proposed in this paper can generate mutation samples with the characteristics of UAF and CUAF vulnerabilities, which can be used for the detection of zero-day vulnerabilities and help to improve the detection rate of zero-day vulnerabilities. Experimental results show that this method reduces the invalid sample size by 34% on average compared with the random variation method in traditional detection methods, and can generate more complex mutated samples. In addition, this method can generate more complex mutated samples, enhancing the coverage and accuracy of detection.

    Prev 1 2 3 Next Last
    Result 64 Jump to Page GO
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063