• Volume 34,Issue 5,2023 Table of Contents
    Select All
    Display Type: |
    • >Review Articles
    • Survey on Vision-language Pre-training

      2023, 34(5):2000-2023. DOI: 10.13328/j.cnki.jos.006774 CSTR:

      Abstract (2899) HTML (4188) PDF 10.38 M (5918) Comment (0) Favorites

      Abstract:In recent years, deep learning has achieved excellent performance in unimodal areas such as computer vision (CV) and natural language processing (NLP). With the development of technology, the importance and necessity of multimodal learning begin to unfold. Essential to multimodal learning, vision-language learning has received extensive attention from researchers in and outside China. Thanks to the development of the Transformer framework, more and more pre-trained models are applied to vision-language multimodal learning, and the performance of related tasks is improved qualitatively. This study systematically reviews the current work on vision-language pre-trained models. Firstly, the knowledge about pre-trained models is introduced. Secondly, the structure of pre-trained models is analyzed and compared from two perspectives. The commonly used vision-language pre-training techniques are discussed, and five downstream pre-training tasks are elaborated. Finally, the common datasets used in image and video pre-training tasks are expounded, and the performance of commonly used pre-trained models on different datasets under different tasks is compared and analyzed.

    • >Special Issue's Articles
    • Multimodal Pre-training Method for Vision-language Understanding and Generation

      2023, 34(5):2024-2034. DOI: 10.13328/j.cnki.jos.006770 CSTR:

      Abstract (1809) HTML (2177) PDF 6.91 M (4453) Comment (0) Favorites

      Abstract:Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes unified multimodal pre-training for vision-language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios.

    • Text-based Person Search via Virtual Attribute Learning

      2023, 34(5):2035-2050. DOI: 10.13328/j.cnki.jos.006766 CSTR:

      Abstract (1532) HTML (2180) PDF 15.99 M (3836) Comment (0) Favorites

      Abstract:The text-based person search aims to find the image of the target person conforming to a given text description from a person database, which has attracted the attention of researchers from academia and industry. It faces two challenges: fine-grained retrieval and a heterogeneous gap between images and texts. Some methods propose to use supervised attribute learning to obtain attribute-related features and build fine-grained associations between tests and images. The attribute annotations, however, are hard to obtain, which leads to poor performance of these methods in practice. Determining how to extract attribute-related features without attribute annotations and establish fine-grained and cross-modal semantic associations becomes a key problem to be solved. To address this issue, this study incorporates the pre-training technology and proposes a text-based person search via virtual attribute learning, which builds the cross-modal semantic associations between images and texts at a fine-grained level through unsupervised attribute learning. Specifically, in view of the invariance and cross-modal consistency of pedestrian attributes, a semantics-guided attribute decoupling method is proposed, which utilizes identity labels as the supervision signal to guide the model to decouple attribute-related features. Then, a feature learning module based on semantic reasoning is presented, which utilizes the relations between attributes to construct a semantic graph. This model uses the graph model to exchange information among attributes to enhance the cross-modal identification ability of features. The proposed approach is compared with existing methods on the public text-based person search dataset CUHK-PEDES and cross-modal retrieval dataset Flickr30k, and the experimental results verify the effectiveness of the proposed approach.

    • Pre-training-driven Multimodal Boundary-aware Vision Transformer

      2023, 34(5):2051-2067. DOI: 10.13328/j.cnki.jos.006768 CSTR:

      Abstract (1550) HTML (2571) PDF 9.96 M (4253) Comment (0) Favorites

      Abstract:Convolutional neural networks (CNN) have continuously achieved performance breakthroughs in image forgery detection, but when faced with realistic scenarios where the means of tampering is unknown, the existing methods are still unable to effectively capture the long-term dependencies of the input image to alleviate the recognition bias problem, which affects the detection accuracy. In addition, due to the difficulty in labeling, image forgery detection usually lacks accurate pixel-level image labeling information. Considering the above problems, this study proposes a pre-training-driven multimodal boundary-aware vision transformer. To capture the subtle forgery traces invisible in the RGB domain, the method first introduces the frequency-domain modality of the image and combines it with the RGB spatial domain as a form of multimodal embedding. Secondly, the encoder of the backbone network is trained with ImageNet to alleviate the current problem of insufficient training samples. Then, the transformer module is integrated into the tail of this encoder to capture both low-level spatial details and global contexts, which improves the overall representation ability of the model. Finally, to effectively alleviate the problem of difficult localization caused by the blurred boundary of the forged regions, this study establishes a boundary-aware module, which can use the noise distribution obtained by the Scharr convolutional layer to pay more attention to the noise information rather than the semantic content and utilize the boundary residual block to sharpen the boundary information. In this way, the boundary segmentation performance of the model can be enhanced. The results of extensive experiments show that the proposed method outperforms existing image forgery detection methods in terms of recognition accuracy and has better generalization and robustness to different forgery methods.

    • Multimodal-guided Local Feature Selection for Few-shot Learning

      2023, 34(5):2068-2082. DOI: 10.13328/j.cnki.jos.006771 CSTR:

      Abstract (1831) HTML (3049) PDF 5.85 M (4192) Comment (0) Favorites

      Abstract:Deep learning models have yielded impressive results in many tasks. However, the success hinges on the availability of a large number of labeled samples for model training, and deep learning models tend to perform poorly in scenarios where labeled samples are scarce. In recent years, few-shot learning (FSL) has been proposed to study how to learn quickly from a small number of samples and has achieved good performance mainly by the use of meta-learning for model training. Nevertheless, two issues exist: 1) Existing FSL methods usually manage to recognize novel classes solely with the visual features of samples, without integrating information from other modalities. 2) By following the paradigm of meta-learning, a model aims at learning generic and transferable knowledge from massive similar few-shot tasks, which inevitably leads to a generalized feature space and insufficient and inaccurate representation of sample features. To tackle the two issues, this study introduces pre-training and multimodal learning techniques into the FSL process and proposes a new multimodal-guided local feature selection strategy for few-shot learning. Specifically, model pre-training is first conducted on known classes with abundant samples to greatly improve the feature representation ability of the model. Then, in the meta-learning stage, the pre-trained model is further optimized by meta-learning to improve its transferability or its adaptability to the few-shot environment. Meanwhile, the local feature selection is carried out on the basis of visual features and textual features of samples to enhance the ability to represent sample features and avoid sharp degradation of the model’s representation ability. Finally, the resultant sample features are utilized for FSL. The experiments on three benchmark datasets, namely, MiniImageNet, CIFAR-FS, and FC-100, demonstrate that the proposed FSL method can achieve better results.

    • Self-supervised Graph Contrastive Learning for Video Question Answering

      2023, 34(5):2083-2100. DOI: 10.13328/j.cnki.jos.006775 CSTR:

      Abstract (1430) HTML (2362) PDF 6.36 M (3987) Comment (0) Favorites

      Abstract:As a cross-modal understanding task, video question answering (VideoQA) requires the interaction of semantic information with different modalities to generate answers to questions given a video and the questions associated with it. In recent years, graph neural networks (GNNs) have made remarkable progress in VideoQA tasks due to their powerful capabilities in cross-modal information fusion and inference. However, most existing GNN approaches fail to improve the performance of VideoQA models due to their inherent deficiencies of overfitting or over-smoothing, as well as weak robustness and generalization. In view of the effectiveness and robustness of self-supervised contrastive learning methods in pre-training techniques, this study proposes a self-supervised graph contrastive learning framework GMC based on the idea of graph data augmentation in VideoQA tasks. The framework uses two independent data augmentation operations for nodes and edges to generate dissimilar subsamples and improves the consistency between predicted graph data distributions of the original samples and augmented subsamples for higher accuracy and robustness of the VideoQA models. The effectiveness of the proposed framework is verified by experimental comparisons with existing state-of-the-art VideoQA models and different GMC variants on the public dataset for VideoQA tasks.

    • Text-to-Chinese-painting Method Based on Multi-domain VQGAN

      2023, 34(5):2116-2133. DOI: 10.13328/j.cnki.jos.006769 CSTR:

      Abstract (1160) HTML (2277) PDF 16.68 M (3738) Comment (0) Favorites

      Abstract:With the development of generative adversarial networks (GANs), synthesizing images from textual descriptions has become an active research area. However, textual descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image generation often requires an enormous number of labeled image-text pairs, and the cost of dataset production is high. With the advance in multimodal pre-training, the GAN generation process can be guided in an optimized way, which significantly reduces the demand for datasets and computational resources. In this study, a multi-domain vector quatization generative adversarial network (VQGAN) model is proposed to simultaneously generate Chinese paintings in multiple domains. Furthermore, a multimodal pre-trained model WenLan is used to calculate the distance loss between generated images and textual descriptions. The semantic consistency between images and texts is achieved by optimization of the hidden space variables input into multi-domain VQGAN. Finally, an ablation experiment is conducted to compare different variants of multi-domain VQGAN in terms of the FID and R-precision metrics, and a user investigation is carried out. The results demonstrate that the complete multi-domain VQGAN model outperforms the original VQGAN model in terms of image quality and text-image semantic consistency.

    • Research on Dual-adversarial MR Image Fusion Network Using Pre-trained Model for Feature Extraction

      2023, 34(5):2134-2151. DOI: 10.13328/j.cnki.jos.006772 CSTR:

      Abstract (1064) HTML (2027) PDF 12.09 M (3320) Comment (0) Favorites

      Abstract:With the popularization of multimodal medical images in clinical diagnosis and treatment, fusion technology based on spatiotemporal correlation characteristics has been developed rapidly. The fused medical images not only retain the unique features of source images with various modalities but also strengthen the complementary information, which can facilitate image reading. At present, most methods perform feature extraction and feature fusion by manually defining constraints, which can easily lead to the loss of useful information and unclear details in the fused images. In light of this, a dual-adversarial fusion network using a pre-trained model for feature extraction is proposed in this study to fuse MR-T1/MR-T2 images. The network consists of a feature extraction module, a feature fusion module, and two discriminator network modules. Due to the small scale of the registered multimodal medical image dataset, the feature extraction network cannot be fully trained. In addition, as the pre-trained model has powerful data representation ability, a pre-trained convolutional neural network model is embedded into the feature extraction module to generate the feature map. Then, the feature fusion network fuses the deep features and outputs fused images. Through accurate classification of the source and fused images, the two discriminator networks establish adversarial relations with the feature fusion network separately and eventually encourage it to learn the optimal fusion parameters. The experimental results illustrate the effectiveness of pre-trained technology in this method. Compared with six existing typical fusion methods, the proposed method can generate the fused results of optimal performance in visual effects and quantitative metrics.

    • End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

      2023, 34(5):2152-2169. DOI: 10.13328/j.cnki.jos.006773 CSTR:

      Abstract (1055) HTML (1990) PDF 4.77 M (3634) Comment (0) Favorites

      Abstract:In recent years, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, which leads to a shift towards a fully end-to-end paradigm for multimodal downstream tasks, such as image captioning tasks, and enables better performance and faster inference speed of models. However, the grid visual features extracted with such pre-trained models lack regional visual information, which results in inaccurate descriptions of the object content. Thus, the applicability of pre-trained models in image captioning remains largely unexplored. Therefore, this study proposes a novel end-to-end image captioning method based on visual region aggregation and dual-level collaboration (VRADC). Specifically, to learn regional visual information, this study designs a visual region aggregation module that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, the dual-level collaboration module uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which guides the model to generate more fine-grained image captions. The experimental results on the MSCOCO dataset and Flickr30k dataset show that the proposed VRADC-based method can significantly improve the quality of image captioning and achieves state-of-the-art performance.

    • Construction Mechanism and Technical Implementation of Blockchain-based Service Network

      2023, 34(5):2170-2180. DOI: 10.13328/j.cnki.jos.006392 CSTR:

      Abstract (1280) HTML (1730) PDF 7.11 M (2666) Comment (0) Favorites

      Abstract:Consortium blockchain technology is the main position of China’s blockchain development and application. The traditional consortium blockchain application has some bottlenecks, such as heterogeneous underlying technology platform, high application technology threshold, high cost of chain formation, and difficult operation and maintenance supervision, which restrict the development of blockchain technology and application. This study proposes the construction mechanism of a public infrastructure of blockchain, named as blockchain-based service network (BSN), and expounds on the technical architecture and implementation of BSN. The BSN has become a commercial platform in China since April 2020. It can reduce the cost of blockchain development, deployment, operation and maintenance, interaction, and supervision. BSN is conducive to the promotion and application of blockchain technology to enterprises, governments, industries, and other fields, and will provide reliable and controllable public infrastructure for the construction of innovative smart cities and the development of digital economy in China.

    • COMPSPEN: Separation Logic Solver for Integrated Reasoning about Shape Properties and Data Constraints

      2023, 34(5):2181-2195. DOI: 10.13328/j.cnki.jos.006407 CSTR:

      Abstract (1008) HTML (1050) PDF 6.42 M (2719) Comment (0) Favorites

      Abstract:Separation logic is an extension of the classical Hoare logic for reasoning about pointers and dynamic data structures, and has been extensively used in the formal analysis and verification of fundamental software, including operating system kernels. Automated constraint solving is one of the key means to automate the separation-logic based verification of these programs. The verification of programs manipulating dynamic data structures usually involves both the shape properties, e.g., singly or doubly linked lists and trees, and data constraints, e.g., sortedness and the invariance of data sets/multisets. This paper introduces COMPSPEN, a separation logic solver capable of simultaneously reasoning about the shape properties and data constraints of linear dynamic data structures. First, the theoretical foundations of COMPSPEN are introduced, including the definition of separation logic fragment SLIDdata as well as the decision procedures of the satisfiability and entailment problems of SLIDdata. Then, the implementation and the architecture of the COMPSPEN tool are presented. At last, the experimental results for COMPSEN are reported. 600 test cases are collected and the performance of COMPSPEN is compared with the state-of-the-art separation logic solvers, including Asterix, S2S, Songbird, and SPEN. The experimental results show that COMPSPEN is the only tool capable of solving separation logic formulae involving set data constraints, and in overall, it is able to efficiently solve the satisfiability problem of separation logic formulas involving both shape properties and linear arithmetic data constraints on linear dynamic data structures, and is also capable of solving the entailment problem.

    • Watch out for Version Mismtaching and Data Leakage! A Case Study of Their Influence in Bug Report Based Bug Localization Models

      2023, 34(5):2196-2217. DOI: 10.13328/j.cnki.jos.006401 CSTR:

      Abstract (811) HTML (1106) PDF 9.20 M (2332) Comment (0) Favorites

      Abstract:In order to reduce the labor cost in the process of bug localization, researchers have proposed various automated information retrieval based bug localization models (IRBL), including those models leveraging traditional features and deep learning based features. When evaluating the effectiveness of IRBL models, most of the existing studies neglect the following problems: the software version mismatching between bug reports and the corresponding source code files in the testing data or/and the data leakage caused by the chronological order of bug reports when training and testing their models. This study aims to investigate the performance of existing models in real experiment settings and analyzes the impact of version mismatching and data leakage on the real performance of each model. F irst, six traditional information retrieval-based models (Buglocator, BTRracer, BLUiR, AmaLgam, BLIA, and Locus) and one novel deep learning model (CodeBERT) are selected as the research objects. Then, an empirical analysis is conducted based on eight open-source projects under five different experimental settings. The experimental results demonstrate that the effectiveness of directly applying CodeBERT in bug localization is not as good as expected, since its accuracy depends on the version and source code size of a test project. Second, the results also show that, compared with the traditional version mismatching experimental setting, the traditional information retrieval-based models under the version matching setting can lead to an improviment that is up to 47.2% and 46.0% in terms of MAP and MRR. Meanwhile, the effectiveness of CodeBERT model is also affected by both data leakage and version mismatching. It means that the effectiveness of traditional information retrieval-based bug localization is underestimated while the application of deep learning based CodeBERT to bug localization still needs more exploration.

    • Research on Sentiment Analysis in Software Engineering

      2023, 34(5):2218-2230. DOI: 10.13328/j.cnki.jos.006428 CSTR:

      Abstract (1498) HTML (1899) PDF 6.59 M (3084) Comment (0) Favorites

      Abstract:Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers’ emotions in commit messages and identifying developers’ opinions on Q&A forums. Nevertheless, commonly used out-of-box sentiment analysis tools cannot obtain reliable results in SE tasks and misunderstanding of technical knowledge is demonstrated to be the main reason. Then researchers start to customize SE-specific methods in supervised or distantly supervised ways. To assess the performance of these methods, researchers use SE-related annotated datasets to evaluate them in a within-dataset setting, that is, they train and test each method using data from the same dataset. However, the annotated dataset for an SE-specific sentiment analysis task is not always available. Moreover, building a manually annotated dataset is time-consuming and not always feasible. An alternative is to use datasets extracted from the same platform for similar tasks or datasets extracted from other SE platforms. To verify the feasibility of these practices, it is needed to evaluate existing methods in within-platform and cross-platform settings, which refer to training and testing each method using data from the same platform but not the same dataset, and training and testing each classifier using data from different platforms. This study comprehensively evaluates existing SE-customized sentiment analysis methods in within-dataset, within-platform, and cross-platform settings. Finally, the experimental results provide actionable insights for both researchers and practitioners.

    • Deep-SBFL: Spectrum-based Fault Localization Approach for Deep Neural Networks

      2023, 34(5):2231-2250. DOI: 10.13328/j.cnki.jos.006403 CSTR:

      Abstract (1227) HTML (1334) PDF 8.89 M (2830) Comment (0) Favorites

      Abstract:Deep neural networks have been widely used in fields such as autonomous driving and smart healthcare. Like traditional software, deep neural networks inevitably contain defects, and it may cause serious consequences if they make wrong decisions. Therefore, the quality assurance of deep neural networks has received extensive attention. However, deep neural networks are quite different from traditional software. Traditional software quality assurance methods cannot be directly applied to deep neural networks, and targeted quality assurance methods need to be designed. Software fault localization is one of the important methods to ensure software quality. The spectrum-based fault localization method has achieved good results in traditional software fault localization methods, but it cannot be directly applied to deep neural networks. In this study, based on the traditional software fault localization methods, a spectrum-based fault localization approach named Deep-SBFL for deep neural network is proposed. The approach firstly collects the neuron output information and the prediction results of deep neural network as the spectrum. The spectrum is then further calculated as the contribution information, which can be used to quantify the contribution of neurons to the predicted results. Finally, a suspicious formula for the defect localization of deep neural network is proposed. Based on the contribution information, the suspiciousness scores of neurons in deep neural network are calculated and ranked to find out the most likely defective neurons. To verify the effectiveness of the method, EInspect@n (the number of defects successfully located by inspecting the first n positions of the sorted list) and EXAM (the percentage of elements that must be checked before finding defect elements) are evaluated on a deep neural network trained by the MNIST data set. Experimental results show that this approach can effectively locate different types of defects in deep neural networks.

    • DeepRanger: Coverage-guided Deep Forest Testing Approach

      2023, 34(5):2251-2267. DOI: 10.13328/j.cnki.jos.006422 CSTR:

      Abstract (703) HTML (1232) PDF 8.38 M (2073) Comment (0) Favorites

      Abstract:Comparing with traditional software, the deep learning software has different structures. Even if a lot of test data is used for testing the deep learning software, the adequacy of testing still hard to be evaluted, and many unknown defects could be implied. The deep forest is an emerging deep learning model that overcomes many shortcomings of deep neural networks. For example, the deep neural network requires a lot of training data, high performance computing platform, and many hyperparameters. However, there is no research on testing deep forest. Based on the structural characteristics of deep forests, this study proposes a set of testing coverage criteria, including random forest node coverage (RFNC), random forest leaf coverage (RFLC), cascad forest class coverage (CFCC), and cascad forest output coverage (CFOC). DeepRanger, a coverage-oriented test data generation method based on genetic algorithm, is proposed to automatically generate new test data and effectively improve the model coverage of the test data. Experiments are carried out on the MNIST data set and the gcForest, which is an open source deep forest project. The experimental results show that the four coverage criteria proposed can effectively evaluate the adequacy of the test data set for the deep forest model. In addition, comparing with the genetic algorithm based on random selection, DeepRanger, which is guided by coverage information, can improve the testing coverage of the deep forest model under testing.

    • Text-oriented Construction for CPS Resource Capability Knowledge Graph

      2023, 34(5):2268-2285. DOI: 10.13328/j.cnki.jos.006410 CSTR:

      Abstract (1419) HTML (1323) PDF 6.70 M (3543) Comment (0) Favorites

      Abstract:Cyber-physical system (CPS) plays an increasingly important role in social life. The on-demand choreography of CPS resources is based on the software defining of CPS resources. The definition of software interfaces depends on the full description for the capabilities of CPS resources. At present, in the CPS field, there is a lack of a knowledge base that can describe resources and their capabilities, and a lack of an effective way to construct the knowledge base. For the text description of CPS resources, this study proposes to construct the CPS resource capability knowledge graph and designs a bottom-up automatic construction method. Given CPS resources, this method first extracts textual descriptions of the resources’ capabilities from code and texts, and generates a normalized expression of capability phrases based on a predefined representation pattern. Then, capability phrases are divided, aggregated and abstracted based on the key components of the verb-object structure to generate the hierarchical abstract description of capabilities for different categories of resources. Finally, the CPS knowledge graph is constructed. Based on the Home Assistant platform, this study constructs a knowledge graph containing 32 resource categories and 957 resource capabilities. In the construction experiment, the results of manual construction and automatic construction using the proposed method are compared and analyzed from different dimensions. Experimental results show that this study provides a feasible method for automatic construction of CPS Resource Capability Knowledge Graph. This method helps to reduce the workload of artificial construction, supplement the description of resource services and capabilities in the CPS field and improves the knowledge completeness.

    • Mutation Optimization of Directional Fuzzing for Cumulative Defects

      2023, 34(5):2286-2299. DOI: 10.13328/j.cnki.jos.006491 CSTR:

      Abstract (711) HTML (1047) PDF 8.60 M (1878) Comment (0) Favorites

      Abstract:Many quantifiable state-out-of-bound software defects, such as access violations, memory exhaustion, and performance failures, are caused by a large quantity of input data. However, existing dependent data identification and mutation optimization technologies for grey-box fuzzing mainly focus on fixed-length data formats. They are not efficient in increasing the amount of cumulated data required by the accumulated buggy states. This study proposes a differential mutation method to accelerate feature state optimization during the directed fuzzing. By monitoring the seed that updates the maximum or minimum state value of the cumulative defects, the effective mutate offset and content are determined. The frequency is leveraged and the distribution of the effective mutation is offset to distinguish whether the feature value of the defect depends on a fixed field or cumulative data in the input. The effective mutation content is reused as a material in the cumulative input mutation to accelerate the bug reproduction or directed testing. Based on this idea, this study implements the fuzzing tool Jigsaw. The evaluation results on the experimental data set show that the proposed dependency detection method can efficiently detect the input data type that drives the feature value of cumulative defects and the mutation method significantly shorten the reproduction time of the cumulative defect that requires a large amount of special input data.

    • >Review Articles
    • Explainable Reinforcement Learning: Basic Problems Exploration and Method Survey

      2023, 34(5):2300-2316. DOI: 10.13328/j.cnki.jos.006485 CSTR:

      Abstract (5697) HTML (3111) PDF 5.93 M (8123) Comment (0) Favorites

      Abstract:Reinforcement learning is a technique that discovers optimal behavior strategies in a trial-and-error way, and it has become a general method for solving environmental interaction problems. However, as a machine learning method, reinforcement learning faces a common problem in machine learning, or in other words, it is unexplainable. The unexplainable problem limits the application of reinforcement learning in safety-sensitive fields, e.g., medical treatment and transportation, and it leads to a lack of universally applicable solutions in environmental simulation and task generalization. In order to address the problem, extensive research on explainable reinforcement learning (XRL) has emerged. However, academic members still have an inconsistent understanding of XRL. Therefore, this study explores the basic problems of XRL and reviews existing works. To begin with, the study discusses the parent problem, i.e., explainable artificial intelligence, and summarizes its existing definitions. Next, it constructs a theoretical system of interpretability to describe the common problems of XRL and explainable artificial intelligence. To be specific, it distinguishes between intelligent algorithms and mechanical algorithms, defines interpretability, discusses factors that affect interpretability, and classifies the intuitiveness of interpretability. Then, based on the characteristics of reinforcement learning, the study defines three unique problems of XRL, i.e., environmental interpretation, task interpretation, and strategy interpretation. After that, the latest research on XRL is reviewed, and the existing methods were systematically classified. Finally, the future research directions of XRL are put forward.

    • Information Fusion Recommendation Approach Combining Attention CNN and GNN

      2023, 34(5):2317-2336. DOI: 10.13328/j.cnki.jos.006405 CSTR:

      Abstract (2048) HTML (1412) PDF 6.71 M (2908) Comment (0) Favorites

      Abstract:The sparsity has always been a primary challenge for recommendation system, and information fusion recommendation can alleviate this problem by exploiting user preference through their comments, ratings, and trust information, so as to generate corresponding recommendations for target users. Full learning of user and item information is the key to build a successful recommendation system. Different users have different preferences for various items, and users’ interest preferences and social circle are changeable dynamically. A recommendation method combining deep learning and information fusion is proposed to solve the problem of sparsity. Particularly, a new deep learning model named information fusion recommendation model combining attention CNN and GNN (ACGIF for short), is constructed. First, attention mechanism is added to the CNN to process the comment information and learn the personalized representation of users and items from the comment information. It learns the comment representation based on comment coding, and learns the user/item representation in the comment through user/item coding. It adds personalized attention mechanism to filter comments with different levels of importance. Then, the rating and trust information are processed through the GNN. For each user, the diffusion process begins with the initial embedding, combining the relevant features and the free user potential vectors that capture the potential behavioral preferences. A layered influence propagation structure is designed to simulate how the user’s potential embedding evolves as the social diffusion process continues. Finally, the preference vector of the user for the item obtained from the first two parts is weighted and fused to obtain the preference vector of the final user for the item. The MAE and RMSE of the recommended results are employed as the experimenalevaluation indicators on four public data sets. The experimental results show that the proposed model has better recommendation effect and running time compared with the existing seven typical recommendation models.

    • Self-paced Learning Method with Adaptive Mixture Weighting

      2023, 34(5):2337-2349. DOI: 10.13328/j.cnki.jos.006438 CSTR:

      Abstract (782) HTML (1349) PDF 6.98 M (2643) Comment (0) Favorites

      Abstract:Self-paced learning (SPL) is a learning regime inspired by the learning process of humans and animals that gradually incorporates samples into training set from easy to complex by assigning a weight to each training sample. SPL incorporates a self-paced regularizer into the objective function to control the learning process. At present, there are various forms of SP regularizers and different regularizers may lead to distinct learning performance. Mixture weighting regularizer has the characteristics of both hard weighting and soft weighting. Therefore, it is widely used in many SPL-based applications. However, the current mixture weighting method only considers logarithmic soft weighting, which is relatively simple. In addition, in comparison with soft weighting or hard weighting, more parameters are introduced in the mixture weighting scheme. In this study, an adaptive mixture weighting SP regularizer is proposed to overcome the above issues. On the one hand, the representation form of weights can be adjusted adaptively during the learning process; on the other hand, the SP parameters introduced by mixture weighting can be adapted according to the characteristics of sample loss distribution, so as to be fully free of the empirically adjusted parameters. The experimental results on action recognition and multimedia event detection show that the proposed method is able to adjust the weighting form and parameters adaptively.

    • Multimodal and Multi-granularity Graph Convolutional Networks for Elderly Daily Activity Recognition

      2023, 34(5):2350-2364. DOI: 10.13328/j.cnki.jos.006439 CSTR:

      Abstract (1459) HTML (1731) PDF 7.38 M (3472) Comment (0) Favorites

      Abstract:With the problem of the aging population becomes serious, more attention is payed to the safety of the elderly when they are at home alone. In order to provide early warning, alarm, and report of some dangerous behaviors, several domestic and foreign research institutions are focusing on studying the intelligent monitoring of the daily activities of the elderly in robot-view. For promoting the industrialization of these technologies, this work mainly studies how to automatically recognize the daily activities of the elderly, such as “drinking water”, “washing hands”, “reading a book”, “reading a newspaper”. Through the investigation of the daily activity videos of the elderly, it is found that the semantics of the daily activities of the elderly are obviously fine-grained. For example, the semantics of “drinking water” and “taking medicine” are highly similar, and only a small number of video frames can accurately reflect their category semantics. To effectively address such problem of the elderly behavior recognition, this work proposes a new multimodal multi-granularity graph convolutional network (MM-GCN), by applying the graph convolution network on four modalities, i.e., the skeleton (“point”), bone (“line”), frame (“frame”), and proposal (“segment”), to model the activities of the elderly, and capture the semantics under the four granularities of “point-line-frame-proposal”. Finally, the experiments are conducted to validate the activity recognition performance of the proposed method on ETRI-Activity3D (110000+ videos, 50+ classes), which is the largest daily activities dataset for the elderly. Compared with the state-of-the-art methods, the proposed MM-GCN achieves the highest recognition accuracy. In addition, in order to verify the robustness of MM-GCN for the normal human action recognition tasks, the experiment is also carried out on the benchmark NTU RGB+D, and the results show that MM-GCN is comparable to the SOTA methods.

    • >Review Articles
    • Survey on Data Integration Technologies for Relational Data and Knowledge Graph

      2023, 34(5):2365-2391. DOI: 10.13328/j.cnki.jos.006808 CSTR:

      Abstract (2103) HTML (4143) PDF 7.38 M (5478) Comment (0) Favorites

      Abstract:Recently, big data is considered a critical strategic resource by many countries and regions. However, difficult data circulation and insufficient data regulation commonly exist in the big data era, thereby leading to the serious phenomenon of data silos, poor data quality, and difficulty in unleashing the potential of data elements. This provokes researchers to explore data integration techniques for breaking data barriers, enabling data sharing, improving data quality, and activating the potential of data elements. Relational data and knowledge graphs, as two significant forms of data organization and storage, have been widely applied in real life. To this end, this study focuses on relational data and knowledge graphs to summarize and analyze the key technologies of data integration, including entity resolution, data fusion, and data cleaning. Finally, it prospects future research directions.

    • Theoretical Study on Multi-level Consistency Modeling in Distributed Databases

      2023, 34(5):2392-2412. DOI: 10.13328/j.cnki.jos.006460 CSTR:

      Abstract (1116) HTML (1055) PDF 9.69 M (2209) Comment (0) Favorites

      Abstract:A new architecture that supports multiple coordinators and multi-replica storage has emerged in distributed database systems, which brings new challenges to the correctness of transaction scheduling. The challenges are represented by new data anomalies caused by the lack of a central coordinator and data inconsistency caused by the multi-replica mechanism. Based on the definition of transaction isolation levels and consistency protocols for distributed systems, this study constructs a unified hybrid dependency graph model for transactional multi-level consistency in multi-coordinator and multi-replica distributed databases. The model provides a robust standard for evaluating the correctness of transaction scheduling, which can facilitate dynamic or static analysis of transaction scheduling in databases.

    • High-dimensional Learned Index Based on Space Division and Dimension Reduction

      2023, 34(5):2413-2426. DOI: 10.13328/j.cnki.jos.006414 CSTR:

      Abstract (806) HTML (1761) PDF 6.15 M (2575) Comment (0) Favorites

      Abstract:In recent years, the prevalent research on big-data processing often deals with increased data scale and high data complexity. The frequent usage of high-dimensional data poses challenges during application, such as efficient query and fast access of database in the system. Hence, it is critical to design an effective high-dimensional index to increase query throughput and decrease memory footage. Kraska et al. proposed learned index, which has been proved superior in real-world low-dimensional datasets. With the success of wide adoption of machine learning and deep learning on database management system, more and more researchers aim to set up learned index on high-dimensional datasets so as to improve the query efficiency. However, current solutions fail to effectively utilize the distribution information of data, and sometimes incur high overhead on the initialization of complex deep learning models. In this work, an improved high-dimensional learned index (IHDL index) is proposed based on the division of data space and dimension reduction. Specifically, the index utilizes multiple linear models on the dataset, and decreases the initialization overhead while maintains high query accuracy. Experiments on the synthetic dataset and the OSM dataset verifyits superiority in terms of initialization overhead, query throughput, and memory footage.

    • Elsa: Coordination-free Distributed KVS for Cross-region Architecture

      2023, 34(5):2427-2445. DOI: 10.13328/j.cnki.jos.006437 CSTR:

      Abstract (1081) HTML (1604) PDF 6.89 M (3419) Comment (0) Favorites

      Abstract:As a distributed storage solution with high performance and high scalability, key-value storage systems have been widely adopted in recent years, such as Redis, MongoDB, Cassandra, etc. On the one hand,the multi-replication mechanism widely used in distributed storage system improves system throughput and reliability, but also increases the extra overhead of system coordination and replicationconsistency. For the cross-region distributed system, the long-distance replication coordination overhead may even become the performance bottleneck of the system, reducing system availability and throughput. The distributed key-value storage system called Elsa, proposed in this study, is a coordination-free multi-master key-value storage system that is designed for cross-region architecture. On the basis of ensuring high performance and high scalability, Elsa adopts the conflict-free replicated data types (CRDT) technology to ensure strong eventual consistency between replications without coordination, reducing the coordination overhead between system nodes. In this study, across-region distributed environment spanning 4 data centers and 8 nodes on aliyun platform is set up and a large-scale distributed performance comparison experiment is carried out.The experimental results show that under the cross-region distributed environment, the throughput of Elsa has obvious advantages for high concurrent contention loads, reaching up to 7.37 times of the MongoDB cluster and 1.62 times of the Cassandra cluster.

    • Fast Mining Algorithm of Frequent Itemset Based on Spark

      2023, 34(5):2446-2464. DOI: 10.13328/j.cnki.jos.006404 CSTR:

      Abstract (880) HTML (1377) PDF 17.79 M (2619) Comment (0) Favorites

      Abstract:Improving the efficiency of frequent itemset mining in big data is a hot research topic at present. With the continuous growth of data volume, the computing costs of traditional frequent itemset generation algorithms remain high. Therefore, this study proposes a fast mining algorithm of frequent itemset based on Spark (Fmafibs in short). Taking advantage of bit-wise operation, a novel pattern growth strategy is designed. Firstly, the algorithm converts itemset into BitString and exploits bit-wise operation to generate candidate itemset. Secondly, to improve the processing efficiency of long BitString, a vertical grouping strategy is designed and the candidate itemset are obtained by joining the frequent itemset between different groups of same transaction, and then aggregating and filtering them to get the final frequent itemset. Fmafibs is implemented in Spark environment. The experimental results on benchmark datasets show that the proposed method is correct and it can significantly improve the mining efficiency.

    • Verifiable Attribute-based Timed Signatures and Its Applications

      2023, 34(5):2465-2481. DOI: 10.13328/j.cnki.jos.006396 CSTR:

      Abstract (1114) HTML (1296) PDF 7.14 M (2638) Comment (0) Favorites

      Abstract:A verifiable timed signature (VTS) scheme allows one to time-lock a signature on a known message for a given amount of time T such that after performing a sequential computation for time T anyone can extract the signature from the time-lock. Verifiability ensures that anyone can publicly check if a time-lock contains a valid signature on the message without solving it first, and that the signature can be obtained by solving the same for time T. This study first proposes the notion of verifiable attribute-based timed signatures (VABTS) and gives an instantiation VABTS further. The instantiation VABTS scheme can not only simultaneously support identity privacy-preserving, dynamic user revocation, traceability, timing, but also solve the problem of key escrow in attribute-based scheme. In addition, VABTS has many applications. This study lists two application scenarios of VABTS: building a privacy-preserving payment channel network for the permissioned blockchain and realizing a fair privacy-preserving multi-party computing. Finally, it is proved that the instantiation VABTS scheme is secure and efficient via formal security analysis and performance evaluation.

    • Password Hardening Encryption Services Against Malicious Server

      2023, 34(5):2482-2493. DOI: 10.13328/j.cnki.jos.006440 CSTR:

      Abstract (889) HTML (1234) PDF 9.00 M (2158) Comment (0) Favorites

      Abstract:Password hardening encryption (PHE) is an emerging primitive in recent years. It can resist offline attack brought by keyword guessing attack from server via adding a third party with crypto services joining the decryption process. This primitive enhances the password authentication protocol and adds encryption functionality. This paper presents an active attack from server in the first scheme that introduced this primitive. This attack combines the idea from a cutting-edge threat called algorithm substitution attack which is undetectable and makes the server capable of launching offline attack. This result shows that the original PHE scheme can not resist attacks from malicious server. Then this study tries to summarize the property that an algorithm substitution attack resistant scheme should have. After that this paper presents a PHE scheme that can resist such kind of attacks from malicious server with simulation results. Finally, this study concludes the result and gives some expectation for future systematic research on interactive protocols under algorithm substitution attack.

    • Unified Image Aesthetic and Emotional Prediction Based on Deep Multi-task Learning

      2023, 34(5):2494-2506. DOI: 10.13328/j.cnki.jos.006487 CSTR:

      Abstract (842) HTML (1278) PDF 9.13 M (2113) Comment (0) Favorites

      Abstract:Image aesthetic assessment and emotional analysis aim to enable computers to identify the aesthetic and emotional responses of human beings caused by visual stimulations, respectively. Existing research usually treats them as two independent tasks. However, people’s aesthetic and emotional responses do not appear in isolation. On the contrary, from the perspective of psychological cognition, the two responses are interrelated and mutually influenced. Therefore, this study follows the idea of deep multi-task learning to deal with image aesthetic assessment and emotional analysis under a unified framework and explore their relationship. Specifically, a novel adaptive feature interaction module is proposed to correlate the backbone networks of the two tasks and achieve a unified prediction. In addition, a dynamic feature interaction mechanism is introduced to adaptively determine the degree of feature interaction between the tasks according to the feature dependencies. As the multi-task network updates structural parameters, the study, based on the inconsistency in complexity and convergence speed between the two tasks, proposes a novel gradient balancing strategy to ensure that the network parameters of each task can be smoothly learned under the unified prediction framework. Furthermore, the study constructs a large-scale unified image aesthetic and emotional dataset–UAE. According to the study, UAE is the first image collection containing both aesthetic and emotional labels. Finally, the model and codes of the proposed method as well as the UAE dataset have been released at https://github.com/zhenshen-mla/Aesthetic-Emotion-Dataset.

Current Issue


Volume , No.

Table of Contents

Archive

Volume

Issue

联系方式
  • 《Journal of Software 》
  • 主办单位:Institute of Software, CAS, China
  • 邮编:100190
  • 电话:010-62562563
  • 电子邮箱:jos@iscas.ac.cn
  • 网址:https://www.jos.org.cn
  • 刊号:ISSN 1000-9825
  •           CN 11-2560/TP
  • 国内定价:70元
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063