Abstract: With the development of Internet information technology, large-scale graphs have widely emerged in social networks, computer networks, and biological information networks. In view of the storage and performance limitations of traditional graph data management technology when dealing with large-scale graphs, distributed management technology has become a hotspot in industry and academia fields. The core decomposition is adopted to get core numbers of vertices in a graph and plays a key role in many applications, including community search, protein structure analysis, and network structure visualization. The existing distributed core decomposition algorithm applied a broadcast message delivery mechanism based on the vertex-centric mode, which may generate a large amount of redundant communication and computation overhead and lead to memory overflow when processing large-scale graphs. To address these issues, this study proposes novel distributed core decomposition algorithms based on global activation and peeling calculation frameworks, respectively. In addition, there are several strategies designed to improve algorithm performance. Based on the locality of the vertex core number, the study proposes a new message-pruning strategy and a new worker-centric computing mode, thereby improving the efficiency of our algorithms. To verify those strategies, this study deploys the proposed models and algorithms on the distributed cluster of the National Supercomputing Center in Changsha, and the effectiveness and efficiency of the proposed methods are evaluated through a large number of experiments on real and synthetic data sets. The total time performance of the algorithm is improved by 37% to 98%.
Abstract: Raft is one of the most popular distributed consensus protocols. Since it was proposed in 2014, Raft and its variants have been widely used in different kinds of distributed systems. To prove the correctness of the Raft protocol, developers use the TLA+ formal specification to model and verify its design. However, due to the gap between the abstract formal specification and practical implementation, distributed systems that implement the Raft protocol can still violate the protocol design and introduce intricate bugs. This study proposes a novel testing technique based on TLA+ formal specification to unearth bugs in Raft implementations. To be specific, the study maps the formal specification to the corresponding system implementation and then uses the specification-defined state space to guide the testing in the implementations. To evaluate the feasibility and effectiveness of the proposed approach, the study applies it on two different Raft implementations and finds 3 previously unknown bugs.
Abstract: The code search method based on deep learning realizes the code search task by calculating the similarity of the corresponding representation of the code and the description statement. However, this manner does not consider the real probability distribution of relevance between the code and the description. To solve this problem, this study proposes a code search method based on a generative adversarial game that combines the correlation between the code and the description in the classical probability model with the feature extraction in the vector space model. Then the generative adversarial game is adopted to apply the probability distribution between the code and the description to the alternate training of the generator and discriminator. Meanwhile, the code encoder and the description encoder are optimized, and high-quality code representation and description statement representation are generated for the code search task. Finally, experimental verification is carried out on the public dataset, and the results show that the proposed method improves the Recall@10, MRR@10, and NDCG@10 metrics by 8.4%, 32.5%, and 24.3% respectively compared to the DeepCS method.
Abstract: Embedded systems are becoming increasingly complex, and the requirements analysis of their software systems has become a bottleneck in embedded system development. Device dependency and interleaving execution logic are typical characteristics of embedded software systems, necessitating effective requirement analysis methods to decouple the requirements based on device dependencies. Starting from the idea of environment-based modeling in requirement engineering, this study proposes a projection-based requirement analysis approach from system requirements to software requirements for embedded software systems, helping requirement engineers to effectively decouple the requirements. The study first summarizes the system requirement and software requirement descriptions of embedded software systems, defines the requirement decoupling strategies of embedded software systems based on interactive environment characteristics, and designs the specification process from system requirements to software requirements. A real case study is carried out in the spacecraft sun search system, and five representative case scenarios are quantitatively evaluated through two metrics of coupling and cohesion, which demonstrate the effectiveness of the proposed approach.
Abstract: In recent years, service-oriented IoT architectures have received a lot of attention from academia and industry. By encapsulating IoT resources into intelligent IoT services, interconnecting and collaborating these resource-constrained and capacity-evolving IoT services to facilitate IoT applications has become a widely adopted and flexible mechanism. Upon capacity-fluctuating and resource-varying edge devices, IoT services may experience QoS degradations or resource mismatches during their execution, making it difficult for IoT applications to continue and possibly inducing failures. Therefore, quantitative monitoring of IoT services at runtime has become the key to guaranteeing the robustness of IoT applications. Different monitoring mechanisms have been proposed in recent literature, but they are inadequate in formal interpretation with strong domain relevance and empirical subjectivity. Based on formal methods, such as signal temporal logic (STL), the problem of IoT service monitoring can be formulated as a temporal logic task to achieve runtime quantitative monitoring. However, STL and its extensions suffer from issues of non-differentiability, loss of soundness, and inapplicability in dynamic environments. Moreover, existing works are inadequate for the monitoring of composite services, with a lack of integrity, linkage, and dynamics. To solve these problems, this study proposes a compositional signal temporal logic (CSTL) to achieve quantitative monitoring of different QoS constraints and time constraints upon intra-, inter-, and composite services. Specifically, CSTL extends an accumulative operator based on positively and negatively biased Riemann sums to emphasize the robust satisfaction of all sub-formulae over their entire time domains and to evaluate qualitative and quantitative constraint satisfaction for IoT service monitoring. Besides, CSTL extends a compositional operator based on constraint types and composite structures, as well as dynamic variables that can vary with the dynamic environment, to effectively monitor QoS variations and temporal violations of composite services. As a result, temporal and QoS constraints upon intra-, inter-, and composite services, can be specified by CSTL formulae, and formally interpreted with qualitative and quantitative satisfaction at runtime. Extensive evaluations show that the proposed CSTL performs better than baseline techniques in terms of expressiveness, applicability, and robustness.
Abstract: With the rapid development of deep neural network (DNN), the accuracy of DNN has become comparable to or even surpassed that of humans in some specific tasks. However, like traditional software, DNN is inevitably prone to defects. If defective DNN models are applied to safety-critical fields, they may cause serious accidents. Therefore, it is urgent to propose effective methods to detect defective DNN models. The traditional differential testing methods rely on the output of the testing target at the same test input as the basis for difference analysis. However, even different DNN models trained with the same program and dataset may produce different outputs under the same test input. Therefore, it is difficult to directly use the traditional differential testing method for detecting defective DNN models. To solve the above problems, this study proposes interpretation-analysis-based differential testing (IADT), an interpretation-analysis-based differential testing method for DNN models. IADT uses interpretation methods to analyze the behavior explanation of DNN models and uses statistical methods to analyze the significant differences in the models’ behavior interpretations to detect defective models. Experiments carried out on real defective models show that the introduction of interpretation methods makes IADT effective in detecting defective DNN models, while the F1-value of IADT in detecting defective models is 0.8% –6.4% greater than that of DeepCrime, and the time consumed by IADT is only 4.0%–5.4% of DeepCrime.
Abstract: Currently, sentiment analysis research is generally based on big data-driven models, which heavily rely on expensive annotation and computational costs. Therefore, research on sentiment analysis in low-resource scenarios is particularly urgent. However, existing research on sentiment analysis in low-resource scenarios mainly focuses on a single task, making it difficult for models to acquire external task knowledge. Therefore, this study constructs successive sentiment analysis in low-resource scenarios, aiming to allow models to learn multiple sentiment analysis tasks over time by continual learning methods. This can make full use of data from different tasks and learn sentiment information from different tasks, thus alleviating the problem of insufficient training data for a single task. There are two core problems with successive sentiment analysis in low-resource scenarios. One is preserving sentiment information for a single task, and the other is fusing sentiment information between different tasks. To solve these two problems, this study proposes continual attention modeling for successive sentiment analysis in low-resource scenarios. Sentiment masked Adapter (SMA) is first constructed, which is used to generate hard attention emotion masks for different tasks. This can preserve sentiment information for different tasks and mitigate catastrophic forgetting. Secondly, dynamic sentiment attention (DSA) is proposed, which dynamically fuses features extracted by different Adapters based on the current time step and task similarity. This can fuse sentiment information between different tasks. Experimental results on multiple datasets show that the proposed approach significantly outperforms the state-of-the-art benchmark approaches. Additionally, experimental analysis indicates that the proposed approach has the best sentiment information retention ability and sentiment information fusion ability compared to other benchmark approaches while maintaining high operational efficiency.
Abstract: Intelligent connected vehicles (ICVs) hold a significant strategic position within the national developmental framework, epitomizing a critical technological facet underpinning automotive industry innovations and serving as a nucleus of core national competitiveness. The culmination of ICV development resides in the realization of autonomous driving capabilities, herein termed “autonomous vehicles”. Security ramifications intrinsic to autonomous vehicles bear direct implications for public security, individual safety, and property integrity. However, a comprehensive, methodologically rigorous investigation of these security dimensions remains conspicuously absent. A comprehensive examination of the security threats germane to autonomous vehicles, thus, serves as a compass guiding security fortifications and engendering widespread adoption. By collating pertinent research endeavors from both academia and industry, this study undertakes a methodical and comprehensive analysis of the security issues intrinsic to autonomous driving. Inceptive discourse elaborates on the architectural contours of autonomous vehicles, interlaced with the nuances of their security considerations. Subsequently, embracing a model-centric vantage point, the analysis meticulously delineates nine prospective attack vectors across the tripartite domains of physical inputs, informational inputs, and the driving model itself. Each vector is expounded alongside its associated attack modalities and corresponding security mitigations. Finally, through quantitative analysis of research literature encompassing the last septennium, the prevailing terrain of autonomous vehicle security scholarship is scrutinized, thereby crystallizing latent trajectories for future research endeavors.
Abstract: With the proliferation of massive data and the ever-growing demand for intelligent applications, ensuring data security has become a critical measure for enhancing data quality and realizing data value. The cloud-edge-client architecture has emerged as a promising technology for efficient data processing and optimization. Federated learning (FL), an efficient decentralized machine learning paradigm that can provide privacy protection for data, has garnered extensive attention from academia and industry in recent years. However, FL has demonstrated inherent vulnerabilities that render it highly susceptible to poisoning attacks. Most existing methods for defending against poisoning attacks rely on continuously updated space, but in practical scenarios, those methods may be less robust when facing flexible attack strategies and varied attack scenarios. Therefore, this study proposes FedDiscrete, a defense method for resisting poisoning attacks in cloud-edge FL (CEFL) systems. The key idea is to compute local rankings on the client side using the scores of network model edges to create discrete update space. To ensure fairness among clients participating in the FL task, this study also introduces a contribution metric. In this way, FedDiscrete can penalize potential attackers by allocating updated global rankings. Extensive experiments demonstrate that the proposed method exhibits significant advantages and robustness against poisoning attacks, and is applicable to both independent and identically distributed (IID) and non-IID scenarios, providing protection for CEFL systems.
Abstract: Cloud storage has become an important part of the digital economy as it brings great convenience to users’ data management. However, complex and diverse network environments and third parties that are not fully trusted pose great threats to users' privacy. To protect users’ privacy, data is usually encrypted before storage, but the ciphertext generated by traditional encryption techniques hinders subsequent data retrieval. Public-key encryption with keyword search (PEKS) technology can provide a confidential retrieval function while guaranteeing data encryption, but the traditional PEKS scheme is vulnerable to keyword guessing attacks due to the small number of common keywords. Public-key authenticated encryption with keyword search (PAEKS) introduces authentication technology based on PEKS, which can further improve security. However, most of the existing PAEKS schemes are designed based on foreign cryptographic algorithms, which do not meet the development needs of independent innovation of cryptography in China. This study proposes an SM9-PAEKS scheme, which can effectively improve user-side retrieval efficiency by redesigning algorithm structure and transferring time-consuming operations to a resource-rich cloud server. Scheme security is also proved under the random oracle model based on q-BDHI and Gap-q-BCCA1 security assumptions. Finally, theoretical analysis and experimental results show that compared with the optimal communication cost among similar schemes, SM9-PAEKS can reduce the total computational overhead by at least 59.34% with only 96 bytes of additional communication cost, and the computational overhead reduction of keyword trapdoor generation is particularly significant, about 77.55%. This study not only helps to enrich national security algorithm applications but also provides theoretical and technical support for data encryption and retrieval in cloud storage.
Abstract: Stochastic block models can fit the generation of various networks, mining implicit structures and potential connections within these networks. Thus, they have significant advantages in community detection. General stochastic block (GSB) models discover general communities based on link communities, but they are only applicable to directed non-attributed networks. This study proposes a degree corrected general stochastic block (DCGSB) model for undirected attributed networks which models both network topology information and node attributes. In the DCGSB model, it is assumed that the generation of network topology information and node attributes follows a distribution in the form of power functions. Node degrees are introduced to characterize the scale-free property of networks, which allows the model to better fit the generation of real networks. The expectation-maximization algorithm is employed to estimate the parameters of the DCGSB model, and node-community memberships are obtained by hard partition to complete community detection. Experiments are conducted on three real attributed network datasets containing different network structures, and the proposed model is compared with ten existing community detection algorithms. Results show that the DCGSB model not only inherits the advantages of GSB models in identifying general communities but also outperforms the ten algorithms in community detection due to the introduction of attribute information and node degrees.
Abstract: Cloud storage auditing guarantees the security of data stored in the cloud, enabling data owners to easily verify the integrity of data. However, a vast amount of data in the cloud can lead to significant computational overhead during cloud storage auditing when verifying data integrity and modifying data ownership. To solve this problem as well as provide practical solutions, this study proposes a dynamic auditing scheme for cloud storage with efficient data ownership sharing. An efficient validation structure is constructed to aggregate information for data verification, avoiding a large number of bilinear pairing operations that incur high computational costs. An efficient mechanism for data ownership sharing is designed based on the chameleon hash function’s ability to generate new collisions. It allows updating the secret key of the corresponding user for shared data ownership without modifying the ciphertexts stored in the cloud. In addition, the proposed scheme achieves fine-grained data sharing, encrypted data auditing, and dynamic data modification. The security and performance analyses show that the proposed scheme ensures the security of data in the cloud without affecting its performance, which means it is a practical scheme.
Abstract: Multi-label text classification aims to assign several predefined labels or categories to text. To fully explore the correlations among labels, current methods typically utilize a label relation graph and integrate it with graph neural networks to obtain the representations of label features. However, such methods often overly rely on the initial graph construction, overlooking the inherent label correlations in the current text. Consequently, classification results heavily depend on the statistics of datasets and may overlook label-related information within the text. Therefore, this study proposes an algorithm for multi-label text classification based on feature-fused dynamic graph networks. It designs dynamic graphs to model label correlations within the current text and integrates feature fusion with graph neural networks to form label representations based on the current text, thus achieving more accurate multi-label text classifications. Experimental results on three datasets demonstrate the effectiveness and feasibility of the proposed model as it shows excellent performance in multi-label text classifications.
Abstract: With the development of deep learning technologies such as pre-trained models, represented by Transformer, large language models (LLMs) have shown excellent comprehension and creativity. They not only have an important impact on downstream tasks such as abstractive summarization, dialogue generation, machine translation, and data-to-text generation but also exhibit promising applications in multimodal fields such as image description and visual narratives. While LLMs have significant advantages in performance, deep learning-based LLMs are susceptible to hallucinations, which may reduce the system performance and even seriously affect the trustworthiness and broad applications of LLMs. The accompanying legal and ethical risks have become the main obstacles to their further development and implementation. Therefore, this survey provides an extensive investigation and technical review of the hallucinations in LLMs. Firstly, the hallucinations in LLMs are systematically summarized, and their origin and causes are analyzed. Secondly, a systematical overview of hallucination evaluation and mitigation is provided, in which the evaluation and mitigation methods are categorized and thoroughly compared for different tasks. Finally, the future challenges and research directions of the hallucinations in LLMs are discussed from the perspectives of evaluation and mitigation.
Abstract: Multimodal information extraction is a task to extract structured knowledge from unstructured or semi-structured multimodal data (such as text and images). It includes multimodal named entity recognition, multimodal relation extraction, and multimodal event extraction. This study analyzes multimodal information extraction tasks and summarizes the common part of the above three subtasks, i.e., a multimodal representation and fusion module. Moreover, it sorts out the commonly used datasets and mainstream research methods of the above three subtasks. Finally, it outlines research trends in multimodal information extraction and analyzes the existing problems and challenges in this field to provide a reference for future research.
Abstract: With the significant success of deep learning in fields such as computer vision and natural language processing, researchers in software engineering have begun to explore its integration into solving software engineering tasks. Existing research indicates that deep learning exhibits advantages in various code-related tasks, such as code retrieval and code summarization, that traditional methods and machine learning cannot match. Deep learning models trained for code-related tasks are referred to as deep code models. However, similar to natural language processing and image processing models, the security of deep code models faces numerous challenges due to the vulnerability and inexplicability of neural networks. It has become a research focus in software engineering. In recent years, researchers have proposed numerous attack and defense methods for deep code models. Nevertheless, there is a lack of a systematic review of research on deep code model security, hindering the rapid understanding of subsequent researchers in this field. To provide a comprehensive overview of the current research, challenges, and latest findings in this field, this study collects 32 relevant papers and categorizes existing research results into two main classes: backdoor attack and defense techniques, and adversarial attack and defense techniques. This study systematically analyzes and summarizes the collected papers based on the above two categories. Subsequently, it outlines commonly used experimental datasets and evaluation metrics in this field. Finally, it analyzes key challenges in this field and suggests feasible future research directions, aiming to provide valuable guidance for further advancements in the security of deep code models.
Abstract: Deep learning has yielded remarkable achievements in many computer vision tasks. However, deep neural networks typically require a large amount of training data to prevent overfitting. In practical applications, labeled data may be extremely limited. Thus, data augmentation has become an effective way to enhance the adequacy and diversity of training data and is also a necessary link for the successful application of deep learning models to image data. This study systematically reviews different image data augmentation methods and proposes a new classification method to provide a fresh perspective for studying image data augmentation. The advantages and limitations of various data augmentation methods are introduced from different categories, and the solution ideas and application values of these methods are elaborated. In addition, commonly used public datasets and performance evaluation indicators in three typical computer vision tasks of semantic segmentation, image classification, and object detection are presented. Experimental comparative analysis of data augmentation methods is conducted on these three tasks. Finally, the challenges and future development trends currently faced by data augmentation are discussed.
Abstract: Dynamic searchable symmetric encryption has attracted much attention because it allows users to securely search and dynamically update encrypted documents stored in a semi-trusted cloud server. However, most searchable symmetric encryption schemes only support single-keyword search, failing to achieve conjunctive search while protecting forward and backward privacy. In addition, most schemes are not robust, which means that they cannot handle irrational update requests from a client, such as adding or deleting a certain keyword/file identifier pair, or deleting non-existent keywords/file identifier pairs. To address these challenges, this study proposes a robust scheme for conjunctive dynamic symmetric searchable encryption that preserves both forward and backward privacy, called RFBC. In this scheme, the server constructs two Bloom filters for each keyword, which are used to store the relevant hash values of the keyword/file identifier pair to be added and deleted, respectively. When the client sends update requests, the server uses the two Bloom filters to determine and filter irrational update requests, so as to guarantee the robustness of the scheme. In addition, by combining the status information of the lowest frequency keywords among multiple keywords, the Bloom filters, and the update counter, RFBC realizes conjunctive search by filtering out file identifiers that do not contain the rest keywords. Finally, by defining the leakage function, RFBC is proved to be forward private and Type-III backward private through a series of security analyses. Experimental results show that compared with related schemes, RFBC greatly improves computation and communication efficiency. Specifically, the computational overhead of update operations in RFBC is about 28% and 61.7% of that in ODXT and BDXT, respectively. The computational overhead of search operations in RFBC is about 21.9% and 27.3% of that in ODXT and BDXT, respectively. The communication overhead of search operations in RFBC is about 19.7% and 31.6% of that in ODXT and BDXT, respectively. Moreover, as the proportion of irrational updates gradually increases, RFBC exhibits significantly higher improvement in search efficiency compared to both BDXT and ODXT.
Abstract: Code comment generation is an important research task in software engineering. Mainstream methods for comment generation train deep learning models to generate comments, relying on metrics such as BLEU to evaluate comment quality on open code comment datasets. These evaluations mainly reflect the similarity between generated comments and manual reference comments in the datasets. However, the quality of the manual reference comments in open comment datasets varies widely, which leads to more and more doubts about the effectiveness of these metrics. Therefore, for code comment generation tasks, there is an urgent need for direct and effective methods to evaluate code comment quality. Such methods can improve the quality of open comment datasets and enhance the evaluation of generated comments. This study conducts research and analysis on existing quantifiable methods for code comment quality evaluation and applies a set of multi-dimensional metrics to directly evaluate the quality of code comments in mainstream open datasets, comments generated by traditional methods, and comments generated by ChatGPT. The study reveals the following findings. 1) The quality of code comments in mainstream open datasets needs improvement, with issues such as inaccuracy, poor readability, excessive simplicity, and a lack of useful information. 2) Comments generated by traditional methods are more lexically and semantically similar to the code but lack information that is more useful to developers, such as high-level intentions of the code. 3) One important reason for the low BLEU scores of generated comments is the large number of poor-quality reference comments in datasets, which lack relevance with the code or exhibit poor naturalness. These kinds of reference comments should be filtered or improved. 4) Comments generated by LLMs like ChatGPT are rich in content but tend to be lengthy. Their quality evaluation needs to be tailored to developer intentions and specific scenarios. Based on these findings, this study provides several suggestions for future research in code comment generation and comment quality evaluation.
Abstract: Software concept drift means that the structure and composition of the same type of software will change over time. In malware classification, concept drift means that the structure and composition characteristics of malware samples from the same family can change over time. This will cause a decline in the performance of fixed-mode malware classification algorithms over time. Existing methods for static malware classification experience significant performance degradation when faced with concept drift scenarios, making it difficult to meet the needs of practical applications. To address this problem, given the commonalities between natural language understanding and binary byte stream analysis, a highly accurate and robust malware classification method is proposed based on BERT and a custom autoencoder architecture. This method extracts execution-oriented malware opcode sequences through disassembly analysis to reduce redundant information. Then, it uses BERT to understand the contextual semantics of the sequences and perform vector embedding to effectively understand the deep program semantics of the malware samples. It also screens effective task-related features through the geometric median subspace projection and bottleneck autoencoders. Finally, a classifier composed of fully connected layers is used to output the classification results. The practical effectiveness of the proposed method is validated through comparative experiments with nine state-of-the-art malware classification methods in both normal and concept drift scenarios. Experimental results show that the proposed method achieves an F1 score of 99.49% in normal scenarios, outperforming those nine methods. Moreover, in concept drift scenarios, the F1 score is improved by 10.78% to 43.71% compared to the nine methods.
Abstract: With the continuous development of computer vision and artificial intelligence (AI) in recent years, embodied AI has received widespread attention from academia and industry at home and abroad. Embodied AI emphasizes that an agent should actively obtain real feedback from the physical world by interacting with the environment in a contextualized way and make itself more intelligent through learning from the feedback. As one of the concrete tasks of embodied AI, object goal navigation requires an agent to search for and navigate to a specified object goal (e.g., find a sink) in a previously unknown, complex, and semantically rich scenario. Object goal navigation has great potential for applications in smart assistants that support daily human activities, serving as a fundamental and antecedent task for other interaction-based embodied AI research. This study systematically classifies current research on object goal navigation. Firstly, the knowledge related to environmental representation and autonomous visual exploration is introduced, and existing object goal navigation methods are classified and analyzed from three different perspectives. Secondly, two categories of higher-level object rearrangement tasks are introduced, with a description of datasets for realistic indoor environment simulation, evaluation metrics, and a generic training paradigm for navigation strategies. Finally, the performance of existing object goal navigation strategies is compared and analyzed on different datasets. The challenges in this field are summarized, and development trends are predicted.
Abstract: With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream processing systems in big data, exemplified by Apache Flink, have been widely applied. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, current research indicates that Flink has weak single-point performance and poor cluster scalability. To improve the throughput of stream processing systems, researchers have focused on optimizations in designing control planes, implementing system operators, and improving vertical scalability. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications are driven by event streams and employ stateful processing functions, including low voltage detection in smart grids and advertising recommendation. This study analyzes the data flow characteristics of typical streaming analysis applications, identifies three bottlenecks in optimizing scalability, and proposes corresponding optimization strategies: the key-level watermark strategy, the dynamic load distribution strategy, and the the key-value based exchange strategy. Based on these optimization strategies, this study implements Trilink based on Flink and applies it to various applications such as low voltage detection, bridge arch crowns monitoring, and the Yahoo Streaming Benchmark. Experimental results show that the modified system, Trilink, achieves more than a 5-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scalability acceleration in an 8-node setup, compared to Flink.
Abstract: Federated learning, a framework for training global machine learning models through distributed iterative collaboration without sharing private data, has gained prevalence. FedProto, a widely used federated learning approach, employs abstract class prototypes, termed feature maps, to enhance model convergence speed and generalization capacity. However, this approach overlooks the verification of the aggregated feature maps’ accuracy, risking model training failures due to incorrect feature maps. This study investigates a feature map poisoning attack on FedProto, revealing that malicious actors can degrade inference accuracy by up to 81.72% through tampering with the training data labels. To counter such attacks, we propose a dual defense mechanism utilizing knowledge distillation and feature map validation. Experimental results on authentic datasets demonstrate that this defense strategy can enhance the compromised model inference accuracy by a factor of 1 to 5, with only a marginal 2% increase in operational time.
Abstract: Reinforcement learning has achieved remarkable results in decision-making tasks like intelligent dialogue systems, yet its efficiency diminishes notably in scenarios with intricate structures and scarce rewards. Researchers have integrated the skill discovery framework into reinforcement learning, aiming to maximize skill disparities to establish policies and boost agent performance in such tasks. However, the constraint posed by the limited diversity of sampled trajectory data confines existing skill discovery methods to learning a single skill per reinforcement learning episode. Consequently, this limitation results in subpar performance in complex tasks requiring sequential skill combinations within a single episode. To address this challenge, a group-wise contrastive learning based sequence-aware skill discovery method (GCSSD) is proposed, which integrates contrastive learning into the skill discovery framework. Initially, to augment trajectory data diversity, the complete trajectories interacting with the environment are segmented and grouped, employing contrastive loss to learn skill embedding representations from grouped trajectories. Subsequently, skill policy training is conducted by combining the skill embedding representation with reinforcement learning. Lastly, to enhance performance in tasks featuring diverse sequential skill combinations, the sampled trajectories are segmented into skill representations and embedded into the learned policy network, facilitating the sequential combination of learned skill policies. Experimental results demonstrate the efficacy of the GCSSD method in tasks characterized by sparse rewards and sequential skill combinations, showcasing its capability to swiftly adapt to tasks with varying sequential skill combinations using learned skills.
Abstract: The rise of video platforms has led to the rapid dissemination of videos, integrating them into various aspects of social life. Videos transmitted in the network may include harmful content, highlighting an urgent need for cyberspace security supervision to accurately identify harmful videos that are encrypted and transmitted in the network. The existing methods collect traffic data at main network access points to extract the features of encrypted video traffic and identify the harmful videos by matching the traffic features based on harmful video databases. However, with the progress of encryption protocol for video transmission, HTTP/2 using new multiplexing technologies has been widely applied, which makes the traditional traffic analysis method based on HTTP/1.1 features fail to identify encrypted videos using HTTP/2. Moreover, the current research mostly focuses on videos with a fixed resolution during playback. Few studies have considered the impact of resolution switching in video identification. To address the above problems, this study analyzes the factors that cause offsets in the length of the audio/video data during the HTTP/2 transmission process and proposes a method to precisely reconstruct corrected fingerprints for encrypted videos by calculating the size of the combined audio and video segments in the encrypted traffic. The study also proposes an encrypted video identification model based on the hidden Markov model and the Viterbi algorithm by using the corrected fingerprints of encrypted videos and a large plaintext fingerprint database for videos. The model applies dynamic planning to solve the problems caused by adaptive video resolution switching. The proposed model achieves identification accuracy of 98.41% and 97.91% respectively for encrypted videos with fixed and adaptive resolutions in 400000-level fingerprint databases, namely Facebook and Instagram. The study validates the generality and generalization of the proposed method using three video platforms: Triller, Twitter, and Mango TV. The higher application value of the proposed method has been validated through comparisons with similar work in terms of recognition effectiveness, generalization, and time overhead.
Abstract: Machine learning has become increasingly prevalent in daily life. Various machine learning methods are proposed to utilize historical data for making predictions, making people’s life more convenient. However, there is a significant challenge associated with machine learning-privacy leakage. Mere deletion of a user’s data from the training set is not sufficient for avoiding privacy leakage, as the trained model may still harbor this information. To tackle this challenge, the conventional approach entails retraining the model on a new training set that excludes the data of the user. However, this method can be costly, prompting the exploration for a more efficient way to “unlearn” specific data while yielding a model comparable to a retrained one. This study summarizes the current literature on this topic, categorizing existing unlearning methods into three groups: training-based, editing-based, and generation-based methods. Additionally, various metrics are introduced to assess unlearning methods. The study also evaluates current unlearning methods in deep learning and concludes with future research directions in this field.
Abstract: As the scale of cities continues to increase, urban transportation systems are facing more and more challenges, such as traffic congestion and traffic safety. Traffic simulation is a method to solve urban traffic problems. It uses virtual and real computing technologies to process real-time traffic data and optimize urban traffic efficiency. It is an important method to achieve the parallel city theory in intelligent transportation. However, traditional computing systems often encounter problems such as insufficient computing resources and long simulation delays when running large-scale urban traffic simulations. To solve the above problems, this study proposes a parallel algorithm for traffic simulation of parallel cities based on the parallel city theory and the heterogeneous architecture of China’s new-generation supercomputer, Tianhe. This algorithm accurately simulates traffic elements such as vehicles, roads, and traffic signals, and applies methods such as road network division, parallel driving of vehicles, and parallel control of signal lights to achieve high-performance traffic simulation. The algorithm runs on Tianhe, a supercomputing platform with 16 nodes and more than 25 000 cores, and simulates real traffic scenarios involving 2.4 million vehicles, 7 797 intersections, and 170 000 lanes within the Fifth Ring Road in Beijing. Compared with traditional single-node simulation, the proposed algorithm reduces the simulation time of each step from 2.21 s to 0.37 s, achieving nearly 6 times acceleration. An urban traffic simulation with a scale of one million vehicles has been successfully implemented on a domestic heterogeneous supercomputing platform.
Abstract: Traffic flow prediction is an important foundation and a hot research direction for traffic management in intelligent transportation systems (ITS). Traditional methods for traffic flow prediction typically rely on a large amount of high-quality historical observation data to achieve accurate predictions, but the prediction accuracy significantly decreases in more common scenarios with data scarcity in traffic networks. To address this problem, a transfer learning model is proposed based on spatial-temporal graph convolutional networks (TL-STGCN), which leverages traffic flow features from a source network with abundant data to assist in predicting future traffic flow in a target network with data scarcity. Firstly, a spatial-temporal graph convolutional network based on time attention is employed to learn the spatial and temporal features of the traffic flow data in both the source and target networks. Secondly, domain-invariant spatial-temporal features are extracted from the representations of the two networks using transfer learning techniques. Lastly, these domain-invariant features are utilized to predict the future traffic flow in the target network. To validate the effectiveness of the proposed model, experiments are conducted on real-world datasets. The results demonstrate that TL-STGCN outperforms existing methods by achieving the highest accuracy in mean absolute error, root mean square error, and mean absolute percentage error, which proves that TL-STGCN provides more accurate traffic flow predictions for scenarios with data scarcity in traffic networks.
Abstract: Multi-tenant cloud databases offer services more cheaply and conveniently, with advantages like paying on demand, scaling on demand, automatic deployment, high availability, self-maintenance, and shared resources. Now more and more enterprises and individuals begin to host their database services on database as a service (DaaS) platforms. These DaaS platforms provide services to multiple tenants in accordance with their service-level agreements (SLAs), while improving revenue for themselves. However, due to the dynamic, heterogeneous, and competitive characteristics of multiple tenants and their loads, it is a very challenging task for DaaS platform providers to adaptively plan and schedule resources according to dynamic loads while complying with multi-tenants’ SLAs. For common types of multi-tenant cloud databases, such as relational databases, this survey firstly analyzes the challenges faced by resource planning and scheduling of multi- tenant cloud databases in detail and then outlines related key scientific issues. Then, it provides a framework of related techniques and a summary of existing research in four areas: resource planning and scheduling technologies, resource forecasting technologies, resource elastic scaling technologies, and resource planning and scheduling tools for existing databases. Lastly, this survey provides suggestions for future research directions on resource planning and scheduling technologies for multi-tenant cloud databases.
Abstract: The density-based spatial clustering of applications with noise (DBSCAN) algorithm is one of the clustering analysis methods in the field of data mining. It has a strong capability of discovering complex relationships between objects and is insensitive to noise data. However, existing DBSCAN methods only support the clustering of unimodal objects, struggling with applications involving multi-model data. With the rapid development of information technology, data has become increasingly diverse in real-life applications and contains a huge variety of models, such as text, images, geographical coordinates, and data features. Thus, existing clustering methods fail to effectively model complex multi-model data and cannot support efficient multi-model data clustering. To address these issues, in this study, a density-based clustering algorithm in multi-metric spaces is proposed. Firstly, to characterize the complex relationships within multi-model data, this study uses a multi-metric space to quantify the similarity between objects and employs aggregated multi-metric graph (AMG) to model multi-model data. Next, this study employs differential distances to balance the graph structure and leverages a best-first search strategy combined with pruning techniques to achieve efficient multi-model data clustering. The experimental evaluation on real and synthetic datasets, using various experimental settings, demonstrates that the proposed method achieves at least one order of magnitude improvement in efficiency with high clustering accuracy, and exhibits good scalability.
Abstract: With the popularity of mobile devices and the enhancement of users’ requirements for privacy protection, studies of user authentication on mobile devices have attracted widespread attention. Recently, the audio infrastructures of mobile devices have provided greater flexibility and scalability for the design of novel user authentication schemes with excellent performance. After surveying a large number of related works, this study first classifies acoustic sensing-based user authentication schemes on mobile devices according to the difference in authentication metrics and sensing methods and describes the corresponding attack model. Then, it analyzes and compares single authentication metric-based and acoustic sensing-based user authentication schemes on mobile devices. Finally, combined with the problems of existing works, this study gives two metrics (security and practicability) to measure the performance of the user authentication system and discuss future research directions.
Abstract: Continuous dynamical systems safety verification is an important research issue, and over the years, various verification methods have been very limited in the scale of the problems they can handle. For a given continuous dynamical system, this study proposes an algorithm to generate a set of compositional probably approximately correct (PAC) barrier certificates through a counterexample-guided approach. A formal description of the infinite-time domain safety verification problem is given in terms of probability and statistics. By establishing and solving a mixed-integer programming method based on the Big-M method, the barrier certificate problem is transformed into a constrained optimization problem. Nonlinear inequalities are linearized in intervals using the mean value theorem of differentiation. Finally, this study implements the compositional PAC barrier certificate generator CPBC and evaluates its performance on 11 benchmark systems. The experimental results show that CPBC can successfully verify the safety of each dynamical system under specified different safety requirement thresholds. Compared with existing methods, the proposed method can more efficiently generate reliable probabilistic barrier certificates for complex or high-dimensional systems, with the verified example scale reaching up to hundreds of dimensions.
Abstract: Wide area network (WAN) has become critical infrastructure in the 21st century, connecting new businesses, new infrastructure, and various emerging applications. In recent years, there has been an explosive growth in data volume, accompanied by the continuous emergence of new application forms such as large-scale WAN-based models, digital economy, metaverse, and holographic society. In addition, the emergence of new service architectures, such as China’s “East Data, West Computing” project, computing power networks, and data fields, has posed increasingly high requirements for the data transmission quality of WAN. For instance, WAN must deliver not only timely but also real-time services, making latency a critical deterministic metric to meet. Therefore, wide area deterministic network emerges as a new paradigm of WAN. This study systematically reviews the connotation of deterministic networks and the development of traditional technologies related to deterministic networks. It introduces new applications of wide area deterministic network, discusses their new characteristics and transmission challenges, and proposes new goals for them. Based on the aforementioned new applications, characteristics, challenges, and goals, this study summarizes the main research progress in the field of wide area deterministic network in detail and provides future research directions. It is hoped that this study will provide reference and assistance for research in this field.
Abstract: The growth in the Internet poses privacy challenges, prompting the development of anonymous communication systems like the most widely used Tor (the second-generation onion router). However, the notable anonymity offered by Tor has inadvertently made it a breeding ground for criminal activities, attracting miscreants engaged in illegal trading and cybercrime. One of the most prevalent techniques for de-anonymizing Tor is Tor passive traffic analysis, where in anonymity is compromised by passively observing network traffic. This study aims to delve into the fundamental concepts of Tor and traffic analysis, elucidate application scenarios and threat models, and classify existing works into two categories: traffic identification & classification, and flow correlation. Subsequently, their respective traffic collection methods, feature extraction techniques, and algorithms are compared and analyzed. Finally, the primary challenges faced by current research in this domain are concluded and future research directions are proposed.
Abstract: The prosperity of open-source software has spurred robust growth in the software industry and has also facilitated the formation of a supply chain development model based on open-source software. Essentially, the open-source software supply chain is a complex topology network, composed of key elements of the open-source ecosystem and their interrelations. Its globalized product advantages contribute to enhancing the development efficiency of the software industry. However, the open-source software supply chain also has characteristics such as intricate dependencies, widespread propagation, and an expanded attack surface, introducing new security risks. Although existing security management based on vulnerabilities and threat intelligence can achieve early warnings and proactive defense, the efficiency of vulnerability handling is severely affected due to delays in obtaining vulnerability threat information, and the lack of attack techniques and mitigation measures. Addressing these issues, a vulnerability threat intelligence sensing method for the open-source software supply chain is designed and implemented, which includes two parts: 1) Construction of the cyber threat intelligence (CTI) knowledge graph. In the process of constructing it, relevant technologies are utilized to achieve real-time analysis and processing of security intelligence. Particularly, the SecERNIE model and the software package naming matrix are introduced to address the challenges of vulnerability threat correlation mining and open-source software alias issues, respectively. 2) Vulnerability risk information push,based on the software package naming matrix, software package filtering rules are established to enable real-time filtering and pushing of vulnerabilities in open-source systems. This study validates the effectiveness and applicability of the proposed method through experiments. Results show that, compared to traditional vulnerability platforms like NVD, the proposed method advances the sensing time by an average of 90.03 days. The coverage rate of operating system software increases by 74.37%, and using the SecERNIE model, the relationships between 63492 CVE vulnerabilities and attack technique entities are mapped. Specifically, for the openEuler operating system, the traceable system software coverage rate reaches 92.76%, with 6239 security vulnerabilities detected. This study also identifies 891 vulnerability-attack correlations in openEuler, obtaining corresponding solutions that serve as a reference for vulnerability handling. Two typical attack scenarios in a real attack environment are verified, demonstrating the efficacy of the proposed method in vulnerability threat perception.
Abstract: Recently, graph convolutional network (GCN), as a powerful graph embedding technology, has been widely applied in the field of recommendation. The main reason is that most of the information in recommender systems can be modeled as graph-structured data, and GCN, as a deep learning model that operates on graph structures, helps to explore the potential interactions between users and items in graph-structured data, to enhance the performance of the recommender systems. Since the modeling of recommender systems usually needs to collect and process a large amount of sensitive data, it may face the risk of privacy leakage. Differential privacy, as a privacy protection model with a solid theoretical foundation, has been widely used in recommender systems to solve the problem of personal privacy leakage. Currently, the research based on differential privacy is mainly oriented to independent and identically distributed data models. However, data within GCN-based recommender systems is highly correlated and not independent, making the existing privacy protection methods less effective. To solve the problem, this study proposes a graph convolutional collaborative filtering recommendation algorithm based on Rényi differential privacy (RDP-GCF for short), aiming to achieve a balance between privacy protection and utility while ensuring the security ofuser-item interaction data. The algorithm first utilizes GCN techniques to learn the embedding vectors for users and items. Then, the Gaussian mechanism is used to randomize the embedding vectors, and a sampling-based method is used to amplify the privacy budget and minimize the injection of differential noise, thereby improving the performance of the recommender system. Lastly, the final embedding vectors of the users and items are obtained by a weighted fusion and applied to the recommendation tasks. The proposed algorithm is validated through experiments on three publicly available datasets. The results show that compared to existing similar methods, the proposed algorithm more effectively achieves a balance between privacy protection and data utility.
Abstract: Star-JOIN queries based on local differential privacy (LDP) have attracted a lot of attention from researchers in recent years. Existing Star-JOIN queries based on the OLH mechanism and hierarchical tree structures face issues such as privacy leakage risks at the root node and the lack of guidance on selecting an appropriate τ value for the τ-truncation mechanism. To remedy the shortcomings of the existing algorithms, this study proposes an effective Star-JOIN query algorithm, longitudinal path random response for join (LPRR-JOIN), to satisfy the requirements of LDP. In the LPRR-JOIN algorithm, full advantage is taken of the longitudinal path structure of the hierarchical tree and the GRR mechanism to propose an algorithm called LPRR to perturb users’ tuples. This algorithm utilizes the combinations of nodes along the longitudinal paths of all attributes as the perturbation domain. In the LPRR-JOIN algorithm, tuples are mapped by each user to corresponding node combinations, followed by local perturbation of the mapped tuples using the GRR mechanism. To guard against frequency attacks on the fact table, the algorithm permits users to locally truncate the count of their tuples based on a threshold τ, where tuples are deleted if their count exceeds τ and supplemented if it falls below τ. Two solutions are proposed within LPRR-JOIN to compute the optimal τ value. The first is to solve the optimization equation over bias caused by τ-truncation and perturbation variance due to LPRR. The second is to obtain the distribution of τ under the constraints of LDP and compute the median value from the distribution. The LPRR-JOIN algorithm employs an overall error function constructed from the bias and perturbation variance resulting from τ-truncation to derive an optimal τ value through the optimization of the error objective function. Additionally, by integrating a user grouping strategy, the algorithm ascertains the overall distribution of τ values and identifies a suitable τ value using the median. When compared with current algorithms across three diverse multi-relational datasets, experimental outcomes demonstrate the superiority of the LPRR-JOIN algorithm in query response performance.
Abstract: With the rapid development of quantum computing, especially the optimization and progress of the Shor quantum algorithm and its variants, the current classical public-key cryptography based on factoring large integers and discrete logarithm problems is facing serious security threats. To cope with quantum attacks, post-quantum cryptography has been proposed, among which lattice-based cryptography is commonly viewed as the most promising one due to its outstanding performance in security, bandwidth, and efficiency. Most of the existing lattice-based post-quantum cryptographic schemes use cyclotomic rings, especially power-of-two cyclotomic rings, as their underlying algebraic structures. However, targeted attacks against cyclotomic rings have been proposed, exploiting subfields, small Galois groups, and ring homomorphisms in these rings. This study uses the large-Galois-group prime-degree prime-ideal field as the new underlying algebraic structure, which has characteristics of high security, prime order, large Galois group, and inert modulus.First, this study proposes a post-quantum digital signature scheme based on the large-Galois-group prime-degree prime-ideal field, which is named Dilithium-Prime, and the recommended parameter sets are provided. Next, considering that the traditional number theory transform (NTT) algorithm cannot be used to multiply polynomials efficiently in the large-Galois-group prime-degree prime-ideal field, this study designs efficient polynomial multiplication strategies for Dilithium-Prime, including NTT for the large-Galois-group prime-degree prime-ideal field and small polynomial multiplication. Finally, this study provides a portable C language implementation of Dilithium-Prime, along with the implementation details and constant-time implementation skills, and compares Dilithium-Prime with other lattice-based digital signature schemes. The experimental results show that the public key size, secret key size, and signature size of Dilithium-Prime are reduced by 1.8%, 10.2%, and 1.8%, respectively, compared to CRYSTALS-Dilithium. The efficiency of the signature algorithm is improved by 11.9%, and the key generation algorithm and the verification algorithm are 2.0× and 2.5× slower than those of CRYSTALS-Dilithium, respectively. However, Dilithium-Prime can withstand the cryptographic attack against cyclotomic rings, which is exactly what CRYSTALS Dilithium lacks. Compared to NCC-Sign, Dilithium-Prime's key generation algorithm, signature algorithm, and verification algorithm are 4.2×, 35.3×, and 7.2× faster, respectively, than those of NCC-Sign under the same security level and bandwidth.
Abstract: Advanced persistent threat (APT) is a novel form of cyberattack that is well-organized, stealthy, persistent, adversarial, and destructive, resulting in catastrophic consequences for global network security. Traditional APT attack defenses tend to construct models to detect whether the attacks are malicious or identify the malicious family categories, primarily employing a passive defense strategy and lacking comprehensive and in-depth exploration of the field of APT attack attribution and inference. In light of this, this study focuses on the intelligent methods of APT attack attribution and inference to conduct a survey study. Firstly, an overall defense chain framework for APT attacks is proposed, which can effectively distinguish and correlate APT attack detection, attribution, and inference. Secondly, the work related to the four tasks of APT attack detection is reviewed in detail. Thirdly, APT attack attribution research is systematically summarized for regions, organizations, attackers, addresses, and attack models. Then, APT attack inference is divided into four aspects: attack intent inference, attack path perception, attack scenario reconstruction, and attack blocking and countermeasures, and relevant works are summarized and compared in detail. Finally, the hot topics, development trends, and challenges in the field of APT attack defense are discussed.
Abstract: Dynamic information networks (DIN), which contain evolving objects in the real world and the links among them, are often modeled as a series of static undirected graph snapshots. A community consists of a group of well-connected objects in an information network. In a DIN, there is often a community whose size increases over time but its members always keep well-connected during that period of time. The evolving trajectory of such a community over time forms a sequence of the community on several snapshots of the DIN, which is termed a lasting enlarging community sequence in this study. It is meaningful to search for lasting enlarging community sequences in a DIN. However, no previous research has paid attention to such community sequences. This study formally defines the q-based lasting enlarging community sequence (qLEC) in a DIN by combining set inclusion with the triangle-connected k-truss model. A two-phase search algorithm is developed, which includes computing candidate vertex sets of communities from the beginning to the end of the time window and performing community sequence search from the end to the beginning of the time window. This study also provides optimization strategies based on early termination and TCP index compression to reduce time and space costs. Sufficient experiments demonstrate that the qLEC model has specific practical significance compared to existing dynamic community models. The two-phase search algorithm effectively finds qLEC-based lasting enlarging community sequences. The proposed optimization strategies significantly reduce the spatiotemporal cost of the two-phase algorithm.
Abstract: The minimum weakly connected dominating set problem is a classic NP-hard problem that has wide applications in various fields. This study proposes an efficient local search algorithm to solve this problem. The algorithm employs a method to construct an initial solution based on locked vertices and frequency feedback. This method ensures the inclusion of vertices that are certain or highly likely to be in the optimal solution, resulting in a high-quality initial solution. Furthermore, the study introduces a method to avoid cycling based on two-hop configuration checking, age properties, and tabu strategies. A perturbation strategy is also proposed to enable the algorithm to effectively escape from the local optimum. Additionally, effective vertex selection methods are presented to assist the algorithm in choosing vertices suitable for addition to or removal from the candidate solution by combining two scoring functions, Dscore and Nscore, with strategies for avoiding cycling. Finally, the proposed local search algorithm is evaluated on four benchmark test instances and compared with four state-of-the-art algorithms and the CPELX solver. Experimental results demonstrate that the proposed algorithm achieves better performance.
Abstract: In the Q1 model, this paper proposes a low-data quantum key-recovery attack against Lai-Massey structures, Misty structures, Type-1 generalized Feistel structures, SMS4-like generalized Feistel structures and MARS-like generalized Feistel structures. This attack only needs to select constant-sized plain-ciphertexts, analyze the encryption process of block cipher structures, and recover the key by searching and calculating some intermediate states and round keys using Grover’s algorithm. This attack belongs to the Q1 model, which is more practical than the Q2 model since no quantum superposition query is required. For the 3-round Lai-Massey structure, compared with other quantum attacks, this attack requires only $ {\rm O}(1) $ data and belongs to the Q1 model, and is even reduced by the $ n{2^{n/4}} $ factor on the evaluation of the complexity product (time×data×classical memory×quantum bits). For the 6-round Misty structure, this attack still retains the advantage of low data complexity, and especially for the 6-round Misty L/R-FK structure, this attack is reduced by the$ {2^{n/2}} $factor on the evaluation of the complexity product. For the 9-round 3-branch Type-1 generalized Feistel structure, in line with other quantum attacks on the evaluation of the complexity product, this attack still retains the advantage of low data complexity and belongs to the chosen plaintext attack. In addition, a low-data quantum key-recovery attack for SMS4-like generalized Feistel structures and MARS-like generalized Feistel structures are also given in this study, complementing their security evaluation in the Q1 model.
Abstract: In the field of federated learning, incentive mechanisms play a crucial role in enticing high-quality data contributors to engage in federated learning and acquire superior models. However, existing research in federated learning often neglects the potential misuse of these incentive mechanisms. Specifically, participants may manipulate their locally trained models to dishonestly maximize their rewards. This issue is thoroughly examined in this study. Firstly, the problem of rewards fraud in federated learning is clearly defined, and the concept of reward-cost ratio is introduced to assess the effectiveness of various rewards fraud techniques and defense mechanisms. Following this, an attack method named the “gradient scale-up attack” is proposed, focusing on manipulating model gradients to exploit the incentive system. This attack method calculates corresponding scaling factors and utilizes them to increase the contribution of the local model to gain more rewards. Finally, an efficient defense mechanism is proposed, which identifies malicious participants by examining the L2-norms of model updates, effectively thwarting gradient scale-up attacks. Through extensive analysis and experimental validation on datasets such as MNIST, the findings of this research demonstrate that the proposed attack method significantly increases rewards, while the corresponding defense method effectively mitigates fraudulent behavior by participants.
Abstract: Point cloud self-supervised representation learning is conducted in an unlabeled pre-training manner, exploring the structural relationships of 3D topological geometric spaces and capturing feature representations. This approach can be applied to downstream tasks, such as point cloud classification, segmentation, and object detection. To enhance the generalization and robustness of the pretrained models, this study proposes a multi-modal self-supervised method for learning point cloud representations. The method is based on bidirectional fit mask reconstruction and comprises three main components: (1) The “bad teacher” model, guided by the inverse density scale, employs a bidirectional fit strategy that utilizes inverse density noise representation and global feature representation to expedite the convergence of the mask region towards the true value. (2) The StyleGAN-based auxiliary point cloud generation model, grounded in local geometric information, generates stylized point clouds and fuses them with mask reconstruction results while adhering to threshold constraints. The objective is to mitigate the adverse effects of noise on representation learning during the reconstruction process. (3) The multi-modal teacher model aims to enhance the diversity of the 3D feature space and prevent the collapse of modal information. It relies on the triple feature contrast loss function to fully extract the latent information contained in the point cloud-image-text sample space. The proposed method is evaluated on ModelNet, ScanObjectNN, and ShapeNet datasets for fine-tuning tasks. Experimental results demonstrate that the pretrained model achieves state-of-the-art performance in various point cloud recognition tasks, including point cloud classification, linear support vector machine classification, few-shot classification, zero-shot classification, and part segmentation.
Abstract: The service descriptions provide limited information about application scenarios, creating a gap between Mashup service component Web API recommendations based on functional similarity calculation and desired expectations. Consequently, there is a need to enhance the accuracy of function matching. While some researchers utilize collaborative associations among Web APIs to enhance recommendation compatibility, they overlook the adverse effects of functional associations on Mashup service creation, thereby limiting the enhancement of recommendation diversity. To address this issue, this study proposes a Web API recommendation method for Mashup service components that integrates latent related words and heterogeneous association compatibility. The study extracts latent related words associated with application scenarios for both Mashup requirements and Web APIs, integrating them into the generation of function vectors. By enhancing the accuracy of functional similarity matching, it obtains a high-quality candidate set of Web API components. Function association and collaboration association are modeled as heterogeneous service association. The study utilizes heterogeneous association compatibility to replace collaboration compatibility in traditional methods, thus enhancing the recommendation diversity of Web APIs. In comparison, the proposed approach demonstrates improvements in evaluation indicators, with Recall, Precision, and NCDG enhanced by 4.17% to 16.05%, 4.46% to 16.62%, and 5.57% to 17.26%, respectively. Additionally, the diversity index ILS is reduced by 8.22% to 15.23%. The Recall and Precision values for cold-start Web API recommendation are 47.71% and 46.58% of those for non-cold-start Web API recommendation, respectively. Experimental results demonstrate that the proposed method not only enhances the quality of Web API recommendation but also yields favorable results for cold-start Web API recommendations.
Abstract: Event detection (ED) aims to detect event triggers in unstructured text and classify them into pre-defined event types, which can be applied to knowledge graph construction, public opinion monitoring, and so on. However, the data sparsity and imbalance severely impair the system’s performance and usability. Most existing methods cannot well address these issues. This is due to that during detection, they regard events of different types as independent and identify or classify them through classifiers or space-distance similarity. Some work considers the correlation between event elements under a broader category and employs multi-task learning for mutual enhancement; they overlook the shared properties of triggers with different event types. Research related to modeling event connections requires designing lots of rules and data annotation, which leads to limited applicability and weak generalizability. Therefore, this study proposes an event-detection method based on meta-attributes. It aims to learn the shared intrinsic information contained in samples across different event types, including (1) extracting type-agnostic semantics of triggers through semantic mapping from the representations of special symbols; (2) concatenating the semantic representations of triggers and samples in each event type as well as the label embedding, inputting them into a trainable similarity measurement layer, thereby modeling a public similarity metric related to triggers and event categories. By combining these representations into a measuring layer, the proposed method mitigates the effects of data sparsity and imbalance. Additionally, the full fusion model is constructed by integrating the type-agnostic semantic into the classification method. Experiments on ACE2005 and MAVEN datasets under various degrees of sparsity and imbalance, verify the effectiveness of the proposed method and build the connection between conventional and few-shot settings.
Abstract: In recent years, research on the interconnect technology of superconducting qubits has made important progress, providing an effective way to build a distributed computing architecture for superconducting quantum computers. The distributed superconducting architecture imposes strict constraints on the execution of quantum circuits in terms of network topology, qubit connectivity, and quantum state transfer protocols. To execute and schedule quantum circuits on a distributed architecture, the circuit mapping process is required to transform the quantum circuits to adapt to the underlying architecture and then to distribute the transformed circuits to multiple QPUs. The distributed circuit mapping process necessitates the insertion of additional quantum operations into the original circuit. Such operations, especially the inter-QPU state transfer operations, are susceptible to noise, leading to high error rates. Therefore, minimizing the number of such additional operations inserted by the mapping process is critical to improving the overall computation success rate. This study constructs an abstract model of distributed quantum computing based on the technical features of the interconnect technology of superconducting qubits and today’s superconducting QPUs. Moreover, this study proposes a distributed quantum circuit mapping approach based on this abstract model. The proposed approach consists of two main components the distributed qubit mapping algorithm and the qubit state routing algorithm. The former formulates the problem of distributing qubits to different QPUs as a combinatorial optimization problem and employs simulated annealing enhanced with local search to find the initial mapping that brings the optimal total routing cost. The latter constructs several heuristic qubit routing rules for different scenarios and integrates them systematically to minimize the additional operations inserted by the mapping process. The abstract model shields any technical details of the underlying architecture that are irrelevant to circuit mapping, which makes the mapping method applicable to a class of such networks rather than a specific one. Moreover, the approach proposed in this study can be used as an ancillary tool to design and evaluate the network topology of distributed systems. The experimental results show that, compared to the baseline approach, the proposed approach reduces the number of intra-chip operations (SWAP gates) and inter-chip operations (ST gates) by 69.69% and 85.88% on average, respectively, with a time overhead similar to existing algorithms.
Abstract: Software traceability is considered critical to trustworthy software engineering, ensuring software reliability through the tracking of the software development process. Despite significant progress in automatic software traceability recovery techniques in recent years, their application in real-world commercial software projects does not meet expectations. An investigation into the application of learning-based software traceability recovery classifier models in commercial software projects is conducted. It uncovers three critical challenges faced in industrial settings. These challenges contribute to underperforming traceability models: low-quality raw data, data sparsity, and class imbalance. In response to these challenges, STRACE(AL+SSL) is proposed. It is a software traceability recovery framework that integrates active learning and semi-supervised learning. By strategically selecting valuable annotated samples and generating high-quality pseudo-labeled samples, STRACE(AL+SSL) effectively harnesses unlabeled data to address data-related challenges. Multiple comparative experiments are conducted with nearly one million issue-commit trace pair samples from 10 different enterprise projects. The results of these experiments validate the effectiveness of the proposed framework for real-world software traceability recovery tasks. The ablation results show that the unlabeled samples selected by the active learning in STRACE(AL+SSL) play a crucial role in the traceability recovery task. Additionally, the optimal combination of sample selection strategies in STRACE(AL+SSL) is confirmed. This includes CBST-Adjust for the semi-supervised sample rebalancing strategy and SMI_Flqmi, which is recognized for its cost-effectiveness and efficiency in active learning.
Abstract: Log merge tree (LSM-tree)-based key-value storage is widely used in many applications due to its excellent read and write performance. Most existing LSM-trees utilize a multi-level structure to store data. Although the multi-level data structure can serve moderately write-intensive applications well, this structure is not well suited for highly write-intensive applications. This is because storing data in multi-levels introduces the write amplification problem, where new data insertion triggers the reorganization of a large portion of the data already stored in multiple levels. This huge (and sometimes frequent) data reorganization is expensive and degrades write performance in many highly write-intensive applications. In addition, the multi-level structure does not provide consistently excellent read performance for hot data. This is because the multi-level structure cannot optimize the read operation of hot data by merging overlapping ranges in a timely manner. To address the above two challenges, this study proposes LazyStore, a novel single-level LSM-tree based on a hybrid storage architecture. LazyStore solves the write amplification problem by storing data in a single logical level instead of multiple logical levels. As a result, expensive multi-level data reorganization is largely eliminated. To further improve write performance, LazyStore distributes data at the logical level to multiple storage devices, such as DRAM, NVM, and SSD, based on the capacity and read/write performance of each storage device. Furthermore, LazyStore introduces real-time merge operations to improve the read performance of hot data ranges. Experiments show that LazyStore improves write performance by 3 times and reduces write amplification by nearly 4 times compared to other multi-level LSM-trees. For hot range reads, LazyStore’s real-time data merge optimization can reduce the latency of range query processing by a factor of two.
Abstract: Domain adaptation (DA) is a group of machine learning tasks where the training set (source domain) and the test set (target domain) exhibit different distributions. Its key idea lies in how to overcome the negative impact given by these distributional differences, in other words, how to design an effective training strategy to obtain a classifier with high generalization performance by minimizing the difference between data domains. This study focuses on the tasks of unsupervised DA (UDA), where annotations are available in the source domain but absent in the target domain. This problem can be considered as how to use partially annotated data and unannotated data to train a classifier in a semi-supervised learning framework. Then, two kinds of semi-supervised learning techniques, namely pseudo labels (PLs) and consistent regularization (CR), are used to augment and annotate data in the observed domain for learning the classifier. Consequently, the classifier can obtain better generalization performance in the tasks of UDA. This study proposes augmentation-based UDA (A-UDA), in which the unannotated data in the target domain are augmented by random augmentation, and the high-confident data are annotated by adding pseudo-labels based on the predicted output of the model. The classifier is trained on the augmented data set. The distribution distance between the source domain and the target domain is calculated by using the maximum mean difference (MMD). By minimizing this distance, the classifier achieves high generalization performance. The proposed method is evaluated on multiple UDA tasks, including MNIST-USPS, Office-Home, and ImageCLEF-DA. Compared to other existing methods, it achieves better performance on these tasks.
Abstract: During the path coverage testing of a message passing interface (MPI) program based on evolutionary optimization, the fitness of evolutionary individuals needs to be evaluated by repeatedly executing the MPI program. However, repeated execution of an MPI program often requires high computational costs. Therefore, this study proposes an approach to generate test cases for path coverage of MPI programs guided by surrogate-assisted multi-task evolutionary optimization, which significantly reduces the actual execution times of MPI programs, thereby improving testing efficiency. Firstly, surrogate models are trained for each target sub-path in the target path of an MPI program. Then, the fitness of evolutionary individuals is estimated using the surrogate model corresponding to each target sub-path, and a candidate set of test cases is formed. Finally, all surrogate models are updated based on the candidate set and the actual fitness for each target sub-path. The proposed approach is applied to the basis path coverage testing of seven benchmark MPI programs and compared with several state-of-the-art approaches. The experimental results show that the proposed approach significantly improves testing efficiency while ensuring high effectiveness in generating test cases.
Abstract: As both Android frameworks and malware continue to evolve, the performance of existing malware classifiers degrades significantly over time. This study proposes droid slow aging (DroidSA), a method for Android malware detection based on API clustering and call graph optimization. Firstly, API clustering is performed before malware detection to generate cluster centers that reflect API functionality. To make clustering results more accurate, this study obtains embeddings fully reflecting the semantic similarity of APIs by designing API sentences to summarize vital features such as API names and permissions and using NLP tools to mine the semantic information of API sentences. Then, call graphs are extracted from apps and optimized by removing unknown methods while preserving the connectivity among API nodes. Call graph optimization enables detection methods to extract more robust contextual information of APIs which reflects the mode of app behavior. DroidSA extracts pairs of function calls from the optimized call graphs and abstracts the APIs in the pairs into cluster centers obtained in API clustering to better adjust to the changes in Android frameworks and malware. Finally, one-hot encoding is used to generate feature vectors, and the best-performing classifier is selected from random forests, support vector machines, and the k-nearest neighbors algorithm for malware detection. Experimental results demonstrate that DroidSA achieves an average F1-Measure of 96.7% for malware detection. Under the experimental setup where temporal bias is eliminated, DroidSA trained with apps from 2012 to 2013 achieves an average F1-Measure of 82.6% when detecting malware developed from 2014 to 2018. Compared with the state-of-the-art detection methods MaMaDroid and MalScan, DroidSA stably maintains high detection metrics with minimal impact from temporal changes and effectively detects evolved malware.
Abstract: Local differential privacy (LDP) is widely used to collect and analyze sensitive data while protecting user privacy. However, it is vulnerable to data poisoning attacks by malicious users. The k-subset mechanism and the wheel mechanism are LDP schemes with optimal utility for frequency estimation. Yet, their resistance to data poisoning attacks lacks in-depth analysis and evaluation. Therefore, data poisoning attack methods are designed to assess the resistance to data poisoning attacks of both the k-subset mechanism and the wheel mechanism. First, the random perturbed-value attack and random item attack are discussed, and then the maximal gain attack methods against the k-subset mechanism and the wheel mechanism are constructed. The attack methods can be exploited to maximize the frequencies of target items selected by attackers, which is achieved by sending carefully crafted poisoning data to the data collector via fake users. Theoretically, the attack gains are rigorously analyzed and compared, and the effects of data poisoning attacks are experimentally evaluated, demonstrating their impact on the k-subset mechanism and the wheel mechanism. Finally, defensive measures are proposed to mitigate the effects of data poisoning attacks.
Abstract: The most significant feature of social network sentiment data is its dynamic nature. Tackling public sentiment drift analysis, this study proposes a Gaussian mixture based hierarchical variational auto-encoder (GHVAE) model for detecting sentiment drifts. Specifically, the GHVAE applies Gaussian mixture distribution as the prior assumption of latent distribution, which corresponds to the multi-center of the latent distribution property to improve model performances. Moreover, the built-in drift measurement algorithm in the original HVAE model is revised to enlarge the distances among big drift scores and improve the classification performance. Several contrast and ablation experiments are implemented to validate the performance of the GHVAE. The results indicate the novelties of the GHVAE bring improvement in sentiment drift detection.
Abstract: Keyword-based auditing (KA) technology is a crucial measure to achieve cost-effectiveness in cloud auditing applications. Different from probabilistic auditing, which verifies outsourced data by random sampling and verification, KA considers the auditing requirements of multi-user and multi-attribute data by performing keyword searches and targeted audits. KA can significantly reduce auditing costs. However, existing KA schemes usually focus only on auditing the efficiency of target data while paying little attention to remedial measures such as fault localization and data recovery after audit failures. This lack of attention to remediation measures does not guarantee data availability. Therefore, this study proposes a keyword-based multi-cloud auditing scheme (referred to as KMCA) that leverages smart contracts to enable targeted auditing, batch fault localization, and data recovery. Specifically, the targeted auditing module defines the keyword-file mapping based on the searchable encryption index structure and employs Bloom filters’ false-positive rate characteristic to hide keyword frequency and protect privacy. The fault localization module uses a binary search approach to locate error-prone cloud servers in batches and fine-grained localization of corrupted data. The data recovery module formulates multi-cloud redundant storage and data recovery strategies to avoid single-point failure and improve storage fault tolerance. Under the random oracle model, KMCA is provably secure. Performance analysis shows that KMCA is feasible.
Abstract: Multi-view data depicts objects from different perspectives, with features in different views exhibiting correlations, complementary, and diverse information. Therefore, it is crucial to make full use of this information for the processing of multi-view data. However, the processing and analysis of multi-view data will be difficult due to the inherent challenges of dealing with a vast number of features and the presence of noise features in multi-view data. Unsupervised multi-view feature selection, emerging as a critical component in multi-view data learning, efficiently learns more accurate and compact representations from the original high-dimensional multi-view data without relying on label information to remarkably improve the performance of data analysis. This study reviews and categorizes these models based on the similarities and differences in the working mechanisms of existing unsupervised multi-view feature selection models, while also detailing their limitations. Furthermore, this study points out promising future research directions in the field of unsupervised feature selection.
Abstract: As merchant review websites develop rapidly, the efficiency improvement brought by recommender systems makes rating prediction one of the emerging research tasks in recent years. Existing rating prediction methods are usually limited to collaborative filtering algorithms and various types of neural network models, and do not take full advantage of the rich semantic knowledge learned in advance by the current pre-trained models. To address this problem, this study proposes a personalized rating prediction method based on pre-trained language models. The method analyzes the historical reviews of users and merchants to provide users with rating predictions as a reference before consumption. It first designs a pre-training task for the model to learn to capture key information in the text. Next, the review text is processed by a fine-grained sentiment analysis method to obtain aspect terms in the review text. Subsequently, the method designs an aspect term embedding layerto incorporate the aforementioned external domain knowledge into the model. Finally, it utilizes an information fusion strategy based on the attention mechanism to fuse the global and local semantic information of the input text. The experimental results show that the method achieves significant improvement in both automatic evaluation metrics compared to the benchmark models.
Abstract: As mobile data is growing everyday, how to predicate the wireless traffic accurately is crucial for the efficient and sensible allocation of communication and network resources. However, most existing prediction methods use a centralized training architecture, which involves large-scale traffic data transmission, leading to security issues such as user privacy leakage. Federated learning can train a global model with local data storage, which protects users’ privacy and effectively reduces the burden of frequent data transmission. However, in wireless traffic prediction, the amount of data from the single base station is limited, and the traffic patterns vary among different base stations, making it difficult to capture the traffic patterns and resulting in poor generalization of the global model. In addition, traditional federated learning methods employ averaging in model aggregation, ignoring the differences in guest contributions, which further leads to the degradation of the global model performance. To address the above issues, this study proposes an attention-based “intra-cluster average, inter-cluster attention” federated wireless traffic prediction model. The model first clusters base stations based on their traffic data to better capture the traffic variation characteristics of base stations with similar traffic patterns. At the same time, a warm-up model is designed to alleviate data heterogeneity by a small amount of base station data to improve the generalization ability of the global model. The study introduces the attention mechanism in the aggregation stage to quantify the contributions of different objects to the global model and incorporates the warm-up model in the model iteration process to improve the prediction accuracy of the model. Extensive experiments are conducted on two real-world datasets (Milano and Trento), and the results show that the DualICA outperforms all baseline methods. The mean absolute error performance gain over the state-of-the-art method is up to 10.1% and 9.6% on the two datasets, respectively.
Abstract: The label distribution in the real world often shows the long-tail effect, where a small number of categories account for the vast majority of samples. The temporal action detection problem is no exception. The existing temporal action detection methods often focus on the head categories with a large number of samples, while neglecting the few-sample categories. This study systematically defines the long-tail temporal action detection problem and proposes a weighted class-rebalancing self-training method (WCReST) based on a semi-supervised learning framework. WCReST makes full use of the large-scale unlabeled data that exists in the real world to rebalance the label distribution in the training samples to improve the model’s fit for the tail categories. Additionally, a pseudo-label loss weighting method is proposed for the temporal action detection task to enhance the stability of model training. Experiments are conducted on the THUMOS14 and HACS Segments datasets, using video samples from the THUMOS15 and ActivityNet1.3 datasets to form corresponding unlabeled datasets. In addition, the Dance dataset is collected to meet the application requirements of video review, which includes 35 action categories, 6632 labeled videos, and 13264 unlabeled videos, preserving the significant long-tail effect in data distribution. A variety of baseline models are used to conduct experiments on the THUMOS14, HACS Segments, and Dance datasets. The results demonstrate that the proposed WCReST can improve the model’s detection performance on tail action categories and can be applied to different baseline temporal action detection models to enhance their performance.
Abstract: In recent years, multi-agent reinforcement learning methods have demonstrated excellent decision-making capabilities and broad application prospects in successful cases such as AlphaStar, AlphaDogFight, and AlphaMosaic. In the multi-agent decision-making system in a real-world environments, the decision-making space of its task is often a parameterized action space with both discrete and continuous action variables. The complex structure of this type of action space makes traditional multi-agent reinforcement learning algorithms no longer applicable. Therefore, researching for parameterized action spaces holds important significance in real-world application. This study proposes a factored multi-agent centralised policy gradients algorithm for parameterized action space in multi-agent settings. By utilizing the factored centralised policy gradient algorithm, effective coordination among multi-agent is ensured. After that, the output of the dual-headed policy in the parameterized deep deterministic policy gradient algorithm is employed to achieve effective coupling in the parameterized action space. Experimental results under different parameter settings in the hybrid predator-prey scenario show that the algorithm has good performance on classic multi-agent parameterized action space collaboration tasks. Additionally, the algorithm’s effectiveness and feasibility is validated in a multi-cruise-missile collaborative penetration tasks with complex and high dynamic properties.
Abstract: Constructing post-quantum key encapsulation mechanisms based on Lattice (especially NTRU Lattice) is one of the popular research fields in Lattice-based cryptography. Commonly, most Lattice-based cryptographic schemes are constructed over cyclotomic rings, which, however, are vulnerable to some attacks due to their abundant algebraic structures. An optional and more secure underlying algebraic structure is the large-Galois-group prime-degree prime-ideal number field. NTRU-Prime is an excellent NTRU-based key encapsulation mechanism over the large-Galois-group prime-degree prime-ideal number field and has been widely adopted as the default in the OpenSSH standard. This study aims to construct a key encapsulation mechanism over the same algebraic structure but with better performance than NTRU-Prime. Firstly, this work studies the security risks of cyclotomic rings, especially the attacks on quadratic power cyclotomic rings, and demonstrates the security advantages of a large-Galois-group prime-degree prime-ideal number field in resisting these attacks. Next, an NTRU-based key encapsulation mechanism named CNTR-Prime over a large-Galois-group prime-degree prime-ideal number field is proposed, along with the corresponding detailed analysis and parameter sets. Then, a pseudo-Mersenne incomplete number theoretic transform (NTT) is provided, which can compute polynomial multiplication efficiently over a large-Galois-group prime-degree prime-ideal number field. In addition, an improved pseudo-Mersenne modular reduction algorithm is proposed, which is utilized in pseudo-Mersenne incomplete NTT. It is faster than Barrett reduction by 2.6% in software implementation and is 2 to 6 times faster than both Montgomery reduction and Barrett reduction in hardware implementation. Finally, a C-language implementation of CNTR-Prime is presented. When compared to SNTRU-Prime, CNTR-Prime has advantages in security, bandwidth, and implementation efficiency. For example, CNTR-Prime-761 has an 8.3% smaller ciphertext size, and its security is strengthened by 19 bits for both classical and quantum security. CNTR-Prime-761 is faster in key generation, encapsulation, and decapsulation algorithms by 25.3×, 10.8×, and 2.0×, respectively. The classical and quantum security of CNTR-Prime-653 is already comparable to that of SNTRU-Prime-761, with a 13.8% reduction in bandwidth, and it is faster in key generation, encapsulation, and decapsulation by 33.9×, 12.6×, and 2.3×, respectively. This study provides an important reference for subsequent research, analysis, and optimization of similar Lattice-based cryptographic schemes.
Abstract: When writing code, software developers often refer to code snippets that implement similar functions in the project. The code generation model shares similar features when generating code fragments and uses the code context provided in the input as a reference. The code completion technology based on retrieval augmentation is akin to this idea. The external code retrieved from the retrieval library is used as additional context information to prompt the generation model so as to complete the unfinished code fragments. The existing code completion method based on retrieval augmentation directly splices the input code and retrieval results together as the input of the generated model. This method brings a risk that the retrieved code fragments may not prompt the model, but mislead the model, resulting in inaccurate or irrelevant code results. In addition, whether the retrieved external code is completely related to the input code or not, it will be spliced with the input code and input to the model. Consequently, the effect of this method largely depends on the accuracy of the code retrieval stage. If the available code fragments cannot be returned in the retrieval phase, the subsequent code completion effect may also be affected. An empirical study is conducted on the retrieval augmentation strategies in the existing code completion methods. Through qualitative and quantitative experiments, the impact of each stage of retrieval augmentation on the code completion effect is analyzed. The empirical study focuses on identifying three factors for the effect of retrieval augmentation, namely, code granularity, code retrieval methods, and post-processing methods. Based on the conclusion of the empirical research, an improved method is designed, and a code completion method MAGIC (multi-stage optimization for retrieval augmented code completion) is proposed to improve the retrieval augmentation by optimizing the code retrieval strategy in stages. The improved strategies such as code segmentation, retrieval-reranking, and template prompt generation are designed to effectively enhance the auxiliary generation effect of the code retrieval module on the code completion model. Meanwhile, those strategies can also reduce the interference of irrelevant code in the code generation phase of the model and improve the quality of generated code. The experimental results on the Java code dataset show that, compared with the existing code completion methods based on retrieval augmentation, this method increases the editing similarity and perfect matching index by 6.76% and 7.81%, respectively. Compared with the large code model with 6B parameters, this method can save 94.5% of the video memory and 73.8% of the inference time, and improve the editing similarity and complete matching index by 5.62% and 4.66% respectively.
Abstract: Machine translation (MT) aims to build an automatic translating system to transform a given sequence in the source language into another target language sequence that shares identical semantic information. MT has been an important research direction in natural language processing and artificial intelligence fields for its widely applied scenarios. In recent years, the performance of neural machine translation (NMT) greatly surpasses that of statistical machine translation (SMT), becoming the mainstream method in MT research. However, NMT generally takes the sentence as the translated unit, and in document-level translation scenarios, some discourse errors such as the mistranslation of words and incoherent sentences may occur due to the separation with discourse context if the sentence is translated independently. Therefore, incorporating document-level information into the procedure of translation may be a more reasonable and natural way to solve discourse errors. This conforms with the goal of document-level neural machine translation (DNMT) and has been a popular direction in MT research. This study reviews and summarizes works in DNMT research in terms of discourse evaluation methods, datasets and models applied, and other aspects to help the researchers efficiently learn the research status and further directions of DNMT. Meanwhile, this study also introduces the prospect and some challenges in DNMT, hoping to bring some inspiration to researchers.
Abstract: Self-training, a common strategy for tackling the annotated-data scarcity, typically involves acquiring auto-annotated data with high confidence generated by a teacher model as reliable data. However, in low-resource scenarios for Relation Extraction (RE) tasks, this approach is hindered by the limited generalization capacity of the teacher model and the confusable relational categories in tasks. Consequently, efficiently identifying reliable data from automatically labeled data becomes challenging, and a large amount of low-confidence noise data will be generalized. Therefore, this study proposes a self-training approach for low-resource relation extraction (ST-LRE). This approach aids the teacher model in selecting reliable data based on prediction ways of paraphrases, and extracts ambiguous data with reliability from low-confidence data based on partially-labeled modes. Considering the candidate categories of ambiguous data, this study proposes a negative training approach based on the set of negative labels. Finally, a unified approach capable of both positive and negative training is proposed for the integrated training of reliable data and ambiguous data. In the experiments, ST-LRE consistently demonstrates significant improvements in low-resource scenarios of two widely used RE datasets SemEval2010 Task-8 and Re-TACRED.
Abstract: Learned indexes are assisting or gradually replacing traditional index structures due to their low memory usage and high query performance. However, the online retraining caused by data updates makes it unable to adapt to the scenario of frequent data updates. To avoid index reconstruction due to frequent data updates without significantly increasing memory consumption, this study proposes an adaptive update-distribution-aware learned index named DRAMA. It uses an LSM-Tree-like delayed learning method to actively learn the characteristics of the data update distribution, approximate fitting techniques to quickly establish the update-distribution model, a model merging strategy to replace the frequent retraining, and a hybrid compression technique to reduce the memory usage of model parameters in the index. The index is constructed and validated on both real and synthetic datasets. The results show that, compared to traditional indexes and state-of-the-art learned indexes, the proposed index can effectively reduce query delay in a data update environment without additional memory consumption.
Abstract: The task of completing knowledge graphs aims to reveal the missing fact triples within the knowledge graph based on existing fact triples (head entity, relation, tail entity). Existing research primarily focuses on utilizing the structural information within the knowledge graph. However, these efforts overlook that other modal information contained within the knowledge graph may also be helpful for knowledge graph completion. In addition, since task-specific knowledge is typically not integrated into general pre-training models, the process of incorporating task-related knowledge into modal information extraction becomes crucial. Moreover, given that different modal features contribute uniquely to knowledge graph completion, effectively preserving useful multimodal information poses a significant challenge. To address these issues, this study proposes a multimodal knowledge graph completion method that incorporates task knowledge. It utilizes a fine-tuned multimodal encoder tailored to the current task to acquire entity vector representations across different modalities. Subsequently, a modal fusion-filtering module based on recurrent neural networks is utilized to eliminate task-independent multimodal features. Finally, the study utilizes a simple isomorphic graph network to represent and update all features, thus effectively accomplishing multimodal knowledge graph completion. Experimental results demonstrate the effectiveness of our approach in extracting information from different modalities. Furthermore, it shows that our method enhances entity representation capability through additional multimodal filtering and fusion, consequently improving the performance of multimodal knowledge graph completion tasks.
Abstract: Existing multi-view attributed graph clustering methods usually learn consistent information and complementary information in a unified representation of multiple views. However, not only will the specific information of the original views be lost under the method of learning after fusion, but also the consistency and complementarity are difficult to balance under the unified representation. To retain the original information of each view, this study adopts the method of learning first and then fusing. Firstly, the shared representation and specific representation of each view are learned separately before fusion, and the consistent information and complementary information of multiple views are learned more fine-grained. A multi-view attributed graph clustering model based on shared and specific representation (MSAGC) is constructed. Specifically, the primary representation of each view is obtained by a multi-view graph encoder, and then the shared information and specific information of each view are obtained. Then the consistent information of multiple views is learned by aligning the view shared information, the complementary information of multiple views is utilized by combining the view specific information, and the redundant information is processed through the difference constraint. After that, the topological structure and attribute feature matrix of the multi-view decoder reconstruction graph are trained. Finally, the additional self-supervised clustering module makes the learning and clustering tasks of graph representation tend to be consistent. The effectiveness of MSAGC is well verified on real multi-view attributed graph datasets.
Abstract: In the field of software engineering, code repositories contain a wealth of knowledge resources, which can provide developers with examples of programming practices. If repetitive patterns, frequently occurring in source code, can be effectively extracted in the form of code templates, programming efficiency could be significantly improved. In current practice, developers often reuse existing solutions by searching through source code. However, this method typically generates a large number of similar and redundant results, increasing the burden of subsequent filtering. Moreover, template mining techniques based on cloned code often fail to cover extensive patterns constructed from dispersed small clones, thereby limiting the practicality of the templates. A method is proposed for extracting and retrieving code templates based on code clone detection. This method achieves more efficient function-level code template extraction by stitching together multiple fragment-level clones and extracting and aggregating the shared parts of method-level clones and addresses the issue of template quality. Based on the mined code templates, this study comes up with a triplet representation method for code structural features that effectively supplements plain text features, and implements an efficient and concise structural representation. In addition, this study presents a template feature retrieval method that combines structural and textual search to retrieve these templates by matching features of the programming context. The tool implemented based on this method, CodeSculptor, demonstrates its significant capability to extract high-quality code templates in a test against a codebase containing 45 high-quality Java open-source projects. The results show that the templates mined by the tool achieve an average code reduction of 60.87%, with 92.09% produced by stitching fragment-level clones, a proportion of templates that is not identifiable by traditional method. It proves the superior performance of the method in recognizing and constructing code templates. Furthermore, the accuracy of the top-5 search results in the code template search and recommendation is 96.87%. A preliminary case study on 9600 randomly selected templates reveals that most of the sampled code templates are complete and coherent in semantics, thus affirming their practicality. Nonetheless, there are a few meaningless templates, highlighting the future potential to refine the proposed template extraction strategy. The user research further shows that code development tasks can be done more efficiently with CodeSculptor.
Abstract: As a distributed approach to problem solving, crowdsourcing reduces costs and efficiently utilizes resources. While blockchain technology is introduced to solve the problem of over-centralization in traditional crowdsourcing platforms, its transparency brings the risk of privacy leakage. The traditional anonymous authentication can hide the user's identity, but the anonymity is abused, and the worker selection gets more difficult. In this study, a decentralized accountable attribute-based authentication scheme is proposed and combined with blockchain to design a novel crowdsourcing scheme. Using decentralized attribute-based encryption and non-interactive zero-knowledge proof, the scheme protects the privacy of users’ identities with linkability and traceability, and the requester can devise access policies to select workers. In addition, the scheme improves the security of the system by implementing attribute authorization authority and tracking groups through the threshold secret sharing technique. Through experimental simulation and analysis, it is demonstrated that the scheme meets the requirements of time and storage overhead in practical application.
Abstract: A cuckoo filter is a space-efficient approximate membership query data structure, widely used in network systems for applications such as network routing, network measurement, and network caching. However, the traditional design of cuckoo filters has not adequately considered the scenario in network systems where some or all queries in the collection are known, and these queries come with associated costs. This limitation results in the suboptimal performance of existing cuckoo filters in such situations. To address this, the variable hashing-fingerprint cuckoo filter (VHCF) has been developed. VHCF introduces variable fingerprint hashing technology, taking into account the known query collection and their associated costs. By searching for the optimal fingerprint hash function for each hash bucket, the overall cost of false positives is significantly reduced. In addition, this study proposes a single-hash technology to reduce the additional computational overhead caused by the variable-hash technology. A theoretical analysis of the operational complexity and false positive rate of VHCF is also provided. Finally, experimental and theoretical results both demonstrate that VHCF achieves a significantly lower false positive rate than existing cuckoo filters and their variants while ensuring comparable query throughput. Specifically, VHCF only needs to allocate 1–2 bits for each hash index unit, which can reduce the false positive rate to 12.5%–50% of the original.
Abstract: Smart contracts are scripts running on the Ethereum blockchain capable of handling intricate business logic with most written in the Solidity. As security concerns surrounding smart contracts intensify, a formal verification method employing the modeling, simulation, and verification language (MSVL) alongside propositional projection temporal logic (PPTL) is proposed. A SOL2M converter is developed, facilitating semi-automatic modeling from the Solidity to MSVL programs. However, the proof of operational semantic equivalence of Solidity and MSVL is lacking. This study initially defines Solidity’s operational semantics using big-step semantics across four levels: semantic elements, evaluation rules, expressions, and statements. Subsequently, it establishes equivalence relations between states, expressions, and statements in Solidity and MSVL. In addition, leveraging the operational semantics of both languages, it employs structural induction to prove expression equivalence and rule induction to establish statement equivalence.
Abstract: Formal methods have made significant strides in the field of requirements consistency verification. However, as the complexity of embedded system requirements continues to increase, verifying requirements consistency faces the challenge of dealing with an excessively large state space. To effectively reduce the verification state space, while also considering the strong dependency among devices in embedded system requirements, this study proposes a compositional verification method for ensuring the consistency of requirements in complex embedded systems. This method is based on requirement decomposition and identification of dependencies among requirements. By leveraging these dependencies, it assembles verification subsystems, enabling the compositional verification of complex embedded system requirements and facilitating the initial identification of inconsistencies. Specifically, the problem frames approach is employed for requirement modeling and decomposition, while a domain-specific device knowledge base is utilized for modeling the physical characteristics of devices. During the assembly of verification subsystems, models of expected software behavior are generated and dynamically integrated with physical device models. Finally, the feasibility and effectiveness of this method are validated through a case study of an airborne reconnaissance control system, demonstrating a significant reduction in the verification state space through five case evaluations. This method thus provides a practical solution for verifying the requirements of complex embedded systems.
Abstract: Requirements for the effective real-time analysis of instant data modification of database systems have driven the rapid development of Hybrid Transactional/Analytical Processing (HTAP) database systems, which support to process both OLTP and OLAP workloads. To realize fair comparisons and healthy development, it is crucial to define and implement new benchmarks to evaluate new features of HTAP database systems. Firstly, this study analyzes the key characteristics of HTAP database systems and summarizes the distinct technologies in their implementations. Secondly, the difficulties of designing HTAP database systems and the challenges of constructing HTAP benchmarks are extracted. Based on these, the design dimensions of HTAP benchmarks are proposed, including data generation, workload generation, evaluation metrics, and consistency model supportability. This study compares differences between existing HTAP benchmarks in terms of design dimensions and implementation technologies and sums up their merits and defects in different dimensions. Additionally, the published benchmarks are demonstrated and their abilities of evaluating key features and supporting horizontal comparisons among HTAP database systems are analyzed. Finally, this study concludes the requirements for HTAP benchmarks and some future research directions, pointing out that semantically consistent workload control and fresh data access metrics are the key issue in defining benchmarks for HTAP database systems.
Abstract: In the field of model-based diagnosis, all minimal hitting sets (MHSs) of minimal conflict sets (MCSs) are the candidate diagnoses of the device to be diagnosed, so the calculation of MHS is a key step for generating candidate diagnoses. MHS is a classic NP-hard constraint solving problem. The bigger the problems get, the harder it becomes exponentially to solve them. Boolean algorithm is typical in calculating MHS. However, in the process of solving, most of the runtime is taken up by the minimization of the intermediate solution sets. This study proposes BWSS (Boolean with suspicious sets) algorithm combined with suspicious set clusters for calculating MHS. By analyzing the spanning tree rule of Boolean algorithm in depth, the set that causes the candidate solution to become a superset is found. When extending elements to the root node, the candidate solution, if discovered to share at least one empty set with the suspicious set cluster, shall be minimal. Otherwise, the solution will be removed. The recursive strategy will be employed to ensure that all and only MHS are generated at the end of the algorithm. In addition, each candidate solution has at least m (m≥1) elements or even the entire solution in no need of complex minimization. Theoretically, BWSS algorithm is far less complex than Boolean algorithm. According to random data and mass reference circuit data, experimental results show that compared with many other state-of-the-art methods, the proposed algorithm reduces several orders of magnitude in runtime.
Abstract: The network traffic measurement technology of programmable switches is capable of handling high-speed network traffic and offers significant advantages in terms of flexibility and real-time processing. However, due to the necessity of configuring the internal logic of switches using the complex P4 programming language, the deployment of measurement tasks becomes intricate and error-prone. Furthermore, measurement accuracy is often constrained by the available measurement resources within the switch of measurement tasks. This study proposes a detailed exploration of intent-based networking and network traffic measurement technology, introducing an intent-driven network traffic distributed measurement method. Firstly, an intent representation format based on measurement intent primitives is designed, and an intent compiler is developed to translate abstract intent representations into executable P4 code. Secondly, a network traffic distributed measurement approach is introduced, utilizing the resources of multiple switches to collaboratively complete a measurement task in a distributed manner. The dynamic allocation of measurement resources and counter-configuration algorithms are exemplified with heavy-hitter measurements. Finally, experimental results demonstrate the feasibility and certain advantages of the proposed method.
Abstract: Recently, deep learning has received increasing attention from researchers due to its excellent performance in various scenarios, but these methods often rely on the independent and identically distribution assumption. Domain adaptation is a problem proposed to mitigate the impact of distribution offset, which uses labeled source domain data and unlabeled target domain data to achieve better performance on target data. Existing methods are devised for static data, while the methods for time series data need to capture the dependencies between variables. Although these methods use feature extractors for time series data, such as recurrent neural networks, to learn the dependencies between variables, they often extract redundant information. This information is easily entangled with semantic information, affecting the model performance. To solve these problems, this study proposes a path-signature-based time-series domain adaptation (PSDA). On the one hand, this method uses path signature transformation to capture sparse dependencies between variables and eliminate redundant correlations while preserving semantic dependencies, thereby facilitating the extraction of discriminative features from temporal data. On the other hand, the invariant dependency relationships are preserved by constraining the consistency of dependency relationships among different domains, and the changing dependency relationships between domains are excluded, which is conducive to extracting generalized features from temporal data. Based on the above methods, the study further proposes a distance metric and generalized boundary theory and obtains the best experimental results on multiple time series domain adaptation standard datasets.
Abstract: The rich development ecosystem of Python provides a lot of third-party libraries, significantly boosting developers’ efficiency and quality. Third-party library developers encapsulate underlying code, enabling upper-layer application developers to swiftly accomplish tasks by calling relevant APIs. However, APIs of third-party libraries are not constant. Owing to fixes, refactoring and feature additions, these libraries undergo continuous updates. Incompatible changes are seen in some APIs after updates, leading to abnormal termination or inconsistent results in upper-layer applications. Therefore, the API compatibility of the Python third-party library has become one of the issues that needs to be solved. There have been related studies focusing on API compatibility issues of Python third-party libraries, of which reasons have yet to be fully classified so that, the fine-grained cause can not be provided. An empirical study is conducted on the symptoms and causes of API compatibility issues with Python third-party library and a targeted static detection method is proposed. Initially, this study gathers 108 pairs of incompatible API versions by combining version update logs and regression tests across 6 version pairs of the Flask and Pandas libraries. Subsequently, an empirical study is conducted on the collected data, summarizing the symptoms and causes of compatibility issues. Finally, this study proposes a static analysis-based detection method for incompatible Python APIs, providing syntactic-level causes of incompatible API issues. This study conducts experimental evaluations on 12 version pairs of 4 popular Python third-party libraries. The results show that the proposed method is good in effectiveness, generalization, time performance, memory performance, and usefulness.
Abstract: In this study, the problem of mining cluster frequent patterns in time-ordered transaction data is discussed for the first time. To deal with redundant operations when the Naive algorithm solves this problem, the improved cluster frequent pattern mining (ICFPM) algorithm is proposed. The algorithm uses two optimization strategies. On the one hand, it can use the defined parameter minCF to effectively reduce the search space of mining results; on the other hand, it can refer to the discriminative results of (n–1)-itemsets to accelerate the discriminative process of cluster frequent n-itemset. The algorithm also applies the ICFPM-list structure to reduce the overhead of the candidate n-itemsets construction. Simulation experiments based on two real-world datasets demonstrate the effectiveness of the ICFPM algorithm. Compared with the Naive algorithm, the ICFPM algorithm improves substantially in terms of time and space efficiency, which makes it an effective method for solving clustered frequent pattern mining.
Abstract: DEFAULT, a new lightweight cryptosystem presented at Asiacrypt in 2021, is designed to protect the information security of Internet of Things (IoT) devices, such as microchips, microcontrollers, and sensors. Based on the ciphertext-only attack assumption, the statistical fault analysis of the DEFAULT cipher with the algebraic relationship is proposed. The statistical fault analysis uses the random nibble-oriented fault model. It not only combines statistical distributions of the intermediate states before and after the fault injections but also takes advantage of the algebraic relationship and novel distinguishers, including Anderson Darling test-square Euclidean imbalance (AD-SEI), Anderson Darling test-maximum likelihood estimate (AD-MLE), and Anderson Darling test-Hamming weight (AD-HW). The analysis requires at least 1344 faults to achieve the reliability of 99% in the recovery of the 128-bit secret key of DEFAULT. The theoretical analysis and experimental results show that the DEFAULT lightweight cryptosystem is not resistant to the statistical fault attack based on the algebraic relationship. This study provides an important reference for the security analysis of the other lightweight cryptosystems.
Abstract: In real-world scenarios, rich interaction relationships often exist among users on different platforms such as e-commerce, consumer reviews, and social networks. Constructing these relationships into a graph structure and applying graph neural network (GNN) for malicious user detection has become a research trend in related fields in recent years. However, due to the small proportion of malicious users, as well as their disguises and high labeling costs, traditional GNN methods are limited by problems suchas data imbalance, data inconsistency, and label scarcity. This study proposes a semi-supervised graph representation learning-based method for detecting malicious nodes. The method improves the GNN method for node representation learning and classification. Specifically, a class-aware malicious node detection (CAMD) method is constructed, which introduces a class-aware attention mechanism, inconsistent GNN encoders, and class-aware imbalance loss functions to solve the problems of data inconsistency and imbalance. Furthermore, to address the limitation of CAMD in detecting malicious nodes with scarce labels, a graph contrastive learning-based method, CAMD+, is proposed. CAMD+ introduces data augmentation, self-supervised graph contrastive learning, and class-aware graph contrastive learning to enable the model to learn more information from unlabeled data and fully utilize scarce label information. Finally, a large number of experimental results on real-world datasets verify that the proposed methods outperform all baseline methods and demonstrate good detection performance in situations with different degrees of label scarcity.
Abstract: As software vulnerabilities grow in type, volume, and complexity, researchers have proposed various techniques to help developers discover, detect, and localize vulnerabilities. However, researchers still need to exert considerable effort to manually repair these vulnerabilities. In recent years, some researchers have focused on automated software vulnerability repair. However, such a task is merely considered a generic text generation problem by the current advanced technology, and the detects are not located. As a result, the generation space of the repair program is large, and the generated repair program is low-quality. Providing developers with such low-quality repairs affects the efficiency and effectiveness of vulnerability repair. To solve the above problems, a general type vulnerability repair approach based on chain-of-thought is proposed in this study, which is named CotRepair. By utilizing the chain-of-thought technology, the model first predicts the locations that are most likely to contain vulnerable code, and then generates the repair program more accurately based on the predicted locations. The experimental results show that CotRepair outperforms the baselines in various metrics, and the effectiveness of the proposed approach is demonstrated from multiple aspects.
Abstract: In the task of numerical question-answering with texts and tables, the models are required to perform numerical reasoning based on given texts and tables. The goal is to generate a computational program consisting of multi-step numerical calculations, and the program’s results are used as the answer to the question. To model the texts and tables, the current work linearizes the table into a series of cell sentences through templates and then designs a generator based on the texts and cell sentences to produce the computational program. However, this approach faces a specific problem: the differences between cell sentences generated by templates are minimal, making it difficult for the generator to distinguish between cell sentences that are essential for answering the question (supporting cell sentences) and those irrelevant to the question (distracting cell sentences). Ultimately, the model generates incorrect computational programs based on distracting cell sentences. To tackle this issue, this study proposes an approach called multi-granularity cell semantic contrast (MGCC) for our generator. The main purpose of this approach is to enhance the representation distances between supporting and distracting cell sentences, thereby helping the generator differentiate between them. Specifically, this contrast mechanism is composed of coarse-grained cell semantic contrasts and fine-grained constituent element contrasts, including contrasts in row names, column names, and cell values. The experimental results validate that the proposed MGCC approach enables the generator to achieve better performance than the benchmark model on the FinQA and MultiHiertt numerical reasoning datasets. On the FinQA dataset, it leads to an improvement of up to 3.38% in answer accuracy. Notably, on the more challenging hierarchical table dataset MultiHiertt, it yields a 7.8% increase in the accuracy of the generator. Compared with GPT-3 combined with chain of chain of thought (CoT), MGCC results in respective improvements of 5.44% and 1.69% on the FinQA and MultiHiertt datasets. The subsequent analytical experiments further verify that the multi-granularity cell semantic contrast approach contributes to the model’s improved differentiation between supporting and distracting cell sentences.
Abstract: The interactions between elements in contemporary software systems are notably intricate, encompassing relationships between packages, classes, and functions. Accurate comprehension of these relationships is pivotal for optimizing system structures and enhancing software quality. Analyzing inter-package relationships can help unveil dependencies between modules, thereby assisting developers in more effectively managing and organizing software architectures. On the other hand, a clear understanding of inter-class relationships contributes to the creation of code repositories that are more scalable and maintainable. Moreover, a clear understanding of inter-function relationships facilitates rapid identification and resolution of logical errors within programs, consequently enhancing the robustness and reliability of the software. However, current predictions of software system interaction confront challenges such as granularity disparities, inadequate features, and version changes. To address this challenge, this study constructs corresponding software network models based on the three granularities, including software packages, classes, and functions. It introduces a novel approach combining local and global features to reinforce the analysis and prediction of software systems through feature extraction and link prediction of software networks. This approach is based on the construction and handling of software networks, involving specific steps such as leveraging the node2vec method to learn local features of software networks and combining Laplacian feature vector encoding to comprehensively represent the global positional information of nodes. Subsequently, the Graph Transformer model is employed to further optimize the feature vectors of node attributes, culminating in the completion of the interaction prediction task of the software system. Extensive experimental validations are conducted on three Java open-source projects, encompassing within-version and cross-version interaction prediction tasks. The experimental results demonstrate that, compared to benchmark methods, the proposed approach achieves an average increase of 8.2% and 8.5% in AUC and AP values, respectively in within-version prediction tasks. This approach reaches an average rise of 3.5% and 2.4% in AUC and AP values, respectively, in cross-version prediction tasks.
Abstract: Social media text summarization aims to provide concise summaries for large-scale social media short texts (referred to as posts) targeting specific topics. Given the brief and informal contents of posts, traditional methods confront the challenges of sparse features and insufficient information. Recent research endeavors have leveraged social relationships among posts to refine post contents and remove redundant information, but these efforts neglect the presence of unreliable noise relationships in real social media contexts, leading to erroneous assessments of post importance and diversity. Therefore, this study proposes a novel unsupervised model DSNSum, which improves summarization performance by removing noise relationships in the social networks. Firstly, the noise relationships in real social relationship networks are statistically verified. Secondly, two noise functions are designed based on sociological theories, and a denoising graph auto-encoder (DGAE) is constructed to mitigate the influence of noise relationships and cultivate post contents of credible social relationships. Finally, a sparse reconstruction framework is utilized to select posts that maintain coverage, importance, and diversity to form a summary of a certain length. Experimental results on a total of 22 topics from two real social media platforms (Twitter and Sina Weibo) demonstrate the efficacy of the proposed model and provide new insights for subsequent research in related fields.
Abstract: As the human pose estimation (HPE) method based on graph convolutional network (GCN) cannot sufficiently aggregate spatiotemporal features of skeleton joints and restrict discriminative features extraction, in this paper, a parallel multi-scale spatio-temporal graph convolutional network (PMST-GNet) model is built to improve the performance of 3D HPE. Firstly, a diagonally dominant spatiotemporal attention graph convolutional layer (DDA-STGConv) is designed to construct a cross-domain spatiotemporal adjacency matrix and model the joint features based on self-constraint and attention mechanism constrain, therefore enhancing information interaction among nodes. Then, a graph topology aggregation function is devised to construct different graph topologies, and a parallel multi-scale sub-graph network module (PM-SubGNet) is constructed with DDA-STGConv as the basic unit. Finally, a multi-scale feature cross fusion block (MFEB) is designed, by which multi-scale information among PM-SubGNets can interact to improve the feature representation of GCN, therefore better extracting the context information of skeleton joints. The experimental results on the mainstream 3D HPE datasets Human3.6M and MPI-INF-3DHP show that the proposed PMST-GNet model has a good effect in 3D HPE and is superior to the current mainstream GCN-based algorithms such as Sem-GCN, GraphSH, and UGCN.
Abstract: Many computational problems on graphs are NP hard, so a natural strategy is to restrict them to some special graphs. This approach has seen many successes in the last few decades, and many efficient algorithms have been designed for problems on graph classes including graphs of bounded degree, bounded tree-width, and planar graphs, to name a few. As a matter of fact, many such algorithmic results can be understood in the framework of the so-called algorithmic meta-theorems. They are general results that provide efficient algorithms for decision problems of logic properties on structural graphs, which are also known as model-checking problems. Most existing algorithmic meta-theorems rely on modern structural graph theory, and they are often concerned with fixed-parameter tractable algorithms, i.e., efficient algorithms in the sense of parameterized complexity. On many well-structured graphs, the model-checking problems for some natural logics, e.g., first-order logic and monadic second-order logic, turn out to be fixed-parameter tractable. Due to varying expressive power, the tractability of the model-checking problems of those logics have huge differences as far as the underlying graph classes are concerned. Therefore, understanding the maximum graph classes that admit efficient model-checking algorithms is a central question for algorithmic meta-theorems. For example, it has been long known that efficient model-checking of first-order logic is closely related to the sparsity of input graphs. After decades of efforts, our understanding of sparse graphs are fairly complete now. So much of the current research has been focused on well-structured dense graphs, where challenging open problems are abundant. Already there are a few deep algorithmic meta-theorems proved for dense graph classes, while the research frontier is still expanding. This survey aims to give an overview of the whole area in order to provide impetus of the research of algorithmic meta-theorem in China.
Abstract: In recent years, research achievements in deep learning have found widespread applications globally. To enhance the training efficiency of large-scale deep learning models, industry practices often involve constructing GPU clusters and configuring efficient task schedulers. However, deep learning training tasks exhibit complex performance characteristics such as performance heterogeneity and placement topological sensitivity. Scheduling without considering performance can lead to issues such as low resource utilization and poor training efficiency. In response to this challenge, a great number of schedulers of deep learning training tasks based on performance modeling have emerged. These schedulers, by constructing accurate performance models, delve into the intricate performance characteristics of tasks. Based on this understanding, they design more optimized scheduling algorithms, thereby forming more efficient scheduling solutions. This study begins with a modeling design perspective, providing a categorized review of the performance modeling methods employed by current schedulers. Subsequently, based on the optimized scheduling approaches from performance modeling by schedulers, a systematic analysis of existing task scheduling efforts is presented. Finally, this study outlines prospective research directions for performance modeling and scheduling in the future.
Abstract: As too many redundant events included in crash test sequences generated by Android automated test tools may result in test replay, defect comprehension, and repairing difficulty, a great number of test sequence reduction works have been proposed. While current works only focus on the application interface changes and ignore the internal state changes during program execution. Moreover, current works only model application states at a single and abstract granularity, such as control widget granularity or activity granularity, resulting in long test sequences after reduction or inefficient reduction. This study proposes an Android test sequence reduction method combined with multi-granularity based on event labeling. By taking into account the Android lifecycle management mechanism and data flow analysis to label critical events that trigger crashes, this method can narrow the sequence reduction space and design a strategy of rough selection under low granularity and detailed reduction under high granularity. At last, a crash test sequence set containing complex scenarios such as inter-application interaction and user input is collected, and the comparison with other test sequence reduction works on this set verifies the effectiveness of the method proposed in this study.
Abstract: In recent years, there has been rapid advancement in the application of artificial intelligence technology to sequential decision-making and adversarial game scenarios, resulting in significant progress in domains such as Go, games, poker, and Mahjong. Notably, systems like AlphaGo, OpenAI Five, AlphaStar, DeepStack, Libratus, Pluribus, and Suphx have achieved or surpassed human expert-level performance in these areas. While these applications primarily focus on zero-sum games involving two players, two teams, or multiple players, there has been limited substantive progress in addressing mixed-motive games. Unlike zero-sum games, mixed-motive games necessitate comprehensive consideration of individual returns, collective returns, and equilibrium. These games are extensively applied in real-world applications such as public resource allocation, task scheduling, and autonomous driving, making research in this area crucial. This study offers a comprehensive overview of key concepts and relevant research in the field of mixed-motive games, providingan in-depth analysis of current trends and future directions both domestically and internationally. Specifically, this study first introduces the definition and classification of mixed-motive games. It then elaborates on game solution concepts and objectives, including Nash equilibrium, correlated equilibrium, and Pareto optimality, as well as objectives related to maximizing individual and collective gains, while considering fairness. Furthermore, the study engages in a thorough exploration and analysis of game theory methods, reinforcement learning methods, and their combination based on different solution objectives. In addition, the study discusses relevant application scenarios and experimental simulation environments before concluding with a summary and outlook on future research directions.
Abstract: The cuckoo filter is an efficient probabilistic data structure that can quickly determine whether an element exists in a given set. The cuckoo filter is widely used in computer networks, IoT applications, and database systems. These systems usually involve the handling of massive amounts of data and numerous concurrent requests in practice. A cuckoo filter that supports high concurrency can significantly improve system throughput and data processing capabilities, which is crucial to system performance enhancement. Therefore, a cuckoo filter that supports lock-free concurrency is designed. The filter achieves high-performance lookup, insertion, and deletion through the two-stage query, separation of path exploration and element migration, and atomic migration based on multi-word compare-and-swap. Theoretical analysis and experimental results indicate that the lock-free concurrent cuckoo filter significantly improves the concurrent performance of the most cutting-edge algorithms in current times. The lookup throughput of a lock-free concurrent cuckoo filter is on average 1.94 times that of a cuckoo filter using fine-grained locks.
Abstract: Previous pre-trained language models (PLMs) have demonstrated excellent performance in numerous tasks of natural language understanding (NLU). However, they generally suffer shortcut learning, which means learning the spurious correlations between non-robust features and labels, resulting in poor generalization in out-of-distribution (OOD) test scenarios. Recently, the outstanding performance of generative large language models (LLMs) in understanding tasks has attracted widespread attention, but the extent to which it is affected by shortcut learning has not been fully studied. In this paper, the shortcut learning effect of generative LLMs in three NLU tasks is investigated for the first time using the LLaMA series models and FLAN-T5 models as representatives. The results show that the shortcut learning problem still exists in generative LLMs. Therefore, a hybrid data augmentation framework is proposed based on controllable explanations as a mitigation strategy for the shortcut learning problem in generative LLMs. The framework is data-centric, constructing a small-scale mix dataset composed of model-generated controllable explain data and partial original prompting data for model fine-tuning. The experimental results in three representative NLU tasks show that the framework can effectively mitigate shortcut learning, and significantly improve the robustness and generalization of the model in OOD test scenarios while avoiding sacrifice of or even improving the model performance in in-distribution test scenarios. The solution code is available at https://github.com/Mint9996/HEDA.
Abstract: In recent years, deep learning has achieved excellent performance in software engineering (SE) tasks. Excellent performance in practical tasks depends on large-scale training sets, and collecting and labeling large-scale training sets require a lot of resources and costs, which limits the wide application of deep learning techniques in practical tasks. With the release of pre-trained models (PTMs) in the field of deep learning, researchers in SE have begun to pay attention to PTMs and introduced PTMs into SE tasks. PTMs have made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (pre-trained models for software engineering, PTM4SE), this study systematically reviews the current studies related to PTM4SE. Specifically, the study first describes the framework of the intelligent software engineering methods based on pre-trained models and then analyzes the commonly used pre-trained models in SE. Meanwhile, it introduces the downstream tasks in SE with pre-trained models in detail and compares and analyzes the performance of pre-trained model techniques on these tasks. The study then presents the datasets used in SE for training and fine-tuning the PTMs. Finally, it discusses the challenges and opportunities for PTM4SE. The collated PTMs and datasets in SE are published at https://github.com/OpenSELab/PTM4SE.
Abstract: Pre-training knowledge graph (KG) models facilitate various downstream tasks in e-commerce applications. However, large-scale social KGs are highly dynamic, and the pre-training models need to be updated regularly to reflect the changes in node features caused by user interactions. This paper proposes an efficient incremental update framework for the pre-training KG models. The framework mainly includes a bidirectional imitation distillation method to fully use the different types of facts in new data, and a sampling strategy based on samples’ normality and abnormality is proposed to sample the most valuable facts from all new facts to reduce the training data size, and a reverse replay mechanism is proposed to generate high-quality negative facts that are more suitable for the incremental training of social KGs in e-commerce. Experimental results on real-world e-commerce datasets and related downstream tasks demonstrate that the proposed framework can incrementally update the pre-training KG models more effectively and efficiently compared to state-of-the-art methods.
Abstract: In recent years, online transactions of digital collections have been increasing, with platforms such as Alibaba Auctions and OpenSea facilitating their circulation in the market. However, the bidder’s bidding privacy is at risk of being disclosed during an online auction. To address this issue, this study proposes a privacy-preserving online auction approach based on the homomorphic property of SM2, which not only protects the users’ bidding privacy but also ensures the usability of the bidding data. Specifically, this study creates a homomorphic encryption scheme based on SM2, encrypting bidders’ bidding information and constructing a piece of noisy bidding information to conceal the privacy data. The efficiency of the online auction privacy preservation approach is improved by integrating the Chinese reminder theorem and baby step giant step (CRT-BSGS) into the homomorphic encryption process with SM2, which has proved to be more efficient than the Paillier algorithm. Finally, the security and efficiency of the proposed scheme are verified in detail.
Abstract: The cross-shard state transition protocol is the basis for ensuring the atomicity of cross-shard transactions, and its efficiency directly affects the performance of the sharding system. The cross-transaction process of the existing protocols can be divided into three phases: source-shard state move-out, cross-shard state transition, and destination-shard state move-in. These phases are executed sequentially, and all phases are tightly coupled. This paper proposes the ChannelLink cross-shard state transition protocol based on the off-chain state channel. Since the off-chain channels are highly flexible and can be confirmed instantly, the ChannelLink protocol can effectively decouple the tightly coupled three-phase process, reducing the average cost of cross-shard transactions, and improving state transition efficiency. On this basis, this paper designs a low-overhead off-chain channel routing algorithm. This algorithm solves the optimal state routing scheme by improving the genetic algorithm based on the characteristics of state transition transactions and off-chain channel topology. It reduces the user’s cross-shard state transition overhead and guarantees transition efficiency. Finally, this paper implements the ChannelLink protocol prototype system and uses Bitcoin transactions and the Lightning Network state to construct the dataset for experimental verification. Results show that in a scenario with 16 shards and a cross-shard transaction ratio of 5.21%, the sharding system integrated with the ChannelLink protocol can improve the throughput by 7.04%, reduce the transaction confirmation latency by 52.51%, and reduce the cost of cross-shard state transition by more than 45.44%. Meanwhile, the performance advantages of the ChannelLink protocol gradually increase as the number of shards and the cross-shard transaction ratio increase.
Abstract: Although convolutional neural networks (CNNs) are widely used in image recognition due to their excellent generalization performance, adversarial samples contaminated by noise can easily deceive fully trained network models, posing security risks. Many existing defense methods improve the robustness of models, but most inevitably sacrifice model generalization. To alleviate this issue, a label-filtered weight parameter regularization method is proposed to balance the generalization and robustness of models using the label information of samples during model training. Many previous robust model training methods suffer from two main issues: 1) The robustness of models is mainly enhanced by increasing the quantity or complexity of training set samples, which not only diminishes the dominant role of clean samples in model training but also significantly increases the workload of training tasks. 2) The label information of samples is used only to compare with model predictions to control the direction of model parameter updates, neglecting the additional information hidden in sample labels. The proposed method selects weight parameters that play a decisive role in classifying samples by filtering the correct labels of samples and the classification labels of adversarial samples and optimizes these parameters regularly to achieve a balance between model generalization and robustness. Experiments and analysis on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that the proposed method achieves good training results.
Abstract: This study considers slot filling as a crucial component of task-oriented dialogue systems, which serves downstream tasks by identifying specific slot entities in utterances. However, in a specific domain, it necessitates a large amount of labeled data, which is costly to collect. In this context, cross-domain slot filling emerges and efficiently addresses the issue of data scarcity through transfer learning. However, existing methods overlook the dependencies between slot types in utterances, leading to the suboptimal performance of existing models when transferring to new domains. To address this issue, a cross-domain slot filling method based on slot dependency modeling is proposed in this study. Leveraging the prompt learning approach based on generative pre-trained models, a prompt template integrating slot dependency information is designed, establishing implicit dependency relationships between different slot types and fully exploiting the predictive performance of slot entities in the pre-trained model. Furthermore, to enhance the semantic dependencies between slot types, slot entities, and utterance texts, discourse filling subtask is introduced in this study to strengthen the inherent connections between utterances and slot entities through reverse filling. Transfer experiments across multiple domains demonstrate significant performance improvements in zero-shot and few-shot settings achieved by the proposed model. Additionally, a detailed analysis of the main structures in the model and ablation experiments are conducted in this study to further validate the necessity of each part of the model.
Abstract: Serverless computing is an emerging cloud computing model based on the “function as a service (FaaS)” paradigm. Functions serve as the fundamental unit for deployment and scheduling, providing users with massively parallel and automatically scalable function execution services without the need to manage underlying resources. For users, serverless computing helps them alleviate the burden of managing cluster-level infrastructure, enabling them to focus on business-layer development and innovation. For service providers, applications are decomposed into fine-grained functions, leading to significantly improved scheduling efficiency and resource utilization. The significant advantages have swiftly drawn the attention from the industry and propelled serverless computing into popularity. However, the distinct computing mode of serverless computing, divergent from traditional cloud computing, along with its stringent limitations on various aspects of tasks, poses numerous obstacles to application migration. The escalating complexity of migrated tasks also imposes higher performance requirements on serverless computing. Therefore, performance optimization technology for serverless computing systems has emerged as a critical research topic. This study reviews and summarizes research efforts on performance optimization of serverless computing from four perspectives, and introduces existing system. Firstly, this study introduces the optimization technologies for typical tasks, including task adaptation and system optimization for specific task types. Secondly, it reviews the optimization work on sandbox environments, encompassing sandbox solutions and cold start optimization methods, which play a crucial role in the execution of serverless functions. Thirdly, it provides an overview of the optimization in I/O and communication technologies, which are major performance bottlenecks of serverless applications. Lastly, it briefly outlines related resource scheduling technologies, including platform-oriented and user-oriented scheduling strategies, which determine system resource utilization and task execution efficiency. In conclusion, it summarizes the current issues and challenges of performance optimization technologies of serverless computing and anticipates potential future research directions.
Abstract: Serverless computing is an emerging paradigm of cloud computing, allowing developers to focus only on application logic development without the need to manage complex underlying tasks. This paradigm allows developers to quickly build smaller-granularity applications, the one at the function level. With the increasing popularity of serverless computing, major cloud computing vendors have introduced their commercial serverless platforms one after another. However, the characteristics of these platforms have yet to be systematically studied and reliably compared. A comprehensive analysis of these characteristics can help developers choose an appropriate serverless platform while developing and executing serverless applications in the right way. To this end, an empirical study is conducted on the characteristics of mainstream commercial serverless platforms. This study involves such mainstream serverless platforms as AWS Lambda, Google Cloud Functions, Microsoft Azure Functions, and Alibaba Function Compute. This study is divided into two major parts: feature summarization and runtime performance analysis. In the feature summarization, the official documents of these serverless platforms are discussed and their key features are summarized and compared in terms of development, deployment, and runtime. In the runtime performance analysis, representative benchmarks are applied to analyze the runtime performance offered by these serverless platforms on a multidimensional basis. Specifically, key factors for the cold-start performance of the applications are first analyzed, such as programming languages and memory sizes. Furthermore, the tasks-executing performance of serverless platforms is discussed. Based on the results of feature summarization and runtime performance analysis, this study sums up a series of findings and provides practical insights and potential research opportunities for developers, cloud computing vendors, and researchers.
Abstract: Microkernels migrate system services to user mode. Thanks to the isolated framework, microkernels are superior in high reliability, which meets the needs of the aerospace field. SPARC processors are widely applied on the control equipment of spacecraft, satellite payloads, and planetary vehicles. The register window mechanism of SPARC will affect the performance of inter-process communication (IPC) on microkernels. Besides, its inter-processor interrupt (IPI) also seriously affects the efficiency of cross-core IPC. As a key mechanism, IPC is vital to the overall performance of applications on microkernels. Through observing the register window mechanism, this study redesigns and implements the register bank mechanism, where the register window is allocated and managed by the kernel. Thus BankedIPC on SPARC is implemented. At the same time, as IPI underperforms on SPARC, FlexIPC is designed to optimize the performance of cross-core IPC. These approaches are employed to optimize the general synchronous IPC implemented on a self-developed microkernel ChCore. According to the test, the average IPC performance of microkernels on the optimized SPARC is about two times better with the application performance up to 15%.
Abstract: Multi-path transmission technology establishes multiple transmission paths between communication parties via various network interfaces on devices. In this way, bandwidth aggregation, load balance, and path redundancy will be achieved to increase transmission throughput and reliability. These benefits allow the multipath transmission technology to be widely used in several application scenarios such as servers, terminals, and data centers. As a part and parcel of network architecture and transmission technology studies, the technology is of research significance and value. To this end, this study systematically analyzes the multi-path transmission technology in terms of its concepts and core mechanisms. Firstly, the basic concepts, standardized process and application value of multi-path transmission are outlined. Secondly, the core mechanisms of the multi-path transmission technology are enunciated, including congestion control, packet scheduling, path management, retransmission mechanism, security mechanism, and the mechanism for specialized applications. Classification methods and the main research results of each mechanism are elaborated, and the advantages, disadvantages and the development direction of mechanisms are summarized. Finally, this study probes into challenges faced by multi-path transmission technology research and envisions the prospect for relevant studies.
Abstract: Capturing an accurate view of IP geolocation is of great interest to the networking research community as it has many uses ranging from network measuring and mapping to analyzing the network’s infrastructure. However, the scale of today’s Internet, coupled with the rapid development of Internet applications, makes it very challenging to acquire a complete and accurate snapshot of the IP geolocation technology. To the best of our knowledge, there is no systematic survey of the relevant research in this field. To fill this gap, this study systematically summarizes the research of client-independent IP geolocation, in which the clients do not participate in the geolocation process, for the first time. This study aims to examine the major research studies that have been conducted on topics related to IP geolocation in the last 22 years since the first IP-based geolocation technology was proposed. To this end, these prior studies are classified according to the measurement method, that is, active, passive, and hybrid. The main techniques for each category are described, identifying their significant advantages and limitations. Also, the primary experience and lessons learned from these past efforts are presented. After the process, the latest progress in IP geolocation both in academia and industry is shown. Finally, the survey and present promising directions in the future are concluded, hoping to promote the development of IP geolocation.
Abstract: The assessment of adversarial robustness requires a complete and accurate evaluation of deep learning models’ noise resistance by combining the attack ability and noise magnitude of adversarial samples. However, the lack of completeness in the adversarial robustness evaluation metric system is a key problem with the existing adversarial attack and defense methods. The existing work on adversarial robustness evaluation lacks analysis and comparison of the evaluation metric system. The impact of attack success rate and different norms on the completeness of the robustness evaluation metric system and the restrictions on designing attack and defense methods are neglected. In this study, the adversarial robustness evaluation metric system is discussed in two dimensions: norm selection and metric indicators. The theoretical analysis of robustness evaluation completeness is carried out from three aspects: the inclusion relation of the evaluation metric domain, robustness description granularity, and the order relationship of the robustness evaluation metric system. The following conclusions are drawn: using noise statistical quantities such as the mean results in a larger and more comprehensive definition domain of evaluation indicators compared to using attack success rates, while also ensuring that any two adversarial sample sets can be compared. Using the $L_2 $ norm is more complete in the description of adversarial robustness evaluation compared to using other norms. Extensive experiments on 23 models and 20 adversarial attacks across 6 datasets validate these conclusions.
Abstract: Benefiting from the rapid development of information technology and the widespread adoption of medical information systems, a vast amount of medical knowledge has been accumulated in medical databases, including patient clinical treatment events and medical expert consensus. It is crucial to extract knowledge from these medical facts and effectively manage and utilize them, which can advance the automation and intelligence of diagnosis and treatment. Knowledge graphs, as a novel knowledge representation tool, can effectively mine and organize information from abundant medical facts and have received extensive attention in the medical field. However, existing medical knowledge graphs often suffer from limitations such as small scale, numerous restrictions, poor scalability, and so on, leading to a limited ability to express knowledge from medical facts. To address these issues, this proposes a bilayer medical knowledge graph architecture and employs information extraction techniques on both English patient clinical treatment events and Chinese medical expert consensus to construct a billion-scale medical knowledge graph that is cross-lingual, multimodal, dynamically updated, and highly scalable, aiming to provide more accurate, intelligent medical services.
Abstract: With the application of artificial intelligence (AI) and end-to-end recognition methods in handwritten mathematical expression recognition, there has been a significant improvement in recognition accuracy. However, in contrast to tests on public datasets, real-world applications involving human input introduce more uncertain factors into recognition algorithms in practice. Factors such as personalized stroke information, ambiguous handwritten characters, and uncertain formula structures can significantly impact the performance of the recognition method. To address these challenges, HchMER, a hybrid human-machine intelligence method for handwritten mathematical expression recognition, is proposed. HchMER combines handwritten mathematical formula recognition algorithms, knowledge bases, and user feedback to enhance the machine's comprehension of user-input mathematical expressions, thereby improving the editing speed and accuracy of handwritten mathematical expressions. To assess the effectiveness of HchMER, it is compared with MyScript Math Recognition (MyScript) and a mature commercial product named “Microsoft Ink Equation” (InkEquation). Results show that HchMER outperformed MyScript and InkEquation in accuracy by 23.2% and 26.51%, respectively. In terms of average completion time, HchMER exceeded MyScript by 44.46% (9.6 s) but fell short of InkEquation by 11.48% (4.05 s). Furthermore, participants affirm HchMER in a questionnaire survey and semi-structured interviews.
Abstract: Persistent memory (PM), serving as a supplement and potential replacement for main memory, offers a lower cost for data storage while ensuring data persistence. However, traditional index structures tailored for PM like B+ trees fail to fully exploit the distribution characteristics of data for optimizing reading and writing performance on PM. Recent research endeavors have sought to enhance indexes’ reading and writing performance on PM and support index persistence through the data distribution awareness of learning indexes. Nonetheless, existing designs of persistent learning index structures suffer from additional PM accesses and poor performance when confronted with real-world data. To address the performance degradation of persistent learning indexes in the face of real data distributions, this study proposes a learning index PLTree, a DRAM/PM hybrid architecture. PLTree optimizes reading and writing performance under real data distributions through the following approaches: (1) a two-stage approach to construct the index, eliminating last-mile search in internal nodes and reducing the access of PM, (2) model-based search for efficient query performance on PM and accelerated query by leveraging metadata in DRAM, and (3) a log-based hierarchical overflow buffer structure tailored to PM characteristics to optimize writing performance. The results show that, compared with the existing persistent memory indexes (APEX, FPTree, uTree, NBTree, and DPTree), PLTree achieves significantly better performance in index construction 1.9× to 34× across various datasets. In single-threaded scenarios, PLTree exhibits an average query and insertion performance improvement of 1.26× to 4.45× and 2.63× to 6.83×, respectively. In multi-threaded scenarios, PLTree surpasses the baseline by up to 10.2× and 23.7× in query and insertion performance, respectively.
Abstract: As artificial intelligence and 5G technology are applied in the automotive industry, the intelligent connected vehicle came into being. It is a complex distributed heterogeneous system composed of a large number of electronic control units (ECUs) from different suppliers and collaborates to control each ECU through the in-vehicle network protocol represented by CAN. However, an attacker could attack an intelligent connected vehicle through a variety of interfaces to penetrate the in-vehicle network, and then attack the in-vehicle network and its components such as ECU. Therefore, in-vehicle network security for intelligent connected vehicles has become one of the focuses of vehicle security research in recent years. On the basis of introducing the structure of intelligent connected vehicle, ECU, CAN bus and on-board diagnostic protocol, this study first summarizes the research progress of reverse engineering technology for in-vehicle network protocols. The reverse engineering technology aims to obtain the implementation details of in-vehicle network protocols that are usually not disclosed in the automotive industry. It is also a prerequisite for the implementation of in-vehicle network attack and defense. The remaining part is developed from two angles of attack and defense. On the one hand, the attack vectors and main attack technologies of in-vehicle network are summarized, including the attack technologies implemented through physical access and remote access, as well as the attack technologies implemented against ECU and CAN bus. On the other hand, the existing in-vehicle network defense technologies are discussed, including the intrusion detection technology based on feature extraction and machine learning methods, and the security enhancement technology of in-vehicle network protocols based on cryptographic approaches. Finally, the future research direction is prospected.
Abstract: As a fine-grained sentiment analysis method, aspect-based sentiment analysis is playing an increasingly important role in many application scenarios. However, with the ubiquity of social media and online reviews, cross-domain aspect-based sentiment analysis faces two major challenges: insufficient labeled data in the target domain and textual distribution differences between the source and target domains. Currently, many data augmentation methods attempt to alleviate these issues, yet the target domain text generated by these methods often suffers from shortcomings such as lack of fluency, limited diversity of generated data, and convergent source domain. To address these issues, this study proposes a method for cross-domain aspect-based sentiment analysis based on data augmentation from a large language model (LLM). This method leverages the rich language knowledge of large language models to construct appropriate prompts for the cross-domain aspect-based sentiment analysis task. It mines similar texts between the target domain and the source domain and uses context learning to guide the LLM to generate labeled text data in the target domain with domain-associated keywords. This approach addresses the lack of data in the target domain and the domain-specificity problem, effectively improving the accuracy and robustness of cross-domain sentiment analysis. Experiments on multiple real datasets show that the proposed method can effectively enhance the performance of the baseline model in cross-domain aspect-based sentiment analysis.
Abstract: As embedded systems are widely applied, their requirements are becoming increasingly complex, making requirements analysis a critical stage in embedded system development. How to correctly describe and model requirements has become a primary issue. This study systematically investigates the current requirements descriptions of embedded systems and conducts a comprehensive comparative analysis to deepen the understanding of the core concerns of embedded system requirements. The study first applies the systematic literature review method to identify, retrieve, summarize, and analyze the relevant literature published between January 1979 and November 2023. Through the automatic retrieval and snowball processes, 150 papers closely related to the topic are finally selected for the comprehensiveness of the review. The study analyzes the existing capabilities of embedded requirements description languages from their description concerns, description contents, requirements analysis elements, etc. Finally, it summarizes the challenges to the current requirements descriptions. Moreover, aiming at the task of intelligent synthesis of embedded software, it puts forward the need for the expressive ability of embedded system requirement description languages.
Abstract: Multi-modal affective computing is a fundamental and important research task in the field of affective computing, using multi-modal signals to understand the sentiment of user-generated video. Although existing multi-modal affective computing approaches have achieved good performance on benchmark datasets, they generally ignore the problem of modal reliability bias in multi-modal affective computing tasks, whether in designing complex fusion strategies or learning modal representations. This study believes that compared to text, acoustic and visual modalities often express sentiment more realistically. Therefore, voice and vision have high reliability, while text has low reliability in affective computing tasks. However, existing learning abilities of different modality feature extraction tools are different, resulting in a stronger ability to represent textual modality than acoustic and visual modalities (e.g., GPT3 and ResNet). This further exacerbates the problem of modal reliability bias, which is unfavorable for high-precision sentiment judgment. To mitigate the bias caused by modal reliability, this study proposes a model-agnostic multi-modal reliability-aware affective computing approach (MRA) based on cumulative learning. MRA captures the modal reliability bias by designing a single textual-modality branch and gradually shifting the focus from sentiments expressed in low-reliability textual modality to high-reliability acoustic and visual modalities during the model learning process. Thus, MRA effectively alleviates inaccurate sentiment predictions caused by low-reliability textual modality. Multiple comparative experiments conducted on multiple benchmark datasets demonstrate that the proposed approach MRA can effectively highlight the importance of high-reliability acoustic and visual modalities and mitigate the bias of low-reliability textual modality. Additionally, the model-agnostic approach significantly improves the performance of multi-modal affective computing, indicating its effectiveness and generality in multi-modal affective computing tasks.
Abstract: As image data grows explosively on the Internet and image application fields widen, the demand for large-scale image retrieval is increasing greatly. Hash learning provides significant storage and retrieval efficiency for large-scale image retrieval and has attracted intensive research interest in recent years. Existing surveys on hash learning are confronted with the problems of weak timeliness and unclear technical routes. Specifically, they mainly conclude the hashing methods proposed five to ten years ago, and few of them conclude the relationship between the components of hashing methods. In view of this, this study makes a comprehensive survey on hash learning for large-scale image retrieval by reviewing the hash learning literature published in the past twenty years. First, the technical route of hash learning and the key components of hashing methods are summarized, including loss function, optimization strategy, and out-of-sample extension. Second, hashing methods for image retrieval are classified into two categories: unsupervised hashing methods and supervised ones. For each category of hashing methods, the research status and evolvement process are analyzed. Third, several image benchmarks and evaluation metrics are introduced, and the performance of some representative hashing methods is analyzed through comparative experiments. Finally, the future research directions of hash learning are summarized considering its limitations and new challenges.
Abstract: The scene sketch is made up of multiple foreground and background objects, which can directly and generally express complex semantic information. It has a wide range of practical applications in real life and has gradually become one of the research hotspots in the field of computer vision and human-computer interaction. As the basic task of the semantic understanding of scene sketch, scene sketch semantic segmentation is rarely studied. Most of the existing methods are improved from the semantic segmentation of natural images, which cannot overcome the sparsity and abstraction of sketches. To solve the above problems, this study proposes a graph Transformer model directly from sketch strokes. The model combines the temporal-spatial information of sketch strokes to solve the semantic segmentation task of free-hand scene sketches. First, the vector scene sketch is constructed into a graph with strokes as the nodes of the graph and temporal and spatial correlations between strokes as the edges of the graph. The temporal-spatial global context information of the strokes is then captured by the edge-enhanced Transformer module. Finally, the encoded temporal-spatial features are optimized for multi-classification learning. The experimental results on the SFSD scene sketch dataset show that the proposed method can effectively segment scene sketches using stroke temporal-spatial information and achieve excellent performance.
Abstract: In the era of big data, the sample scale and the dynamic update and variation of dimensionality greatly increase the computational burden. Most of these data sets do not exist in the form of a single data type but are more often hybrid data containing both symbolic and numerical data. For this reason, scholars have proposed many feature selection algorithms for hybrid data. However, most of the existing algorithms are only applicable to static data or small-scale incremental data and cannot handle large-scale dynamic changing data, especially large-scale incremental data sets with changing data distribution. To address this limitation, this paper proposes a multi-granulation incremental feature selection algorithm for dynamic hybrid data based on an information fusion mechanism by analyzing the variations and updates of granularity space and granularity structure in dynamic data. The algorithm focuses on the mechanism of granularity space construction in dynamic hybrid data, the mechanism of dynamic update of multiple data granularity structures, and the mechanism of information fusion for data distribution variations. Finally, the paper verifies the feasibility and efficiency of the proposed algorithm by comparing the experimental results with other algorithms on the UCI dataset.
Abstract: As mobile devices are widely used, the performance of their graphics processors has increasingly improved. To meet users’ continuous pursuit of excellent experience, the screen resolution and refresh rate of mobile devices are constantly increasing every year. At the same time, the programmable shading pipeline in mobile games is becoming more complex, which leads to game applications becoming the main source of power consumption for mobile devices. This paper studies the rendering pipeline in mobile games and proposes a motion-aware rendering frame rate adjustment method to ensure rendering quality in power-saving mode. Unlike previous prediction models that only consider rendering errors of historical frames, this method builds a nonlinear model between camera pose and inter-frame rendering error and predicts error based on the new frame’s camera pose, thus achieving more accurate frame rate adjustment strategies. In addition, the method also includes a lightweight scene recognition module that can adjust the error threshold according to the specific scene where the player is located, thereby adopting different degrees of frame rate adjustment strategies. Quantitatively compared with the prediction model that only considers historical frame errors, the proposed model improves the prediction accuracy on game frame sequences by more than 30%. At the same time, in the qualitative comparison of user experiments, under the same frame-skipping ratio, the proposed algorithm can achieve higher rendering quality and better user experience. The algorithm integrates historical frame errors and camera information to predict more accurate future frame errors. It also combines prediction and scene recognition results to achieve better dynamic frame rate adjustment performance.
Abstract: The time-sensitive networking standard developed by IEEE 802.1 Task Group can be applied to build highly reliable, low latency, low jitter Ethernet, and the extension of time-sensitive networking to the wireless field is also a hot topic. Compared with traditional wired communication, wireless time-sensitive networking can not only achieve high reliability and low delay communication but also has the advantages of higher flexibility, stronger mobility, and lower wiring and maintenance costs. Therefore, wireless time-sensitive networking is considered a promising technology in the face of emerging applications such as autonomous driving, collaborative robotics, and remote medical control in the future. Generally, wireless networks can be divided into infrastructure-based wireless networks and non-infrastructure-based wireless networks. The latter can be divided into two categories based on mobility: mobile Ad hoc networks and wireless sensor networks. Therefore, this paper mainly studies and summarizes the application scenarios, related technologies, routing protocols, and high-reliability and low-delay transmission of the three types of networks.
Abstract: While keeping frequent application updates, Android application developers need to detect Android runtime permission (ARP) bugs as quickly as possible. Android applications cannot effectively be tested for permission-related behaviors with automated testing tools since they are rarely designed for ARP bugs. This study proposes a state transition graph guided testing approach for detecting ARP bugs in Android applications. First, it analyzes the APK file of the application under test for permission misuse, instruments the APIs that may cause ARP bugs in the APK file, and re-signs the APK file. Then, it installs the APK file and dynamically explores the application to generate its state transition graph (STG). Finally, it detects ARP bugs quickly by automated testing with the guidance of STG. To evaluate the effectiveness of the approach, the study implements a prototype tool RPBDroid and conducts comparative experiments with the ARP bug detection tools SetDroid, PermDroid, and the automated testing tool APE. The experimental results show that RPBDroid successfully detects 15 ARP bugs out of 17 applications, which detects 14, 12, and 14 more ARP bugs than APE, SetDroid, and PermDroid respectively. In addition, RPBDroid reduces the average time required to detect ARP bugs by 86.42%, 86.72%, and 86.70% in comparison with SetDroid, PermDroid, and APE.
Abstract: Identity-based matchmaking encryption is a new cryptographic primitive that allows both the receiver and the sender to specify each other’s identity and communicate with each other only when the identities match. Meanwhile, it provides a non-interactive secret handshake protocol to get rid of real-time interaction and further improve participant privacy. This study proposes an identity-based matchmaking encryption (IB-ME) scheme in prime-order groups under symmetric external Diffie-Hellman (SXDH) assumption under the standard model. Realizing short parameters and reducing the matchmaking times during decryption are the most efficient identity-based matchmaking encryption scheme. Additionally, this study also puts forward the first inner product with equality matchmaking encryption (IPE-ME) scheme under the SXDH assumption in the standard model. Technically, it first constructs two schemes in composite-order groups, then simulates them with dual pairing vector space (DPVS) into prime-order groups, and further reduces the parameter size by decreasing the required dimension of dual basis. Finally, for the proposed IPE-ME scheme, this study replaces the equality policy in the first layer of an IB-ME scheme with inner-product policy.
Abstract: A function-as-a-service (FaaS) workflow, composed of multiple function services, can realize a complex business application by orchestrating and controlling the function services. The current FaaS workflow execution systems achieve data transfer among function services mainly based on centralized data storages, resulting in heavy data transmission overhead and affecting application performance significantly. In the cases of high concurrency, frequent data transmission will also cause serious contention for network bandwidth resources, resulting in application performance degradation. To address the above problems, this study analyzes the fine-grained data dependency between function services and proposes a critical path-based FaaS workflow deployment optimization method. In addition, the study designs a dependency-sensitive data access and management mechanism to effectively reduce the data transmission between function services, thereby reducing the data transmission latency and end-to-end execution latency of FaaS workflow applications. The study implements a FaaS workflow system, FineFlow, and conducts experiments based on five real-world FaaS workflow applications. The experimental results show that FineFlow can effectively reduce the data transmission latency (the highest reduction and the average reduction are 74.6% and 63.8%, respectively) compared with the FaaS workflow platform with the centralized data storing-based function interaction mechanism. On average, FineFlow reduces the latency of the end-to-end FaaS workflow executions by 19.6%. In particular, for the FaaS workflow application with fine-grained data dependencies, FineFlow can further reduce its data transmission latency and the end-to-end execution latency by 28.4% and 13.8% respectively compared with the state-of-the-art work. In addition, FineFlow can effectively alleviate the impact of network bandwidth fluctuations on application performance by reducing cross-node data transmission, improving the robustness of application performance influenced by the network bandwidth changes.
Abstract: Elephant flow identification is a fundamental task in network measurements. Currently, the mainstream methods generally employ sketch data structure Sketch to quickly count network traffic and efficiently find elephant flows. However, the rapid influx of numerous packets will significantly decrease the identification accuracy of elephant flows under network traffic jitters. To this end, this study proposes an elastic identification method for elephant flows supporting network traffic jitters, which is named RobustSketch. This method first designs a stretchable mice flow filter based on the cyclic Sketch chain, and adaptively increases and reduces the number of Sketch in real-time packet arrival rates. As a result, it always completely records all arrived packets within the current period to ensure accurate mice flow filtering even under network traffic jitters. Subsequently, this study designs a scalable elephant flow record table based on dynamic segmented hashing, which adaptively increases and reduces segments according to the number of candidate elephant flows filtered out by the mice flow filter. Finally, this can fully record all candidate elephant flows and keep high storage space utilization. Furthermore, the error bounds of the proposed mice flow filter and elephant flow recording table are provided by theoretical analysis. Finally, experimental evaluation is conducted on the proposed elephant flow identification method RobustSketch with real network traffic samples. Experimental results indicate that the identification accuracy of elephant flows of the proposed method is significantly higher than that of the existing methods, and can stably keep high accuracy of over 99% even under network traffic jitters. Meanwhile, its average relative error is reduced by more than 2.7 times, which enhances the accuracy and robustness of elephant flow identification.
Abstract: Internet service providers employ routing protection algorithms to meet real-time, low-latency, and high-availability application needs. However, existing routing protection algorithms have the following three problems. (1) The failure protection ratio is generally low under the premise of not changing the traditional routing protocol forwarding mechanism. (2) The traditional routing protocol forwarding mechanism should be changed to pursue a high failure protection ratio, which is difficult to deploy in practice. (3) The optimal next hop and backup next hop cannot be utilized simultaneously, which causes poor network load balancing capability. For the three problems, this study proposes a routing protection algorithm based on the shortest path serialization graph, which does not need to change the forwarding mechanism, supports incremental deployment and adopts both optimal next hop and backup next hop without routing loops, with a high failure protection ratio. The proposed algorithm mainly includes the following two steps. (1) A sequence number for each node is calculated, and the shortest path sequencing graph is generated. (2) The shortest path serialization graph is generated based on the node sequence number and reverse order search rules, and the next hop set between node pairs is calculated according to the backup next hop calculation rules. Tests on real and simulated network topologies show that the proposed scheme has significant advantages over other routing protection schemes in the average number of backup next hops, failure protection ratio, and path stretch.
Abstract: Database management systems (DBMSs) are the infrastructure for efficient storage, management, and analysis of data, playing a pivotal role in modern data-intensive applications. Vulnerabilities in DBMSs pose a great threat to the security of data and the operation of applications. Fuzzing is one of the most popular dynamic vulnerability detection techniques and has been applied to analyze DBMSs, uncovering many vulnerabilities. This study analyzes the requirements and the difficulties involved in testing a DBMS and proposes a foundational framework for DBMS fuzzing. It also analyzes the challenges encountered by DBMS fuzzers and identifies the dimensions that necessitate support. It introduces typical DBMS fuzzers from the perspective of discovering different types of vulnerabilities and summarizes key techniques in DBMS fuzzing, including SQL statement synthesis, code coverage tracking, and test oracle construction. Several popular DBMS fuzzers are evaluated in terms of coverage, syntax and semantic correctness of the generated test cases, and the ability to find vulnerabilities. Finally, it presents the problems faced by current DBMS fuzzing research and practices and prospects for future research directions in DBMS fuzzing.
Abstract: There are a lot of two-party threshold schemes for SM2 digital signatures proposed in recent years, which can significantly enhance the security of private keys for SM2 digital signatures. According to different methods of key splitting, public schemes can be divided into two types: multiplicative key splitting and additive key splitting. Further, these public schemes can be subdivided into various two-party threshold schemes according to different constructions of the signature random number. This study proposes the framework of two-party threshold schemes for SM2 digital signature, which provides a safe basic calculation process of two-party threshold schemes and introduces the signature random number that can be constructed variously. With the proposed framework and various constructions of the random number, this study achieves the instantiation of the framework, obtaining a variety of two-party threshold schemes for SM2 digital signature. The instantiation includes 23 known two-party threshold schemes, as well as a variety of new schemes.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,315-325页.
原文链接如下:https://doi.org/10.1145/3106237.3106242,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,303-314页.
原文链接如下:https://doi.org/10.1145/3106237.3106239,
读者如需引用该文请标引原文出处。
Abstract: GitHub, a popular social-software-development
platform, has fostered a variety of software ecosystems where
projects depend on one another and
practitioners interact with
each other. Projects within an
ecosystem often have complex
inter-dependencies that impose new challenges in bug
reporting and fixing. In this paper, we conduct an empirical
study on cross-project correlated bugs, i.e., causally related
bugs reported to different projects, focusing on two aspects: 1)
how developers track the root causes across projects; and 2)
how the downstream developers coordinate to deal with
upstream bugs. Through manual inspection of bug reports collected from the scientific Python ecosystem and an online survey with developers, this study reveals the common practices of developers and the
various factors in fixing cross-project bugs. These findings provide implications for future software bug analysis in the scope of ecosystem, as well as shed light on the requirements of issue trackers for such bugs.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 39th International Conference on Software Engineering, Pages 27-37, Buenos Aires, Argentina — May 20 - 28, 2017, IEEE Press Piscataway, NJ, USA ?2017, ISBN: 978-1-5386-3868-2
原文链接如下:http://dl.acm.org/citation.cfm?id=3097373,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 871-882. DOI: https://doi.org/10.1145/2950290.2950364
原文链接如下:http://dl.acm.org/citation.cfm?id=2950364,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Pages 133—143, Seattle WA, USA, November 2016.
原文链接如下:http://dl.acm.org/citation.cfm?id=2950327,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16), 810 – 821, November 13 - 18, 2016.
原文链接如下:https://doi.org/10.1145/2950290.2950310,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在FSE'16会议上Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering,
原文链接如下:http://dl.acm.org/citation.cfm?id=2950340,
读者如需引用该文请标引原文出处。
Abstract: CCF 软件工程专业委员会白晓颖教授(清华大学)推荐。
原文发表在 ASE 2016 Proceedings of the 31st IEEE/ACM International Conference on Automated
Software Engineering。 全文链接:http://dx.doi.org/10.1145/2970276.2970307。
重要提示:读者如引用该文时请标注原文出处。
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: With the rapid development of e-business, web applications based on the Web are developed from localization to globalization, from B2C(business-to-customer) to B2B(business-to-business), from centralized fashion to decentralized fashion. Web service is a new application model for decentralized computing, and it is also an effective mechanism for the data and service integration on the web. Thus, web service has become a solution to e-business. It is important and necessary to carry out the research on the new architecture of web services, on the combinations with other good techniques, and on the integration of services. In this paper, a survey presents on various aspects of the research of web services from the basic concepts to the principal research problems and the underlying techniques, including data integration in web services, web service composition, semantic web service, web service discovery, web service security, the solution to web services in the P2P (Peer-to-Peer) computing environment, and the grid service, etc. This paper also presents a summary of the current art of the state of these techniques, a discussion on the future research topics, and the challenges of the web services.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: With the explosive growth of network applications and complexity, the threat of Internet worms against network security becomes increasingly serious. Especially under the environment of Internet, the variety of the propagation ways and the complexity of the application environment result in worm with much higher frequency of outbreak, much deeper latency and more wider coverage, and Internet worms have been a primary issue faced by malicious code researchers. In this paper, the concept and research situation of Internet worms, exploration function component and execution mechanism are first presented, then the scanning strategies and propagation model are discussed, and finally the critical techniques of Internet worm prevention are given. Some major problems and research trends in this area are also addressed.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: This paper studies uncertain graph data mining and especially investigates the problem of mining frequent subgraph patterns from uncertain graph data. A data model is introduced for representing uncertainties in graphs, and an expected support is employed to evaluate the significance of subgraph patterns. By using the apriori property of expected support, a depth-first search-based mining algorithm is proposed with an efficient method for computing expected supports and a technique for pruning search space, which reduces the number of subgraph isomorphism testings needed by computing expected support from the exponential scale to the linear scale. Experimental results show that the proposed algorithm is 3 to 5 orders of magnitude faster than a na?ve depth-first search algorithm, and is efficient and scalable.
Abstract: This paper introduces the concrete details of combining the automated reasoning techniques with planning methods, which includes planning as satisfiability using propositional logic, Conformant planning using modal logic and disjunctive reasoning, planning as nonmonotonic logic, and Flexible planning as fuzzy description logic. After considering experimental results of International Planning Competition and relevant papers, it concludes that planning methods based on automated reasoning techniques is helpful and can be adopted. It also proposes the challenges and possible hotspots.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: Sensor networks are integration of sensor techniques, nested computation techniques, distributed computation techniques and wireless communication techniques. They can be used for testing, sensing, collecting and processing information of monitored objects and transferring the processed information to users. Sensor network is a new research area of computer science and technology and has a wide application future. Both academia and industries are very interested in it. The concepts and characteristics of the sensor networks and the data in the networks are introduced, and the issues of the sensor networks and the data management of sensor networks are discussed. The advance of the research on sensor networks and the data management of sensor networks are also presented.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: In a multi-hop wireless sensor network (WSN), the sensors closest to the sink tend to deplete their energy faster than other sensors, which is known as an energy hole around the sink. No more data can be delivered to the sink after an energy hole appears, while a considerable amount of energy is wasted and the network lifetime ends prematurely. This paper investigates the energy hole problem, and based on the improved corona model with levels, it concludes that the assignment of transmission ranges of nodes in different coronas is an effective approach for achieving energy-efficient network. It proves that the optimal transmission ranges for all areas is a multi-objective optimization problem (MOP), which is NP hard. The paper proposes an ACO (ant colony optimization)-based distributed algorithm to prolong the network lifetime, which can help nodes in different areas to adaptively find approximate optimal transmission range based on the node distribution. Furthermore, the simulation results indicate that the network lifetime under this solution approximates to that using the optimal list. Compared with existing algorithms, this ACO-based algorithm can not only make the network lifetime be extended more than two times longer, but also have good performance in the non-uniform node distribution.
Abstract: With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
Abstract: Intrusion detection is a highlighted topic of network security research in recent years. In this paper, first the necessity o f intrusion detection is presented, and its concepts and models are described. T hen, many intrusion detection techniques and architectures are summarized. Final ly, the existing problems and the future direction in this field are discussed.
Abstract: Software architecture (SA) is emerging as one of the primary research areas in software engineering recently and one of the key technologies to the development of large-scale software-intensive system and software product line system. The history and the major direction of SA are summarized, and the concept of SA is brought up based on analyzing and comparing the several classical definitions about SA. Based on summing up the activities about SA, two categories of study about SA are extracted out, and the advancements of researches on SA are subsequently introduced from seven aspects.Additionally,some disadvantages of study on SA are discussed,and the causes are explained at the same.Finally,it is concluded with some singificantly promising tendency about research on SA.
Abstract: Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
Abstract: In recent years, transfer learning has provoked vast amount of attention and research. Transfer learning is a new machine learning method that applies the knowledge from related but different domains to target domains. It relaxes the two basic assumptions in traditional machine learning: (1) the training (also referred as source domain) and test data (also referred target domain) follow the independent and identically distributed (i.i.d.) condition; (2) there are enough labeled samples to learn a good classification model, aiming to solve the problems that there are few or even not any labeled data in target domains. This paper surveys the research progress of transfer learning and introduces its own works, especially the ones in building transfer learning models by applying generative model on the concept level. Finally, the paper introduces the applications of transfer learning, such as text classification and collaborative filtering, and further suggests the future research direction of transfer learning.
Abstract: The Internet traffic model is the key issue for network performance management, Quality of Service
management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
Abstract: Routing technology at the network layer is pivotal in the architecture of wireless sensor networks. As an active branch of routing technology, cluster-based routing protocols excel in network topology management, energy minimization, data aggregation and so on. In this paper, cluster-based routing mechanisms for wireless sensor networks are analyzed. Cluster head selection, cluster formation and data transmission are three key techniques in cluster-based routing protocols. As viewed from the three techniques, recent representative cluster-based routing protocols are presented, and their characteristics and application areas are compared. Finally, the future research issues in this area are pointed out.
Abstract: For most peer-to-peer file-swapping applications, sharing is a volunteer action, and peers are not responsible for their irresponsible bartering history. This situation indicates the trust between participants can not be set up simply on the traditional trust mechanism. A reasonable trust construction approach comes from the social network analysis, in which trust relations between individuals are set up upon recommendations of other individuals. Current p2p trust model could not promise the convergence of iteration for trust computation, and takes no consideration for model security problems, such as sybil attack and slandering. This paper presents a novel recommendation-based global trust model and gives a distributed implementation method. Mathematic analyses and simulations show that, compared to the current global trust model, the proposed model is more robust on trust security problems and more complete on iteration for computing peer trust.
Abstract: Constrained optimization problems (COPs) are mathematical programming problems frequently encountered in the disciplines of science and engineering application. Solving COPs has become an important research area of evolutionary computation in recent years. In this paper, the state-of-the-art of constrained optimization evolutionary algorithms (COEAs) is surveyed from two basic aspects of COEAs (i.e., constraint-handling techniques and evolutionary algorithms). In addition, this paper discusses some important issues of COEAs. More specifically, several typical algorithms are analyzed in detail. Based on the analyses, it concluded that to obtain competitive results, a proper constraint-handling technique needs to be considered in conjunction with an appropriate search algorithm. Finally, the open research issues in this field are also pointed out.
Abstract: An ad hoc network is a collection of wireless mobile nodes dynamically forming a temporary network without the use of any existing network infrastructure or centralized administration. Due to bandwidth constraint and dynamic topology of mobile ad hoc networks, multipath supported routing is a very important research issue. In this paper, we present an entropy-based metric to support stability multipath on-demand routing (SMDR). The key idea of SMDR protocol is to construct the new metric-entropy and select the stability multipath with the help of entropy metric to reduce the number of route reconstruction so as to provide QoS guarantee in the ad hoc network whose topology changes continuously. Simulation results show that, with the proposed multipath routing protocol, packet delivery ratio, end-to-end delay, and routing overhead ratio can be improved in most of cases. It is an available approach to multipath routing decision.
Abstract: As an important application of acceleration in the cloud, the distributed caching technology has received considerable attention in industry and academia. This paper starts with a discussion on the combination of cloud computing and distributed caching technology, giving an analysis of its characteristics, typical application scenarios, stages of development, standards, and several key elements, which have promoted its development. In order to systematically know the state of art progress and weak points of the distributed caching technology, the paper builds a multi-dimensional framework, DctAF. This framework is constituted of 6 dimensions through analyzing the characteristics of cloud computing and boundary of the caching techniques. Based on DctAF, current techniques have been analyzed and summarized; comparisons among several influential products have also been made. Finally, the paper describes and highlights the several challenges that the cache system faces and examines the current research through in-depth analysis and comparison.
Abstract: Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
Abstract: The crucial technologies related to personalization are introduced in this paper, which include the representation and modification of user profile, the representation of resource, the recommendation technology, and the architecture of personalization. By comparing with some existing prototype systems, the key technologies about how to implement personalization are discussed in detail. In addition, three representative personalization systems are analyzed. At last, some research directions for personalization are presented.
Abstract: Computer forensics is the technology field that attempts to prove thorough, efficient, and secure means to investigate computer crime. Computer evidence must be authentic, accurate, complete and convincing to juries. In this paper, the stages of computer forensics are presented, and the theories and the realization of the forensics software are described. An example about forensic practice is also given. The deficiency of computer forensics technique and anti-forensics are also discussed. The result comes out that it is as the improvement of computer science technology, the forensics technique will become more integrated and thorough.
Abstract: Wide-Spread deployment for interactive information visualization is difficult. Non-Specialist users need a general development method and a toolkit to support the generic data structures suited to tree, network and multi-dimensional data, special visualization techniques and interaction techniques, and well-known generic information tasks. This paper presents a model driven development method for interactive information visualization. First, an interactive information visualization interface model (IIVM) is proposed. Then, the development method for interactive information visualization based on IIVM is presented. The Daisy toolkit is introduced, which includes Daisy model builder, Daisy IIV generator and runtime framework with Daisy library. Finally, an application example is given. Experimental results show that Daisy can provide a general solution for development for interactive information visualization.
Abstract: Botnets are one of the most serious threats to the Internet. Researchers have done plenty of research and made significant progress. However, botnets keep evolving and have become more and more sophisticated. Due to the underlying security limitation of current system and Internet architecture, and the complexity of botnet itself, how to effectively counter the global threat of botnets is still a very challenging issue. This paper first introduces the evolving of botnet’s propagation, attack, command, and control mechanisms. Then the paper summarizes recent advances of botnet defense research and categorizes into five areas: Botnet monitoring, botnet infiltration, analysis of botnet characteristics, botnet detection and botnet disruption. The limitation of current botnet defense techniques, the evolving trend of botnet, and some possible directions for future research are also discussed.
Abstract: Visual language techniques have exhibited more advantages in describing various software artifacts than one-dimensional textual languages during software development, ranging from the requirement analysis and design to testing and maintenance, as diagrammatic and graphical notations have been well applied in modeling system. In addition to an intuitive appearance, graph grammars provide a well-established foundation for defining visual languages with the power of precise modeling and verification on computers. This paper discusses the issues and techniques for a formal foundation of visual languages, reviews related practical graphical environments, presents a spatial graph grammar formalism, and applies the spatial graph grammar to defining behavioral semantics of UML diagrams and developing a style-driven framework for software architecture design.
Abstract: Software defect prediction has been one of the active parts of software engineering since it was developed in 1970's. It plays a very important role in the analysis of software quality and balance of software cost. This paper investigates and discusses the motivation, evolvement, solutions and challenges of software defect prediction technologies, and it also categorizes, analyzes and compares the representatives of these prediction technologies. Some case studies for software defect distribution models are given to help understanding.
Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
Abstract: In this paper, a framework is proposed for handling fault of service composition through analyzing fault requirements. Petri nets are used in the framework for fault detecting and its handling, which focuses on targeting the failure of available services, component failure and network failure. The corresponding fault models are given. Based on the model, the correctness criterion of fault handling is given to analyze fault handling model, and its correctness is proven. Finally, CTL (computational tree logic) is used to specify the related properties and enforcement algorithm of fault analysis. The simulation results show that this method can ensure the reliability and consistency of service composition.
Abstract: Knapsack problem (KP) is a well-known combinatorial optimization problem which includes 0-1 KP, bounded KP, multi-constraint KP, multiple KP, multiple-choice KP, quadratic KP, dynamic knapsack KP, discounted KP and other types of KPs. KP can be considered as a mathematical model extracted from variety of real fields and therefore has wide applications. Evolutionary algorithms (EAs) are universally considered as an efficient tool to solve KP approximately and quickly. This paper presents a survey on solving KP by EAs over the past ten years. It not only discusses various KP encoding mechanism and the individual infeasible solution processing but also provides useful guidelines for designing new EAs to solve KPs.
Abstract: As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
Abstract: Data deduplication technologies can be divided into two categories: a) identical data detection
techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these
two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since
data deduplication technologies can affect the reliability and performance of storage systems, this paper also
surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of
the current state of research on data deduplication technologies, this paper makes several conclusions as follows:
a) How to mine data characteristic information in data deduplication has not been completely solved, and how to
use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the
perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome
the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by
data deduplication techniques.
Abstract: Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
Abstract: Web search engine has become a very important tool for finding information efficiently from the massive Web data. With the explosive growth of the Web data, traditional centralized search engines become harder to catch up with the growing step of people's information needs. With the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web search has been proposed and quickly becomes a research focus. The goal of this paper is to give a brief summary of current P2P Web search technologies in order to facilitate future research. First, some main challenges for P2P Web search are presented. Then, key techniques for building a feasible and efficient P2P Web search engine are reviewed, including system topology, data placement, query routing, index partitioning, collection selection, relevance ranking and Web crawling. Finally, three recently proposed novel P2P Web search prototypes are introduced.
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: Graphics processing unit (GPU) has been developing rapidly in recent years at a speed over Moor抯 law, and as a result, various applications associated with computer graphics advance greatly. At the same time, the highly processing power, parallelism and programmability available nowadays on the contemporary GPU provide an ideal platform on which the general-purpose computation could be made. Starting from an introduction to the development history and the architecture of GPU, the technical fundamentals of GPU are described in the paper. Then in the main part of the paper, the development of various applications on general purpose computation on GPU is introduced, and among those applications, fluid dynamics, algebraic computation, database operations, and spectrum analysis are introduced in detail. The experience of our work on fluid dynamics has been also given, and the development of software tools in this area is introduced. Finally, a conclusion is made, and the future development and the new challenge on both hardware and software in this subject are discussed.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: Few-shot learning is defined as learning models to solve problems from small samples. In recent years, under the trend of training model with big data, machine learning and deep learning have achieved success in many fields. However, in many application scenarios in the real world, there is not a large amount of data or labeled data for model training, and labeling a large number of unlabeled samples will cost a lot of manpower. Therefore, how to use a small number of samples for learning has become a problem that needs to be paid attention to at present. This paper systematically combs the current approaches of few-shot learning. It introduces each kind of corresponding model from the three categories: fine-tune based, data augmentation based, and transfer learning based. Then, the data augmentation based approaches are subdivided into unlabeled data based, data generation based, and feature augmentation based approaches. The transfer learning based approaches are subdivided into metric learning based, meta-learning based, and graph neural network based methods. In the following, the paper summarizes the few-shot datasets and the results in the experiments of the aforementioned models. Next, the paper summarizes the current situation and challenges in few-shot learning. Finally, the future technological development of few-shot learning is prospected.
Abstract: Symbolic propagation methods based on linear abstraction play a significant role in neural network verification. This study proposes the notion of multi-path back-propagation for these methods. Existing methods are viewed as using only a single back-propagation path to calculate the upper and lower bounds of each node in a given neural network, being specific instances of the proposed notion. Leveraging multiple back-propagation paths effectively improves the accuracy of this kind of method. For evaluation, the proposed method is quantitatively compared using multiple back-propagation paths with the state-of-the-art tool DeepPoly on benchmarks ACAS Xu, MNIST, and CIFAR10. The experiment results show that the proposed method achieves significant accuracy improvement while introducing only a low extra time cost. In addition, the multi-path back-propagation method is compared with the Optimized LiRPA based on global optimization, on the dataset MNIST. The results show that the proposed method still has an accuracy advantage.
Abstract: Probabilistic graphical models are powerful tools for compactly representing complex probability distributions, efficiently computing (approximate) marginal and conditional distributions, and conveniently learning parameters and hyperparameters in probabilistic models. As a result, they have been widely used in applications that require some sort of automated probabilistic reasoning, such as computer vision and natural language processing, as a formal approach to deal with uncertainty. This paper surveys the basic concepts and key results of representation, inference and learning in probabilistic graphical models, and demonstrates their uses in two important probabilistic models. It also reviews some recent advances in speeding up classic approximate inference algorithms, followed by a discussion of promising research directions.
Abstract: Computer aided detection/diagnosis (CAD) can improve the accuracy of diagnosis,reduce false positive,and provide decision supports for doctors.The main purpose of this paper is to analyze the latest development of computer aided diagnosis tools.Focusing on the top four fatal cancer's incidence positions,major recent publications on CAD applications in different medical imaging areas are reviewed in this survey according to different imaging techniques and diseases.Further more,multidimentional analysis is made on the researches from image data sets,algorithms and evaluation methods.Finally,existing problems,research trend and development direction in the field of medical image CAD system are discussed.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: The Internet traffic model is the key issue for network performance management, Quality of Service
management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
Abstract: Task parallel programming model is a widely used parallel programming model on multi-core platforms. With the intention of simplifying parallel programming and improving the utilization of multiple cores, this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives: Parallelism expression, data management and task scheduling. In the end, some future trends in this area are discussed.
Abstract: The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
Abstract: In this paper, the existing intrusion tolerance and self-destruction technology are integrated into autonomic computing in order to construct an autonomic dependability model based on SM-PEPA (semi-Markov performance evaluation process algebra) which is capable of formal analysis and verification. It can hierarchically anticipate Threats to dependability (TtD) at different levels in a self-management manner to satisfy the special requirements for dependability of mission-critical systems. Based on this model, a quantification approach is proposed on the view of steady-state probability to evaluate autonomic dependability. Finally, this paper analyzes the impacts of parameters of the model on autonomic dependability in a case study, and the experimental results demonstrate that improving the detection rate of TtD as well as the successful rate of self-healing will greatly increase the autonomic dependability.
Abstract: This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.
Abstract: Attribute-Based encryption (ABE) scheme takes attributes as the public key and associates the ciphertext and user’s secret key with attributes, so that it can support expressive access control policies. This dramatically reduces the cost of network bandwidth and sending node’s operation in fine-grained access control of data sharing. Therefore, ABE has a broad prospect of application in the area of fine-grained access control. After analyzing the basic ABE system and its two variants, Key-Policy ABE (KP-ABE) and Ciphertext-Policy ABE (CP-ABE), this study elaborates the research problems relating to ABE systems, including access structure design for CP-ABE, attribute key revocation, key abuse and multi-authorities ABE with an extensive comparison of their functionality and performance. Finally, this study discusses the need-to-be solved problems and main research directions in ABE.
Abstract: In recent years, the rapid development of Internet technology and Web applications has triggered the explosion of various data on the Internet, which generates a large amount of valuable knowledge. How to organize, represent and analyze these knowledge has attracted much attention. Knowledge graph was thus developed to organize these knowledge in a semantical and visualized manner. Knowledge reasoning over knowledge graph then becomes one of the hot research topics and plays an important role in many applications such as vertical search and intelligent question-answer. The goal of knowledge reasoning over knowledge graph is to infer new facts or identify erroneous facts according to existing ones. Unlike traditional knowledge reasoning, knowledge reasoning over knowledge graph is more diversified, due to the simplicity, intuitiveness, flexibility, and richness of knowledge representation in knowledge graph. Starting with the basic concept of knowledge reasoning, this paper presents a survey on the recently developed methods for knowledge reasoning over knowledge graph. Specifically, the research progress is reviewed in detail from two aspects:One-Step reasoning and multi-step reasoning, each including rule based reasoning, distributed embedding based reasoning, neural network based reasoning and hybrid reasoning. Finally, future research directions and outlook of knowledge reasoning over knowledge graph are discussed.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: Uncertainty exists widely in the subjective and objective world. In all kinds of uncertainty, randomness and fuzziness are the most important and fundamental. In this paper, the relationship between randomness and fuzziness is discussed. Uncertain states and their changes can be measured by entropy and hyper-entropy respectively. Taken advantage of entropy and hyper-entropy, the uncertainty of chaos, fractal and complex networks by their various evolution and differentiation are further studied. A simple and effective way is proposed to simulate the uncertainty by means of knowledge representation which provides a basis for the automation of both logic and image thinking with uncertainty. The AI (artificial intelligence) with uncertainty is a new cross-discipline, which covers computer science, physics, mathematics, brain science, psychology, cognitive science, biology and philosophy, and results in the automation of representation, process and thinking for uncertain information and knowledge.
Abstract: Recent years, applying Deep Learning (DL) into Image Semantic Segmentation (ISS) has been widely used due to its state-of-the-art performances and high-quality results. This paper systematically reviews the contribution of DL to the field of ISS. Different methods of ISS based on DL (ISSbDL) are summarized. These methods are divided into ISS based on the Regional Classification (ISSbRC) and ISS based on the Pixel Classification (ISSbPC) according to the image segmentation characteristics and segmentation granularity. Then, the methods of ISSbPC are surveyed from two points of view:ISS based on Fully Supervised Learning (ISSbFSL) and ISS based on Weakly Supervised Learning (ISSbWSL). The representative algorithms of each method are introduced and analyzed, as well as the basic workflow, framework, advantages and disadvantages of these methods are detailedly analyzed and compared. In addition, the related experiments of ISS are analyzed and summarized, and the common data sets and performance evaluation indexes in ISS experiments are introduced. Finally, possible research directions and trends are given and analyzed.
Abstract: Designing problems are ubiquitous in science research and industry applications. In recent years, Bayesian optimization, which acts as a very effective global optimization algorithm, has been widely applied in designing problems. By structuring the probabilistic surrogate model and the acquisition function appropriately, Bayesian optimization framework can guarantee to obtain the optimal solution under a few numbers of function evaluations, thus it is very suitable to solve the extremely complex optimization problems in which their objective functions could not be expressed, or the functions are non-convex, multimodal and computational expensive. This paper provides a detailed analysis on Bayesian optimization in methodology and application areas, and discusses its research status and the problems in future researches. This work is hopefully beneficial to the researchers from the related communities.
Abstract: Ultrasonography is the first choice of imaging examination and preoperative evaluation for thyroid and breast cancer. However, ultrasonic characteristics of benign and malignant nodules are commonly overlapped. The diagnosis heavily relies on operator's experience other than quantitative and stable methods. In recent years, medical imaging analysis based on computer technology has developed rapidly, and a series of landmark breakthroughs have been made, which provides effective decision supports for medical imaging diagnosis. In this work, the research progress of computer vision and image recognition technologies in thyroid and breast ultrasound images is studied. A series of key technologies involved in automatic diagnosis of ultrasound images is the main lines of the work. The major algorithms in recent years are summarized and analyzed, such as ultrasound image preprocessing, lesion localization and segmentation, feature extraction and classification. Moreover, multi-dimensional analysis is made on the algorithms, data sets, and evaluation methods. Finally, existing problems related to automatic analysis of those two kinds of ultrasound imaging are discussed, research trend and development direction in the field of ultrasound images analysis are discussed.
Abstract: Honeypot is a proactive defense technology, introduced by the defense side to change the asymmetric situation of a network attack and defensive game. Through the deployment of the honeypots, i.e. security resources without any production purpose, the defenders can deceive attackers to illegally take advantage of the honeypots and capture and analyze the attack behaviors to understand the attack tools and methods, and to learn the intentions and motivations. Honeypot technology has won the sustained attention of the security community to make considerable progress and get wide application, and has become one of the main technical means of the Internet security threat monitoring and analysis. In this paper, the origin and evolution process of the honeypot technology are presented first. Next, the key mechanisms of honeypot technology are comprehensively analyzed, the development process of the honeypot deployment structure is also reviewed, and the latest applications of honeypot technology in the directions of Internet security threat monitoring, analysis and prevention are summarized. Finally, the problems of honeypot technology, development trends and further research directions are discussed.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
Abstract: The Distributed denial of service (DDoS) attack is a major threat to the current network. Based on the attack packet level, the study divides DDoS attacks into network-level DDoS attacks and application-level DDoS attacks. Next, the study analyzes the detection and control methods of these two kinds of DDoS attacks in detail, and it also analyzes the drawbacks of different control methods implemented in different network positions. Finally, the study analyzes the drawbacks of the current detection and control methods, the development trend of the DDoS filter system, and corresponding technological challenges are also proposed.
Abstract: This paper presents a survey on the theory of provable security and its applications to the design and analysis of security protocols. It clarifies what the provable security is, explains some basic notions involved in the theory of provable security and illustrates the basic idea of random oracle model. It also reviews the development and advances of provably secure public-key encryption and digital signature schemes, in the random oracle model or the standard model, as well as the applications of provable security to the design and analysis of session-key distribution protocols and their advances.
Abstract: Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
Abstract: Under the new application mode, the traditional hierarchy data centers face several limitations in size, bandwidth, scalability, and cost. In order to meet the needs of new applications, data center network should fulfill the requirements with low-cost, such as high scalability, low configuration overhead, robustness and energy-saving. First, the shortcomings of the traditional data center network architecture are summarized, and new requirements are pointed out. Secondly, the existing proposals are divided into two categories, i.e. server-centric and network-centric. Then, several representative architectures of these two categories are overviewed and compared in detail. Finally, the future directions of data center network are discussed.
Abstract: Source code bug (vulnerability) detection is a process of judging whether there are unexpected behaviors in the program code. It is widely used in software engineering tasks such as software testing and software maintenance, and plays a vital role in software functional assurance and application security. Traditional vulnerability detection research is based on program analysis, which usually requires strong domain knowledge and complex calculation rules, and faces the problem of state explosion, resulting in limited detection performance, and there is room for greater improvement in the rate of false positives and false negatives. In recent years, the open source community's vigorous development has accumulated massive amounts of data with open source code as the core. In this context, the feature learning capabilities of deep learning can automatically learn semantically rich code representations, thereby providing a new way for vulnerability detection. This study collected the latest high-level papers in this field, systematically summarized and explained the current methods from two aspects:vulnerability code dataset and deep learning vulnerability detection model. Finally, it summarizes the main challenges faced by the research in this field, and looks forward to the possible future research focus.
Abstract: As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
Abstract: Deep learning has achieved great success in the field of computer vision, surpassing many traditional methods. However, in recent years, deep learning technology has been abused in the production of fake videos, making fake videos represented by Deepfakes flooding on the Internet. This technique produces pornographic movies, fake news, political rumors by tampering or replacing the face information of the original videos and synthesizes fake speech. In order to eliminate the negative effects brought by such forgery technologies, many researchers have conducted in-depth research on the identification of fake videos and proposed a series of detection methods to help institutions or communities to identify such fake videos. Nevertheless, the current detection technology still has many limitations such as specific distribution data, specific compression ratio, and so on, far behind the generation technology of fake video. In addition, different researchers handle the problem from different angles. The data sets and evaluation indicators used are not uniform. So far, the academic community still lacks a unified understanding of deep forgery and detection technology. The architecture of deep forgery and detection technology research is not clear. In this review, the development of deep forgery and detection technologies are reviewed. Besides, existing research works are systematically summarize and scientifically classified. Finally, the social risks posed by the spread of Deepfakes technology are discussed, the limitations of detection technology are analyzed, and the challenges and potential research directions of detection technology are discussed, aiming to provide guidance for follow-up researchers to further promote the development and deployment of Deepfakes detection technology.
Abstract: With the proliferation of the Chinese social network (especially the rise of weibo), the productivity and lifestyle of the country's society is more and more profoundly influenced by the Chinese internet public events. Due to the lack of the effective technical means, the efficiency of information processing is limited. This paper proposes a public event information entropy calculation method. First, a mathematical modeling of event information content is built. Then, multidimensional random variable information entropy of the public events is calculated based on Shannon information theory. Furthermore, a new technical index of quantitative analysis to the internet public events is put forward, laying out a foundation for further research work.
Abstract: Blockchain is a distributed public ledger technology that originates from the digital cryptocurrency, bitcoin. Its development has attracted wide attention in industry and academia fields. Blockchain has the advantages of de-centralization, trustworthiness, anonymity and immutability. It breaks through the limitation of traditional center-based technology and has broad development prospect. This paper introduces the research progress of blockchain technology and its application in the field of information security. Firstly, the basic theory and model of blockchain are introduced from five aspects:Basic framework, key technology, technical feature, and application mode and area. Secondly, from the perspective of current research situation of blockchain in the field of information security, this paper summarizes the research progress of blockchain in authentication technology, access control technology and data protection technology, and compares the characteristics of various researches. Finally, the application challenges of blockchain technology are analyzed, and the development outlook of blockchain in the field of information security is highlighted. This study intends to provide certain reference value for future research work.
Abstract: The rapid development of Internet leads to an increase in system complexity and uncertainty. Traditional network management can not meet the requirement, and it shall evolve to fusion based Cyberspace Situational Awareness (CSA). Based on the analysis of function shortage and development requirement, this paper introduces CSA as well as its origin, conception, objective and characteristics. Firstly, a CSA research framework is proposed and the research history is investigated, based on which the main aspects and the existing issues of the research are analyzed. Meanwhile, assessment methods are divided into three categories: Mathematics model, knowledge reasoning and pattern recognition. Then, this paper discusses CSA from three aspects: Model, knowledge representation and assessment methods, and then goes into detail about main idea, assessment process, merits and shortcomings of novel methods. Many typical methods are compared. The current application research of CSA in the fields of security, transmission, survivable, system evaluation and so on is presented. Finally, this paper points the development directions of CSA and offers the conclusions from issue system, technical system and application system.
Abstract: Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
Abstract: Map matching is a key preprocessing step in the location-based service to match GPS points into a digital road network. Data analysis on the map matched trajectory data can be used to facilitate many real city computing applications such as intelligent transportation system and trip planning. This survey provides a systematic summary of existing research achievements of map matching. With the rapid development of urban traffic, the cost of acquiring and processing vehicle location information is increasing, low-sampling-rate GPS tracking data is growing, and the accuracy of existing algorithms is not adequate. In recent years, map matching algorithm based on hidden Markov model (HMM) has been widely studies. HMM can smoothly assimilate noisy data with path constraints by choosing a maximum likelihood path. The accuracy of HMM-based algorithms can reach 90% under certain conditions, which confirms the validity of map matching algorithm based on HMM at low sampling rate. A perspective of future work in this research area is also discussed.
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.