Abstract: Structured data analysis typically requires performing multi-attribute queries over tabular data, making efficient multi-dimensional indexes key support for database systems. However, existing multi-dimensional indexing methods face limitations in high-dimensional scenarios. Traditional multi-dimensional indexing methods partition data uniformly based on data distribution but lack the awareness of query features, resulting in limited filtering effectiveness. In contrast, although existing learned multi-dimensional indexes introduce query-awareness, they often produce highly unbalanced partitions, thereby resulting in some oversized partitions and substantially increased scanning costs. To this end, this study proposes LA-tree, a novel learned tree-based multi-dimensional index that balances both data distribution and query workload awareness. In the offline construction phase, LA-tree formulates the selection of partitioning dimensions at each node as an optimization problem of minimizing the overall scan ratio of the query workload, and puts forward a hierarchical greedy search algorithm to achieve the unity of uniform partitioning and query-awareness. In the online query phase, the lightweight linear model and piecewise linear model are introduced to transform traditional numerical comparisons to fast mapping computations, thereby reducing filtering latency while ensuring the completeness of query results. In dynamic settings, an adaptive incremental update mechanism based on scan volume monitoring is proposed to efficiently adapt to changes in data and query workloads via local subtree reconstruction, thereby avoiding the high cost of rebuilding the entire index. Experimental results demonstrate that LA-tree outperforms existing methods on multiple real-world and benchmark datasets. In static settings, the query time is reduced by an average of 52% compared with the optimal benchmark method, while in dynamic settings, the update costs are reduced by 97% compared with the reconstruction methods. Additionally, low query latency and lightweight index scale are maintained.
Abstract: Programs with recursive data structures, such as list and tree, are widely used in computer science. Program verification problems are often translated into satisfiability modulo theories (SMT) formulas for solving. Recursive data structures are usually converted into first-order logic formulas combining algebraic data types (ADTs) and other theories such as integers. To express properties of recursive data structures, programs often include recursive functions, which in SMT are represented using assertions with quantifiers and uninterpreted functions. This study focuses on solving methods for SMT formulas with both ADTs and recursive functions. Existing techniques are reviewed from three perspectives: SMT solvers, automated theorem provers, and constrained Horn clause (CHC) solvers. Furthermore, the study conducts unified experiments to compare state-of-the-art tools on different benchmarks. It investigates the advantages and limitations of existing solving tools and techniques on various types of problems and explores potential optimization directions, providing valuable analyses and references for researchers.
Abstract: Intelligent question answering (QA) system utilizes information retrieval and natural language processing techniques to deliver automated responses to user inquiries. Like other artificial intelligence software, intelligent QA system is prone to bugs. These bugs can degrade user experience, cause financial losses, or even trigger social panic. Therefore, it is crucial to detect and fix bugs in intelligent QA system promptly. Automated testing approaches fall into two categories. The first approach synthesizes hypothetical facts based on questions and predicted answers, then generates new questions and expected answers to detect bugs. The second approach generates semantically equivalent test inputs by injecting knowledge from existing datasets, ensuring the answer to the question remains unchanged. However, both methods have limitations in practical use. They rely heavily on the intelligent QA system’s output or training set, which results in poor testing effectiveness and generalization, especially for large-language-model-based intelligent QA systems. Moreover, these methods primarily assess semantic understanding while neglecting the logical reasoning capabilities of intelligent QA system. To address this gap, a logic-guided testing technique named QALT is proposed. It designs three logically related metamorphic relations and uses semantic similarity measurement and dependency parsing to generate high-quality test cases. The experimental results show that QALT detected a total of 9247 bugs in two different intelligent QA systems, which is 3150 and 3897 more bugs than the two current state-of-the-art techniques (i.e., QAQA and QAAskeR), respectively. Based on the statistical analysis of manually labeled results, QALT detects approximately 8073 true bugs, which is 2142 more than QAQA and 4867 more than QAAskeR. Moreover, the test inputs generated by QALT successfully reduce the MR violation rate from 22.33% to 14.37% when used for fine-tuning the intelligent QA system under test.
Abstract: Accurate workload forecasting is essential for effective cloud resource management. However, existing models typically employ fixed architectures to extract sequential features from different perspectives, which limits the flexibility of combining various model structures to further improve forecasting performance. To address this limitation, a novel ensemble framework SAC-MWF is proposed based on the soft actor-critic (SAC) algorithm for multi-view workload forecasting. A set of feature sequence construction methods is developed to generate multi-view feature sequences at low computational cost from historical windows, enabling the model to focus on workload patterns from different perspectives. Subsequently, a base prediction model and several feature prediction models are trained on historical windows and their corresponding feature sequences, respectively, to capture workload dynamics from different views. Finally, the SAC algorithm is employed to integrate these models to generate the final forecast. Experimental results on three datasets demonstrate that SAC-MWF performs excellently in terms of effectiveness and computational efficiency.
Abstract: As the foundation of AI, deep learning frameworks play a vital role in driving the rapid progress of AI technologies. However, due to the lack of unified standards, compatibility across different frameworks remains limited. Faithful model transformation enhances interoperability by converting a source model into an equivalent model in the target framework. However, the large number and diversity of deep learning frameworks, combined with the increasing demand for custom frameworks, lead to high conversion costs. To address this issue, this study proposes an automatic AI source code migration method between frameworks based on a domain knowledge graph. The method integrates domain knowledge graphs and abstract syntax trees to systematically manage migration challenges. First, the source code is transformed into a framework-specific abstract syntax tree, from which general dependency information and operator-specific details are extracted. By applying the operator and parameter mappings stored in the domain knowledge graph, the code is migrated to the target framework, generating equivalent target model code while significantly reducing engineering complexity. Compared with existing code migration tools, the proposed method supports mutual migration among widely used deep learning frameworks, such as PyTorch, PaddlePaddle, and MindSpore. The approach has proven to be both mature and reliable, with part of its implementation open-sourced in Baidu’s official migration tool, PaConvert.
Abstract: Code comments serve as natural-language descriptions of the source code functionality, helping developers quickly understand the code’s semantics and functionality, thus improving software development and maintenance efficiency. However, writing and maintaining code comments is time-consuming and labor-intensive, often leading to issues such as absence, inconsistency, and obsolescence. Therefore, the automatic generation of comments for source code has attracted significant attention. Existing methods typically use information retrieval techniques or deep learning techniques for automatic code comment generation, but both have their limitations. Some research has integrated these two techniques, but such approaches often fail to effectively leverage the advantages of both methods. To address these issues, this study proposes a semantic reranking-based code comment generation method, SRBCS. SRBCS employs a semantic reranking model to rank and select comments generated by various approaches, thus integrating multiple methods and maximizing their respective strengths in the comment generation process. We compared SRBCS with 11 code comment generation approaches on two subject datasets. Experimental results demonstrate that SRBCS effectively integrates different approaches and outperforms existing methods in code comment generation.
Abstract: Root cause analysis refers to identifying the underlying factors that lead to abnormal failures in complex systems. Causal-based backward reasoning methods, founded on structural causal models, are among the optimal approaches for implementing root cause analysis. Most current causality-driven root cause analysis methods require the prior discovery of the causal structure from data as a prerequisite, making the effectiveness of the analysis heavily dependent on the success of this causal discovery task. Recently, score function-based intervention identification has gained significant attention. By comparing the variance of score function derivatives before and after interventions, this approach detects the set of intervened variables, showing potential to overcome the constraints of causal discovery in root cause analysis. However, mainstream score function-based intervention identification is often limited by the score function estimation step. The analytical solutions used in existing methods struggle to effectively model the real distribution of high-dimensional complex data. In light of recent advances in data generation, this study proposes a diffusion model-guided root cause analysis strategy. Specifically, the proposed method first estimates the score functions corresponding to data distributions before and after the anomaly using diffusion models. It then identifies the set of root cause variables by observing the variance of the first-order derivatives of the overall score function after weighted fusion. Furthermore, to solve the issue of computational overhead raised by the pruning operation, an acceleration strategy is proposed to estimate the score function from the initially trained diffusion model, avoiding the re-training cost of the diffusion model after each pruning operation. Experimental results on simulated and real-world datasets demonstrate that the proposed method accurately identifies the set of root cause variables. Furthermore, ablation studies show that the guidance provided by the diffusion model is critical to the improved performance.
Abstract: Recommendation systems have become a key technology in mitigating information overload in the era of big data, with widespread applications in E-commerce and other fields. However, traditional centralized data collection methods expose significant risks of user privacy leakage. Federated learning enables collaborative model training across multiple data holders without the need to share raw user data, thus protecting privacy. Federated recommendation systems have gained considerable attention from both academia and industry. Existing federated recommendation algorithms place the model training process in a distributed environment, effectively avoiding the centralized storage of sensitive user data on a single server. However, these approaches still face challenges related to privacy leakage and high communication costs. To address these issues, this study proposes a communication-efficient federated recommendation algorithm based on differential privacy. The algorithm introduces a general sub-model selection strategy that strengthens privacy protection of user interaction data on the client side through a randomized response mechanism. On the server side, it employs maximum likelihood estimation to infer the true interaction frequencies of items and optimize the sub-model selection process. This strategy achieves an effective balance between privacy protection and model utility. The proposed algorithm is applicable not only to matrix factorization-based recommendation models but also to deep learning-based models, demonstrating high flexibility and adaptability across various recommendation scenarios. Furthermore, to reduce communication overhead, a global model partitioning strategy is proposed to address the complex structures and large parameter sizes of deep learning models. Differentiated optimization strategies are applied to shallow and deep networks to effectively mitigate communication costs. Theoretical analysis shows that the method satisfies differential privacy, while experimental results on real-world datasets demonstrate that the proposed approach preserves user data privacy without significantly compromising model utility, while substantially improving communication efficiency in federated recommendation systems.
Abstract: In recent years, deep learning has developed rapidly and achieved significant success in computer vision, with model evaluation and improvement remaining central concerns for researchers. However, the commonly used model comparison paradigm relies on training (or validation) and testing on closed datasets, and then identifies hard samples based on discrepancies between predictions and ground-truth labels, which provide feedback on model weaknesses and directions for improvement. This paradigm suffers from two major limitations: 1) the limited size and coverage of datasets often fail to faithfully reflect the true weaknesses of models; 2) procedures such as pretraining may introduce data leakage, resulting in potential biases in the demonstrated performance. To address these issues, this study proposes a general visual hard sample mining algorithm based on maximum discrepancy competition, which automatically mines real hard samples to reveal models’ deficiencies. The proposed algorithm follows the principle of “comparing models through competition” and optimizes the discovery of potential hard samples by jointly exploiting the intra-task and cross-task prediction dissimilarities, aiming to provide new test benchmarks for the field of computer vision in a controllable and efficient manner. Experimental results demonstrate that the constructed benchmark named GHS-CV exposes models’ weaknesses more effectively than single-task hard sample benchmarks (i.e., the semantic segmentation hard sample set SS-C and the salient object detection hard sample set SOD-C). Specifically, compared to DeepLabv3+ on SS-C, the mIoU drops by about 20% on GHS-CV, while compared to VST on SOD-C, the Fβ decreases by about 36%.
Abstract: Fault localization is one of the most expensive, tedious, and time-consuming activities in software debugging, and it is also an indispensable step in software maintenance. Due to the variability of faults, fault localization is even more challenging in software product lines. Although significant progress has been made in fault localization for single-system software, research on fault localization for variability in software product lines is still insufficient. Meanwhile, existing methods face challenges such as low efficiency and poor root cause localization due to the issues of repeated generation and checking of feature interactions, as well as the propagation of faults between program statements. To address this, this study proposes an efficient and accurate fault localization method for software product lines, which performs localization at both the feature level and the statement level. At the feature level, based on observations of inclusion relationships and identical subsets between suspicious feature selection sets, the method identifies suspicious feature interactions more efficiently. At the statement level, a reduced causal model with mediator variables is used, combining causal effects and spectrum-based effects to achieve more precise fault localization. Four advanced fault localization methods for software product lines are selected, and experiments are conducted on six real-world software product line systems for comparison. The results demonstrate that the proposed method significantly outperforms other mainstream methods in terms of localization efficiency and accuracy.
Abstract: Spatio-temporal logical analysis refers to accurately expressing spatio-temporal relationships between entities using logical symbols. Traditional spatio-temporal logical analysis adopts two paradigms: closed-domain and open-domain approaches. Closed-domain methods predefine symbolic systems for representing spatio-temporal logic and then translate natural language into logical expressions. While ensuring accurate representation of spatio-temporal relationships, such methods face limitations in handling complex relationships due to the constraints of artificial definitions. Open-domain approaches extract keywords to represent spatio-temporal relationships using natural language itself. Although capable of covering complex relationships, these methods suffer from the semantic ambiguity inherent in natural language, resulting in imprecise logical representations. The purpose of this study is to convert natural language expressions of spatio-temporal relationships into logical language, enabling more precise representation of spatio-temporal information. To address the forementioned issues, this study considered the linguistic observation that spatio-temporal relationships in language are primarily expressed through localizers. By defining the semantics of localizers through logical symbols, the proposed framework aims to overcome both the insufficiency of coverage and the lack of precision. Accordingly, a spatio-temporal logical framework for localizers is established, including 1) the design of annotation specifications that define the logical expression scope of localizers and provide detailed annotation guidelines; 2) manual annotation of 6190 samples from the People’s Daily and CTB datasets to construct a task-specific corpus based on the proposed specifications; 3) application of large language models to perform logical reasoning on localizer-triggered spatio-temporal expressions, achieving an accuracy exceeding 70% based on corpus-driven inference.
Abstract: As autonomous driving applications are rapidly popularized, their safety has become the common focus of both academia and industry. Autonomous driving system (ADS) testing is an effective means for solving this problem. Currently, the mainstream testing method is the scenario-based simulation test, which evaluates the decision of ADS to be measured by simulating various elements of driving scenarios, such as roads and pedestrians. However, existing methods mainly focus on the construction and dynamic generation of critical driving scenarios, neglecting the influence of configuration changes of the vehicle itself, such as its weight and torque, on the decision-making of ADS deployed on the vehicle. To address this issue, based on the previous work SAFEVAR, this study proposes SAFEVCS, an efficient search method for safety-critical vehicle configurations. SAFEVCS employs a search algorithm to explore the vehicle configuration setting (VCS) that exposes safety vulnerabilities of ADS. Furthermore, to improve the diversity of the search results, SAFEVCS introduces fuzzing to optimize the conditions and constraints of crossover and mutation operators in search algorithms. To improve search efficiency, SAFEVCS further combines the vehicle dynamics knowledge, which achieves the self-adaption of search termination strategy and deduplication strategy. To evaluate the effectiveness and execution efficiency of SAFEVCS, the study takes SAFEVAR as the baseline for comparison and carries out extensive experiments under three driving scenarios. The experimental results show that VCS generated by SAFEVCS can effectively expose the safety vulnerabilities of ADS. In the two weather conditions of sunny and rainy days, under the simulation scenarios of pedestrians crossing the road, the obtained solution set significantly decreased the safety performance of the ADS under test, and under the same experiment environment, the simulation efficiency is increased by approximately 2.5 times.
Abstract: As the research on audio adversarial attacks advances, improving the transferability of adversarial audio across different models and ensuring its imperceptibility (that is, highly similar to the original audio in auditory perception) at the same time have become a research hotspot. This study proposes a new method called speak information attack (SIAttack) that can simultaneously improve the imperceptibility and transferability of adversarial audio. Specifically, the core idea of this method is to decouple speaker information from content information in the audio, and then apply small perturbations only to the speaker information, thereby achieving efficient attacks on the speaker recognition system under the premise of keeping the content information unchanged. The experiments on four speaker recognition models and three mainstream commercial APIs show that the audio generated by SIAttack is almost indistinguishable from the original audio, and can mislead all test models with a high success rate. Additionally, the transfer success rate on speaker recognition models can reach up to 100%.
Abstract: As a kind of graph structure with timestamps when nodes interact with each other, temporal graphs have more modeling advantages than static graphs. For example, they can detect money laundering, order brushing, equity relationships, financial fraud, and circular guarantees within a certain time interval. The cycle is the modeling of the behavior that forms a cycle in a temporal graph. Existing temporal cycle detection or mining methods mostly focus on the detection of non-decreasing complete cycles in time, but overlook the analysis and discovery of approximate cycles within a certain time interval. The discovery of such approximate cycles can detect fraudulent behavior with stronger cheating techniques. To address the problem of discovering approximate cycles that have already appeared within a certain time interval but are not fully displayed in a single source of data, this study first proposes an approximate cycle detection method based on the depth-first search, which is referred to as the baseline method (Baseline). It first mines complete cycles composed of edges satisfying non-decreasing order in each window, and then employs nodes that meet certain criteria as the start and end points of approximate cycles. In the subsequent windows, paths composed of edges within a certain time interval are mined, namely time-interval approximate cycles. To address the problems of Baseline, this study subsequently proposes an improved method for approximate cycle detection, referred to as the improved method (Improved). It first utilizes the node activity to enhance the possibility of start and end points, then improves the index features by adopting active paths and hotspots, and finally accelerates detection by employing the bidirectional search and connection from start and end points to hotspots. Extensive experiments on real and synthetic data demonstrate the efficiency and effectiveness of the proposed method.
Abstract: In recent years, recommender systems based on graph neural network (GNN) have made good use of the interaction structure of interaction data to learn user and item representations. However, existing recommendation models based on GNN often ignore the temporal information of interactions during aggregation, which makes it difficult to model the change characteristics of users’ interests. As a result, this causes overfitting of the recommendation model to data, and a lack of diversity in the recommendation results, thereby making it difficult to satisfy the more diversified needs of users. To this end, a temporal information-enhanced diversified recommendation model is proposed. First, an attention mechanism is employed to capture and fuse temporal information and interaction information from historical user-item interactions. Meanwhile, a feature disentanglement module is designed to disentangle smoothed global features from salient, highly discriminative key signals to reduce feature redundancy and improve representational clarity. Subsequently, neighbour selection is adopted to highlight inter-node differences and conduct graph convolution, with a layer attention mechanism employed to alleviate over smoothing. Finally, the learning of items in the long-tail category is enhanced by reweighting loss to improve the diversity.
Abstract: As multimodal multiobjective optimization faces challenges of reasonably defining the individual crowdedness and dynamically balancing the decision space and objective space in individual diversity calculation, there is still significant room for performance improvement in existing multimodal multiobjective optimization algorithms. To this end, this study proposes a multimodal multiobjective differential evolution algorithm based on adaptive individual diversity (MMODE-AID). First, based on the average Euclidean distance of individuals’ nearest neighbors in the decision space or objective space, the crowdedness of individuals can be defined by multiplying the relative distances between individuals, which can more reasonably measure the true crowdedness of each individual in the corresponding space. Second, based on the overall crowdedness of the decision space and objective space, the relative crowdedness of individuals in the corresponding space is obtained, which can reasonably and dynamically balance the influence of the current state of the decision space and objective space on individual diversity calculation during the evolution process, and is conducive to the sufficient search of each equivalent Pareto optimal solution set. By employing differential evolution as the basic optimization framework, MMODE-AID evaluates individual fitness based on adaptive individual diversity. Meanwhile, it can obtain a population with excellent performance in decision space distribution, objective space distribution and convergence during offspring generation and environmental selection. MMODE-AID is compared with seven advanced multimodal multiobjective optimization algorithms on 39 benchmark test problems and one real-world application problem to validate the algorithm’s performance. The experimental results demonstrate that MMODE-AID exhibits significant competitive advantages in solving multimodal multiobjective optimization problems. The source code and original experimental data of MMODE-AID are publicly available on GitHub: https://github.com/CIA-SZU/ZQ.
Abstract: As a widely employed interpreted language, Python faces performance challenges in execution efficiency. Just-in-time (JIT) compilers have been introduced to the Python ecosystem to dynamically compile bytecode into machine code, significantly improving program operation speed. However, the complex optimization strategies of JIT compilers may introduce program defects, thereby affecting program stability and reliability. Existing fuzz testing methods for Python interpreters struggle to effectively detect deep optimization defects and non-crashing defects in JIT compilers. To this end, this study proposes PjitFuzz, a coverage-guided defect detection method for Python JIT compilers. First, PjitFuzz proposes five mutation rules based on JIT optimization strategies to generate program variants that trigger the optimization strategies of Python JIT compilers. Second, a coverage-guided dynamic mutation rule selection method is designed to integrate the advantages of different mutation rules and generate diverse program variants. Third, a checksum-based code block insertion strategy is developed to effectively record changes in variable values during program execution and detect inconsistency in the output. Finally, differential testing is performed by combining different JIT compilation options to effectively detect defects in Python JIT compilers. This study compares PjitFuzz with two state-of-the-art Python interpreter fuzzing methods, FcFuzzer and IFuzzer. The experimental results show that PjitFuzz improves defect detection capability by 150% and 66.7% respectively, and outperforms existing methods in terms of code coverage by 28.23% and 15.68% respectively. For the validity rate of generated test programs, PjitFuzz outperforms the comparative methods by 42.42% and 62.74% respectively. In an eight-month experiment, PjitFuzz has discovered and reported 16 defects, 12 of which have been confirmed by developers.
Abstract: As a superset of JavaScript, TypeScript provides a rich set of features, such as static type support and object-oriented programming capabilities. It is widely adopted by many mainstream frameworks such as Angular, Vue, and React, and has become a core technology for building large-scale applications. Its compiler is responsible for compiling TypeScript codes into standard JavaScript codes. However, the TypeScript compiler itself may contain bugs, resulting in unexpected errors in the generated JavaScript code. Although numerous studies have been conducted on JavaScript engine testing, there has been no systematic study dedicated to testing the TypeScript compiler. Existing JavaScript engine testing methods have difficulty in generating a large number of TypeScript programs with specific types and effectively mutating these types, thus making it difficult to detect bugs related to complex type systems in the TypeScript compiler. To this end, a TypeScript compiler testing framework based on syntax and type mutation TscFuzz is proposed. To obtain a large number of seed programs containing specific types of TypeScript, TscFuzz designs a set of prompts tailored to the unique type system of TypeScript compared to JavaScript, with the large language model (LLM) guided to generate a series of programs featuring these specific types. Next, a set of type-specific mutation operators are designed to conduct targeted testing on the type system of TypeScript via type mutation. Finally, based on differential testing of the cross-version strategy, TscFuzz compares the outputs of different versions of the TypeScript compiler to detect bugs. Additionally, Node.js is employed to verify the semantic correctness of the JavaScript programs output by the compiler. Experimental results demonstrate that TscFuzz detects five bugs within 72 hours, two and three bugs more than the baseline methods DIE and FuzzJIT, respectively. The bug detection effect of TscFuzz is significantly better than that of the baseline methods. Meanwhile, after three months of testing, TscFuzz successfully identifies 12 real TypeScript bugs, eight of which have been confirmed and seven have been repaired.
Abstract: Quality of service (QoS)-aware cloud API recommendation systems play an important role in solving cloud API overload problems, differentiating cloud API performance, and achieving high-quality cloud API selection. However, due to the openness of the network environment and the monetary nature of cloud APIs, recommendation systems are susceptible to poisoning attacks, which causes the recommendation results to deviate from fairness and credibility. Existing defense methods against poisoning attacks mainly adopt the “detection and defense” strategy, which utilizes detection algorithms to filter out malicious users before model training to mitigate the influence of the attacks. However, due to the performance limitations of detection algorithms, it is inevitable that malicious users cannot be completely filtered out. To this end, this study proposes a continuous defense method against poisoning attacks on the QoS-aware cloud API recommendation system from a “learning to defense by attacks” perspective with trusted data augmentation. First, this study establishes a defense framework against poisoning attacks based on trusted data augmentation and enhances the robustness of the recommendation system by generating high-quality trusted user data for model training. Second, the study designs a trusted user generation algorithm based on the diffusion model, which employs iterative denoising to learn real-world QoS data distribution related to cloud APIs and generate high-quality trusted user vectors, thus mitigating the influence of data subjected to poisoning attacks on training models. Finally, extensive experiments are conducted based on real-world cloud API QoS datasets, and 11 recommendation algorithms from three categories are utilized to comprehensively evaluate the effectiveness and universality of the proposed defense method. Experimental results indicate that the proposed framework of continuous defense against poisoning attacks based on trusted data augmentation is effective, and the generated trusted user can significantly improve the robustness of the cloud API recommendation system.
Abstract: The current software market is witnessing an intensified trend of product homogenization, where functional innovation has become a decisive factor in maintaining competitive advantage. This shift has transformed the paradigm of modern requirements engineering from passive requirements extraction to proactive creative requirements capture. Existing approaches to enhancing requirements creativity primarily follow two paths: (1) fostering collaborative innovation in workshops through scenario modeling and facilitation methods, and (2) rapidly generating novel solutions by deconstructing and recombining existing requirements based on combinatorial innovation theory. However, both methods face a core challenge in balancing innovation quality with participation costs. The breakthrough advancements in generative AI technologies offer new opportunities to address this dilemma. This study proposes a business modeling-driven human-AI multi-agent collaborative framework with TRIZ infusion for creative requirements capture (BMHACT). The framework adopts the unified process business modeling collaborative architecture to design prompt-based definitions for five agent roles: business process analyst, business designer, and other relevant roles. The multi-agent team collaboratively generates creative requirements through a structured workflow: system vision collection→process pain point identification→technical contradiction analysis→TRIZ innovation principle matching→requirement solution generation. Domain experts and client representatives then evaluate the requirements for creativity. An empirical study on a portal system for a small-scale mechanical manufacturing enterprise demonstrates that, compared to the requirement reuse-based method and the adversarial-sample-based retrospective requirement generation method, BMHACT reduces iteration cycles by 50% and 28.6%, shortens total process duration by 66.7% and 33.3%, increases the clarity novelty usefulness (CNU) by 22.9% and 10.7%, and achieves a 2.16× and 2.14× higher per-round CNU improvement rate. These results validate BMHACT’s superiority in enhancing requirements innovation quality while reducing collaboration costs.
Abstract: The heuristic test case generation method that combines machine learning-related technologies can significantly improve the test efficiency. Existing studies focus on building efficient surrogate models with partial test cases, but ignore the influence of both the initial population quality and surrogate models on the multi-path testing efficiency. Therefore, this study proposes a test case reduction and generation method combining K-means and support vector machine regression (SVR). The randomly generated test cases are clustered into several clusters by adopting K-means, and only the test cases that are within a particular distance away from the cluster center are retained, with the path coverage matrix for these test cases constructed. This matrix is employed to evaluate the path coverage potential of test cases and the coverage difficulty of paths. Additionally, based on these two conditions, the test cases are ranked, and several test cases are selected from different clusters to construct the test case reduction set, which is taken as the initial genetic population. This not only increases the diversity of the initial population and reduces its redundancy, but also helps to reduce the iteration number for multi-path coverage test cases. Meanwhile, the test cases before clustering and their fitness are employed as the samples to train the SVR fitness prediction model designed for multi-path coverage, and then the new test cases generated by genetic evolution are utilized to update the model, thus improving the model accuracy and reducing the time consumed due to the instrumentation program execution. In this way, both population quality and test efficiency can be improved. The experimental results show that on fifteen programs, the proposed method has better improvements in terms of indicators such as the coverage rate and average evolutionary generation. Specifically, in terms of the coverage rate, the proposed method demonstrates an improvement of at least 7% and up to 49% compared to three types of baseline methods, and shows the enhancement of approximately 10% to a maximum of 25% compared to five competitive methods. The proposed method provides guidance for the research on multi-path testing that combines machine learning.
Abstract: Scade is a well-known commercial tool widely used in the development of safety-critical embedded control software, whose modeling language is a synchronous language extended from Lustre, a synchronous data-flow language. Correct compilation of synchronous languages, including Lustre, has attracted much attention in recent years, and has been addressed in many studies through formal verification. To build a formally verified compiler for such a language, it is a common practice to compile the source program into a C-like program first, and then to compile it into low-level machine-dependent code using a formally verified backend compiler such as the CompCert compiler, where the correct compilation of temporal operators is crucial. In this study, the formally verified compilation of Scade-like temporal operators is introduced, which is used in a formally verified compiler projects, where a Lustre-extended synchronous language is translated into the front-end intermediate language Clight in the CompCert compiler. The compilation and formal verification of temporal operators are divided into two key stages, which are implemented in the interactive proof assistant Coq.
Abstract: Deep reinforcement learning has achieved significant breakthroughs in various fields, with policy gradient algorithms widely adopted due to their suitability for handling nonlinear and high-dimensional state spaces. However, in practical applications, existing policy gradient algorithms still suffer from high variance, which slows convergence and may cause suboptimal solutions. To tackle this challenge, a variance optimization method for policy gradients is proposed from a latent causal model perspective. By introducing latent variables to characterize unobserved random information, a latent variable causal model is constructed and learned. Utilizing this model, a causal value function is proposed and combined with long short-term memory (LSTM) networks to differentiate the temporal impact of unobserved information on value estimation. This approach improves the accuracy of action advantage function estimation and reduces policy gradient variance. Experiments demonstrate that the proposed latent variable causal model outperforms state-of-the-art algorithms across multiple tasks, with better performance and stability.
Abstract: K-clique enumeration is an important problem in subgraph matching, and the bitmap algorithm has been proven to be an effective method for solving the K-clique enumeration problem. Currently, state-of-the-art K-clique enumeration algorithms are accelerated by GPU. Previous studies have not investigated the impact of sparsity in real-world graph data on bitmap-based K-clique enumeration algorithms. Instead, static parallelization methods and bitmap construction strategies are commonly used on GPU, which result in low computational efficiency. This study proposes a thread-parallel load-balancing scheduling algorithm for bitmap tasks, which resolves the thread divergence problem while achieving high parallelism in the bitmap algorithm. Furthermore, it introduces a dynamic bitmap construction algorithm, enabling bitmaps to be constructed and activated at appropriate times for efficient execution of the bitmap algorithm. A GPU-friendly K-clique enumeration system, KCMiner, is implemented, which adaptively selects optimization strategies for K-clique enumeration tasks. Experimental results on GPU platforms show that the proposed method achieves up to 7.36 times speedup over the baseline K-clique enumeration algorithm and up to 30.2 times speedup over the baseline subgraph matching system.
Abstract: The rapid development of quantum computers poses significant threats to existing cryptographic systems. The implementation and migration of post-quantum cryptographic algorithms are therefore of utmost importance. Among these, NTRU lattice-based cryptographic schemes have gained attention due to their simplicity and computational efficiency. The CTRU-Prime scheme, based on NTRU lattices, stands out for its excellent performance in security, bandwidth, and implementation efficiency. Given the powerful capabilities of GPUs in handling large-scale parallel processing tasks, this study presents the first high-throughput implementation of CTRU-Prime using Tensor Core and compute unified device architecture (CUDA) Core. The underlying algebraic structure of CTRU-Prime is large-Galois-group prime-degree prime-ideal number field (LPPNF), which not only resists attacks targeting cyclotomic rings but also presents challenges for the implementation of polynomial multiplication. First, two GPU implementations of polynomial multiplication over LPPNF are proposed. The CUDA Core-based Pseudo-Mersenne incomplete NTT polynomial multiplication uses layer fusion techniques to optimize memory access patterns, achieving a throughput of 256.98 times. The Tensor Core-based schoolbook polynomial multiplication converts polynomial multiplication into matrix operations, leveraging low-precision matrix-multiply-and-accumulate (MMA) operations, achieving a throughput of 177.24 times. Next, an overall architecture for CTRU-Prime on the GPU platform is presented, focusing on throughput. This architecture combines batch mode and single mode, multi-stream technology, and multi-thread techniques. Optimization strategies such as fused kernels, coalesced global memory access, and optimized memory access patterns are employed to accelerate memory access and computation speeds of various kernel functions. Experimental results show that, on the RTX 3060 platform, CTRU-Prime-653, CTRU-Prime-761, and CTRU-Prime-1277 can perform key generation at rates of 63000, 54000, and 16000 times per second, respectively; key encapsulation at rates of 635000, 2745000, and 1601000 times per second, respectively; and key decapsulation at rates of 351000, 2622000, and 1524000 times per second, respectively. These rates are 68.85, 79.78, and 66.84 times higher for key generation, 10.32, 46.57, and 46.81 times higher for key encapsulation, and 11.43, 89.19, and 90.32 times higher for key decapsulation compared to the C implementation. Compared to the latest Kyber implementation, the key encapsulation throughput is 1.46 times higher, and the key decapsulation throughput is 1.74 times higher, making it 26 times more efficient than other high-throughput NTRU lattice-based GPU implementations.
Abstract: With the rapid development of 5G technology, the 5G-AKA protocol, as the core security mechanism of 5G technology, has caught widespread attention. Although the deployment of the 5G-AKA protocol has promoted the high-speed interconnection of communication networks, it has also raised users’ concerns about privacy leakage. During the protocol interaction, operators will collect a large amount of data, and once the data is leaked, it will pose a serious threat to users. Therefore, this study proposes an anonymous authentication and key agreement protocol based on SM2 to enhance the privacy of the user authentication process and minimize the disclosure of user information. It extends the Chinese cryptographic SM2 digital signature algorithm to achieve the signature of multiple messages, combines the ElGamal algorithm to encrypt the user’s identity, and adopts zero-knowledge proof technology to ensure the anonymity of the user credentials, thereby achieving the anonymous authentication of the user’s identity. The protocol protects the identity privacy of legitimate users in network activities and effectively blocks the illegal acquisition of user information. Additionally, the protocol holds the accountability for malicious users, allowing authorized regulatory agencies to restore the user’s identity in a legal process. Finally, protocol experimental evaluations are conducted, with deployment and implementation carried out on Windows and Raspberry Pi 4B platforms. The evaluation results show that the consumed time of the anonymous authentication and key agreement process is at the millisecond level, fully demonstrating the efficiency and practicality of this protocol.
Abstract: The popularization of GPS mobile devices and 5G Internet technology has led to the rapid growth of trajectory data. How to efficiently store, manage, and analyze massive trajectory data has become a hot research issue in the current environment. The traditional single-node trajectory index is limited by memory capacity, disk I/O speed, and other factors, and is no longer capable of managing large-scale trajectory data. Spark, as a distributed framework based on in-memory computing, has natural advantages in processing massive data. Therefore, this study proposes a distributed trajectory data indexing and query scheme based on the Spark platform. To improve the data storage capacity of a single node in a distributed cluster and the efficiency of trajectory queries, a trajectory encoding technique, Z-order trajectory encoding (ZTE), is proposed. This technique encodes the minimum adjacent subspaces covered by the trajectory minimum bounding rectangle (MBR), which can represent trajectories of different granularities and their movement directions, and is used to determine the relationship between a trajectory and the query space. Based on this technique, this study further organizes the ZTE codes of trajectories into a partial-order structure and designs a subspace partial-order branch (SPB). Combined with the hash mapping table IDMap, a local index is constructed. This index avoids the inefficiency caused by the dead space formed by the overlapping of minimum bounding rectangles in R-tree-like indexes and enables fast pruning. To support efficient retrieval of massive trajectory data, the study designs a distributed trajectory index named SPBSpark based on the SPB-branch local index. SPBSpark mainly consists of three components: data partition, local index, and global index. The proposed index effectively supports three types of queries: spatiotemporal range query, k-nearest neighbor query, and moving object trajectory query. Finally, the study selects the distributed trajectory indexes TrajSpark and LocationSpark, which are also based on the Spark framework, as comparison systems. Through comparative simulation experiments, the spatial utilization of the SPBSpark index is improved by about 15% compared with LocationSpark. In terms of query performance, SPBSpark achieves a 2–3 times performance improvement compared with TrajSpark and LocationSpark.
Abstract: Crowd intelligence is a crucial component of the next generation of artificial intelligence. Researching how to stimulate and converge the innovative power of “people” in open-source communities can significantly enhance development efficiency. Community detection, as a technical approach to studying the relationships among developers in open-source projects, plays a vital role in exploring and understanding social networks. However, current research has predominantly focused on large-scale social networks such as Facebook, while systematic studies on community detection in project-level open source software developer social networks (OSS-DSN) remain limited. This study first collects real-world data and analyzes the features of OSS-DSN. Then, it benchmarks several overlapping and non-overlapping community detection algorithms on these real datasets, comparing algorithm performance across multiple metrics and dimensions. Finally, based on synthetic OSS-DSN, it generates networks efficiently and performs algorithm evaluations using ground-truth data for comparative analysis. Differences in characteristics between small- and medium-scale social networks and large-scale networks are identified, and the influence of these differences on community detection metrics and algorithm performance is explored. The study provides a new benchmark and offers important insights into communication and collaboration in open-source software communities.
Abstract: Many code files become oversized and take on excessive responsibilities as software evolves, which severely affects software maintainability and comprehensibility. Developers often need to refactor such files by decomposing a large code file into several smaller ones. Existing studies mainly focus on class file decomposition and are not fully applicable to decomposing complex header files. This is because header file decomposition faces unique challenges. It needs to consider the build dependencies of the entire software project to reduce compilation cost and ensure that the decomposed files are free of cyclic dependencies. To address these challenges, this study proposes an automated approach for decomposing and refactoring complex header files, HeaderSplit. It first constructs a code element graph that captures multiple types of code relationships, including co-usage relationships that reflect project build dependencies. Then, a node coarsening process and a multi-view graph clustering algorithm are applied to identify clusters of closely related code elements. A heuristic algorithm is further introduced to eliminate cyclic dependencies in the clustering results. After the decomposition plan is confirmed, HeaderSplit automatically performs the refactoring, generating new sub-header files and updating the include statements in all code files that directly or indirectly include the original header file. HeaderSplit is evaluated on both synthetic and real complex header files. The results are as follows. 1) HeaderSplit improves accuracy by 11.5% compared with existing methods and demonstrates higher cross-project stability. 2) The decomposed sub-files have higher Modularity and no cyclic dependencies, indicating better architectural design. 3) Using HeaderSplit to decompose complex header files can reduce recompilation costs in their evolution history by15%–60%. 4) HeaderSplit efficiently performs automated refactoring, completing the decomposition and refactoring of header files in large-scale software projects with millions of lines of code within five minutes, showing high practical value.
Abstract: In software engineering, eliciting non-functional requirements (NFR) remains a critical yet often overlooked task in requirements engineering practice. Traditional NFR elicitation methods predominantly rely on the experience and manual analysis of requirements engineers, leading to inefficiency, omissions, and inconsistencies. Recent breakthroughs in large language models (LLM) in natural language processing have provided new technological means for the automated NFR elicitation. However, directly employing LLM for NFR generation often faces challenges such as hallucination and insufficient domain expertise. To address these issues, this study proposes an automated NFR elicitation method based on LLM to achieve high-quality NFR generation. A structured and correlated dataset comprising 3856 functional requirements and 5723 NFR is constructed, establishing 22647 FR-NFR association pairs. The proposed method integrates retrieval-augmented generation (RAG) technology through three core modules: a semantic case retrieval module based on the maximum marginal relevance algorithm, a prompt engineering module designed for NFR generation, and an optimized LLM generation module. Through professional evaluation by software engineering experts and automatic metrics including BLEU and ROUGE, experimental results demonstrate that the proposed method outperforms existing approaches in terms of completeness, accuracy, and testability of requirements.
Abstract: With the rapid advancement of deep learning and computer vision, grayscale image colorization has evolved from traditional handcrafted feature-based methods to data-driven deep neural network paradigms. However, existing evaluation systems for grayscale image colorization models face the following two challenges: First, due to the limitations of evaluation metrics and the highly ill-posed nature of the colorization task, traditional quantitative metrics such as PSNR, SSIM, and FID cannot effectively quantify the performance of grayscale image colorization models. Second, it is time-consuming, laborious, and infeasible to conduct qualitative analyses through large-scale subjective experiments. To address these issues, a new evaluation method for grayscale image colorization models based on hard sample mining is proposed. The method aims to efficiently identify representative samples for model comparison through multi-dimensional evaluation (including image quality, aesthetics epression, and color difference), and then conduct a controlled small-scale subjective experiment to reliably compare different models. Subsequently, the advantages and shortcomings of the models are revealed. Experimental results show that the proposed method can efficiently and accurately find hard samples, and reveal the strengths and weaknesses of the models while drastically reducing the scale of subjective experiments, providing a new paradigm for grayscale image colorization model evaluation and indicating the direction for model optimization.
Abstract: The widespread adoption of blockchain technology has driven the development of multi-chain applications, creating a need for cross-chain technology to address information isolation across different blockchains. However, when a large number of transactions occur concurrently across blockchains, existing cross-chain technologies are unable to process them in parallel, resulting in low scalability. Blockchain sharding offers a potential solution, but its impact on scalability is limited by inefficient transaction allocation and cross-chain transaction methods. Therefore, this study proposes a two-phase adaptive transaction allocation model for a relay chain sharding environment. In the first phase, the model generates an allocation scheme to reduce cross-shard transactions and balance shard load with performance. In the second phase, it fine-tunes transactions in unstable queues after allocation to mitigate delays caused by load surges. In the first stage, this study also includes a transaction allocation prediction method that leverages historical cross-chain data to forecast transaction size and volume, calculating load based on these predictions and transaction throughput. An inter-shard allocation method further refines transaction distribution. In the second stage, the relay chain directs transactions to specific shards based on the allocation scheme, adapting dynamically if load surges lead to a mismatch between shard load and performance. A stability analysis method assesses transaction queue changes, allowing for fine-tuning across shards to reduce waiting times and increase throughput. Experimental results show that this model significantly improves transaction throughput and reduces processing delays compared to existing methods.
Abstract: Large language models demonstrate significantly superior performance in reasoning tasks compared to traditional models, yet still struggle to meet the demands of complex tasks in terms of computational cost and response quality. Against this backdrop, model interconnection enables the sharing, integration, and complementation of large model capabilities by constructing a collaborative paradigm among models. The cascade architecture represents a typical form of such collaboration, where multiple large models are organized in a chain-like sequence to enhance system performance through step-by-step optimization. Routing in model cascades aims to select appropriate cascade paths and serves as a key factor in improving system capabilities. However, current routing evaluation and selection methods lack systematic consideration of model collaboration relationships. To address this, this study proposes a dynamic routing method based on collaboration relationships. It first builds a model collaboration graph through a mutual evaluation mechanism, and then employs a dynamic collaborative routing algorithm to analyze responses hop by hop and optimize path selection. The mutual evaluation mechanism uses gradient-based mutual assessment to quantify the quality of pairwise model collaboration. Based on the resulting collaboration quality information, the dynamic collaborative routing algorithm adopts a model “consensus rule” to analyze each hop’s response and determine the routing order, thus enabling dynamic path adjustment. Experimental results show that the proposed routing algorithm outperforms both non-preset and non-targeted routing methods in terms of accuracy and response win rate on benchmark task datasets. On the OMGEval dataset, the win rate is improved by up to 45% compared to non-preset routing.
Abstract: With the continuous development of blockchain technology and applications, the demand for interaction between blockchains is increasing. However, the lack of effective interoperability between different blockchain systems limits the further development of blockchain technology. To address the problem of heterogeneous interconnection between blockchains, cross-chain technology has emerged and quickly become a prominent research topic. Specifically, the XCMP protocol, one of the most popular cross-chain communication protocols, not only provides a secure and efficient communication mechanism but also offers a broad platform for future blockchain innovation and applications. However, the cross-chain massage passing (XCMP) protocol is still in a phase of continuous development and improvement, facing security challenges such as replay attacks, denial of service attacks, and delay attacks. This study formally verifies and improves the XCMP protocol, aiming to provide solid support for the development of more secure and feature-rich decentralized applications based on it. First, Z language, a formal description language based on classical set theory and first-order predicate logic, is used to summarize, refine, and formally model the 10 key security goals and protocol contents of the XCMP protocol. The security goals are then verified using Z/EVES, an automated verification tool supporting the Z language. The verification results show that the XCMP protocol does not meet three of the security goals. Second, after a comprehensive analysis of the verification results, the study introduces a commitment mechanism, a supervision mechanism, and a polling mechanism to address unmet security goals of the XCMP protocol, proposing an enhanced cross-chain message passing (E-XCMP) protocol. Finally, the E-XCMP protocol is formally modeled, and its security and reliability are evaluated using the security protocol analysis tool Scyther and the automatic verification tool Z/EVES. The evaluation results show that the E-XCMP protocol not only meets the three previously unmet security goals but also effectively solves security issues such as replay attacks, denial of service attacks, and delay attacks, demonstrating strong security and reliability.
Abstract: Software programming assistants based on large language models (LLMs), such as Copilot, significantly enhance programmer productivity. However, LLMs have large computing and storage requirements and are difficult to deploy locally. Building a lightweight, small LLM can meet computing, storage, and deployment requirements, but it leads to a greater accuracy loss in code generation compared to large LLMs. Knowledge distillation (KD) techniques allow small LLMs (student models) to approximate the output distributions of large LLMs (teacher models) on target training datasets, thus reducing accuracy loss in code generation. Cutting-edge KD techniques in artificial intelligence are based on the Kullback-Leibler (KL) divergence loss function, which measures and reduces accuracy loss due to discrepancies in the output distributions between student and teacher models. However, student models struggle to learn in the near-zero distribution regions of teacher models. Consequently, researchers have employed the Reverse KL (RKL) divergence loss function to address this issue in near-zero distribution regions. This study finds that RKL faces learning challenges in high-probability distribution regions and complements the KL divergence loss function. For some datasets, low-quality outputs from teacher models lead to poor learning outcomes for the student models. This study proposes an adaptive knowledge distillation (AKD) method that uses prompts to enhance teacher model output quality and constructs an adaptive loss function to adjust learning priorities based on the distributional differences between student and teacher models. This ensures the student model effectively learns in both primary and near-zero probability regions. Using the AKD method, this study trains a lightweight code generation model based on StarCoder-1B/7B (student/teacher models) and the CodeAlpaca dataset, evaluating accuracy loss and code quality issues. Experimental results show that the lightweight model size is reduced by 85.7%. On the HumanEval and MBPP data sets, prompts with clear instructions improve teacher model code generation quality, reducing the average accuracy loss of the trained student model by 6%. The AKD-trained model’s average accuracy loss compared to the teacher model (StarCoder-7B) is 17.14%, a 30.6% reduction over the original student model. The AKD-trained model’s accuracy loss is reduced by an average of 19.9% compared to state-of-the-art KD and RKD methods. Regarding inference memory requirements, the KD and RKD methods require 54.7 GB, while the AKD method only adds 3 GB. In terms of training time, the AKD method incurs a 30% increase. However, even when the KD and RKD methods are trained for the same duration, their average performance improves by only 3%, which is 16.9% lower than that of the AKD method. Therefore, the additional training cost of the AKD method is justified. Moreover, applying the AKD method to the CodeLlama and CodeGen model series reduces accuracy loss by an average of 19.2% compared to state-of-the-art KD and RKD methods, demonstrating the generalizability of the AKD method.
Abstract: Unlearning has significant application value in safeguarding privacy, mitigating the impact of contaminated samples, and processing redundant data. However, existing unlearning methods are mostly applied to black-box models such as neural networks, while achieving efficient single-class and multi-class unlearning in interpretable TSK fuzzy classification systems remains challenging. To address this, this study proposes a TSK fuzzy unlearning method for classification (TSK-FUC). First, the rule base is divided into three subsets using the normalized activation strengths of rule antecedent parameters on the (single/multi-class) forgotten data: 1) a deleted rule set that is highly relevant to the forgotten data, 2) a retained rule set with low relevance to the forgotten data, and 3) an updated rule set showing overlapping relevance to both the retained and forgotten data. Subsequently, differential processing strategies are applied: the deleted rule set is directly removed to eliminate major information residues and reduce the number of system parameters; the retained rule set is fully preserved to reduce parameter adjustment scope during unlearning; and for the updated rules, class-specific noise is added to the consequent parameters to further eliminate information related to the forgotten data, thus achieving single-class and multi-class unlearning. Experimental results on 16 benchmark datasets demonstrate that TSK-FUC accurately partitions the rule space and exhibits effective single-class and multi-class unlearning performance through differentiated processing in both 0-order and 1-order established TSK fuzzy classification systems. This method maintains the interpretability of the rule base while rendering the TSK fuzzy classification system more lightweight in terms of structure after unlearning.
Abstract: HTAP databases are capable of simultaneously supporting OLTP and OLAP workloads within a set of systems. The workload identification is a critical entry point for routing distribution in query execution. The only way to reasonably optimize the queries and allocate resources is to accurately identify whether a query belongs to OLTP or OLAP. Therefore, accurate identification of workload types is a key factor in the performance of HTAP databases. However, existing workload identification methods are mainly based on rules and cost-based measures in SQL statements, as well as machine learning approaches to differentiate workloads. These methods do not consider the inherent characteristics of query statements and utilize structural information in execution plans, resulting in low workload identification accuracy. To improve workload identification accuracy, this study proposes an intelligent method for identifying OLTP and OLAP workloads. This method extracts and encodes features from SQL statements and execution plans, builds the SQL statement encoder based on BERT, and combines the convolutional neural networks and attention mechanisms to construct the encoder of execution plans, with two types of features integrated to build a classifier. The model enables intelligent identification of workloads in HTAP hybrid workloads. Experimental verification shows that the proposed model can accurately identify OLTP and OLAP workloads with high identification accuracy. Additionally, the robustness of the model has been validated across multiple datasets, and the model is integrated into the TiDB database to verify its performance improvement on the database.
Abstract: With the growing demand for wireless communication quality in high-speed railway (HSR), ensuring communication reliability in high-mobility scenarios has become a critical challenge. Constructing a reliable channel model is the key to addressing this issue. To build a highly general and reliable channel model, composite wireless communication channel modeling requires full consideration of the actual operating environment and channel propagation characteristics. With rigorous mathematical modeling and logical reasoning capabilities, the formal method demonstrates significant advantages in complex wireless channel modeling. Focusing on the typical HSR communication scenario of viaducts, this study proposes a high-order logic model of composite wireless communication channels based on a small-scale fading model using the formal method. To address the long-tail characteristic of composite channels, the theorem proving technique is used to verify that the probability density function (PDF) of the composite wireless communication channel conforms to the distribution of the modified Bessel function of the second kind.
Abstract: The propositional satisfiability problem (SAT) and the satisfiability modulo theories problem (SMT) are fundamental problems in computer science, with significant applications in circuit design, software analysis and verification, and other fields. At present, extensive research has been conducted on their solving techniques. In practical applications, SAT/SMT solvers often need to solve a series of closely related formulas. Compared to solving each problem from scratch using an independent solver, incremental solving techniques can reuse previously obtained search information, including previous solutions and learned clauses, thus effectively improving solving efficiency. Currently, incremental SAT/SMT solving has received extensive attention and research, and has been successfully applied in fields such as bounded model checking, symbolic execution, and the maximum satisfiability problem (MaxSAT). This study provides a detailed review and categorization of incremental SAT/SMT solving techniques, covering both complete and incomplete algorithms. In addition, the applications of incremental SAT/SMT solving techniques in practical scenarios are comprehensively summarized. Finally, the development directions in this field are summarized and discussed.
Abstract: The 12-lead electrocardiogram (ECG) is the most commonly used signal source for testing cardiac activity, and its automatic classification and interpretability are crucial for the early screening and diagnosis of cardiovascular diseases. Most ECG classification studies focus on single-label classification, where each ECG record corresponds to only one type of cardiac dysfunction. However, in clinical practice, patients with cardiovascular diseases often have multiple concurrent heart diseases, making multi-label ECG classification more aligned with real-world needs. Existing deep learning-based multi-label ECG classification methods have mostly concentrated on label correlation analyses or neural network modifications, neglecting the fundamental issue in multi-label learning: the inherent imbalance between positive and negative labels. To address this issue, this study proposes a novel strategy that balances positive and negative labels during training by pushing away only one pair of labels each time. Specifically, it maximizes the margin between positive and negative labels and derives a new loss function to mitigate the imbalance issue. Furthermore, to address the insufficiency of interpretability in existing ECG methods, which hinders diagnostic assistance, the study introduces a temporal saliency rescaling method to visualize the experimental results of the proposed method, aiding in the localization and interpretation of different diseases. Experiments conducted on the PhysioNet Challenge 2021 ECG dataset, which includes 8 subsets, demonstrate that the proposed method outperforms state-of-the-art multi-label ECG classification methods.
Abstract: With the rapid development of the HarmonyOS ecosystem, security issues related to HarmonyOS applications have gradually become a key research focus. In the Android domain, various mature static analysis frameworks have been widely applied to security detection tasks. However, static analysis frameworks for HarmonyOS applications are still in the early stages of development. The OpenHarmony community is currently working on static analysis based on the source code of HarmonyOS applications using ArkTS. However, in practical security detection tasks, obtaining application source code is often difficult, which limits the applicability of this approach. To address this challenge, this study proposes a static analysis framework for HarmonyOS applications based on the Ark intermediate representation (Panda IR). This framework provides basic information interfaces for Panda IR, designs a field-sensitive pointer analysis algorithm tailored to ArkTS syntax features, and implements extended analysis interfaces that interact with pointer analysis. Specifically, 318 instructions in Panda IR are semantically categorized and processed, and a customized pointer flow graph design is further developed. To support ArkTS syntax features, new propagation rules for pointer sets are introduced, and the semantics of special calls are accurately modeled. In addition, based on the pointer analysis results, inter-procedural data dependencies are optimized, and alias analysis capabilities are provided.The experimental evaluation of HarmonyFlow covers three aspects: ArkTS syntax feature coverage, pointer analysis accuracy, and pointer analysis speed. Experimental results show that HarmonyFlow can correctly handle key ArkTS syntax features. The precision and recall rates for call-edge identification in 9 open-source HarmonyOS applications are 98.33% and 92.22%, respectively, with an average runtime of 96 s for 35 real-world HarmonyOS applications.
Abstract: With the continuous advancement of compilation technology, modern compilers support richer programming models and more complex compilation optimizations, which makes manually adjusting compilation options for optimal performance extremely challenging. Although various automated compilation tuning methods have been proposed, traditional heuristic search algorithms often struggle to avoid being trapped in local optima when confronted with vast search spaces. Moreover, most existing tuning methods target single-core or multi-core architectures, limiting their use in large-scale parallel computing systems. To address these issues, this study designs and implements a distributed compilation tuning framework, SWTuner, based on machine learning methodologies. By introducing AUC-Bandit-based distributed meta-search strategies, machine learning model-guided performance prediction, and SHAP-based compilation option analysis and filtering, the resource utilization and search efficiency during the compilation tuning process are significantly improved. Experimental results show that SWTuner performs excellently in tuning typical test cases on the new-generation Sunway supercomputer, not only reducing search time but also achieving notable reductions in actual execution power consumption during the search process compared to other tuning methods. During the tuning process, the random forest model employed by SWTuner demonstrates good generalization capability and prediction accuracy, effectively reducing search space dimensionality while maintaining tuning effectiveness, providing an efficient and reliable solution for automatic compilation tuning in high-performance computing.
Abstract: Existing static malware similarity measurement methods are affected by static anti-antivirus techniques, and the model features are either easily confused or fail to fully capture malware semantics. This study proposes a malware similarity measurement method called heterogeneous graph matching network-based similarity (HGMSim) to address the above problems. This method first uses the disassembly tool IDA Pro to extract a malware’s call graph, which is then abstracted into a heterogeneous graph to effectively capture the heterogeneous semantics of different function node types and their call relationships. Meanwhile, cross-graph edges are established for similar function nodes of the same type in two call graphs to mine the implicit neighbor semantics between nodes in different call graphs, and a heterogeneous graph matching network is constructed. Then, the study proposes a heterogeneous graph embedding method based on local node graph matching strategy and implements malware similarity measurement to solve the problem of difficulty in distinguishing malware with highly similar graph structures between different families. Finally, experimental results show that HGMSim performs best in malware similarity measurement.
Abstract: In privacy-preserving inference using convolutional neural network (CNN) models, previous research has employed methods such as homomorphic encryption and secure multi-party computation to protect client data privacy. However, these methods typically suffer from excessive prediction time overhead. To address this issue, an efficient privacy-preserving CNN prediction scheme is proposed. This scheme exploits the different computational characteristics of the linear and non-linear layers in CNNs and designs a matrix decomposition computation protocol and a parameterized quadratic polynomial approximation for the ReLU activation function. This enables efficient and secure computation of both the linear and non-linear layers, while mitigating the prediction accuracy loss caused by the approximations. The computations in both the linear and non-linear layers can be performed using lightweight cryptographic primitives, such as secret sharing. Theoretical analysis and experimental results show that, while ensuring security, the proposed scheme improves prediction speed by a factor of 2 to 15, with only about a 2% loss in prediction accuracy.
Abstract: The black-box vulnerability scanner is an essential tool for Web application vulnerability detection, capable of identifying potential security threats effectively before a Web application is launched, thus enhancing the overall security of the application. However, most current black-box scanners primarily collect the attack surface through user operation simulation and regular expression matching. The simulation of user operations is vulnerable to interception by input validation mechanisms and struggles with handling complex event operations, while regular expression matching is ineffective in processing dynamic content. As a result, the scanner cannot effectively address hidden attack surfaces within JavaScript code or dynamically generated attack surfaces, leading to suboptimal vulnerability detection in some Web applications. To resolve these issues, this study proposes a JavaScript Exposure Scanner (JSEScan), a vulnerability scanner enhancement framework based on JavaScript code analysis. The framework integrates static and dynamic code analysis techniques, bypassing form validation and event-triggering restrictions. By extracting attack surface features from JavaScript code, JSEScan identifies attack surfaces and synchronizes them across multiple scanners, enhancing their vulnerability detection capabilities. The experimental results demonstrate that JSEScan increases coverage by 81.02% to 242.15% compared to using a single scanner and uncovers an additional 239 security vulnerabilities when compared to multiple scanners working concurrently, showing superior attack surface collection and vulnerability detection capabilities.
Abstract: Android application developers need to quickly and accurately reproduce error reports to ensure application quality. However, existing methods often rely solely on crash information provided in stack traces to generate event sequences, making it difficult to accurately locate the crash page and offer effective guidance for dynamic exploration to trigger the crash. To address this issue, this study proposes a component-aware automatic crash reproduction method for Android applications, called CReDroid, which effectively reproduces the crash by leveraging both the title and stack trace of the crash report. First, CReDroid dynamically explores the application under test to construct a component transition graph (CTG) and combines the dynamic exception information from the stack traces with the static component interaction data from the CTG to accurately locate the target crash component. Second, based on the critical operations in the crash report title and the reachable paths in the CTG, CReDroid designs an adaptive strategy that uses the contextual relationship between the current page’s component and the crash component to assign priority scores to GUI widgets. The dynamic exploration process is globally optimized through reinforcement learning to effectively reduce inaccuracies in the prediction process. This study evaluates CReDroid using 74 crash reports and compares its performance with state-of-the-art crash reproduction tools, including CrashTranslator, ReCDroid, and ReproBot, as well as widely used automated testing tools, Monkey and APE. The experimental results show that CReDroid successfully reproduces 57 crash reports, which is 13, 25, 27, 30, and 17 more than CrashTranslator, ReCDroid, ReproBot, Monkey, and APE, respectively. Moreover, for the successfully reproduced crashes, CReDroid reduces the average reproduction time by 26.71%, 94.96%, 71.65%, 84.72%, and 88.56%, compared to CrashTranslator, ReCDroid, ReproBot, Monkey, and APE.
Abstract: The computation of signatures is typically performed on physically insecure devices such as mobile phones or small IoT devices, which may lead to private key exposure and subsequently compromise the entire cryptographic system. Key-insulated signature schemes serve as a method to mitigate the damage caused by private key exposure. In a key-insulated cryptosystem, the public key remains constant throughout the entire time period, and the fixed private key is stored on a physically secure device. At the beginning of each time period, the insecure device interacts with the physically secure device storing the fixed private key to obtain the temporary private key for the current time slice. A secure identity-based key-insulated signature scheme must satisfy both unforgeability and key insulation. Key insulation ensures that even if an adversary obtains temporary private keys for multiple time periods, they cannot forge signatures for other periods. SM9 is a commercial identity-based cryptographic standard independently developed by China. This study applies the key-insulated method to the SM9 identity-based signature scheme to resolve the private key exposure issue present in the original scheme. First, a security model for identity-based key-insulated signatures is presented. Then, an identity-based key-insulated signature scheme based on SM9 is constructed. Finally, detailed security proofs and experimental analysis are provided.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,315-325页.
原文链接如下:https://doi.org/10.1145/3106237.3106242,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,303-314页.
原文链接如下:https://doi.org/10.1145/3106237.3106239,
读者如需引用该文请标引原文出处。
Abstract: GitHub, a popular social-software-development
platform, has fostered a variety of software ecosystems where
projects depend on one another and
practitioners interact with
each other. Projects within an
ecosystem often have complex
inter-dependencies that impose new challenges in bug
reporting and fixing. In this paper, we conduct an empirical
study on cross-project correlated bugs, i.e., causally related
bugs reported to different projects, focusing on two aspects: 1)
how developers track the root causes across projects; and 2)
how the downstream developers coordinate to deal with
upstream bugs. Through manual inspection of bug reports collected from the scientific Python ecosystem and an online survey with developers, this study reveals the common practices of developers and the
various factors in fixing cross-project bugs. These findings provide implications for future software bug analysis in the scope of ecosystem, as well as shed light on the requirements of issue trackers for such bugs.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 39th International Conference on Software Engineering, Pages 27-37, Buenos Aires, Argentina — May 20 - 28, 2017, IEEE Press Piscataway, NJ, USA ?2017, ISBN: 978-1-5386-3868-2
原文链接如下:http://dl.acm.org/citation.cfm?id=3097373,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 871-882. DOI: https://doi.org/10.1145/2950290.2950364
原文链接如下:http://dl.acm.org/citation.cfm?id=2950364,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Pages 133—143, Seattle WA, USA, November 2016.
原文链接如下:http://dl.acm.org/citation.cfm?id=2950327,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16), 810 – 821, November 13 - 18, 2016.
原文链接如下:https://doi.org/10.1145/2950290.2950310,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在FSE'16会议上Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering,
原文链接如下:http://dl.acm.org/citation.cfm?id=2950340,
读者如需引用该文请标引原文出处。
Abstract: CCF 软件工程专业委员会白晓颖教授(清华大学)推荐。
原文发表在 ASE 2016 Proceedings of the 31st IEEE/ACM International Conference on Automated
Software Engineering。 全文链接:http://dx.doi.org/10.1145/2970276.2970307。
重要提示:读者如引用该文时请标注原文出处。
Abstract: Social recommender systems have recently become one of the hottest topics in the domain of recommender systems. The main task of social recommender system is to alleviate data sparsity and cold-start problems, and improve its performance utilizing users' social attributes. This paper presents an overview of the field of social recommender systems, including trust inference algorithms, key techniques and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: This paper presents several new insights into system software, which is one of the basic concepts in computing discipline, from three perspectives of essential features, characteristics of the times, and the future development trend. The first insight is that system software stems theoretically and technically from universal Turing machine and the idea of stored-program, with an essential feature of "manipulating the execution of a computing system". There are two typical manipulation modes:encoding and then loading, executing and controlling. The second insight is that software system is a kind of software, in the Internet age, providing substantial online services continuously, which lay the foundation for the newly emerged "software-as-a-service" paradigm. The final insight is about its development trend:system software will evolve online continuously. Driven by innovations of computing systems, integration of cyber and physical spaces, and intelligence technologies, system software will become the core of future software ecology.
Abstract: With the rapid development of cloud computing technology, its security issues have become more and more obvious and received much attention in both industry and academia. High security risk is widespread in traditional cloud architecture. Hacking into a virtual machine destroys the availability of cloud services or resources. Un-Trusted cloud storage makes it more difficult to share or search users' private data. The risk of privacy leakage is caused by various outsourcing computation and application requirements. From the perspective of security and privacy preserving technologies in cloud computing, this paper first introduces related research progress of cloud virtualization security, cloud data security and cloud application security. In addition, it analyzes the characteristics and application scopes of typical schemes, and compares their different effectiveness on the security defense and privacy preserving. Finally, the paper discusses current limitations and possible directions for future research.
Abstract: In recent years, transfer learning has provoked vast amount of attention and research. Transfer learning is a new machine learning method that applies the knowledge from related but different domains to target domains. It relaxes the two basic assumptions in traditional machine learning: (1) the training (also referred as source domain) and test data (also referred target domain) follow the independent and identically distributed (i.i.d.) condition; (2) there are enough labeled samples to learn a good classification model, aiming to solve the problems that there are few or even not any labeled data in target domains. This paper surveys the research progress of transfer learning and introduces its own works, especially the ones in building transfer learning models by applying generative model on the concept level. Finally, the paper introduces the applications of transfer learning, such as text classification and collaborative filtering, and further suggests the future research direction of transfer learning.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: Learning to rank(L2R) techniques try to solve sorting problems using machine learning methods, and have been well studied and widely used in various fields such as information retrieval, text mining, personalized recommendation, and biomedicine.The main task of L2R based recommendation algorithms is integrating L2R techniques into recommendation algorithms, and studying how to organize a large number of users and features of items, build more suitable user models according to user preferences requirements, and improve the performance and user satisfaction of recommendation algorithms.This paper surveys L2R based recommendation algorithms in recent years, summarizes the problem definition, compares key technologies and analyzes evaluation metrics and their applications.In addition, the paper discusses the future development trend of L2R based recommendation algorithms.
Abstract: Mobile recommender systems have recently become one of the hottest topics in the domain of recommender systems. The main task of mobile recommender systems is to improve the performance and accuracy along with user satisfaction utilizing mobile context, mobile social network and other information. This paper presents an overview of the field of mobile recommender systems including key techniques, evaluation and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: With the increasing of social network, social recommendation becomes hot research topic in recommendation systems. Matrix factorization based (MF-based) recommendation model gradually becomes the key component of social recommendation due to its high expansibility and flexibility. Thus, this paper focuses on MF-based social recommendation methods. Firstly, it reviews the existing social recommendation models according to the model construction strategies. Next, it conducts a series of experiments on real-world datasets to demonstrate the performance of different social recommendation methods from three perspectives including whole-users, cold start-users, and long-tail items. Finally, the paper analyzes the problems of MF-based social recommendation model, and discusses the possible future research directions and development trends in this research area.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: Recommender systems have been successfully adopted as an effective tool to alleviate information overload and assist users to make decisions. Recently, it has been demonstrated that incorporating social relationships into recommender models can enhance recommendation performance. Despite its remarkable progress, a majority of social recommendation models have overlooked the item relations-a key factor that can also significantly influence recommendation performance. In this paper, a approach is first proposed to acquire item relations by measuring correlations among items. Then, a co-regularized recommendation model is put forward to integrate the item relations with social relationships by introducing co-regularization term in the matrix factorization model. Meanwhile, that the co-regularization term is a case of weighted atomic norm is illustrated. Finally, based on the proposed model a recommendation algorithm named CRMF is constructed. CRMF is compared with existing state-of-the-art recommendation algorithms based on the evaluations over four real-world data sets. The experimental results demonstrate that CRMF is able to not only effectively alleviate the user cold-start problem, but also help obtain more accurate rating predictions of various users.
Abstract: Graph embedding is a fundamental technique for graph data mining. The real-world graphs not only consist of complex network structures, but also contain diverse vertex information. How to integrate the network structure and vertex information into the graph embedding procedure is a big challenge. To deal with this challenge, a graph embedding method, which is based on deep leaning technique while taking into account the prior knowledge on vertices information, is proposed in this paper. The basic idea of the proposed method is to regard the vertex features as the prior knowledge, and learn the representation vector through optimizing an objective function that simultaneously keeps the similarity of network structure and vertex features. The time complexity of the proposed method is O(|V|), where|V|is the count of vertices in the graph. This indicates the proposed method is suitable for large-scale graph analysis. Experiments on several data sets demonstrate that, compared with the state-of-art baselines, the proposed method is able to achieve favorable and stable results for the task of node classification.
Abstract: Group recommender systems have recently become one of the most prevalent topics in recommender systems. As an effective solution to the problem of group recommendation, Group recommender systems have been utilized in news, music, movies, food, and so forth through extending individual recommendation to group recommendation. The existing group recommender systems usually employ aggregating preference strategy or aggregating recommendation strategy, but the effectiveness of both two methods is not well solved yet, and they respectively have their own advantages and disadvantages. Aggregating preference strategy possesses a fairness problem between group members, whereas aggregating recommendation strategy pays less attention to the interaction between group members. This paper proposes an enhanced group recommendation method based on preference aggregation, incorporating simultaneously the advantages of the aforesaid two aggregation methods. Further, the paper demonstrates that group preference and personal preference are similar, which is also considered in the proposed method. Experimental results show that the proposed method outperforms baselines in terms of effectiveness based on Movielens dataset.
Abstract: The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
Abstract: Event-Based social networks (EBSNs) have experienced rapid growth in people's daily life. Hence, event recommendation plays an important role in helping people discover interesting online events and attend offline activities face to face in the real world. However, event recommendation is quite different from traditional recommender systems, and there are several challenges:(1) One user can only attend a scarce number of events, leading to a very sparse user-event matrix; (2) The response data of users is implicit feedback; (3) Events have their life cycles, so outdated events should not be recommended to users; (4) A large number of new events which are created every day need to be recommended to users in time. To cope with these challenges, this article proposes to jointly model heterogeneous social and content information for event recommendation. This approach explores both the online and offline social interactions and fuses the content of events to model their joint effect on users' decision-making for events. Extensive experiments are conducted to evaluate the performance of the proposed model on Meetup dataset. The experimental results demonstrate that the proposed model outperforms state-of-the-art methods.
Abstract: Since the factorization machine (FM) model can effectively solve the sparsity problem of high-dimensional data feature combination with high prediction accuracy and computational efficiency, it has been widely studied and applied in the field of click-through-rate (CTR) prediction and recommender systems. The review of the progress on the subsequent research on FM and its related models will help to promote the further improvement and application of the model. By comparing the relationship between the FM model and the polynomial regression model and the factorization model, the flexibility and generality of the FM model are described. Considering width extension, the strategies, methods, and key technologies are summarized from the dimensions of high-order feature interaction, field-aware feature interaction and hierarchical feature interaction, as well as feature extraction, combining, intelligent selection and promotion based on feature engineering. The integration approaches and benefits of FM model with other models, especially the combination with deep learning models are compared and analyzed, which provides insights into the in-depth expansion of traditional models. The learning and optimization methods of FM models and the implementation based on different parallel and distributed computing frameworks are summarized, compared, and analyzed. Finally, the authors forecast the difficult points, hot spots and development trends in the FM model that need to be further studied.
Abstract: The development of Internet has brought convenience to the public, but also troubles users in making choices among enormous data. Thus, recommender systems based on user understanding are urgently in need. Different from the traditional techniques that usually focus on individual users, the social-based recommender systems perform better with integrating social influence modeling to achieve more accurate user profiling. However, current works usually generalize influence in simple mode, while deep discussions on intrinsic mechanism have been largely ignored. To solve this problem, this paper studies the social influence within users who affects both rating and user attributes, and then proposes a novel trust-driven PMF (TPMF) algorithm to merge these two mechanisms. Furthermore, to deal with the task that different user should have personalized parameters, the study clusters users according to rating correlation and then maps them to corresponding weights, thereby achieving the personalized selection of users' model parameters. Comprehensive experiments on open data sets validate that TPMF and its derivation algorithm can effectively predict users' rating compared with several state of the art baselines, which demonstrates the capability of the presented influence mechanism and technical framework.
Abstract: Recommending valuable and interesting contents for microblog users is an important way to improve the user experience. In this study, tags are considered as the users' interests and a microblog recommendation method based on hypergraph random walk tag Extension and tag probability correlation is proposed via the analysis of characteristics and the existing limitations of microblog recommendation algorithm. Firstly, microblogs are considered as hyperedges, while each term is taken as the hypervertex, and the weighting strategies for both hyperedges and hypervertexes are established. A random walk is conducted on the hypergraph to obtain a number of keywords for the expansion of microblog users. And then the weight of the tag for each user is enhanced based on the relevance weighting scheme and the user tag matrix can be constructed. Probability correlation between tags is calculated to construct the tag similarity matrix, which can be used to update the matrix is updated using the label similarity matrix, which contains both the user interest information and the relationship between tags and tags. Experimental results show that the algorithm is effective in microblog recommendation.
Abstract: The newly emerging event-based social network (EBSN) based on the event as the core combines the online relationship with offline activities to promote the formation of real and effective social relationship among users. However, excessive activity information would make users difficult to distinguish and choose. The context-aware local event recommendation is an effective solution for the information overload problem, but most of existing local event recommendation algorithms only learns users' preference for contextual information indirectly from statistics of historical event participation and ignores latent correlations among them, which impacts on recommendation effectiveness. To take full advantage of latent correlations between users' event preference and contextual information, the proposed collective contextual relation learning (CCRL) algorithm models relations among users' participation records and related contextual information such as event organizer, description text, venue, and starting time. Then multi-relational Bayesian personalized ranking (MRBPR) algorithm is adapted for collective contextual relation learning and local event recommendation. Experiment results on Meetup dataset demonstrate that proposed algorithm outperforms state-of-the-art local event recommendation algorithms in terms of many metrics.
Abstract: The explosive growth of the digital data brings great challenges to the relational database management systems in addressing issues in areas such as scalability and fault tolerance. The cloud computing techniques have been widely used in many applications and become the standard effective approach to manage large scale data because of their high scalability, high availability and fault tolerance. The existing cloud-based data management systems can't efficiently support complex queries such as multi-dimensional queries and join queries because of lacking of index or view techniques, limiting the application of cloud computing in many respects. This paper conducts an in-depth research on the index techniques for cloud data management to highlight their strengths and weaknesses. This paper also introduces its own preliminary work on the index for massive IOT data in cloud environment. Finally, it points out some challenges in the index techniques for big data in cloud environment.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: With the rapid development of e-business, web applications based on the Web are developed from localization to globalization, from B2C(business-to-customer) to B2B(business-to-business), from centralized fashion to decentralized fashion. Web service is a new application model for decentralized computing, and it is also an effective mechanism for the data and service integration on the web. Thus, web service has become a solution to e-business. It is important and necessary to carry out the research on the new architecture of web services, on the combinations with other good techniques, and on the integration of services. In this paper, a survey presents on various aspects of the research of web services from the basic concepts to the principal research problems and the underlying techniques, including data integration in web services, web service composition, semantic web service, web service discovery, web service security, the solution to web services in the P2P (Peer-to-Peer) computing environment, and the grid service, etc. This paper also presents a summary of the current art of the state of these techniques, a discussion on the future research topics, and the challenges of the web services.
Abstract: Information flow analysis is a promising approach for protecting the confidentiality and integrity of information manipulated by computing systems. Taint analysis, as in practice, is widely used in the area of software security assurance. This survey summarizes the latest advances on taint analysis, especially the solutions applied in different platform applications. Firstly, the basic principle of taint analysis is introduced along with the general technology of taint propagation implemented by dynamic and static analyses. Then, the proposals applied in different platform frameworks, including techniques for protecting privacy leakage on Android and finding security vulnerabilities on Web, are analyzed. Lastly, further research directions and future work are discussed.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: Cyber-Physical Systems (CPSs) have great potentials in several application domains. Time plays an important role in CPS and should be specified in the very early phase of requirements engineering. This paper proposes a framework to model and verify timing requirements for the CPS. To begin with, a conceptual model is presented for providing basic concepts of timing and functional requirements. Guided by this model, the CPS software timing requirement specification can be obtained from CPS environment properties and constraints. To support formal verification, formal semantics for the conceptual model is provided. Based on the semantics, the consistency properties of the timing requirements specification are defined and expressed as CTL formulas. The timing requirements specification is transformed into a NuSMV model and checked by this well-known model checker.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: This paper firstly presents a summary of AADL (architecture analysis and design language), including
its progress over the years and its modeling elements. Then, it surveys the research and practice of AADL from a
model-based perspective, such as AADL modeling, AADL formal semantics, model transformation, verification and
code generation. Finally, the potential research directions are discussed.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: With the explosive growth of network applications and complexity, the threat of Internet worms against network security becomes increasingly serious. Especially under the environment of Internet, the variety of the propagation ways and the complexity of the application environment result in worm with much higher frequency of outbreak, much deeper latency and more wider coverage, and Internet worms have been a primary issue faced by malicious code researchers. In this paper, the concept and research situation of Internet worms, exploration function component and execution mechanism are first presented, then the scanning strategies and propagation model are discussed, and finally the critical techniques of Internet worm prevention are given. Some major problems and research trends in this area are also addressed.
Abstract: This paper studies uncertain graph data mining and especially investigates the problem of mining frequent subgraph patterns from uncertain graph data. A data model is introduced for representing uncertainties in graphs, and an expected support is employed to evaluate the significance of subgraph patterns. By using the apriori property of expected support, a depth-first search-based mining algorithm is proposed with an efficient method for computing expected supports and a technique for pruning search space, which reduces the number of subgraph isomorphism testings needed by computing expected support from the exponential scale to the linear scale. Experimental results show that the proposed algorithm is 3 to 5 orders of magnitude faster than a na?ve depth-first search algorithm, and is efficient and scalable.
Abstract: This paper introduces the concrete details of combining the automated reasoning techniques with planning methods, which includes planning as satisfiability using propositional logic, Conformant planning using modal logic and disjunctive reasoning, planning as nonmonotonic logic, and Flexible planning as fuzzy description logic. After considering experimental results of International Planning Competition and relevant papers, it concludes that planning methods based on automated reasoning techniques is helpful and can be adopted. It also proposes the challenges and possible hotspots.
Abstract: Sensor networks are integration of sensor techniques, nested computation techniques, distributed computation techniques and wireless communication techniques. They can be used for testing, sensing, collecting and processing information of monitored objects and transferring the processed information to users. Sensor network is a new research area of computer science and technology and has a wide application future. Both academia and industries are very interested in it. The concepts and characteristics of the sensor networks and the data in the networks are introduced, and the issues of the sensor networks and the data management of sensor networks are discussed. The advance of the research on sensor networks and the data management of sensor networks are also presented.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: Intrusion detection is a highlighted topic of network security research in recent years. In this paper, first the necessity o f intrusion detection is presented, and its concepts and models are described. T hen, many intrusion detection techniques and architectures are summarized. Final ly, the existing problems and the future direction in this field are discussed.
Abstract: With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
Abstract: In a multi-hop wireless sensor network (WSN), the sensors closest to the sink tend to deplete their energy faster than other sensors, which is known as an energy hole around the sink. No more data can be delivered to the sink after an energy hole appears, while a considerable amount of energy is wasted and the network lifetime ends prematurely. This paper investigates the energy hole problem, and based on the improved corona model with levels, it concludes that the assignment of transmission ranges of nodes in different coronas is an effective approach for achieving energy-efficient network. It proves that the optimal transmission ranges for all areas is a multi-objective optimization problem (MOP), which is NP hard. The paper proposes an ACO (ant colony optimization)-based distributed algorithm to prolong the network lifetime, which can help nodes in different areas to adaptively find approximate optimal transmission range based on the node distribution. Furthermore, the simulation results indicate that the network lifetime under this solution approximates to that using the optimal list. Compared with existing algorithms, this ACO-based algorithm can not only make the network lifetime be extended more than two times longer, but also have good performance in the non-uniform node distribution.
Abstract: Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
Abstract: Software architecture (SA) is emerging as one of the primary research areas in software engineering recently and one of the key technologies to the development of large-scale software-intensive system and software product line system. The history and the major direction of SA are summarized, and the concept of SA is brought up based on analyzing and comparing the several classical definitions about SA. Based on summing up the activities about SA, two categories of study about SA are extracted out, and the advancements of researches on SA are subsequently introduced from seven aspects.Additionally,some disadvantages of study on SA are discussed,and the causes are explained at the same.Finally,it is concluded with some singificantly promising tendency about research on SA.
Abstract: Routing technology at the network layer is pivotal in the architecture of wireless sensor networks. As an active branch of routing technology, cluster-based routing protocols excel in network topology management, energy minimization, data aggregation and so on. In this paper, cluster-based routing mechanisms for wireless sensor networks are analyzed. Cluster head selection, cluster formation and data transmission are three key techniques in cluster-based routing protocols. As viewed from the three techniques, recent representative cluster-based routing protocols are presented, and their characteristics and application areas are compared. Finally, the future research issues in this area are pointed out.
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: Few-shot learning is defined as learning models to solve problems from small samples. In recent years, under the trend of training model with big data, machine learning and deep learning have achieved success in many fields. However, in many application scenarios in the real world, there is not a large amount of data or labeled data for model training, and labeling a large number of unlabeled samples will cost a lot of manpower. Therefore, how to use a small number of samples for learning has become a problem that needs to be paid attention to at present. This paper systematically combs the current approaches of few-shot learning. It introduces each kind of corresponding model from the three categories: fine-tune based, data augmentation based, and transfer learning based. Then, the data augmentation based approaches are subdivided into unlabeled data based, data generation based, and feature augmentation based approaches. The transfer learning based approaches are subdivided into metric learning based, meta-learning based, and graph neural network based methods. In the following, the paper summarizes the few-shot datasets and the results in the experiments of the aforementioned models. Next, the paper summarizes the current situation and challenges in few-shot learning. Finally, the future technological development of few-shot learning is prospected.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: Graphics processing unit (GPU) has been developing rapidly in recent years at a speed over Moor抯 law, and as a result, various applications associated with computer graphics advance greatly. At the same time, the highly processing power, parallelism and programmability available nowadays on the contemporary GPU provide an ideal platform on which the general-purpose computation could be made. Starting from an introduction to the development history and the architecture of GPU, the technical fundamentals of GPU are described in the paper. Then in the main part of the paper, the development of various applications on general purpose computation on GPU is introduced, and among those applications, fluid dynamics, algebraic computation, database operations, and spectrum analysis are introduced in detail. The experience of our work on fluid dynamics has been also given, and the development of software tools in this area is introduced. Finally, a conclusion is made, and the future development and the new challenge on both hardware and software in this subject are discussed.
Abstract: Probabilistic graphical models are powerful tools for compactly representing complex probability distributions, efficiently computing (approximate) marginal and conditional distributions, and conveniently learning parameters and hyperparameters in probabilistic models. As a result, they have been widely used in applications that require some sort of automated probabilistic reasoning, such as computer vision and natural language processing, as a formal approach to deal with uncertainty. This paper surveys the basic concepts and key results of representation, inference and learning in probabilistic graphical models, and demonstrates their uses in two important probabilistic models. It also reviews some recent advances in speeding up classic approximate inference algorithms, followed by a discussion of promising research directions.
Abstract: Symbolic propagation methods based on linear abstraction play a significant role in neural network verification. This study proposes the notion of multi-path back-propagation for these methods. Existing methods are viewed as using only a single back-propagation path to calculate the upper and lower bounds of each node in a given neural network, being specific instances of the proposed notion. Leveraging multiple back-propagation paths effectively improves the accuracy of this kind of method. For evaluation, the proposed method is quantitatively compared using multiple back-propagation paths with the state-of-the-art tool DeepPoly on benchmarks ACAS Xu, MNIST, and CIFAR10. The experiment results show that the proposed method achieves significant accuracy improvement while introducing only a low extra time cost. In addition, the multi-path back-propagation method is compared with the Optimized LiRPA based on global optimization, on the dataset MNIST. The results show that the proposed method still has an accuracy advantage.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: Ultrasonography is the first choice of imaging examination and preoperative evaluation for thyroid and breast cancer. However, ultrasonic characteristics of benign and malignant nodules are commonly overlapped. The diagnosis heavily relies on operator's experience other than quantitative and stable methods. In recent years, medical imaging analysis based on computer technology has developed rapidly, and a series of landmark breakthroughs have been made, which provides effective decision supports for medical imaging diagnosis. In this work, the research progress of computer vision and image recognition technologies in thyroid and breast ultrasound images is studied. A series of key technologies involved in automatic diagnosis of ultrasound images is the main lines of the work. The major algorithms in recent years are summarized and analyzed, such as ultrasound image preprocessing, lesion localization and segmentation, feature extraction and classification. Moreover, multi-dimensional analysis is made on the algorithms, data sets, and evaluation methods. Finally, existing problems related to automatic analysis of those two kinds of ultrasound imaging are discussed, research trend and development direction in the field of ultrasound images analysis are discussed.
Abstract: Computer aided detection/diagnosis (CAD) can improve the accuracy of diagnosis,reduce false positive,and provide decision supports for doctors.The main purpose of this paper is to analyze the latest development of computer aided diagnosis tools.Focusing on the top four fatal cancer's incidence positions,major recent publications on CAD applications in different medical imaging areas are reviewed in this survey according to different imaging techniques and diseases.Further more,multidimentional analysis is made on the researches from image data sets,algorithms and evaluation methods.Finally,existing problems,research trend and development direction in the field of medical image CAD system are discussed.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: Task parallel programming model is a widely used parallel programming model on multi-core platforms. With the intention of simplifying parallel programming and improving the utilization of multiple cores, this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives: Parallelism expression, data management and task scheduling. In the end, some future trends in this area are discussed.
Abstract: The Internet traffic model is the key issue for network performance management, Quality of Service
management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
Abstract: The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
Abstract: In this paper, the existing intrusion tolerance and self-destruction technology are integrated into autonomic computing in order to construct an autonomic dependability model based on SM-PEPA (semi-Markov performance evaluation process algebra) which is capable of formal analysis and verification. It can hierarchically anticipate Threats to dependability (TtD) at different levels in a self-management manner to satisfy the special requirements for dependability of mission-critical systems. Based on this model, a quantification approach is proposed on the view of steady-state probability to evaluate autonomic dependability. Finally, this paper analyzes the impacts of parameters of the model on autonomic dependability in a case study, and the experimental results demonstrate that improving the detection rate of TtD as well as the successful rate of self-healing will greatly increase the autonomic dependability.
Abstract: Attribute-Based encryption (ABE) scheme takes attributes as the public key and associates the ciphertext and user’s secret key with attributes, so that it can support expressive access control policies. This dramatically reduces the cost of network bandwidth and sending node’s operation in fine-grained access control of data sharing. Therefore, ABE has a broad prospect of application in the area of fine-grained access control. After analyzing the basic ABE system and its two variants, Key-Policy ABE (KP-ABE) and Ciphertext-Policy ABE (CP-ABE), this study elaborates the research problems relating to ABE systems, including access structure design for CP-ABE, attribute key revocation, key abuse and multi-authorities ABE with an extensive comparison of their functionality and performance. Finally, this study discusses the need-to-be solved problems and main research directions in ABE.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.
Abstract: In recent years, the rapid development of Internet technology and Web applications has triggered the explosion of various data on the Internet, which generates a large amount of valuable knowledge. How to organize, represent and analyze these knowledge has attracted much attention. Knowledge graph was thus developed to organize these knowledge in a semantical and visualized manner. Knowledge reasoning over knowledge graph then becomes one of the hot research topics and plays an important role in many applications such as vertical search and intelligent question-answer. The goal of knowledge reasoning over knowledge graph is to infer new facts or identify erroneous facts according to existing ones. Unlike traditional knowledge reasoning, knowledge reasoning over knowledge graph is more diversified, due to the simplicity, intuitiveness, flexibility, and richness of knowledge representation in knowledge graph. Starting with the basic concept of knowledge reasoning, this paper presents a survey on the recently developed methods for knowledge reasoning over knowledge graph. Specifically, the research progress is reviewed in detail from two aspects:One-Step reasoning and multi-step reasoning, each including rule based reasoning, distributed embedding based reasoning, neural network based reasoning and hybrid reasoning. Finally, future research directions and outlook of knowledge reasoning over knowledge graph are discussed.
Abstract: Honeypot is a proactive defense technology, introduced by the defense side to change the asymmetric situation of a network attack and defensive game. Through the deployment of the honeypots, i.e. security resources without any production purpose, the defenders can deceive attackers to illegally take advantage of the honeypots and capture and analyze the attack behaviors to understand the attack tools and methods, and to learn the intentions and motivations. Honeypot technology has won the sustained attention of the security community to make considerable progress and get wide application, and has become one of the main technical means of the Internet security threat monitoring and analysis. In this paper, the origin and evolution process of the honeypot technology are presented first. Next, the key mechanisms of honeypot technology are comprehensively analyzed, the development process of the honeypot deployment structure is also reviewed, and the latest applications of honeypot technology in the directions of Internet security threat monitoring, analysis and prevention are summarized. Finally, the problems of honeypot technology, development trends and further research directions are discussed.
Abstract: Uncertainty exists widely in the subjective and objective world. In all kinds of uncertainty, randomness and fuzziness are the most important and fundamental. In this paper, the relationship between randomness and fuzziness is discussed. Uncertain states and their changes can be measured by entropy and hyper-entropy respectively. Taken advantage of entropy and hyper-entropy, the uncertainty of chaos, fractal and complex networks by their various evolution and differentiation are further studied. A simple and effective way is proposed to simulate the uncertainty by means of knowledge representation which provides a basis for the automation of both logic and image thinking with uncertainty. The AI (artificial intelligence) with uncertainty is a new cross-discipline, which covers computer science, physics, mathematics, brain science, psychology, cognitive science, biology and philosophy, and results in the automation of representation, process and thinking for uncertain information and knowledge.
Abstract: Designing problems are ubiquitous in science research and industry applications. In recent years, Bayesian optimization, which acts as a very effective global optimization algorithm, has been widely applied in designing problems. By structuring the probabilistic surrogate model and the acquisition function appropriately, Bayesian optimization framework can guarantee to obtain the optimal solution under a few numbers of function evaluations, thus it is very suitable to solve the extremely complex optimization problems in which their objective functions could not be expressed, or the functions are non-convex, multimodal and computational expensive. This paper provides a detailed analysis on Bayesian optimization in methodology and application areas, and discusses its research status and the problems in future researches. This work is hopefully beneficial to the researchers from the related communities.
Abstract: The popularity of the Internet and the boom of the World Wide Web foster innovative changes in software technology that give birth to a new form of software—networked software, which delivers diversified and personalized on-demand services to the public. With the ever-increasing expansion of applications and users, the scale and complexity of networked software are growing beyond the information processing capability of human beings, which brings software engineers a series of challenges to face. In order to come to a scientific understanding of this kind of ultra-large-scale artificial complex systems, a survey research on the infrastructure, application services, and social interactions of networked software is conducted from a three-dimensional perspective of cyberization, servicesation, and socialization. Interestingly enough, most of them have been found to share the same global characteristics of complex networks such as “Small World” and “Scale Free”. Next, the impact of the empirical study on software engineering research and practice and its implications for further investigations are systematically set forth. The convergence of software engineering and other disciplines will put forth new ideas and thoughts that will breed a new way of thinking and input new methodologies for the study of networked software. This convergence is also expected to achieve the innovations of theories, methods, and key technologies of software engineering to promote the rapid development of software service industry in China.
Abstract: Recent years, applying Deep Learning (DL) into Image Semantic Segmentation (ISS) has been widely used due to its state-of-the-art performances and high-quality results. This paper systematically reviews the contribution of DL to the field of ISS. Different methods of ISS based on DL (ISSbDL) are summarized. These methods are divided into ISS based on the Regional Classification (ISSbRC) and ISS based on the Pixel Classification (ISSbPC) according to the image segmentation characteristics and segmentation granularity. Then, the methods of ISSbPC are surveyed from two points of view:ISS based on Fully Supervised Learning (ISSbFSL) and ISS based on Weakly Supervised Learning (ISSbWSL). The representative algorithms of each method are introduced and analyzed, as well as the basic workflow, framework, advantages and disadvantages of these methods are detailedly analyzed and compared. In addition, the related experiments of ISS are analyzed and summarized, and the common data sets and performance evaluation indexes in ISS experiments are introduced. Finally, possible research directions and trends are given and analyzed.
Abstract: The rapid development of Internet leads to an increase in system complexity and uncertainty. Traditional network management can not meet the requirement, and it shall evolve to fusion based Cyberspace Situational Awareness (CSA). Based on the analysis of function shortage and development requirement, this paper introduces CSA as well as its origin, conception, objective and characteristics. Firstly, a CSA research framework is proposed and the research history is investigated, based on which the main aspects and the existing issues of the research are analyzed. Meanwhile, assessment methods are divided into three categories: Mathematics model, knowledge reasoning and pattern recognition. Then, this paper discusses CSA from three aspects: Model, knowledge representation and assessment methods, and then goes into detail about main idea, assessment process, merits and shortcomings of novel methods. Many typical methods are compared. The current application research of CSA in the fields of security, transmission, survivable, system evaluation and so on is presented. Finally, this paper points the development directions of CSA and offers the conclusions from issue system, technical system and application system.
Abstract: Blockchain is a distributed public ledger technology that originates from the digital cryptocurrency, bitcoin. Its development has attracted wide attention in industry and academia fields. Blockchain has the advantages of de-centralization, trustworthiness, anonymity and immutability. It breaks through the limitation of traditional center-based technology and has broad development prospect. This paper introduces the research progress of blockchain technology and its application in the field of information security. Firstly, the basic theory and model of blockchain are introduced from five aspects:Basic framework, key technology, technical feature, and application mode and area. Secondly, from the perspective of current research situation of blockchain in the field of information security, this paper summarizes the research progress of blockchain in authentication technology, access control technology and data protection technology, and compares the characteristics of various researches. Finally, the application challenges of blockchain technology are analyzed, and the development outlook of blockchain in the field of information security is highlighted. This study intends to provide certain reference value for future research work.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
Abstract: Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
Abstract: With the proliferation of the Chinese social network (especially the rise of weibo), the productivity and lifestyle of the country's society is more and more profoundly influenced by the Chinese internet public events. Due to the lack of the effective technical means, the efficiency of information processing is limited. This paper proposes a public event information entropy calculation method. First, a mathematical modeling of event information content is built. Then, multidimensional random variable information entropy of the public events is calculated based on Shannon information theory. Furthermore, a new technical index of quantitative analysis to the internet public events is put forward, laying out a foundation for further research work.
Abstract: Source code bug (vulnerability) detection is a process of judging whether there are unexpected behaviors in the program code. It is widely used in software engineering tasks such as software testing and software maintenance, and plays a vital role in software functional assurance and application security. Traditional vulnerability detection research is based on program analysis, which usually requires strong domain knowledge and complex calculation rules, and faces the problem of state explosion, resulting in limited detection performance, and there is room for greater improvement in the rate of false positives and false negatives. In recent years, the open source community's vigorous development has accumulated massive amounts of data with open source code as the core. In this context, the feature learning capabilities of deep learning can automatically learn semantically rich code representations, thereby providing a new way for vulnerability detection. This study collected the latest high-level papers in this field, systematically summarized and explained the current methods from two aspects:vulnerability code dataset and deep learning vulnerability detection model. Finally, it summarizes the main challenges faced by the research in this field, and looks forward to the possible future research focus.
Abstract: The Distributed denial of service (DDoS) attack is a major threat to the current network. Based on the attack packet level, the study divides DDoS attacks into network-level DDoS attacks and application-level DDoS attacks. Next, the study analyzes the detection and control methods of these two kinds of DDoS attacks in detail, and it also analyzes the drawbacks of different control methods implemented in different network positions. Finally, the study analyzes the drawbacks of the current detection and control methods, the development trend of the DDoS filter system, and corresponding technological challenges are also proposed.
Abstract: This paper presents a survey on the theory of provable security and its applications to the design and analysis of security protocols. It clarifies what the provable security is, explains some basic notions involved in the theory of provable security and illustrates the basic idea of random oracle model. It also reviews the development and advances of provably secure public-key encryption and digital signature schemes, in the random oracle model or the standard model, as well as the applications of provable security to the design and analysis of session-key distribution protocols and their advances.
Abstract: Machine learning has become a core technology in areas such as big data, Internet of Things, and cloud computing. Training machine learning models requires a large amount of data, which is often collected by means of crowdsourcing and contains a large number of private data including personally identifiable information (such as phone number, id number, etc.) and sensitive information (such as financial data, health care, etc.). How to protect these data with low cost and high efficiency is an important issue. This paper first introduces the concept of machine learning, explains various definitions of privacy in machine learning and demonstrates all kinds of privacy threats encountered in machine learning, then continues to elaborate on the working principle and outstanding features of the mainstream technology of machine learning privacy protection. According to differential privacy, homomorphic encryption, and secure multi-party computing, the research achievements in the field of machine learning privacy protection are summarized respectively. On this basis, the paper comparatively analyzes the main advantages and disadvantages of different mechanisms of privacy preserving for machine learning. Finally, the developing trend of privacy preserving for machine learning is prospected, and the possible research directions in this field are proposed.
Abstract: Under the new application mode, the traditional hierarchy data centers face several limitations in size, bandwidth, scalability, and cost. In order to meet the needs of new applications, data center network should fulfill the requirements with low-cost, such as high scalability, low configuration overhead, robustness and energy-saving. First, the shortcomings of the traditional data center network architecture are summarized, and new requirements are pointed out. Secondly, the existing proposals are divided into two categories, i.e. server-centric and network-centric. Then, several representative architectures of these two categories are overviewed and compared in detail. Finally, the future directions of data center network are discussed.
Abstract: Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
Abstract: Deep learning has achieved great success in the field of computer vision, surpassing many traditional methods. However, in recent years, deep learning technology has been abused in the production of fake videos, making fake videos represented by Deepfakes flooding on the Internet. This technique produces pornographic movies, fake news, political rumors by tampering or replacing the face information of the original videos and synthesizes fake speech. In order to eliminate the negative effects brought by such forgery technologies, many researchers have conducted in-depth research on the identification of fake videos and proposed a series of detection methods to help institutions or communities to identify such fake videos. Nevertheless, the current detection technology still has many limitations such as specific distribution data, specific compression ratio, and so on, far behind the generation technology of fake video. In addition, different researchers handle the problem from different angles. The data sets and evaluation indicators used are not uniform. So far, the academic community still lacks a unified understanding of deep forgery and detection technology. The architecture of deep forgery and detection technology research is not clear. In this review, the development of deep forgery and detection technologies are reviewed. Besides, existing research works are systematically summarize and scientifically classified. Finally, the social risks posed by the spread of Deepfakes technology are discussed, the limitations of detection technology are analyzed, and the challenges and potential research directions of detection technology are discussed, aiming to provide guidance for follow-up researchers to further promote the development and deployment of Deepfakes detection technology.
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.