TU Yao-Feng , CHEN Xiao-Qiang , ZHOU Shi-Jun , BIAN Fu-Sheng , WU Fei , CHEN Bing
2022, 33(3):774-796. DOI: 10.13328/j.cnki.jos.006441 CSTR:
Abstract:The new hardware and its built environment have changed the traditional computing, storage and network systems, and also changed the previous design assumptions of the upper-level software. In particular, the heterogeneous computing architecture composed of general-purpose processors and dedicated accelerators has changed the design of the underlying framework of the database system and the cost model of query optimization. The database system needs to make adaptive adjustments to the characteristics of the new hardware to give full play to the potential of the new hardware. A cost-based query optimizer Geno for CPU/GPU/FPGA heterogeneous computing fusion is proposed, which can flexibly schedule and optimize the use of various computing resources. The main contribution is: finding that adjusting the cost parameters according to the actual hardware capabilities of the system environment can significantly improve the accuracy of the query plan, and proposing a calculation method and calibration tool for the cost of heterogeneous resources; through the estimation of the capabilities of heterogeneous hardware such as GPU and FPGA and the calibration of the actual capabilities of the database system hardware, establishing a cost model for query processing in a heterogeneous computing environment; implementing GPU operators and FPGA operators that support selection, projection, join and aggregation, realizing GPU operator pipeline design and FPGA operator pipeline design; solving the operator assignment and scheduling through cost-based evaluation, and generating a heterogeneous collaborative execution plan to realize the collaborative optimization of heterogeneous computing resources to makes full use of the advantages of each heterogeneous resource. Experiments show that the parameter values calibrated by Geno are more compatible with the actual hardware capabilities. Compared with PostgreSQL and GPU database HeteroDB, Geno can generate a more reasonable query plan. In the TPC-H scenario, the execution time of Geno in the case of row storage is 64%-93% less than that of Postgresql, and 1% to 39% less than that of Hetero-DB; in the case of column storage, Geno’s execution time is 87%-92% less than that of Postgresql, and 1%-81% less than that of Hetero-DB; Compared with row storage, Geno reduces query execution time 32%-89% in the case of column storage.
QIAO Shao-Jie , YANG Guo-Ping , HAN Nan , QU Lu-Lu , CHEN Hao , MAO Rui , YUAN Chang-An , Louis Alberto GUTIERREZ
2022, 33(3):797-813. DOI: 10.13328/j.cnki.jos.006448 CSTR:
Abstract:Cardinality estimation and cost estimation can guide the selection of execution plan, and cardinality accuracy is very important for query optimizer. However, the cost and cardinality estimation techniques in traditional databases cannot provide accurate estimations because they do not consider the correlation across multiple tables. Recently, the application of artificial intelligence technology to databases (artificial intelligence for databases, AI4DB) has attracted wide attention, and the results show that the learning-based estimation method is superior to the traditional methods. However, the existing learning-based methods have some drawbacks. Firstly, most methods can only estimate the cardinality, but ignore the cost estimation. Secondly, these methods can only deal with some simple query statements, while do not work for complex queries, such as multi-table query and nested query. At the same time, it is also difficult for them to deal with string type of values. In order to solve the above problems, a novel method of estimating the cardinality and cost based on Tree-GRU (tree-gated recurrent unit), which can estimate the cardinality as well as the cost. In addition, an effective feature extraction and coding technology is applied, both query and execution plan are considered in feature extraction. These features are embedded into Tree-GRU. For the value of string type, the neural network is used to automatically extract the relationship between the substring and the whole string, embedding the string, so that the sparse string can be easily processed by the estimator. Extensive experiments were conducted on the JOB and Synthetic datasets and the results show that the performance of the proposed model outperforms the state-of-the-art algorithms in all aspects.
YU Xiang , CHAI Cheng-Liang , ZHANG Xin-Ning , TANG Nan , SUN Ji , LI Guo-Liang
2022, 33(3):814-831. DOI: 10.13328/j.cnki.jos.006452 CSTR:
Abstract:Learned database query optimizers, which are typically empowered by (deep) learning models, have attracted significant attention recently, because they can offer similar or even better performance than the state-of-the-art commercial optimizers that require hundreds of expert-hours to tune. A crucial factor of successfully training learned optimizers is training queries. Unfortunately, a good query workload that is sufficient for training learned optimizers is not always available. This study proposes a framework, called AlphaQO, on generating queries for learned optimizers with reinforcement learning (RL). AlphaQO is a loop system that consists of two main components, query generator and learned optimizer. Query generator aims at generating “hard” queries (i.e., those queries that the learned optimizer provides poor estimates). The learned optimizer will be trained using generated queries, as well as providing feedbacks (in terms of numerical rewards) to the query generator. If the generated queries are good, the query generator will get a high reward;otherwise, the query generator will get a low reward. The above process is performed iteratively, with the main goal that within a small budget, the learned optimizer can be trained and generalized well to a wide range of unseen queries. Extensive experiments show that AlphaQO can generate a relatively small number of queries and train a learned optimizer to outperform commercial optimizers. Moreover, learned optimizers need much less queries from AlphaQO than randomly generated queries, in order to well train the learned optimizer.
LIU Rui-Cheng , ZHANG Jun-Chen , LUO Yong-Ping , JIN Pei-Quan
2022, 33(3):832-848. DOI: 10.13328/j.cnki.jos.006456 CSTR:
Abstract:Non-volatile memory (NVM) introduces new opportunities to data storage and management, but it also requires existing indices to be revisited according to the properties of NVM. This study, based on the access characteristics of NVM, focuses on the performance optimization of the access, persistence, range query, and other operations of tree indexes on NVM. A two-layer heterogeneous index called HART is presented. HART takes advantage of the high range query performance of the B+-tree and the fast node search speed of the Radix tree. The index structure is redesigned and the strategy of path compression for the Radix tree is improved. Moreover, a write-efficient node is proposed for NVM and the Radix leaf nodes together with a link are stored. The experiments are conducted on both simulated NVM and real Intel Optane persistent memory modules and different variants of HART are compared to several existing NVM indices. The results show that HART achieves better performance for point queries and writes than existing B+tree-like indices. In addition, it outperforms the existing Radix-tree-based WOART index in range query performance. As a result, HART can deliver high overall performance.
WEI Xing-Da , LU Fang-Ming , CHEN Rong , CHEN Hai-Bo , ZANG Bin-Yu
2022, 33(3):849-866. DOI: 10.13328/j.cnki.jos.006444 CSTR:
Abstract:The emergency of hardware transactional memory (HTM) has greatly boosted the transaction processing throughput in in-memory databases. However, the group commit used in in-memory databases, which aims at reducing the impact from slow persistent storage, leads to high transaction commit latency. Non-volatile memory (NVM) opens opportunities for reducing transaction commit latency. However, HTM cannot cooperate with NVM together: flushing data to NVM will always cause HTM to abort. This study proposes a technique called Parity Version to decouple the process of HTM execution and NVM write. Thus, the transactions can correctly and efficiently use NVM to reduce their commit latency with HTM. This technique has been integrated to DBX, a state-of-the-art HTM-based database, and DBXN: A low-latency and high-throughput in-memory transaction processing system, is proposed. Evaluations using typical OLTP workloads, including TPC-C, show that it has 99% lower latency and 2.1 times higher throughput than DBX.
ZHAO Hong-Yao , ZHAO Zhan-Hao , YANG Wan-Qing , LU Wei , LI Hai-Xiang , DU Xiao-Yong
2022, 33(3):867-890. DOI: 10.13328/j.cnki.jos.006454 CSTR:
Abstract:The concurrency control algorithm is a key component to guarantee the correctness and efficiency of executing transactions. Thus far, substantial effort has been devoted to proposing new concurrency controls algorithms in both database industry and academia. This study abstracts a common paradigm of the state-of-the-art and summarizes the core idea of concurrency control algorithms as “ordering-and-verifying”. Then, the existing concurrency control algorithms are re-presented following the ordering-and-verifying paradigm. Based on extensive experiments under an open-source memory-based distributed transaction testbed called 3TS, it is systematically demonstrated the advantages and disadvantages of the mainstream state-of-the-art concurrency control algorithms. Finally, the preferable application scenario is summarized for each algorithm and some valuable references are provided for the follow-up research of concurrency control algorithms used in in-memory databases.
TU Yao-Feng , CHEN He-Dui , WANG Han-Yi , YAN Zong-Shuai , KONG Lu , CHEN Bing
2022, 33(3):891-908. DOI: 10.13328/j.cnki.jos.006443 CSTR:
Abstract:Persistent memory (PM) has the characteristics of non-volatility, byte addressable, low latency, and large capacity, which breaks the boundary between traditional internal and external memory and has a has a disruptive impact on the existing software architecture. However, the current PM hardware still has problems such as uneven wear and asymmetric read and write. Especially serious I/O performance degradation problem will occur when the CPU accesses the PM across NUMA (non uniform memory access) nodes. An NUMA-aware PM storage engine optimization design is proposed and applied to Zhongxing’s new generation database system GoldenX, which significantly reduces the overhead of database system accessing persistent memory across NUMA nodes. The main innovations include: a data space distribution strategy and distributed access model across NUMA nodes are proposed under a DRAM+PM hybrid memory architecture, which realizes the efficient use of PM data space; aiming at the high latency problem of accessing PM across NUMA nodes, an I/O proxy routines access method is proposed, which converts the overhead of accessing PM across NUMA into the overhead of a remote DRAM memory copy and local access to PM. The Cache Line Area cache page mechanism is designed to alleviate the problem of I/O write amplification and improve the efficiency of local access to PM. The concept of traditional table space is extended, so that each table space has both independent table data storage and dedicated WAL (write-ahead logging) storage. For the distributed WAL storage architecture, a transaction processing mechanism based on global sequence numbers is proposed, which addresses the problem of single-point the WAL performance bottleneck, and implement NUMA-aware transaction processing, checkpoint and disaster recovery optimization mechanisms and algorithms. Experimental results show that the method proposed in this study can effectively improve the performance of the PM storage engine under the NUMA architecture, by 105%-317% in various test scenarios of YCSB and 90%-134% in TPC-C.
LI Hai-Xiang , LI Xiao-Yan , LIU Chang , DU Xiao-Yong , LU Wei , PAN An-Qun
2022, 33(3):909-930. DOI: 10.13328/j.cnki.jos.006442 CSTR:
Abstract:There is no unified definition of data anomalies, which refers to the specific data operation mode that may destroy the consistency of the database. Known data anomalies include Dirty Write, Dirty Read, Non-repeatable Read, Phantom, Read Skew, Write Skew, etc. In order to improve the efficiency of concurrency control algorithms, data anomalies are also used to define the isolation levels, because the weak isolation level can improve the efficiency of transaction processing systems. This work systematically studies the data anomalies and the corresponding isolation levels. Twenty-two new data anomalies are reported that have not been reported by other researches, and all data anomalies are classified miraculously. Based on the classification of data anomalies, two new isolation levels with different granularity are proposed, which reveals the rule of defining isolation levels based on data anomalies and makes the cognition of data anomalies and isolation levels more concise.
Lü Wei-Feng , ZHENG Zhi-Ming , TONG Yong-Xin , ZHANG Rui-Sheng , WEI Shu-Yue , LI Wei-Hua
2022, 33(3):931-949. DOI: 10.13328/j.cnki.jos.006455 CSTR:
Abstract:In recent years, promoting the synergy and intelligence of social governance, and improving the social governance system of co-construction, co-governance and sharing are important development directions for the country. As a production factor, data plays an increasingly critical role in social governance. How to realize the secure query, collaborative management, and intelligent analysis of multi-party massive data is the key issue to improve the effectiveness of social governance. In major public events such as the prevention and control of the COVID-19, distributed social governance faces low computing efficiency, poor multi-party credible coordination, and difficult decision-making for complex tasks. In response to the above challenges, this study proposes on big data based distributed social governance intelligent system based on secure multi-party computing, blockchain technology, and precise intelligence theory. The proposed system can support various applications of social governance that provide decision-making support for the improvement of social governance in the new era.
YU Zi-Sheng , LI Rui-Yuan , GUO Yang , JIANG Zhong-Yuan , BAO Jie , ZHENG Yu
2022, 33(3):950-967. DOI: 10.13328/j.cnki.jos.006445 CSTR:
Abstract:Time series similarity search is one of the most basic operations for temporal data analysis, which has various application scenarios. Existing distributed methods face the problems of dimension explosion, too large scan range, and time-consuming similarity calculation. To this end, this study proposes a distributed time series similarity search algorithm KV-Search. First, time series are segmented into blocks and stored in the key-value database, which solves the problem of high and growing dimension. Second, the lower bound is calculated based on Chebyshev distance, and the invalid data is filtered out in advance using key value range scanni ng, which reduce the data transmission and calculation overhead. Third, a block-based time series representation is used to calculate the lower bound of distance, which avoids the calculation of higher dimensional real data. KV-Search is implemented based on HBase, and a set of extensive experiments are conducted using both real and synthetic time series data. The results show that the proposed KV-Search is superior to benchmark experiment in efficiency and scalability.
LI Pan-Pan , SONG Shao-Xu , WANG Jian-Min
2022, 33(3):968-984. DOI: 10.13328/j.cnki.jos.006453 CSTR:
Abstract:With the integration of informatization and industrialization, the Internet of Things and industrial Internet have flourished, resulting in a large amount of industrial big data represented by time series. There are many valuable patterns in time series, among which symmetric patterns are widespread in various time series. Mining symmetric patterns has important research value in the fields of behavior analysis, trajectory tracking, anomaly detection, etc. However, the data volume of time series is often as high as tens or even hundreds of gigabytes. It can take months or even years to mine symmetric patterns using a direct nested query algorithm, and typical acceleration techniques such as indexing, lower bounds, and triangular inequalities can only produce speedup of one or two orders of magnitude at most. Therefore, based on the inspiration of the dynamic time warping algorithm, this study proposes a method that can mine all the symmetric patterns of the time series within the time complexity of O(w×|T|). Specifically, given the symmetric pattern length constraint, the symmetric subsequences can be calculated based on the interval dynamic programming. Then the largest number of non-overlapping symmetric patterns can be selected according to the greedy strategy. In addition, we also study the algorithm for mining symmetric patterns in the time series data stream, and dynamically adjusts the window size according to the characteristics of the data in the window to ensure the integrity of the symmetric pattern data. Using one artificial data set and three real data sets to experiment with the above method under different data volumes, it can be seen from the experimental results that compared with other symmetric pattern mining methods, this method has better performance in terms of pattern mining results and time overhead.
YUE Xiao-Fei , SHI Lan , ZHAO Yu-Hai , JI Hang-Xu , WANG Guo-Ren
2022, 33(3):985-1004. DOI: 10.13328/j.cnki.jos.006447 CSTR:
Abstract:Apache Flink, an emerging distributed computing framework, supports the execution of large-scale iterative programs on the cluster, but its default static resource allocation mechanism makes it impossible to carry out reasonable resource allocation to make iterative jobs complete on time. In response to this problem, that users should be relied on to actively express performance constraints rather than passively retain resources. RABORP, a dynamic resource allocation strategy based on runtime prediction is proposed to develop and implement a dynamic resource allocation plan for Flink iterative jobs with clear runtime limits. The main idea is to predict the runtime of each iteration superstep, and then the initial allocation and dynamic adjustment of resources are performed at the time of the iterative job submission and the synchronization barrier between the supersteps according to the predicted results, to ensure that the minimum set of resources can be used to complete the iterative job within the runtime limit specified by the user. A variety of typical Flink iterative jobs were executed under the dataset to carry out relevant comparative experiments. Experimental results show that the established runtime prediction model can accurately predict the runtime of each superstep, and compared with the current state-of-the-art algorithms, the proposed dynamic resource allocation strategy used in single-job and multi-job scenarios has improved various performance indicators.
LIU Rui-Qi , LI Bo-Yang , GAO Yu-Jin , LI Chang-Sheng , ZHAO Heng-Tai , JIN Fu-Sheng , LI Rong-Hua , WANG Guo-Ren
2022, 33(3):1005-1017. DOI: 10.13328/j.cnki.jos.006451 CSTR:
Abstract:With the rapid development of big data and machine learning, the distributed big data computing engine for machine learning have emerged. These systems can support both batch distributed learning and incremental learning and verification, with low latency and high performance. However, some of them adopt a random task scheduling strategy, ignoring the performance differences of nodes, which easily lead to uneven load and performance degradation. At the same time, for some tasks, if the resource requirements are not met, the scheduling will fail. In response to these problems, a heterogeneous task scheduling framework is proposed, which can ensure the efficient execution and execution of tasks. Specifically, for the task scheduling module, the proposed framework proposes a probabilistic random scheduling strategy resource-Pick_kx and a definite smooth weighted round-robin algorithm around the heterogeneous computing resources of nodes. The resource-Pick_kx al-gorithm calculates the probability according to the performance of the node, and performs random scheduling with probability. The higher the probability of a node with high performance, the higher the possibility of task scheduling to this node. The smooth weighted round-robin algorithm sets the weights according to the node performance at the beginning, and smoothly weights during the scheduling process, so that the task is scheduled to the node with the highest performance. In addition, for task scenarios where resources do not meet the requirements, a container-based vertical expansion mechanism is proposed to customize task resources, create nodes to join the cluster, and complete task scheduling again. The performance of the framework is tested on benchmarks and public data sets through ex-periments. Compared with the current strategy, the performance of the proposed frame is improved by 10% to 20%.
CUI Peng-Jie , YUAN Ye , LI Cen-Hao , ZHANG Can , WANG Guo-Ren
2022, 33(3):1018-1042. DOI: 10.13328/j.cnki.jos.006449 CSTR:
Abstract:Graph is a significant data structure which describes the relationship between entries, and it is widely used in information science, physics, biology, environmental ecology and other scientific fields. Nowadays, with the growing magnitude of graph data, processing large-scale graph data using distributed system has become the popular, many specialized distributed systems, including Pregel, GraphX, PowerGraph, and Gemini have been proposed. However, compared with the current state-of-the-art shared-memory graph processing systems, these specialized distributed graph processing systems do not deliver satisfactory or stable performance advantages in processing real-world graph datasets. Several representative distributed graph processing systems are analyzed, and the major challenges that affect their performance are summarized. This study proposes RGraph, an effective distributed graph processing system based on RDMA. The key idea of RGraph is improving performance on top of making full use of the advantages of RDMA. For graph partition, RGraph adopts chunk-based partition to avoid destroying the native locality of the real-world graph, so as to ensure the locality-preserving vertex accesses. For workload, RGraph proposes a task migration mechanism based on RDMA one-side READ and a fine-grained task preemption method among threads to ensure the dynamic load balance for inter-node and intra-node, so that all computing resources can be fully utilized. For communication, RGraph effectively encapsulates IB verbs and implements a concurrent RDMA communication stack satisfied graph computing semantics. Compared with traditional MPI, RGraph’s communication stack can reduce the latency up to 2.1 times for servers’ communication. Finally, five real-world large-scale graph datasets and one synthetic dataset are used to evaluation RGraph on an HPC cluster with eight servers, and the experiment shows that RGraph has obvious performance advantages. Compared with Powergraph, RGraph has 10.1-16.8 times performance improvement. And compared with the existing state-of-the-art CPU- based distributed graph processing system, RGraph still has 2.89-5.12 times performance improvement. Meanwhile, RGraph can still guarantee stable performance advantage on extremely skewed power-law graph.
ZHOU Xu , WENG Tong-Feng , YANG Zhi-Bang , LI Bo-Ren , ZHANG Ji , LI Ken-Li
2022, 33(3):1043-1056. DOI: 10.13328/j.cnki.jos.006457 CSTR:
Abstract:Tip decomposition has a pivotal role in mining cohesive subgraph in bipartite graphs. It is a popular research topic with wide applications in document clustering, spam group detection, and analysis of affiliation networks. With the explosive growth of bipartite graph data scale in these scenarios, it is necessary to use distributed method to realize its effective storage. For this reason, this work studies the problem of tip decomposition on a bipartite graph in the distributed environment for the first time. Firstly, a new relay based communication mode is proposed to realize effective message transmission when the given bipartite graph is decomposed in distributed environment. Secondly, the distributed butterfly counting algorithm (DBC) and tip decomposition algorithm (DTD) are designed. In particular, a controllable parallel vertex activation strategy is proposed to solve the problem of memory overflow when DBC decomposes large-scale bipartite graphs. Finally, the message pruning strategy based on vertex priority and message validity pruning strategy are introduced to further improve the efficiency of the algorithm by reducing redundant communication and computing overhead. The experiment is deployed on the high performance distributed computing platform of National Supercomputing Center. The effectiveness and efficiency of the proposed algorithms are verified by experiments on several real datasets.
LIU Yi-Xuan , CHEN Hong , LIU Yu-Han , LI Cui-Ping
2022, 33(3):1057-1092. DOI: 10.13328/j.cnki.jos.006446 CSTR:
Abstract:With the era of big data and the development of artificial intelligence, Federated learning (FL) emerges as a distributed machine learning approach. It allows multiple participants to train a global model collaboratively while keeping each of their training datasets in local devices. FL is created to break up data silos and preserve the privacy and security of data. However, there are still a large number of privacy risks during data exchange steps, where local data is threatened not only by model users as in centralized training but also by any dishonest participants. It is necessary to study technologies to achieve rigorous privacy-preserving approaches. The research progress and trend of privacy-preserving techniques for FL are surveyed in this paper. At first, the architecture and type of FL are introduced, then privacy risks and attacks are illustrated, including reconstruction and inference strategies. According to the mechanism of privacy preservation, the main privacy protection technologies are introduced. By applying these technologies, privacy defense strategies are presented and they are abstracted as 3 levels: local, central, local & central. Challenges and future directions of privacy-preserving in federated learning are discussed at last.
LIU Yi-Fei , WANG Ning , WANG Zhi-Gang , GU Yu , WEI Zhi-Qiang , ZHANG Xiao-Jian , YU Ge
2022, 33(3):1093-1110. DOI: 10.13328/j.cnki.jos.006450 CSTR:
Abstract:The big era is coming with the ever-growing demands on frequency estimation based on sensitive multi-dimensional categorical data. The existing works are devoted to designing privacy protection algorithms based on centralized differential privacy or local differential privacy. However, the above models provide either the weak level of privacy protection or low accuracy of published results. Therefore, standing on the emerging shuffled differential privacy which remedies the above modes, the data collection mechanisms are designed, providing frequency distribution estimation service with high security and high availability. Considering the multi-dimensional characteristics of data and the heterogeneous characteristics existed in different attributes, the mechanisms including SRR-MS with multiple shufflers and ARR-SS with one shuffler are firstly proposed. And then in order to combine the advantages of the above two mechanisms, PSRR-SS with one single shuffler, is proposed to eliminate the heterogeneity among attributes by means of padding dummy values technology to the attribute domains. This study detailedly analyzes the degree of privacy protection and the error level of three strategies theoretically, and evaluates the performance of the proposed mechanisms on frequency estimation by using four real datasets. Besides, the proposals are used as the perturbing component of the techniques generating synthetic data and the training results of stochastic gradient descent are evaluated based on synthetic data. The experimental results show that the proposed method outperforms the existing algorithms.
LI Shu-Yuan , JI Yu-Dian , SHI Ding-Yuan , LIAO Wang-Dong , ZHANG Li-Peng , TONG Yong-Xin , XU Ke
2022, 33(3):1111-1127. DOI: 10.13328/j.cnki.jos.006458 CSTR:
Abstract:In the era of big data, data is of great value as an essential factor of production. It is of great significance to implement its analysis, mining and utilization of large-scale data via data sharing. However, due to the heterogeneous dispersion of data and increasingly rigorous privacy protection regulations, data owners can not arbitrarily share data. This dilemma turns data owners into data silos. Data Federation calculate collaborative query while preserving the privacy of data silos. This study implements a multi-party secure relational data federation system. The system is designed based on the idea of federated computation that “data stays, computation moves”. Its adaptation interface of the system is different kinds of relational database adaptation, which can shield the data heterogeneity of multiple data owners. The system implements the multi-party security basic calculator library based on secret sharing, and the calculator realizes the optimization of the result reconstruction process. On this basis, it supports the query operations such as sum, average, maximum, equi-join and theta-join. Making full use of the multi-party properties to reduce the data interaction among data owners, the proposed system reduces the security computation overhead, so as to effectively support efficient data sharing. Finally, the experiment is carried out on the benchmark data set TPC-H. The experimental results show that the proposed system can support more data owners’ participation and has higher execution efficiency than current data federation systems such as SMCQL and Conclave by at most 3.75 times.
XU Zhao-Zhao , SHEN De-Rong , NIE Tie-Zheng , KOU Yue
2022, 33(3):1128-1140. DOI: 10.13328/j.cnki.jos.006099 CSTR:
Abstract:In recent years, the application of information technology and electronic medical records and medical records in medical institutions has become more and more widespread, which has resulted in a large amount of medical data in hospital databases. Decision tree is widely used in medical data analysis because of its high classification precision, fast calculation speed, and simple and easily understood classification rules. However, due to the inherent high dimensional feature space and high feature redundancy of medical data, the classification precision of traditional decision trees is low. Based on this, this paper proposes a hybrid feature selection algorithm (GRRGA) that combines information gain ratio ranking grouping and group evolution genetic algorithm. Firstly, the information gain ratio based filtering algorithm is used to sort the original feature set; then, the ranked features are grouped according to the density principle of equal division; finally, a group evolution genetic algorithm is used to perform a search on the ranked feature groups. There are two kinds of evolution methods: in-population and out-population, which use two different fitness functions to control the evolution process in group evolution genetic algorithm. The experimental results show that the average precision index of the GRRGA algorithm on the six UCI datasets is 87.13%, which is significantly better than the traditional feature selection algorithm. In addition, compared with the other two classification algorithms, the feature selection performance of the GRRGA algorithm proposed in this study is optimal. More importantly, the precision index of the bagging method on the arrhythmia and cancer medical datasets is 84.7% and 78.7% respectively, which fully proves the practical application significance of the proposed algorithm.
LIU Jun-Peng , HUANG Kai-Yu , LI Jiu-Yi , SONG Ding-Xin , HUANG De-Gen
2022, 33(3):1141-1152. DOI: 10.13328/j.cnki.jos.006201 CSTR:
Abstract:The over-translation and under-translation problem in neural machine translation could be alleviated by coverage model. The existing methods usually store the coverage information in a single way, such as coverage vector or coverage score, but do not take the relationship among different coverage methods into consideration, leading to insufficient use of the information. This study proposes a multi-coverage mechanism based on the consistency of translation history and complementarity between different models. The concept of word-level coverage score is defined first, and then the coverage information stored in both coverage vector and coverage score are incorporated into the attention mechanism simultaneously, aiming to reduce the influence of information loss. According to different fusion methods, two models are introduced. Experiments are carried out on Chinese-to-English translation task based on sequence-to-sequence model. Results show that the proposed method could significantly enhance translation performance and improve alignment quality between source and target words. Compared with the model with coverage vector only, the number of over-translation and under-translation problem is further reduced.