GAO Shan , YUAN Wan-Zhu , LU Wei , WANG Lan , ZHANG Jing , DU Xiao-Yong
2023, 34(3):1010-1026. DOI: 10.13328/j.cnki.jos.006787 CSTR:
Abstract:Government data governance is undergoing a new phase of transition from "physical data aggregation" to "logical semantic unification". Thus far, long-term "autonomy" of government information silos, lead to a wide spectrum of metadata curation issues, such as attributes with the same names but having different meanings, or attributes with different names but having the same meanings. Instead of either rebuilding/modifying legacy information systems or physically aggregating data from isolated information systems, logical semantic unification solves this problem by unifying the semantic expression of the metadata in government information silos and achieves the standardized metadata governance. This work semantically aligns the metadata of each government information silo to the existing standard metadata. Specifically, the standard metadata names are viewed as semantic labels, and the semantic meanings of columns of relations in each government information silo are semantically identified, so as to establish the semantic alignment of column names and standard metadata and achieve standardized governance of silo metadata.
ZHAO Dong-Ming , QIU Yuan-Hui , KANG Rui , SONG Shao-Xu , HUANG Xiang-Dong , WANG Jian-Min
2023, 34(3):1027-1048. DOI: 10.13328/j.cnki.jos.006789 CSTR:
Abstract:Timeseries data is widely used in energy, manufacturing, finance, climate and many other fields. Aggregate queries are quite common in timeseries data analysis scenarios to quickly obtain summary of massive data. It is an effective way to accelerating aggregate queries by storing metadata. However, most existing timeseries databases slice data with fixed time windows, which requires real-time sorting and partitioning. In IoT applications with high writing concurrency and throughput, these additional costs are unacceptable. This study proposes a physical metadata management solution in Apache IoTDB for accelerating aggregate queries, in which data are sliced according to the physical storage sharding of files. Both synchronous and asynchronous computing are adopted to ensure writing performance ahead of queries. Out-of-order data streams are another major challenge in IoTDB applications. This study abstracts files with overlapping time ranges into out-of-order file groups and provides metadata for each group. Then aggregate queries will be rewritten into three sub-queries and efficiently executed on physical metadata and timeseries data. Experiments on various datasets have shown the improvement in performance of aggregate queries with the proposed solution, as well as the validity of different computing strategies.
PENG Jin-Feng , SHEN De-Rong , KOU Yue , NIE Tie-Zheng
2023, 34(3):1049-1064. DOI: 10.13328/j.cnki.jos.006791 CSTR:
Abstract:With the development of the information society, the scale of data has become larger and the types of data have become more abundant. Nowadays, data have become important strategic resources, which are the vital guarantees for scientific management for countries and enterprises. Nevertheless, with the increasing of data generated in social life, a large amount of dirty data come along with it, and data quality issue ensues. In the field of data science, it has always been a pain point that how to detect errors in an accurate and comprehensive manner. Although many traditional methods based on constraints or statistics have been widely used, they are usually limited by prior knowledge and labor cost. Recently, some novel methods detect errors by utilizing deep learning model to inference time series data or analyze context data and achieve better performance. However, these methods tend to be only applicable to specific areas or specific types of errors, which are not general enough for complex reality cases. Based on above observations, this study takes advantages of both traditional methods and deep learning model to propose a comprehensive error detection method (CEDM), which can deal with multiple type errors in multiple views. Firstly, under the view of patterns, basic detection rules can be constructed based on the statistical analysis with constraints from multiple dimensions, including attributes, cells, and tuples. After this, under the semantic view, data semantics are captured by word embedding and attribute relevance, cell dependency, and tuple similarity are analyzed. And hence, the basic rules can be extended and updated based on the semantic relations in different dimensions. Finally, the errors of multiple types could be detected comprehensively and accurately in multiple views. Extensive experiments on real and synthetic datasets demonstrate that the proposed method outperforms the state-of-the-art error detection methods and has higher generalization ability that can be applicable to multiple areas and multiple error types.
DING Xiao-Ou , LI Ying-Ze , WANG Chen , WANG Hong-Zhi , LI Hao-Xuan
2023, 34(3):1065-1086. DOI: 10.13328/j.cnki.jos.006793 CSTR:
Abstract:Time series data generated by intelligent devices are growing rapidly and faced with serious data quality problems. The demand for time series data quality management and data quality improvement based on data repairing techniques is increasingly urgent. Time series data has the obvious characteristics about the ordered time window and strong associations between rows and columns. This brings much more challenges for the research of the data quality semantic expression of time series data. This study proposes the definition and the construction of time series data quality rules, which takes into account the association on both rows and columns. It extends the expression of the existing data quality rule systems in terms of time window and multi-order expression operation. In addition, the discovery method is proposed for time series data quality rules. Experiment results on real time series data sets verify that the proposed method can effectively and efficiently discover hidden data quality rules from time series data, showing that the proposed method has higher performance with the predicate construction of associated information on row and column, compared with the existing data quality rule discovery method.
LIANG Zheng , WANG Hong-Zhi , DAI Jia-Jia , SHAO Xin-Yue , DING Xiao-Ou , MU Tian-Yu
2023, 34(3):1087-1108. DOI: 10.13328/j.cnki.jos.006794 CSTR:
Abstract:Entity matching can determine whether records in two datasets point to the same real-world entity, and is indispensable for tasks such as big data integration, social network analysis, and web semantic data management. As a deep learning technology that has achieved a lot of success in natural language processing and computer vision, pre-trained language models have also achieved better results than traditional methods in entity matching tasks, which have attracted the attention of a large number of researchers. However, the performance of entity matching based on pre-trained language model is unstable and the matching results cannot be explained, which brings great uncertainty to the application of this technology in big data integration. At the same time, the existing entity matching model interpretation methods are mainly oriented to machine learning methods as model-agnostic interpretation, and there are shortcomings in their applicability on pre-trained language models. Therefore, this study takes BERT entity matching models such as Ditto and JointBERT as examples, and proposes three model interpretation methods for pre-training language model entity matching technology to solve this problem. (1) In the serialization operation, the order of relational data attributes is sensitive. Dataset meta-features and attribute similarity are used to generate attribute ranking counterfactuals for misclassified samples; (2) As a supplement to traditional attribute importance measurement, the pre-trained language model attention weights are used to measure and visualize model processing; (3) Based on the serialized sentence vector, the k-nearest neighbor search technique is used to recall the samples with good interpretability similar to the misclassified samples to enhance the low-confidence prediction results of pre-trained language model. Experiments on real public datasets show that while improving the model effect through the enhancing method, the proposed method can reach 68.8% of the upper limit of fidelity in the attribute order search space, which provides a decision explanation for the pre-trained language entity matching model. New perspectives such as attribute order counterfactual and attribute association understanding are also introduced.
ZHANG Yuan-Yuan , LI Shu-Yuan , SHI Ye-Xuan , ZHOU Nan , XU Yi , XU Ke
2023, 34(3):1109-1125. DOI: 10.13328/j.cnki.jos.006795 CSTR:
Abstract:Recently, many countries and regions have enacted data security policies, such as General Data Protection Regulation proposed by the EU. The release of related laws and regulations has aggravated the problem of data silos, which makes it difficult to share data among various data owners. The data federation is a possible solution to this problem. Data federation refers to the calculation of query tasks jointly performed by multiple data owners without disclosing their original data and combining privacy computing technologies such as secure multi-party computation. This concept has become a research trend in recent years, and a series of representative systems have been proposed such as SMCQL and Conclave. However, for the fundamental join query in the relational database system, the existing data federation system still has the following problems. First of all, the join query type is single. It is difficult to meet the query requirements under complex join conditions. Secondly, the algorithm performance has huge improvement space, because the existing systems often call the security tool library directly, which has high running time and communication overhead. Therefore, a data federation join algorithm is proposed to address the above issues. The main contributions of this study are as follows. Firstly, multiparty-oriented federation security operators are designed and implemented, which can support a variety of operations. Secondly, a federated q-join algorithm and an optimization strategy are proposed to significantly reduce the security computation cost. Finally, the performance of this proposal is verified based on the benchmark dataset TPC-H. The experimental results show that the proposed algorithm can reduce the runtime and communication overhead by 61.33% and 95.26% compared with the existing data federation system SMCQL and Conclave.
CHEN Lu , GUO Yu-Xiang , GE Cong-Cong , ZHENG Bai-Hua , GAO Yun-Jun
2023, 34(3):1126-1147. DOI: 10.13328/j.cnki.jos.006781 CSTR:
Abstract:With the emergence and accumulation of massive data, data governance has become an important manner to improve data quality and maximize data value. Error detection is crucial for improving data quality, which has attracted a surge of interests from both industry and academia. Various detection methods tailored for a single data source have been proposed. Nevertheless, in many real-world scenarios, data is not centrally stored and managed. Different sources of correlated data can be employed to improve the accuracy of error detection. Unfortunately, due to privacy/security issues, cross-source data is often not allowed to be integrated centrally. To this end, this study proposes FeLeDetect, a cross-source data error detection method based on federated learning. First, a graph-based error detection model (GEDM) is presented to capture sufficient data features from each data source. Then, the study investigates a federated co-training algorithm (FCTA) to collaboratively train GEDM over different data sources without privacy leakage. Furthermore, the study designs a series of optimization methods to reduce the communication cost during the federated learning and the manual labeling efforts. Extensive experiments on three real-life datasets demonstrate that GEDM achieves an average improvement of 10.3% F1-score in the local scenario and 25.2% F1-score in the centralized scenario, outperforming all the five existing state-of-the-art competitors for a single data source; and FeLeDetect further enhances local GEDM in terms of F1-score by 23.2% on average.
QIAO Shao-Jie , LIN Yu-Feng , HAN Nan , YANG Guo-Ping , LI He , YUAN Guan , MAO Rui , YUAN Chang-An , Louis Alberto GUTIERREZ
2023, 34(3):1148-1167. DOI: 10.13328/j.cnki.jos.006784 CSTR:
Abstract:In the background of big data, ensuring credible data sharing is the basic requirement of data federation. Using blockchain technology to replace the traditional client-server architecture can improve the security of federated learning (FL). However, the huge communication cost and storage consumption generated by model parameter validation and data persistence in existing works have become problems that need to be solved urgently in data federation. To tackle these problems, an efficient decentralized federated learning framework (EDFL) is proposed, which can reduce the system overhead and significantly improve the learning efficiency of FL. Firstly, a consensus mechanism based on proof-of-contribution (PoC) is proposed where the election of the block generation is based on historical contribution instead of using the competition mechanism, thus, it can avoid the latency of the block generation caused by the mining process, and asynchronously alleviate the congestion in the model parameter validation. Secondly, a role-adaptive incentive algorithm is presented. Because the proposed algorithm is based on the work intensity of participating nodes and the role assigned by EDFL, it can motivate legitimate nodes to actively conduct model training and effectively identify malicious nodes. Thirdly, blockchain partition storage strategy is proposed. The proposed strategy enables multiple local reconstruction code chunks to be evenly distributed to nodes in the network, which reduces the local storage consumption and achieves higher efficiency of data recovery. Lastly, the learning efficiency, storage scalability, and security of EDFL are evaluated in real FEMNIST dataset. Experimental results show that EDFL outperforms the state-of-the-art blockchain-based FL framework from the above three aspects.
WANG Yong , LI Guo-Liang , LI Kai-Yu
2023, 34(3):1168-1192. DOI: 10.13328/j.cnki.jos.006786 CSTR:
Abstract:Federated learning is a collaborative machine learning framework with multiple participants whose training datasets are kept locally. How to evaluate the corresponding data contribution of each participant is one of the critical problems of federated learning. However, contribution evaluation in federated learning faces multiple challenges. First, to evaluate participant contribution, data value needs to be quantified, however, data valuation is challenging because it is subjective, task context-dependent, and vulnerable to malicious data. Second, participant contribution evaluation is a classic cooperative game problem, and a fair yet rational cooperative contribution evaluation scheme is needed to achieve an optimal equilibrium among all participants. Third, contribution evaluation schemes often involve exponential computational complexity, where data valuation by training models in federated learning is also quite time consuming. In recent years, researchers have conducted extensive studies on participant contribution evaluation in federated learning to tackle the above challenges. This study first introduces the background knowledge of federated learning and contribution evaluation. Then, data valuation metrics, contribution evaluation schemes, and corresponding optimization technologies are surveyed successively. Finally, the remaining challenges of contribution evaluation and potential future work are discussed.
FU Peng-Tao , LUO Lai-Long , GUO De-Ke , ZHAO Xiang , LI Shang-Sen , WANG Huai-Min
2023, 34(3):1193-1212. DOI: 10.13328/j.cnki.jos.006782 CSTR:
Abstract:With the rapid development of information technology, the volume of data maintains an exponential growth, and the value of data is hard to mine. It brings significant challenges to the efficient management and control of each link in the data life cycle, such as data collection, cleaning, storage, and sharing. Sketch uses a hash table/matrix/bit vector to track the core characteristics of data, such as frequency, cardinality, membership, etc. This mechanism makes sketch itself metadata which has been widely used in the sharing, transmission, update and other scenarios. The rapid flow characteristic of big data has spawned the dynamic sketches. The existing dynamic sketches have the advantage of expanding or shrinking in capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure. However, there are defects of excessive space overhead and time overhead increasing with the increase of the dataset cardinality. This study designs a dynamic sketch for big data governance based on the advanced jump consistent hash. This method can simultaneously realize the space overhead that grows linearly with the dataset cardinality and the constant time overhead of data processing and analysis, effectively supporting the demanding big data processing and analysis tasks for big data governance. The validity and efficiency of the proposed method are verified by comparing it with traditional methods on various datasets, including synthetic and natural datasets.
TU Yao-Feng , NIU Jia-Hao , WANG De-Zheng , GAO Hong , XU Jin , HONG Ke , YANG Fang
2023, 34(3):1213-1235. DOI: 10.13328/j.cnki.jos.006783 CSTR:
Abstract:Big data has become a national basic strategic resource, and the opening and sharing of data is the core of China's big data strategy. Cloud native technology and lake-house architecture are reconstructing the big data infrastructure and promoting data sharing and value dissemination. The development of big data industry and technology require stronger data security and data sharing capabilities. However, data security in an open environment has become a bottleneck, which restricts the development and utilization of big data technology. The issues of data security and privacy protection have become increasingly prominent both in the open source big data ecosystem and the commercial big data system. Dynamic data protection system under the open big data environment is now facing challenges of data availability, processing efficiency and system scalability and etc. This study proposes a dynamic data protection system BDMasker for the open big data environment. Through a precise query analysis and query rewriting technology based on the query dependency model, it can accurately perceive but not change the original business request, which indicates that the whole process of dynamic desensitization has zero impact on the business. Furthermore, its multi-engine-oriented unified security strategy framework realizes the vertical expansion of dynamic data protection capabilities and the horizontal expansion among multiple computing engines. The distributed computing capability of the big data execution engine can be used to improve the data protection processing performance of the system. The experimental results show that the precise SQL analysis and rewriting technology proposed by BDMasker is effective, the system has good scalability and performance, and the overall performance fluctuates within 3% in the TPC-DS and YCSB benchmark tests.
CHEN Zi-Hao , XU Chen , QIAN Wei-Ning , ZHOU Ao-Ying
2023, 34(3):1236-1258. DOI: 10.13328/j.cnki.jos.006785 CSTR:
Abstract:As an essential part of big data governance applications, data analysis is characterized by time-consuming and large hardware requirements, making it essential to optimize its execution efficiency. Earlier, data analysts could execute analysis algorithms using traditional matrix computation tools. However, with the explosive growth of data volume, the traditional tools can no longer meet the performance requirements of applications. Hence, distributed matrix computation systems for big data analysis have emerged. This study reviews the progress of distributed matrix computation systems from technical and system perspectives. First, this study analyzes the challenges faced by distributed matrix computation systems in four dimensions:programming interface, compilation optimization, execution engine, and data storage, from the perspective of the mature data management field. Second, this study discusses and summarizes the technologies in each of these four dimensions. Finally, the study investigates the future research and development directions of distributed matrix computation systems.
PANG Jun , LIU Xiao-Qi , GU Yu , WANG Xin , ZHAO Yu-Hai , ZHANG Xiao-Long , YU Ge
2023, 34(3):1259-1276. DOI: 10.13328/j.cnki.jos.006788 CSTR:
Abstract:Link prediction in knowledge graphs is the most effective method for graph complementation, which can effectively improve the data quality of knowledge graphs. However, the relationships in real life are often multiple, thus these knowledge graphs containing multiple relationships can be called knowledge hypergraphs (KHGs). Unfortunately, the existing knowledge graph link prediction methods cannot be directly applied to knowledge hypergraphs, and the existing knowledge hypergraph link prediction models ignore the equality (there is no sequential relationship among the elements in a multivariate relationship) and completeness (a multivariate relationship is not valid if it lacks elements) of the real-life multivariate relationships. To address these problems, a knowledge hypergraph multivariate representation model is firstly proposed, which can directly model the multivariate relationships in the knowledge hypergraph. Then, a multi-granularity neural network-based hypergraph prediction method (HPMG) is studied, which divides the relations into multiple granularities for learning and prediction from different granularities jointly. Next, to address the problem of inadequate HPMG feature fusion, HPMG+ is proposed based on multi-granularity attention network for link prediction of knowledge hypergraphs, which combines all and local attention to achieve differentiated fusion of different features and further improves the performance of the model. Finally, extensive experimental results on real datasets verify that the proposed methods significantly outperform all baseline methods in terms of hyper-edge prediction.
QIAO Lian-Peng , HOU Hui-Wen , WANG Guo-Ren
2023, 34(3):1277-1291. DOI: 10.13328/j.cnki.jos.006792 CSTR:
Abstract:In recent years, community search on heterogeneous information networks has attracted more and more attention and has been widely used in graph data analysis. Nevertheless, the existing community search problems on heterogeneous information networks do not consider the fairness of attributes on subgraphs. This work combines attribute fairness with kPcore mining on heterogeneous information networks and proposes a maximum core mining problem on heterogeneous information networks based on attribute fairness. To solve this problem, a subgraph model called FkPcore is proposed. When enumerating FkPcore, the basic algorithm called Basic-FkPcore traverses all path instances and enumerates a large number of kPcores and their subgraphs. In order to improve the efficiency of the algorithm, an Adv-FkPcore algorithm is proposed to avoid judging all kPcores and their subgraphs when enumerating FkPcores. In addition, in order to improve the acquisition efficiency of P_neighbor, a traversal method with vertex sign (TMS) and a FkPcore enumeration algorithm called Opt-FkPcore based on the TMS algorithm are proposed. A large number of experiments on heterogeneous information networks demonstrate the effectiveness and efficiency of the proposed method.
QIN Hao , WANG Ping-Hui , ZHANG Ruo-Fei , QIN Zun-Ying
2023, 34(3):1292-1309. DOI: 10.13328/j.cnki.jos.006790 CSTR:
Abstract:Surveillance video keyframe retrieval and attribute search have many application scenarios in traffic, security, education and other fields. The application of deep learning model to process massive video data to a certain extent alleviates manpower consumption, but it is characterized by privacy disclosure, large consumption of computing resources and long time. Based on the above scenarios, this study proposes a safe and fast video retrieval model for mass surveillance video. In particular, according to the characteristics of large computing power in the cloud and small scale of computing power in the surveillance camera, heavyweight model is deployed in the cloud, and the proposed tolerance training strategy is used for customized knowledge distillation, the distilled lightweight model is then deployed inside a surveillance camera, at the same time using local encryption algorithm to encrypt sensitive to image part, combined with cloud TEE technology and user authorization mechanism, privacy protection can be achieved with very low resource consumption. By reasonably controlling the "tolerance" of distillation strategy, the time-consuming of camera video input stage and cloud retrieval stage can be balanced, and extremely low retrieval delay is ensured on the premise of extremely high accuracy. Compared with traditional retrieval methods, the proposed model has the characteristics of security, efficiency, scalability and low latency. Experimental results show that the proposed model provides 9×-133× acceleration compared with traditional retrieval methods on multiple open data sets.
CHEN Xiang , YU Chi , YANG Guang , PU Xue-Lian , CUI Zhan-Qi
2023, 34(3):1310-1329. DOI: 10.13328/j.cnki.jos.006690 CSTR:
Abstract:Bash is the default shell command language for Linux, which plays an important role in the development and maintenance of Linux systems. Nevertheless, understanding the purpose and functionality of the Bash code is a challenging task. Therefore, an automatic method ExplainBash is proposed based on dual information retrieval for automatic Bash code comment generation. Specifically, the proposed method is based on semantic similarity and lexical similarity to perform dual information retrieval, which aims to generate high-quality code comments. For semantic similarity, CodeBERT and BERT-whitening operator are used to learn the code semantic representation, and Euclidean distance is resorted to compute semantic similarity; while for lexical similarity, code is represented as a set of code tokens, then the edit distance is resorted to compute lexical similarity. A high-quality corpus is constructed based on the corpus shared in the NL2Bash study and the data shared in the NLC2CMD competition. After that, nine state-of-the-art baselines are selected from the automatic code comment generation domain, which cover the information retrieval-based methods and deep learning-based methods. Results of empirical study and human study verify the effectiveness of the proposed method. Ablation experiments are also designed to analyze the rationality of the settings (such as retrieval strategy, BERT-whitening operator) in the proposed method. Finally, a browser plug-in is developed based on the proposed method to facilitate the code comprehension of the Bash code.
JI Shou-Ling , WANG Qin-Ying , CHEN An-Ying , ZHAO Bin-Bin , YE Tong , ZHANG Xu-Hong , WU Jing-Zheng , LI Yun , YIN Jian-Wei , WU Yan-Jun
2023, 34(3):1330-1364. DOI: 10.13328/j.cnki.jos.006717 CSTR:
Abstract:In recent years, the vigorous development of open source software and the modern software development and supply models have greatly facilitated the rapid iteration and evolution of open source software, resulting in increased social benefits. The emerging collaborative software development model of open source has transformed the software development supply process from a relatively linear path to a complex network structure. Within open-source software's complex and intertwined supply relationships, the overall security risk trend has significantly increased, drawing increasing attention from the academic and industrial communities. This work tries to define the new open-source software supply chain model and, based on attacks that have occurred over the past decade, summarizes the threat model and security trends of the open-source software supply chain. For securing the open-source software supply chain, this work provides a systematic overview from the perspectives of risk identification and reinforced defense and also highlight the new challenges and opportunities.
WANG Yu , ZHU Meng-Xia , YANG Shang-Hui , LU Xue-Song , ZHOU Ao-Ying
2023, 34(3):1365-1395. DOI: 10.13328/j.cnki.jos.006715 CSTR:
Abstract:Knowledge tracking is an important cognitive diagnosis method, which is often used in digitalized education platforms such as online learning platforms and intelligent tutoring systems. By analyzing students' interactions with course assignments, knowledge tracing models can simulate their mastery level of knowledge concepts in courses in real time. The simulation results can be used to predict students' future learning performance and help them plan personalized learning paths. In the past 20 years, knowledge tracing models have been constructed based on theories of statistics and cognitive science. With the openness and application of educational big data, models based on deep neural networks, referred to as "deep knowledge tracing models", have gradually replaced traditional models due to their simple theoretical foundations and superior predictive performances, and become a new research hotspot in the field of knowledge tracing. According to the neural network architectures, the algorithm details of recent representative deep models for knowledge tracing are illustrated, and a comprehensive performance evaluation of the models on five publicly available datasets is conducted. Finally, some use cases and future research directions of deep knowledge tracing are discussed.
JIANG Dong , YUAN Ye , ZHANG Xiao-Wei , WANG Guo-Ren
2023, 34(3):1396-1424. DOI: 10.13328/j.cnki.jos.006751 CSTR:
Abstract:In the big data era, an enormous amount of data is collected in every industry with the development of information technology. Data is the foundation of the digital economy, containing great value. However, for the lack of efficient and feasible data-sharing mechanisms, data owners seldom communicate with each other, which leads to the formation of data islands and is unfavorable to the healthy development of the big data industry. Hence, allocating a proper price to data and designing an efficient data market platform have become important ways to eliminate data islands and secure sufficient data flow. This study systematically sorts out the technical issues regarding data pricing and trading. Specifically, the difficulties and related principles of data pricing and trading are introduced. The life cycle of data in the data market is divided into four stages: data collection and integration, data management and analysis, data pricing, and data trading. Upon the research on big data management, related methods applicable to the first two stages are elaborated. After that, data pricing methods are categorized, and usage scenarios, advantages, and shortcomings of these methods are analyzed. Moreover, the classification of data markets is introduced, and the impact of market types and participants’ behavior in data trading on the trading process and prices is studied with game theory and auctions as examples. Finally, future research directions of data pricing and trading are discussed.
ZHU Rui , SONG Fu-Yao , WANG Bin , YANG Xiao-Chun , ZHANG An-Zhen , XIA Xiu-Feng
2023, 34(3):1425-1450. DOI: 10.13328/j.cnki.jos.006718 CSTR:
Abstract:k representative skyline query is a type of query derived from traditional skyline query. Given a set of d-dimensional dataset D, a skyline query finds all objects in D that are not dominated by other ones, which helps users to select high-quality objects based on their preference. However, the scale of skyline objects may be large in many cases, users have to choose target objects from a large number of objects, leading that both the selection speed and quality cannot be guaranteed. Compared with traditional skyline query, k representative skyline query chooses the most "representative" k objects from all skyline objects, which effectively solves such problem causes by traditional skyline query. Given the sliding window W and a continuous query q, q monitors objects in the window. When the window slides, q returns k skyline objects with the largest group dominance size in the window. The key behind existing algorithms is to monitor skyline objects in the current window. When the skyline set is updated, the algorithm updates k representative skyline set. However, the cost of monitoring skyline set is usually high. When the skyline set scale is large, the computational cost of choosing k representative skyline objects is also high. Thus, existing algorithms cannot efficiently work under high-speed stream environment. This study proposes a query named r-approximate k representative skyline query. In order to support this type of queries, a novel framework is proposed named PAKRS (predict-based approximate k representative skyline). Firstly, PAKRS partitions the current window into a group of sub-windows. Next, the predicted result sets of a few future windows are constructed according to the partition result. In this way, the earliest moment can be predicted when new arrived objects may become skyline objects. Secondly, an index is proposed named r-GRID, which can help PAKRS to select r-approximate k representative skyline with O(k/s+k/m) computational cost under 2-dimensional space, and O(2Ld/m+2Ld/s) computational cost under d-dimensional space (d>2), where L is a little integer smaller than k. Theoretical analysis shows that the computational complexity of PAKRS is lower than the state-of-the-art efforts. Extensive experiments have been conducted to confirm the efficiency and effectiveness of the proposed algorithms. Experimental results show that the running time of PAKRS is about 1/4 times of PBA (prefix-based algorithm), algorithm 1/6 times of GA (greedy algorithm) and about 1/3 times of e-GA (e-constraint greedy algorithm).
ZENG Meng , ZOU Bei-Ji , ZHANG Wen-Sheng , YANG Xue-Bing , ZHU Cheng-Zhang
2023, 34(3):1451-1469. DOI: 10.13328/j.cnki.jos.006710 CSTR:
Abstract:Hadoop distributed file system (HDFS) is used for the storage and management of large files, while storing and computing a large number of small files consume a lot of NameNode memory usage and access time. Therefore, the small file problem becomes an important factor that restricts HDFS performance. Aiming at the problem of massive small files in multi-modal medical data, a small file storage method based on two-layer hash coding and HBase is proposed to optimize the storage of massive small files on HDFS. When merging small files, an expandable hash function is utilized to build an index file bucket to expand the index file dynamically as needed and realize the file append function. To read the file in O(1) time complexity and improve the efficiency of file search, the MWHC hash function is used to store the position of the index information of each file in the index file. There is no need to read the index information of all files, only need to read the index information of the corresponding bucket. To meet the storage needs of multi-modal medical data, HBase is used to store the index information and set the identification column to identify different modal medical data, which is convenient for storage and management of different modal data and improves file reading speed. To further optimize storage performance, the LRU-based metadata prefetching mechanism is established, and the LZ4 compression algorithm is utilized to compress the merged files. The experiment compares file access performance and NameNode memory usage. The results show that compared with the original HDFS, HAR, MapFile, TypeStorage, and HPF small file storage methods, the proposed algorithm has a shorter file access time, which can improve the overall performance of HDFS when processing massive small files in multi-modal medical data.
CHENG Guan-Jie , DENG Shui-Guang , WEN Ying-Ying , YAN Xue-Qiang , ZHAO Ming-Yu
2023, 34(3):1470-1490. DOI: 10.13328/j.cnki.jos.006778 CSTR:
Abstract:With the rapid development of the Internet of Things (IoT), the number of smart devices has increased sharply, and identity authentication becomes the primary requirement for ensuring IoT security. Blockchain, as a distributed ledger technology, provides a trusted collaboration environment and a secure data management platform. The utilization of blockchain technology to drive IoT authentication has been a hotspot in academia and industry. This study analyzes the main requirements of authentication mechanism design based on cloud computing and cloud-edge collaboration and summarizes the challenges in applying blockchain technology to IoT scenarios. Relevant research on IoT authentication mechanisms is presented and classified into three categories of key-based authentication, certificate-based authentication, and identity-based authentication. Moreover, the existing IoT authentication studies using blockchain technology are analyzed, and related literature is reviewed according to authentication objects and additional attributes. This study also summarizes the security analysis method for the blockchain-based IoT authentication mechanism from formal and informal perspectives and finally points out the prospect of the technology.
LI Tao , YANG An-Jia , WENG Jian , GUO Zi-Fan
2023, 34(3):1491-1511. DOI: 10.13328/j.cnki.jos.006716 CSTR:
Abstract:As the amount of data generated by the Industrial Internet grows, more and more companies are choosing to outsource the storage for their Industrial Internet data to cloud servers to save storage costs. To prevent the outsourced data from being tampered or deleted, companies need to audit the data integrity regularly. This study proposes a public auditing scheme for Industrial Internet data based on smart contracts. Particularly, a series of game-theory based smart contracts are designed which can efficiently mitigate malicious participators including the third-party auditor and the cloud server. Compared to existing collusion-resistant public auditing schemes, the proposed scheme does not rely on complex cryptographic tools to achieve resistance to participant malicious behavior, and thus is more efficient and suitable to Industrial Internet applications where huge amount of data need to be frequently updated. Specifically, the game-based contract designed in this study as an individual solution, can be effectively combined with existing public auditing schemes to turn out a public auditing scheme with better security without losing efficiency. Finally, a series of tests are conducted on the proposed contract in the local environment and Ropsten, the common test chain for Ethereum. The results show that the designed contract is cheap to run and adaptable to the operating environment, has little impact on the efficiency of the original integrity audit solution, and is more efficient than other integrity schemes that resist the malicious behavior of auditors.
ZHU Jin-Yu , ZHANG Yu , ZENG Liang-Wei , ZHANG Hong-Li , FANG Bin-Xing
2023, 34(3):1512-1522. DOI: 10.13328/j.cnki.jos.006321 CSTR:
Abstract:The regional network border describes the topological border nodes in cyberspace among countries and regions in the real world. By combining active and passive measurement techniques, this study proposes a dual-stage method of discovering regional network border (RNB) nodes. The first stage is to discover the regional network border's candidate sets by using directed topology measurement and multi-source geolocation. The second stage is to accurately identify border nodes from the candidate sets by using multi-source information weighted geolocation and dual PING geolocation. The experiment took China as the target region and discovered 1 644 border nodes. Compared with the CAIDA data set, the proposed approach's results have 37% of exclusively discovered border nodes with only 2.5% of the measurement cost. The accuracy rate under manual verification is 99.3%, and that under the verification of an ISP operator is 75%.