LI Jian-Zhong , WANG Hong-Zhi , GAO Hong
2016, 27(7):1605-1625. DOI: 10.13328/j.cnki.jos.005038 CSTR:
Abstract:The rapid development of information technology gives rise to the big data era. Big data has become an important wealth of information society, and has provided unprecedented rich information for people to further perceive, understand and control the physical world. However, withthe growth in data scale, dirty datacomes along. Dirty data leads to the low qualityand usability of big data, and seriously harms the information society. In recent years, the data usability problems have drawn the attentions of both the academia and industry. In-Depth studies have been conducted, and a series of research results have been obtained. This paper introduces the concept of data usability, discusses the challenges and research issues, reviews the research results and explories future research directions in this area.
DING Xiao-Ou , WANG Hong-Zhi , ZHANG Xiao-Ying , LI Jian-Zhong , GAO Hong
2016, 27(7):1626-1644. DOI: 10.13328/j.cnki.jos.005040 CSTR:
Abstract:Recently, with the rapid growth of data quantity, users are using a variety of indicators to evaluate and improve the quality of data from different dimensions. During the course of data quality management, it is found that many important factors that influence the data availability are not completely isolated. In the evaluation mechanism which can guide data cleaning rules, these dimensions may be associated with each other. In this paper, serveral data quality dimensions researched in the literature as well as being used in the real information system are discussed, and accordingly the definition and properties of the dimensions are summarized. In addition, a multi-dimensional data quality assessment framework is proposed. According to the four important properties of data availability:Accuracy, completeness, consistency and currency, the operation method and the relationships among them on the data set are constructed. Finally, a multi-dimensional data quality accessment strategy is created. The effctiveness of the proposed strategy is verified by experiments.
CHEN Yu , ZHAO Su-Yun , CHEN Hong , LI Cui-Ping , SUN Hui
2016, 27(7):1645-1654. DOI: 10.13328/j.cnki.jos.005036 CSTR:
Abstract:This paper introduces random sampling into traditional fuzzy rough methods and proposes a random sampling based statistical rough set model. The work focuses on how to bring random sampling into traditional rough set. First, random sampling is used to propose a concept of k-limit, which can dramatically reduce the amount of computation during the computing of lower approximation value. Then, statistical upper and lower approximation is formulated. By mathematical reasoning, sufficient theorem and proof are used to valid the reliability of new model. Finally, numerical experiments illustrate the efficiency of the proposed statistical rough sets.
LI Tian-Yi , GU Yu , MA Qian , LI Fang-Fang , YU Ge
2016, 27(7):1655-1670. DOI: 10.13328/j.cnki.jos.005033 CSTR:
Abstract:As a method of assessing validity of conflicting information provided by various data sources, truth discovery has been widely researched in the conventional database community. However, most of the existing solutions of truth discovery are not suitable for applications involving data streams, mainly because their methods include iterative processes. This paper studies the problem of continuous truth discovery in a special kind of data streams-sensor data streams. Combining with the characteristics of sensor data itself and its application, a strategy is proposed based on changing the frequency of assessing source reliability to reduce the iterative processes, and therefore to improve the efficiency of truth discovery in multiple-source sensor data streams. First, definitions are provided on when the relative errors and accumulative errors are relatively small, and the necessary conditions of the variation on source reliability from adjacent time points. Next, a probabilistic model is given to predict the probability of meeting these necessary conditions. Then, by integrating the above conclusions, maximal assessing period of source reliability is achieved, under the condition that the cumulative error of prediction is smaller than the given threshold in a certain confidence level of probabilities, in order to improve efficiency. Thus the truth discovery problem is transformed into an optimization problem. Furthermore, an algorithm, CTF-Stream (continuous truth finding over sensor data streams) is constructed to assessing source reliability with changeable frequencies. CTF-Stream utilizes the historic data to dynamically determine the time needed to assess the source reliability, and finds the truth with a certain accuracy given by customers while improving the efficiency. Finally, both efficiency and accuracy of the presented methods for truth discovery in sensor data streams are validated by conducting the extensive experiments on real sensor dataset.
JIN Che-Qing , LIU Hui-Ping , ZHOU Ao-Ying
2016, 27(7):1671-1684. DOI: 10.13328/j.cnki.jos.005037 CSTR:
Abstract:Along with the development of economy and information technology, a large amount of data are produced in many applications. However, due to the influence of some factors, such as hardware equipments, manual operations, and multi-source data integration, serious data quality issues sunch as data inconsistency arise, which makes it more challenging to manage data effectively. Hence, it is crucial to develop new data cleaning technology to improve data quality to better support data management and analysis. Existing work in this area mainly focuses on the situation where functional dependencies are used to describe data inconsistency. Once some violations are detected, some tuples must be changed to suit for the functional dependency set via update (instead of insert or delete). Besides functional dependency, there also exist other types of constraints, such as the hard constraint, quantity constraint, equivalent constraint, and non-equivalent constraint. However, it becomes more difficult when more inconsistent conditions are involved. This paper addresses the general scenario where functional dependencies and other constraints co-exist. Corresponding data repair algorithm is designed to improve the data quality effectively. Experimental results show that the proposed method performs effectively and efficiently.
XU Yao-Li , LI Zhan-Huai , CHEN Qun , ZHONG Ping
2016, 27(7):1685-1699. DOI: 10.13328/j.cnki.jos.005041 CSTR:
Abstract:Various techniques have been proposed to repair inconsistent relational data that violate functional dependencies by optimizing the repair plan by the metric of repair cost. However, they may fall short in the circumstances where the erroneous data occurs in the left-hand side of a functional dependency or repair cost is not a reliable optimization indicator. In this paper, a novel repairing approach based on possible world model is proposed. It first constructs candidate repair plans and then estimates their possible world probabilities. The possible world probabilities are measured by quantifying both repair cost and candidate value appropriateness with regard to other related attribute values presented in relational data. Finally, extensive experiments on synthetic datasets show that the proposed approach performs considerably better than the cost-based approach on repair quality.
ZHANG Zhi-Gang , JIN Che-Qing , WANG Xiao-Ling , ZHOU Ao-Ying
2016, 27(7):1700-1714. DOI: 10.13328/j.cnki.jos.005035 CSTR:
Abstract:Important locations mainly refer to the places where people spend much time in the daily life, including their home and working places. The development and popularization of smart cell phones bring great convenience to people's daily life. Besides making calls and surfing the Internet, the logs generated when visiting the base stations also contribute to users' pattern mining, such as important location discovery. However, it's challenging to deal with such kind of trajectory data, due to huge volume, data inaccuracy and diversity of cell phone users. In this research, a general framework is proposed to improve the usability of trajectory data. The framework includes a filter to improve data usability and a model to produce the mining results. Two concrete strategies, namely GPMA (grid-based parallel mining algorithm) and SPMA (station-based parallel mining algorithm), can be embedded into this framework separately. Moreover, three optimization techniques are developed for better performance:(1) a data fusion method, (2) an algorithm to find users who have no work places, and (3) an algorithm to find people who work at night and fix their important locations. Theoretical analysis and extensive experimental results on real datasets show that the proposed algorithms are efficient, scalable, and effective.
WANG Jia-Ying , WANG Bin , YANG Xiao-Chun
2016, 27(7):1715-1728. DOI: 10.13328/j.cnki.jos.005034 CSTR:
Abstract:With the rapid development of the third and next generation sequencing techniques, genomic sequences such as DNA grow explosively. Processing such big data efficiently is a great challenge. Research found that although those sequences are very large, they are highly similar to each other. Thus it is feasible to reduce the space cost by storing their differences to a reference sequence. New studies show that it is possible to directly search on the compressed data. The target of this study is to improve the scalability of the indexing and searching techniques to meet the growing demand of big data. Based on the existing method, this work is placed on compressing the reference sequence. Several optimization techniques are proposed to perform efficient exact and approximate search with arbitrary query length on the compressed data. The process is further improved by utilizing parallel computing to crease the query efficiency for big data. Experimental study demonstrates the efficiency of the proposed method.
HUANG Dong-Mei , GENG Xia , WEI Li-Fei , SU Cheng
2016, 27(7):1729-1740. DOI: 10.13328/j.cnki.jos.005039 CSTR:
Abstract:Remote sensing image has the characteristics of multi-temporal, multi-semantic and multi-spectral, and it plays an important role in different industries and military. It is a significant issue that the efficiency and accuracy of ciphertext query on the huge amounts of remote sensing image have direct impact on the remote sensing big data in its general applicability and real-time performance. For ciphertext, secure-retrieval on remote sensing big data is one of the important standards on the usability of remote sensing image. This paper first proposes an encryption/query scheme on remote sensing image including encryption algorithm and query algorithm, and then constructs an image query scheme on the encrypted remote sensing image based on the improved Henon mapping. Based on the imaging principle and multi-band remote sensing image characteristics, the paper takes the improved Henon mapping technique to encrypt the gray value of remote sensing image for each band. According to the characteristics of "remote sensing big data", the process extracts the feature vector by the statistical gray values for each band, and then queries the target remote image based on matching algorithm. Finally, experiments are carried out on encrypting and processing the big data of Landsat 8 remote sensing satellites, the experimental results show that the presented algorithms effectively improve the security and accuracy on the usage of remote sensing image. Meanwhile, the scheme has a low computational complexity and communication cost.
PAN Jian-Dong , CHEN Li-Qian , HUANG Da-Ming , SUN Hao , ZENG Qing-Kai
2016, 27(7):1741-1756. DOI: 10.13328/j.cnki.jos.004836 CSTR:
Abstract:Abstract interpretation provides a general framework to generate program invariants automatically. However, most existing numerical abstract domains under this framework can only express constraint sets that are geometrically convex. Therefore, using convex abstract domains to analyze special program structures involving disjunctive semantics (whose constraint sets are non-convex) may lead to imprecise results. This paper presents a novel approach based on loop decomposition and deduction to improve invariant generation for loop structures with explicit and implicit disjunctive semantics. This approach can alleviate the problem of great semantic loss during abstract interpretation of loop structures with disjunctive semantics. Compared with existing approaches, experimental results show that the presented approach can generate more precise invariants for loop structures involving disjunctive semantics, which is also helpful for reasoning about certain security properties.
GE Xu-Jun , WANG Ling , XU Li-Hua , GUO Jian , ZHU Hui-Biao
2016, 27(7):1757-1771. DOI: 10.13328/j.cnki.jos.004834 CSTR:
Abstract:Model-Driven development is currently a highly regarded software development paradigm among software developers and researchers, and model-based testing techniques are usually applied during the development to ensure the quality of software systems. With the growing size and complexity of software systems, maintaining the consistency between software models and their implementation become more and more challenging. While traditional model-based testing focuses on ensuring the software implementation comply with its designed model, this work addresses particularly the situation where the implementation is modified while software models are left outdated due to workarounds or other unexpected changes during development. The paper presents an automated consistency checking framework, ProMiner, which extends traditional model-based testing with mining software properties that represent the identified inconsistencies as linear temporal logic (LTL). Experiments show that this extended consistency checking technique effectively helps software designer to narrow down the specific locations of software models that need to be updated with respects to its running implementation.
XU Hong-Zhen , ZENG Guo-Sun , WANG Xiao-Yan
2016, 27(7):1772-1788. DOI: 10.13328/j.cnki.jos.004841 CSTR:
Abstract:How to verify the correctness of dynamic evolution process using the model checking technique is a challenge in the dynamic software architecture evolution research field at present. In fact, the existing approaches in this direction rarely consider relevant conditions of dynamic software architecture evolution. To solve the problem, this paper proposes a state model of dynamic software architecture evolution using the conditional state transition system. This approach maps software architecture hypergraphs to states, and the application of evolution rules to the conditional state transition relation. It also provides the method for mapping conditional hypergraph grammars of dynamic software architecture evolution to conditional state transition systems and corresponding realization algorithms, as well as for implementing the construction of the conditional state transition system of dynamic software architecture evolution. Furthermore, the bisimulation equivalence between the conditional hypergraph grammar of software architecture dynamic evolution and the conditional state transition system under the mapping method is proved. Finally, the paper presents a case study in applying the proposed method and model checking to verify corresponding properties of dynamic software architecture evolution, demonstrating the effectiveness of the proposed method.
WANG Yi-Zhuo , CHEN Xu , JI Wei-Xing , SU Yan , WANG Xiao-Jun , SHI Feng
2016, 27(7):1789-1804. DOI: 10.13328/j.cnki.jos.004842 CSTR:
Abstract:Task-Based parallel programming model has become the mainstream parallel programming model to improve the performance of parallel computer systems by exploiting task parallelism. This paper presents a novel task-based parallel programming model which supports hardware fault tolerance. This model incorporates fault tolerance mechanisms into the task-based parallel programming model and aim to improve system performance and reliability. It uses task as the basic unit of scheduling, execution, fault detection and recovery, and supports fault tolerance in the application level. A buffer-commit computation model is used for transient fault tolerance and application-level diskless checkpointing technique is employed for permanent fault tolerance. A work-stealing scheduling scheme supporting fault tolerance is adopted to achieve dynamic load balancing. Experimental results show that the proposed model provides hardware fault tolerance with low performance overhead.
2016, 27(7):1805-1821. DOI: 10.13328/j.cnki.jos.005053 CSTR:
Abstract:Privacy protection problem in location based service receives continuous attentions in recent years, especially for location privacy protection in location based nearest neighbors query. Existing work however often neglects or fails to cultivate in accommodating query users' privacy preference requirements. The constraint of privacy preference burdens the trade-off between location privacy protection and quality of service in privacy-aware location based query service. Several issues need to be addressed:(1) Privacy preferences contradict with privacy models diametrically in terms of personality and commonality they focus on; (2) There is a dilemma between privacy preferences and query performance that preferences require intermediate query results be dynamic and adjustable while simplified intermediate query results commonly promise good performances; and (3) Privacy preferences incur the attack originated from intersection inferring to candidate answer sets in continuous location based queries. In this survey, the privacy preference problem in location based nearest neighbor query is identified and presented. The performance of existing location obfuscation and query techniques, as well as their ability in accommodating users' privacy preferences, are discussed in terms of location obfuscation principle and inherent restricting mechanism between nearest neighbor query performance and location protection. Subsequently, inherent restricting mechanism between location privacy preserving nearest neighbors query and privacy preference support is detailed, and some major problems originated from location privacy preference are demonstrated. Finally, some possible solutions to these problems are elaborated and the future research work is suggested.
LIANG Jun-Bin , ZOU Shao-Jun , CHEN Ning-Jiang , LI Tao
2016, 27(7):1822-1840. DOI: 10.13328/j.cnki.jos.004926 CSTR:
Abstract:For data gathering in large-scale wireless sensor networks, not only energy consumption of nodes but also data collection delay should be considered. It is a challenging problem to achieve the goal of balancing energy consumption of nodes and minimizing data collection delay in the network at the same time. In order to balance energy consumption of nodes, this paper utilizes a mobile data collector to collect data in the network, and proposes an algorithm named DC-Collection to solve the problem of minimizing data collection delay and energy consumption. First, DC-Collection constructs a shortest path tree. If the network is not connected, there are more than one shortest path trees (i.e., there is a set of shortest path trees in the network). Second, some nodes in the trees are selected as collective nodes or lingering nodes, where a collective node is the root of a height-limited tree that will receive data sent from its descendant nodes and a lingering node is a normal node that the mobile data collector will visit at a given time to collect sensing data. The mobile data collector can collect the data of all nodes in the network as long as it traverses the locations of all lingering nodes. The heights of the trees that rooted at the collective nodes are limited to be smaller than h. There is at least a lingering node exists in the communication area of each collective node. Third, DC-Collection adjusts the structures of the height-limited trees. It makes nodes with higher energy level possess more descendants, so as to maximize the network lifetime by balancing the energy consumption of nodes. Finally, the mobile data collector starts from a Sink, and traverses locations of the lingering nodes in sequence to collect data. After collecting all the data, it returns to the starting point and uploads the data to the Sink. Compared with existing algorithms, theoretical analyses and simulations show that DC-Collection can not only balance the energy consumption of nodes to prolong the network lifetime, but also shorten the path length that the mobile data collector walks to reduce the data collection delay.
ZHOU Ai-Ping , CHENG Guang , GUO Xiao-Jun , LIANG Yi-Xin
2016, 27(7):1841-1860. DOI: 10.13328/j.cnki.jos.004843 CSTR:
Abstract:Detection of super points is very important for some network applications such as network security and network management. Due to the imbalance between the massive amount of traffic data in high-speed networks and the limited system resource, it is a great challenge to accurately monitor traffic in high-speed networks online. With the development of multi-core processers, the parallelism of multi-core processers becomes an effective method of improving performance. Consider that the existing method based on flow sampling has heavy computing load, low detection accuracy and poor timeliness, this article presents a new approach, PDS (parallel data streaming). This method constructs parallel reversible sketch and builds a compact summary of node connection degrees. Addresses of super points are reconstructed by a simple calculation without address information of nodes. This makes PDS efficient and precise. The experimental results show that PDS is superior to the CSE (compact spread estimator) and JM (joint data streaming and sampling method) methods. Consequently, the proposed method satisfies application requirement of traffic monitoring in high-speed networks.
LI Xue-Jun , WU Yang , LIU Xiao , CHENG Hui-Min , ZHU Er-Zhou , YANG Yun
2016, 27(7):1861-1875. DOI: 10.13328/j.cnki.jos.004879 CSTR:
Abstract:Scientific workflow is a complicated data intensive application. How to achieve an effective data placement schema in hybrid cloud environment has become a crucial issue nowadays, especially with the new challenges brought by the security issues. Traditional data placement strategies usually adopt load balancing-based partition model to allocate datasets. Although these data placement schemas can have good performance in load balancing, their data transfer time may not be optimal. In contrast to traditional strategies, this paper focuses on the hybrid cloud environment and proposes a data dependency destruction-based partition model to achieve the minimal data dependency destruction partition. In addition, it presents a novel datacenter-oriented data placement strategy. This strategy allocates high dependency datasets to one datacenter according to the new partition model and thus significantly reduces data transfer time between datacenters. Experimental results show that the proposed strategy can effectively reduce data transfer time during workflow's execution.
XU Si-Yao , LIN Wei-Wei , James Z. WANG
2016, 27(7):1876-1887. DOI: 10.13328/j.cnki.jos.004918 CSTR:
Abstract:This paper proposes a novel method to allocate virtual machines by statistical resource multiplexing based on the characteristics of virtual machine's peak workloads. In a cloud environment, if the peak workloads of multiple virtual machines overlap, the resource utilization of the cloud system can be significantly low when all these virtual machines enter the non-peak workload phases. If the overlap of workload peaks among virtual machines can be avoided, resource utilization will not fluctuate so much (heavily loaded during peak period and largely idle during non-peak period). Since the workload of an application usually follows a cyclic pattern, historical data can be analyzed to predict future workload. This paper models the workload characteristics of virtual machines through monitoring their peak workloads. A similarity matrix of VM's workloads is used to allocate virtual machines so that their workload peaks will not overlap. The performance study using CloudSim demonstrates that the proposed virtual machine allocation algorithm improves the CPU utilization by 8.9% to 12.4% under different workloads compared to the random allocation algorithm. The number of hosts needed by the algorithm is also reduced by 8.2% to 11.0% under the same workload requirements.