Abstract:To solve the problems of data sparseness and knowledge acquisition in translation disambiguation and WSD (word sense disambiguation), this paper introduces a fully unsupervised method, which is based on Web mining and Web indirect association of bilingual words. It provides new knowledge of translation disambiguation. It assumes that word sense can be determined by indirect association of bilingual words. Based on Web, this paper revises four common methods of indirect association, and designs three decision methods. These methods are evaluated on a gold standard Multilingual Chinese English Lexical Sample Task dataset of SemEval- 2007. The experimental results show that the model gets the state-of-the-art results (Pmar=44.4%) and outperforms the best system in SemEval-2007.
ZHANG Xiang-Rong , YANG Chun , JIAO Li-Cheng
Abstract:A Synthetic Aperture Radar (SAR) target recognition approach based on KPCA (kernel principal component analysis) and Laplacian regularized least squares classification is proposed. KPCA feature extraction method can not only extract the main characteristics of target, but also reduce the input dimension effectively. Laplacian regularized least squares classification is a semi-supervised learning method. In the target recognition process, training set is treated as labeled samples and test set as unlabeled samples. Since the test samples are considered in the learning process, high recognition accuracy is obtained. Experimental results on MSTAR (moving and stationary target acquisition and recognition) SAR datasets show its good performance and robustness to azimuth interval. Compared with template matching, support vector machine and regularized least squares learning method, the proposed method gets more SAR target recognition accuracy. In addition, the effect of the number of labeled points on target identification performance is analyzed at different conditions.
LI Ji-Hong , WANG Rui-Bo , WANG Wei-Lin , LI Guo-Chen
Abstract:Based on the semantic knowledge base of Chinese FrameNet (CFN) self-developed by Shanxi University, automatic labeling of the semantic roles of Chinese FrameNet is turned into a sequential tagging problem at word-level by applying IOB (inside/outside/begin) strategies to the exemplified sentences in CFN corpus, and the Conditional Random Fields (CRF) model is adopted. The basic unit of tagging is word. The word, its part of speech, its relative position to the target word, the target word, and their combination are chosen as the features. Various model templates are formed through optional size windows in each feature, and the orthogonal array within statistics is employed for screening of the better template. All experiments are based on the6 692 exemplified sentences of 25 frames selected from CFN corpus. The separate model is trained for each frame on its exemplified sentences by 2-fold cross-validation, and the processing of identification and classification for the semantic roles are taken simultaneously. Finally, with the target word given in a sentence, as well as the frame name of the target word, the experimental results on all 25 frames data for the precision, the recall, and F1-measure are 74.16%, 52.70%, 61.62%, respectively.
NING Ji-Feng , WU Cheng-Ke , JIANG Guang , LIU Shi-Gang
Abstract:A new external force field for active contour model, called anisotropic gradient vector flow, is presented to solve the problem that gradient vector flow (GVF) is difficult to enter the indentation. The diffusion term of GVF is the isotropic and highly smooth Laplacian operator with the same diffusion speed along tangent and normal directions. The diffusion of Laplacian operator is actually decomposed into the tangent and normal directions by the local image structures. Diffusion along the tangent direction enhances the edge, while diffusion along the normal direction removes noise and propagates the force field. This paper develops an anisotropic gradient vector flow based on the analysis of diffusion process of GVF along tangent and normal directions. In the proposed method, the diffusion speeds along the normal and tangent directions are adaptively obtained by the local structure of the image. The experimental results show that compared with GVF, the proposed method considering these two diffusion actions can enter long, thin indentation and improve the segmentation.
LUAN Hua , DU Xiao-Yong , WANG Shan
Abstract:The computation of data cubes usually produces huge outputs. There are two popular methods to solve this problem: Iceberg cube and closed cube, which can be combined together. Due to the importance and usability of closed iceberg cube, how to efficiently compute it becomes a key research issue. A cache-conscious computation method is proposed in this paper. The data are aggregated in a bottom-up manner. In the meantime, the closed cells covering the aggregate cells are discovered and output. Two pruning strategies are used to save unnecessary recursive calls. The Apriori pruning is utilized to support iceberg cube computation. To reduce the number of memory-related stalls and produce the aggregate results efficiently, multiple dimensions are pre-sorted and the software prefetching technology is introduced into data scans. A comprehensive and detailed performance study is conducted on both synthetic data and real data sets. The results show that the proposed closed iceberg cube computation method is efficient and effective.
GU Yu , YU Ge , HU Xiao-Long , WANG Yi
Abstract:Because wireless radio frequency signal is adopted during the RFID (radio frequency identification) communication, many false negative and false positive readings may be produced leading to inaccurate event detection. In RFID-based applications, monitored objects often progress in the form of group. In this paper, by analyzing the group changes based on the defined association degree and dynamic clusters, a series of models and algorithms about association degree maintenance and data cleaning are proposed. Especially, by compressing the graph model, an association degree maintenance strategy based on splitting and recombining list model is discussed to improve the time and space performance. Simulated experiments have demonstrated the efficiency and accuracy of the proposed data cleaning model.
CHEN Hua-Hui , SHI Bai-Le , QIAN Jiang-Bo , CHEN Ye-Fang
Abstract:In many real-life applications, such as stock markets, network monitoring, and sensor networks, data are modeled as dynamic evolving time series which is continuous and unbounded in nature, and many such data streams concur usually. Clustering is useful in analyzing such paralleled data streams. This paper is interested in grouping these evolving data streams. For this purpose, a synopsis is maintained dynamically for each data stream. The construction of the synopsis is based on Discrete Wavelet Transform and utilizes the amnesic feature of data stream. By using the synopsis, a fast computation of approximate distances between streams and the cluster center can be implemented, and an efficient online version of the classical K-means clustering algorithm is developed. Experiments have proved the effectiveness of the proposed method.
XIAO Bo , ZHANG Liang , XU Qian-Fang , LIN Zhi-Qing , GUO Jun
Abstract:The hyperclique pattern is a new type of association pattern, in which items are highly affiliated with each other. The presence of an item in one transaction strongly implies the presence of every other item in the same hyperclique pattern. The maximal hyperclique pattern is a more compact representation of a group of hyperclique patterns, which is desirable for many applications. The standard algorithms mining the two kinds of patterns are different. This paper presents a fast algorithm called hybrid hyperclique pattern growth (HHCP-growth) based on FP-tree (frequent pattern tree), which unifies the mining processes of the two patterns. This algorithm adopts recursive mining method and exploits many efficient pruning strategies. Some propositions are also presented and proved to indicate the effectiveness of the strategies and the validity of the algorithm. The experimental results show that HHCP-growth is more effective than the standard hyperclique pattern and maximal hyperclique pattern mining algorithms, particularly for large-scale datasets or at low levels of support.
Abstract:It is of great importance to compare the correlation of different XML (extensible markup language) streams in the limited space in the Database Theory. In the study of these problems, several measures are proposed, e.g. the tree-edit distance, to show the difference of XML trees. This paper proposes a natural measure l0 employing Hamming norms, i.e. the number of distinct sub-trees between two XML trees, to estimate the correlation. Furthermore, a probabilistic estimating algorithm involving space-bounded pseudorandom generators, stable distributions and hash functions has been presented in the data stream model. Theoretical time/space complexity analysis, correctness proof and experimental simulation show that this algorithm can give a desired approximation.
WANG Zhi-Hui , XU Jian , WANG Wei , SHI Bai-Le
Abstract:To prevent the disclosure of privacy, it requires preserving the anonymity of sensitive attributes in data sharing. The attribute values on quasi-identifiers often have to be generalized before data sharing to avoid linking attack, and thus to achieve the anonymity in data sharing. Data generalization increases the uncertainty of attribute values, and results in the loss of information to some extent. Traditional data generalization is often based on the predefined hierarchy, which causes over-generalization and too much unnecessary information loss. In this paper, the attributes in a quasi-identifier are classified into two categories, ordered attributes and unordered attributes. More flexible strategies for data generalization are proposed for them, respectively. At the same time, the loss of information is defined quantitatively based on the change of uncertainty of attribute values during data generalization. Furthermore, data anonymization is modeled by a clustering problem with special constraints. A clustering-based approach, called L-clustering, is presented for the l-diversity model. L-clustering can meet the requirement of preserving anonymity of sensitive attributes in data sharing, and reduce greatly the amount of information loss resulting from data generalization for implementing data anonymization.
Abstract:Although axiomatic systems and proof method for temporal logic have found so far relatively few applications in the query language modeling of temporal database and that was proved by Gabbay, et al in 1994, the model of time axes still must be built axiomatically, which owns soundness and completeness and depicts the time axes in fine grain. Thus the essential characteristics of time and temporal attributes in database can be described exactly, and the result is used in the temporal querying. So the research starts from essential characteristics of time attributes. At first, this paper expounds the order relation and first order logic properties of time axes. Secondly, this paper axiomatically models that using Tense Logic and dynamic logic, which aims at reflecting the properties of axes in fine grain by logical analysis. In the part of dynamic logic modeling, respecting to the static of temporal logic system, this paper mostly deals with the dynamics in the new system. Based on Lin.Z system in Tense Logic, this paper makes out the dynamic Lin.Z system, which has some parameters and functions. The parameters of this system are based on the action, which helps the action exponential numerical and functional. The results of that embody the properties and representation method of the rule’s lifecycle and point “Now” in temporal database, which positively helps the research in the field of temporal knowledge representation and temporal querying subsequently.
DAI Dong-Bo , TANG Chun-Lei , XIONG Yun
Abstract:Many current sequence clustering algorithms are based on the hypothesis that sequence can be characterized by its local features, without differentiating between global similarity and local similarity of sequences in different applications, which is applicable to biological sequences such as DNA and protein with conserved sub-patterns. However, in some domains such as the comparison of customers’ purchase behaviors in retail transaction database and the pattern match in time series data, due to difficulties in forming frequent sub-pattern, it is more reasonable to cluster these sequence data based on global similarity. Besides, among sequence clustering algorithms based on local similarity, the ability that sub-patterns characterize sequence should be improved. So, this paper proposes two clustering algorithms, GSClu (global similarity clustering) and LSClu (local similarity clustering), for different application fields, based on global and local similarity respectively. GSClu uses bisecting k-means technique and CSClu adopts sub-patterns with gap constraint to cluster the sequence data of corresponding application field. Sequence data in the experiments include retail transaction data and protein data. The experimental results show that GSClu and LSClu are of fast processing rate and high clustering quality.
DAI Dong-Bo , XIONG Yun , ZHU Yang-Yong
Abstract:Sequence data are ubiquitous in many domains such as text, Web access log and biological database. Similarity search in this kind of data is very important for knowledge acquisition and analysis. An indexing technique based on reference is an effective method for sequence similarity search, the main idea of which is to assign some sequences in database as reference sets, then filter out those sequences unrelated to query sequence and finally get the answer efficiently. This paper presents a similarity search algorithm IRI (improved reference indexing) which is based on current indexing technique using reference set and is more powerful in terms of filtration. First, previous query results are used to accelerate the current query. Then, the upper bound and lower bound based on sequence characteristic are proposed to make the bound tighter and improve the filtration capability. Finally, to avoid the time-consuming edit distance computing, only partial edit distance between prefix sequences need to compute, which makes the algorithm run faster. Real data including DNA and protein sequence data are used in the experiment. Comprehensive experimental results show that IRI is more efficient than the current reference-based indexing algorithm RI (reference indexing).
LI Fang-Min , MA Xiao-Lin , FANG Yi-Lin , WU Peng
Abstract:After surveying the research of MIMO (multiple-input multiple-output) multi-hop wireless networks, this paper analyzes the MIMO mechanisms for multi-hop wireless networks and discusses the effect of MIMO upon separate network layer and the whole design of multi-hop wireless networks. Focusing on cross-layer design, it expounds the fundamental mechanism of existing representative protocols and algorithms for MIMO-based multi-hop wireless networks, and compares and analyzes their characteristics, performance as well as deficiencies. Finally, by pointing out current research development, it summarizes the restrictions to the practical applications of MIMO-based multi-hop wireless networks, and proposes the importance of designing an adaptive and high performance cross-layer model which based on MIMO technology for multi-hop wireless networks.
MAO Jian-Bing , MAO Yu-Ming , LENG Su-Peng , BAI Xiang
Abstract:With the rapid development of wireless applications, the IEEE 802.11 task group proposed the QoS- supporting enhanced distributed channel access (EDCA) MAC mechanism, which is based on the legacy IEEE 802.11 DCF, to provide service differentiation for different types of traffic flows in the networks. This paper proposes a novel Markov chain-based analytical model for the IEEE 802.11 EDCA. Unlike the existing analytical models for the EDCA, the proposed model incorporates all the three major features of the EDCA, i.e., Wmin/Wmax, AIFS (arbitration inter-frame space), and TXOP (transmission opportunity). With this model, the paper presents the performance evaluations of the EDCA analytically, including each priority’s transmission throughput, media access delay, packet loss rate, etc. It dose not only analyze the saturated EDCA operation case, but also analyze the non-saturated EDCA operation case. Simulation results confirm the accuracy of the model analysis. Based on the model analysis, this paper proposes an admission control scheme D-TXOP (dynamic TXOP). While making admission decision, the scheme adjusts different priorities’ TXOP parameter settings adaptively according to their QoS requirements, which greatly enhances the network capacity.
TONG Yun-Hai , TAO You-Dong , TANG Shi-Wei , YANG Dong-Qing
Abstract:In the research of privacy preserving data publishing, the present method always removes the individual identification attributes and then anonymizes the quasi-identifier attributes. This paper analyzes the situation of multiple records one individual and proposes the principle of identity-reserved anonymity. This method reserves more information while maintaining the individual privacy. The generalization and loss-join approaches are developed to meet this requirement. The algorithms are evaluated in an experimental scenario, reserving more information and demonstrating practical applicability of the approaches.
WANG Peng , LUO Jun-Zhou , LI Wei , BIAN Zheng-Ai , QU Yan-Sheng
Abstract:Using a multi-agent system to control network is an important method to manipulate controllable networks. The reasonable control of network is based on the belief reachability of the multi-agent system, which means that the belief of every agent must be consistent with the real state of the network before making decisions. To research the belief reachability and convergence rate of the multi-agent system, a new model named multi-agent system belief distance updating model, which describes the updating progress of distance between agent’s belief and real network state, is proposed based on the traditional belief updating model. And the rationality of the new model is also proved. The model which transforms the belief updating progress into a linear system simplifies the analysis of belief reachability and convergence rate of the multi-agent system. Based on this model, a sufficient and necessary condition for belief reachability, and the upper limit of convergence rate of multi-agent system are proved. Besides, the belief reachability and the convergence rate of the multi-agent system in complete coupling network and scale-free network are also discussed respectively considering the characteristics of the two complicated networks. The model is adaptable to all multi-agent environments, and provides a good tool to analyze the belief reachability of the multi-agent system.
Abstract:ACO (ant colony optimization) can find the optimal path from source nodes to sink nodes in data query for wireless sensor networks. But once all query tasks and query results are transmitted along the path, the energy of the path will be exhausted. So this paper puts forward an Energy Balance Data Query Protocol based on Ant Colony Optimization (EBDQ). This protocol rewards or punishes a path according to the energy consumption by pheromone decentralize energy consumption into different paths, and reposefully demotes the energy consumption of the whole network. Both theoretical analysis and simulation results show that EBDQ can prolong the lifetime and reduce the delay of the network.
Abstract:Tracing online propagation paths when worm breaks out on a large scale can improve the network’s anti-attackability. The existing tracing approaches to obtain worm propagation path are all based on off-line analysis and usually have a lower accuracy. This paper proposes an online Accumulation Algorithm with sliding detection windows, which can fleetly and efficiently trace the origin and initial causal edges of the worm. The algorithm solves the conflicts in choosing causal edges and tackles the problem of merging propagation paths in the consecutive reconstruction phase. The algorithm’s accuracy and performance have been analyzed. Experimental results reveal that the online Accumulation Algorithm can dig out causal edge even at the initial stage, and the Accumulation Algorithm can achieve detection accuracy higher than 90% while its running time is only 1% of related works.
Abstract:Worms search for targets by means of service requests, and anomalous service requests give indication of worm propagation. A worm detection system that uses positive selection algorithm to characterize normal service requests with self-strings is proposed. Bloom filters are used to represent hosts’ self-strings and monitor the network for suspicious service requests. On the basis of worm properties, the discovered suspicious service requests are correlated in the form of binary trees, and a non-parametric CUSUM (cumulative sum) algorithm is used to monitor the anomaly value of binary trees so as to detect worm propagation timely and accurately. Experimental results of the GTNetS (Georgia Tech Network Simulation) platform show that the proposed system is effective to detect worms, and the system’s influence on normal network traffic is minor.
Abstract:IP multicast and MPLS (multi-protocol label switching) are proposed to support emerging network applications effectively, extending current IP routing and switching mode in different ways. It is still in progress to combine IP multicast with MPLS properly. In this paper, a novel architecture, called Rendezvous MPLS Multicast, is proposed to support IP multicast in MPLS networks and to have the scalability of multicast routing, to achieve the label space reduction, and to solve some practical problems in implementation. A two-layered structure of control plane without multicast routing protocols is implemented by overlaying novel service control plane over the existing routing control plane to support multicast state aggregate. The interface between the two planes is formulated to support the interaction and cooperation. Moreover, the label distribution process for MPLS P2MP (point to multi-point) connection, with RSVP-TE (resource reservation protocol-traffic engineering) protocol, is extended to support aggregate of multiple label switching paths with traffic engineering and guarantee of quality of service. An algorithm of selecting Rendezvous Routers for Rendezvous MPLS Multicast has been presented. A test-bed with Linux-based implementations of MPLS multicast router and IP multicast service control system has also been constructed. The experimental results show its efficiency in terms of label space reduction and multicast traffic balancing, with the whole system adapting to the dynamic change of group members and network topology.
CHEN Feng , ZHANG Yi , SU Jin-Shu , HAN Wen-Bao
Abstract:An attack graph is a model-based vulnerability analysis technology, which can automatically analyze the interrelation among vulnerabilities in the network and the potential threats resulting from the vulnerabilities. Since the state-based attack graphs can not be applied to the real large networks for the combinatorial explosion in the number of attack paths, the study is now shifted to attribute-based. Based on attribute-based attack graphs, this paper discusses the loop attack paths and the optimization security measures. For the former, an iterative algorithm is presented to find all the non-loop attack paths to the key attributes with their depth less than the given number n. For the latter, it is proved to be an NP-complete problem, and the greedy algorithm is proposed to solve the problem with polynomial time complexity.
Abstract:Disk I/O may be the bottleneck in computer systems because of the mechanism characters in disk system. In order to tune the system performance effectively, the collection of the disk I/O workload characteristics will be the essential step for the performance optimization. Compared with other I/O characteristics collection methods, this paper presents an on-line I/O characteristics analysis method based on Xen 3.0 virtual machine system. In virtual machine environments, this method can be transparent to the unmodified operating system. With this method, several essential I/O characteristics metrics can be collected, such as disk I/O block size, I/O latency, I/O arrival interval, I/O spatial locality statistics, I/O time locality statistics and the I/O operation hotspot distribution. Through the testing and analysis, this method is found to have little system overhead and impact on the application system I/O performance. In addition, the I/O characteristics analysis results are demonstrated on the large file copy workload and the Filebench benchmark workload generated by the filemicro and varmail personalities.
WANG Peng-Yu , CHEN Yun-Ji , SHEN Hai-Hua , CHEN Tian-Shi , ZHANG Heng
Abstract:Memory consistency verification is an important part of functional validation of CMP (chip multi- processor). Since checking an execution of a parallel program against a memory consistency model is known to be an NP-hard problem, in practice, incomplete verification methods with higher than O(n3) time complexity are used to deal with memory consistency verification. In this paper, a linear time complexity memory consistency verification tool LCHECK is introduced. In the multi-processor system which supports store atomicity, there must be a time order between two operations with disjoint execution periods: The former operation in time order must be observed by the latter operation. LCHECK localizes memory consistency verification based on time order. It infers edges of orders and checks correctness in bounded operations. LCHECK is used in the verification of an industrial CMP, Godson-3, and finds many bugs of memory subsystem of Godson-3.