Abstract:
LIU Wei , MENG Xiao-Feng , LING Yan-Yan
Abstract:A flood of information is hidden behind the Web-based query interfaces with specific query capabilities, which makes it difficult to capture the characteristics of the Web database, such as the topic and the frequency of updates. This poses a great challenge for Deep Web data integration. To address this problem, a graph-based approach WDB-Sampler for Web database sampling is proposed in this paper, which can incrementally obtain sample records from a Web database through its query interface. That is, a number of samples are obtained for the current query, and one of them is transformed into the next query. The important characteristic of this approach is it can adapt to different kinds of attributes on the query interfaces. The extensive experiments on the local simulation Web databases and the real Web databases prove that the approach can achieve high-quality samples from a Web database at a lower cost.
KOU Yue , SHEN De-Rong , LI Dong , NIE Tie-Zheng
Abstract:According to analyzing the traditional entity identification methods, a deep Web entity identification mechanism based on semantics and statistical analysis (SS-EIM) is presented in this paper, which includes text matching model, semantics analysis model and group statistics model. Also a three-phase gradual refining strategy is adopted, which includes text initial matching, representation relationship abstraction and group statistics analysis. Based on the text characteristics, semantic information and constraints, the identification result is revised continuously to improve the accuracy. By performing the self-adaptive knowledge maintenance strategy, the content of representation relationship knowledge database can be more complete and effective. The experiments demonstrate the feasibility and effectiveness of the key techniques of SS-EIM.
YANG Shao-Hua , LIN Hai-Lüe , HAN Yan-Bo
Abstract:A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price...). The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Comparing with many other existing work, the approach is applicable for both "list pages" and "detail pages". Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
LING Yan-Yan , MENG Xiao-Feng , LIU Wei
Abstract:An approach based on the word frequency is proposed in this paper to estimate the size of Web database. It obtains a random sample on a certain attribute by analyzing the attribute correlations among all the textual attributes in the query interface. The size of a Web database can be estimated by submitting probing queries which are generated by top-k frequent words to the query interface of a Web database. The experiments on several real-world databases have proved that this approach is effective and can achieve high accuracy in estimating the size of Web databases.
YUAN Liu , LI Zhan-Huai , CHEN Shi-Liang
Abstract:A semantic annotation method for Web database query result is proposed in this paper by adopting the deep annotation procedure in semantic Web. As a global schema Web database should be followed, domain ontology is introduced to the annotation procedure for a completed and consistent annotation result. The query interface and the query result features are analyzed in detail, the strategy of query condition reconfigured is adopted, and then the semantic markups of query result are determined. By collecting Web database from different domains, the experiments indicate that the approach proposed can annotate the Web database query result properly under the support of domain ontology.
WANG Hui , LIU Yan-Wei , ZUO Wan-Li
Abstract:In hidden Web domain, general-purpose search engines (i.e., Google and Yahoo) have their shortcomings. They cover less than one-third of the data stored in document databases. Unlike the surface Web, if combined, they cover roughly the same data. Hidden Web is a highly important information source since the content provided by many hidden Web sites is often of very high quality. This paper proposes a three-step framework to automatically identify domain-specific hidden Web entries. With those obtained query interfaces, they can be integrated to obtain a unified interface which is given to users to query. Eight large-scale experiments demonstrate that the technique can find domain-specific hidden Web entries accurately and efficiently.
XU He-Xiang , WANG Xin-Yin , WANG Shu-Yun , HU Yun-Fa
Abstract:Based on the research on the dependence of the components in the deep Web integration (executive partial order and knowledge dependency), a knowledge-based method is given to process the changes in such integration, which includes environmental changes processing model, a self-adaptive software architecture and algorithm. This method can provide a reference to the further research or toward application for the large-scale deep Web integration. The experimental results show that the method can not only process the changes, but also highly improve the performance of the integrated system.
MA Jun , SONG Ling , HAN Xiao-Hui , YAN Po
Abstract:New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text blocks in HTML pages. The unification processing for database attributes is to let the attribute labels that are closed semantically be replaced with delegates. The domain and language knowledge found in learning samples is represented in hierarchical fuzzy sets and an algorithm for the unification processing is proposed based on the presentation. Based on the pre-computing a k-NN (k nearest neighbors) algorithm is given for deep Web database classification, where the semantic distance between two databases is calculated based on both the distance between the content texts of the HTML pages and the distance between database forms embedded in the pages. Various classification experiments are carried out to compare the classification results done by the algorithm with pre-computing and the one without the pre-computing in terms of classification precision, recall and F1 values.
SONG Jie , WANG Da-Ling , BAO Yu-Bin , SHEN De-Rong
Abstract:In this paper, the page block based Web archive collecting and storing approach is proposed. The algorithms of layout-based page partition, extracting topic from block, version comparison and incremental storage implementation are introduced in detail. The prototype system is implemented and tested to verify the proposed approach. Theoretics and experiments show that, the proposed approach adapts the Web archive management well, and provides a valuable data resource to the Web archive based query, search, data mining and knowledge discovering applications.
LANG Hao , WANG Bin , LI Jin-Tao , DING Fan
Abstract:Predicting query performance (PQP) has recently been recognized by the IR (information retrieval) community as an important capability for IR systems. In recent years, research work carried out by many groups has confirmed that predicting query performance is a good method to figure out the robustness problem of the IR system and useful to give feedback to users, search engines and database creators. In this paper, the basic predicting query performance approaches for text retrieval are surveyed. The data for experiments and the methods for evaluation are introduced, the contributions of different factors to overall retrieval variability across queries are presented, the main PQP approaches are described from Pre-Retrieval to Post-Retrieval aspects, and some applications of PQP are presented. Finally, several primary challenges and open issues in PQP are summarized.
LIN Zi-Yu , YANG Dong-Qing , SONG Guo-Jie , WANG Teng-Jiao , TANG Shi-Wei
Abstract:In this paper, data mining based on the log of active decision engine is introduced to find the CUBE using pattern of analysis rules, which can be used as important reference information for materialized views selection. Based on it, a 3A probability model is designed, and the greedy algorithm, called PGreedy (probability greedy), is proposed, which takes into account the probability distribution of CUBE. Also view keeping rule is adopted to achieve better performance for dynamic view adjusting. Experimental results show that PGreedy algorithm can achieve better performance than BPUS (benefit per unit space) algorithm in real-time active data warehouses environment.
WANG Xiao-Qing , ZHENG Yan-Xing , SHI Mei-Lin
Abstract:The problem of multiple data sources selection (MDSS) in DSE (data-sharing environments) is addressed and the algorithm MDSSA (MDSS algorithm) is presented. MDSSA introduces the concept of Pareto optimization which reduces the search space greatly. By means of a novel normal-measure based nonlinear cost function, MDSSA computes approximate Pareto optimal paths to each data source first, and then gives the optimal data source and its corresponding path by comparing the cost of all candidate paths, resulting in finding more effective paths and much shorter response time. Extensive simulations show the efficiency of the algorithm.
PENG Zhao-Hui , ZHANG Jun , WANG Shan
Abstract:In this paper, a result presentation method of keyword search over databases named S-CBR (schema-based classification, browsing and retrieving) which includes three phases of result classification, user browsing, and retrieving is proposed. Firstly, S-CBR uses database schema and query keywords to produce the first-level classes automatically and classify results into them. For large classes, it further partitions them based on the content of keyword nodes. Furthermore, S-CBR gives each class a readable description, and presents the description and each result graphically to help users understand the results easily. Users also can retrieve the class which they are interested in based on the information of the first-level classes to find the results they need or acquire more relevant results. Experimental results verify the method's effectiveness and efficiency.
GUO Yu-Hong , TONG Yun-Hai , TANG Shi-Wei , YANG Dong-Qing
Abstract:After the current definition of the inverse frequent set mining problem is expanded and its three practical applications are explored, an FP-tree-based method is proposed for the inverse mining problem. First, the method divides target constraints into some sub constraints and each time it solves a sub linear constraint problem. After some iterations, it finds an FP-tree satisfying the whole given constraints. Then, based on the FP-tree it generates a temporary database TempD that only involves frequent items. The target datasets are obtained by scattering infrequent items into TempD. Theoretic analysis and experiments show that the method is right and efficient. Moreover, compared with the current methods, the method can output more than one target data set.
ZHANG Ming , YIN Ping , DENG Zhi-Hong , YANG Dong-Qing
Abstract:This paper proposes SVM+BiHMM, a hybrid statistic model of metadata extraction based on SVM (support vector machine) and BiHMM (bigram HMM (hidden Markov model)). The BiHMM model modifies the HMM model with both Bigram sequential relation and position information of words, by means of distinguishing the beginning emitting probability from the inner emitting probability. First, the rule based extractor segments documents into line-blocks. Second, the SVM classifier tags the blocks into metadata elements. Finally, the SVM+BiHMM model is built based on the BiHMM model, with the emitting probability adjusted by the Sigmoid function of SVM score, and the transition probability trained by Bigram HMM. The SVM classifier benefits from the structure patterns of document line data while the Bigram HMM considers both words' Bigram sequential relation and position information, so the complementary SVM+BiHMM outperforms HMM, BiHMM, and SVM methods in the experiments on the same task.
Abstract:Although count operator was used effectively in the process of data preprocessing, abusive use would cause the inconsistent problem of attribute relationship. To solve that problem, after proposing three attribute construction rules, time-serial count operator, a new algorithm for time-serial correlative model without inconsistent problem of attribute relationship is proposed. The time-serial increment count operator can remarkably reduce the high computing cost of time-serial count operator if the assumption is satisfied. The results of experiments prove the above conclusion.
Abstract:In the cases of Web applications, RDF(S) can be used to semantically describe the distributed information sources in enterprises to make the information retrieval more accurate. In this paper, a distributed RDF(S) model is presented to describe distributed and heterogeneous RDF(S)s. Based on this model, a method to query distributed RDF(S) descriptions is also presented, which can retrieve data of concepts as well as that of instances. By this method, users can retrieve RDF(S) descriptions and the information they describe in the same way and implement distributed RDF(S)s' integration.
Abstract:Through the theoretical analysis and research works on some famous mining algorithms, a new mining algorithm named Cherry is proposed in this paper. It bases on FP-tree technology and adopts a novel Cherry-Items-detecting technology. This novel technology can find those prefixes which result to the unclosed or redundant frequent itemsets without maintaining the frequent closed itemsets mined so far in the main memory. In the performance test, the Cherry algorithm is compared with other state of the art algorithms, such as FP-CLOSE, LCMv2 and DCI-CLOSE, in many synthetic and real data sets. The experimental results demonstrate that the Cherry algorithm outperforms them in low support.
JIAN Qiang , GONG Zheng-Hu , ZHU Pei-Dong , GUI Chun-Mei
Abstract:In wireless sensor network, medium access control (MAC) has been at the core of effective communication. Since the traditional MAC layer protocols don't adapt themselves to the performance traits and technique request of wireless sensor network, many MAC protocols for wireless sensor network are studied. Design principles and classification methods for MAC protocols in wireless sensor network are summarized, and fundamental mechanism of each recent representative MAC protocol is analyzed in detail. The characteristics, performance, and application areas of various MAC protocols in wireless sensor network are adequately compared. Finally, the status of current research development are concluded and the open research issues on MAC layer design are pointed out.
OU Zhong-Hong , SONG Mei-Na , ZHAN Xiao-Su , SONG Jun-De
Abstract:The great success of P2P (peer-to-peer) system based on Internet makes researchers focus on the mobile network environment, which is more distributed, with wider participants and more autonomic than the fixed P2P system. The popularity of intelligent terminals and the maturity of mobile application environment bring a bright prospect for mobile P2P networks. But the current research on the mobile P2P network is short of an accurate definition and a great number of questions are still needed to be studied deeply. In this paper, the basic concept of mobile P2P network is introduced firstly, including the definition and characteristics of mobile P2P network, the differences between mobile P2P network and mobile Ad Hoc network and the key techniques of the former. Then a comprehensive survey of the key techniques of mobile P2P networks, such as mobile P2P network architecture, resource discovery strategy, network structure consistency, data dissemination strategy, security and privacy mechanism, is given. The research results of the key techniques are also analyzed in depth, furthermore, the shortcomings and problems are outlined. In the end, the future research and development trend of mobile P2P networks is discussed.
CAI Zhi-Ping , LIU Fang , ZHAO Wen-Tao , LIU Xiang-Hui , YIN Jian-Ping
Abstract:Network monitoring systems are deployed by Internet Service Providers (ISP) and enterprises to obtain network performance information and enhance the global network performance. A constant monitoring is also required to enforce and ensure both connectivity and security of the infrastructure. Designing and developing the infrastructure and optimization algorithms for network monitoring system have raised a great deal of interest recently. One of the key issues in this domain is to minimize the overhead in terms of deploying cost, maintenance cost and additional traffic. Due to different measurement approaches and polling infrastructure, there are many network measurement deploying models. Unfortunately, the optimization problems of these models are generally NP-hard. These optimization problems used to be solved by integer programming, designing approximation algorithms or mapping it to classic optimization problems. In this paper, the chief researches of network measurement deploying models and optimization algorithms in the world are introduced. Lastly, some open issues are presented to be further studied in the network measurement deploying models field.
ZUO Lin , LIU Shao-Hua , FENG Yu-Lin , WEI Jun , LI Yang
Abstract:The support of reliability as adopted in conventional replication or transaction processing techniques is not enough due to their distinct objectives: Replication guarantees the liveness of computational operations by using forward error recovery, while transaction processing guarantees the safety of application data by using backward error recovery. Combining the two mechanisms for stronger reliability is a challenging task. Current solutions, however, are typically on the assumption of simple transaction pattern where only a server transaction exists at the middle-tier application server, and seldom think about some complex patterns, such as client transaction or nested transaction. To address this problem, four typical transaction patterns in J2EE application are recognized first. Then a Web application server replication scheme based state synchronization point concept, RSCTP (replication scheme for complex transaction pattern), is presented to uniformly provide exactly-once semantic reliability support for these complex transaction patterns. In this scheme, EJB components are replicated to endow business logics with high availability. In addition, by replicating transaction coordinator, the blocking problem of 2PC protocol during distributed transactions processing is eliminated. Different transaction scenarios are also discussed to illustrate the effectivity of this scheme. This scheme has been implemented, and it has been integrated into J2EE compatible application server, OnceAS, and the performance evaluation shows that its overhead is acceptable.
LIU Ying , LIU Xue-Hui , SUN Chun-Juan , WU En-Hua
Abstract:A general efficient algorithm for entropy encoding of the connectivity information of meshes is presented in this paper. In comparison to the previous encoding methods, which use only Huffman or arithmetic coding method to encode operator series, this coding method can efficiently compress connectivity information by first calculating Huffman code for every symbol in connectivity series, followed by encoding the Huffman code through using a context-based arithmetic coding method. Experimental results indicate that this method can be applied to almost all the connectivity compression algorithms for meshes. The compression result by using this entropy encoding method is generally higher than the entropy of the series-the best compression result that most connectivity compression algorithms of mesh can obtain respectively.
LI Shi-Jian , XU Cong-Fu , YANG Yang , PAN Yun-He
Abstract:In this paper, firstly, the number of positions for beacon is deduced to send a signal according to the acreage of ROI (region of interest); then a simple method is presented to calculate the coordinates of the sending positions in rectangular ROI; and then, a method is advanced based on virtual force to arrange the positions in arbitrary ROI; further, the wandering salesman problem (WSP) algorithm is applied to the positions sequence so as to get the optimal path touring it. When mobile beacon moves according to the optimal path and emits RF signals at every position, the sensors in ROI could work out their position with multilateration. Experimental results demonstrate that the proposed localization method is efficient and flexible.
Abstract:This paper shows how to efficiently transform any 3-round public-coin honest verifier zero knowledge argument system for any language in NP into a 4 round (round-optimal) concurrent zero knowledge argument for the same language in the bare public-key model. The transformation has the following properties: 1) incurs only O(1) (small constant, about 20) additional modular exponentiations. Compared to the concurrent zero knowledge protocol proposed by Di Crescenzo and Visconti in ICALP 2005, in which their transformation requires an overhead of ((n), the protocol is significantly more efficient under the same intractability assumptions; 2) yields a perfect zero knowledge argument under DL assumption. Note that the Di Crescenzo, et al.'s argument system enjoys only computational zero knowledge property. The transformation relies on a specific 3-round honest verifier zero knowledge proof of knowledge for committed discrete log. Such protocols that require only O(1) modular exponentiations based on different kinds of commitment scheme are developed and they may be of independent interest.