HE Yi-Chao , WANG Xi-Zhao , LI Wen-Bin , ZHAO Shu-Liang
2017, 28(2):185-202. DOI: 10.13328/j.cnki.jos.004937
Abstract:The randomized time-varying knapsack problem (RTVKP) is both a kind of dynamic knapsack problem and a kind of dynamic combinational optimization problem. Currently, the leading algorithms for solving RTVKP include the exact algorithm base on dynamic programming, approximation algorithm base on greedy-choice strategy and evolutionary algorithm base on genetic algorithm. First, in this paper, an exact algorithm base on dynamic programming to solve RTVKP is presented, along with comparison of its time complexity with the existing exact algorithms. Results show that the proposed algorithm is more suitable to solve RTVKP whose profit is larger. Then, the greedy correction and optimization strategy is combined with differential evolution and particle swarm optimization respectively to solve RTVKP. The numerical results on 5 instances of RTVKP show that the evolutionary algorithms which combine the differential evolution, particle swarm optimization and genetic algorithm with Greedy correction and optimization strategy respectively are more suitable to solve the hard RTVKP whose scale and oscillation frequency are larger while having bigger data.
SONG Li-Hua , WANG Hai-Tao , JI Xiao-Jun , ZHANG Xing-Yuan
2017, 28(2):203-215. DOI: 10.13328/j.cnki.jos.005098
Abstract:Being unbound to the state space size, mechanical theorem proving is an important method in ensuring software's correctness and avoiding serious damage from program bugs. File comparison algorithms constitute a large family of algorithms which find wide range of application domains including bio-informatics, information retrieval and network security. This paper presents a work on formalization of fcomp, an efficient line oriented file comparison algorithm suggested by Miller and Myers in 1985, in the interactive theorem prover Isabelle/HOL. A small bug in fcomp's bound variable iteration is identified, and the termination and correctness of the modified algorithm is established. Formal analysis of time complexity is also performed which coincides with the algorithm designers' own results. The presented work lays a valuable foundation for subsequent formal checking of other file comparison algorithms.
WANG Ju , CHEN Guang-Xi , YU Quan
2017, 28(2):216-233. DOI: 10.13328/j.cnki.jos.004950
Abstract:In the framework of description logics, the theories and their related algorithms of conservative extensions, modularity and module extraction are the core notions and vital tools in engineering semantic Web construction, ontology construction, ontology merging and reuse. Among other important contributions in this area, Lutz, et al. have shown that the conservative extension problem for ALC is decidable but its complexity is 2ExpTime-complete while the complexity of the deciding algorithm for light-weight εL is 1ExpTime-complete. Their deciding algorithms are basically depends on tableau algorithm which is substantially a reasoning mechanism in first-order predicate logic. Although, theoretically speaking, those results and algorithms are significant and valuable, both existing theories and methods appear to be complicated, difficult to understand, and hard to implement by engineers working on the semantic Web and ontology construction. This paper will not discuss and analyze the current theories and their algorithms. Instead, it independently proposes a second-order linear reasoning mechanism for all members of DL-Lite family. A proof of the completeness of the second-order deduction system is provided. The proposed mechanism is intuitive, easy to manipulate, and much easier to implement in the engineering sector. It is uniformly applicable for members of DL-Lite family such as εL, FL0, FLε, and vL. Most of all, this system is consistent with graph deduction that facilitates design and construction of the necessary graphs by using space to exchange with time cost. As a result, the time complexity is reduced significantly such that if the space and deduction graphs are sophisticatedly equipped, the deciding time can be reduced topolynomial.
DUAN Lu , SONG Yong-Hong , ZHANG Yuan-Lin
2017, 28(2):234-245. DOI: 10.13328/j.cnki.jos.005032
Abstract:The recognition of the information area with common format in the non-fixed format questionnaire is the major problem in existing questionnaire layout recognition algorithm. To address those problems, a new approach for questionnaire layout analysis based on regional connectivity and neural networks is proposed. First, a center valid graphics is generated by preprocessing the scanned image firstly. Then, a rapid skew correction algorithm is applied for questionnaire images. Next, many questionnaire rows are obtained by using horizontal projection profile segmentation algorithms. After that, the first connected region for each row is extracted to estimate the existence of form region. Based on the analysis of general questionnaire row and table row, a large amount of possible answers region are generated. Finally, the neural network is used to determine the type of possible information areas. Experiments show that the proposed algorithm can automatically identify common questionnaire.
LIU Bing-Yu , WANG Cui-Rong , WANG Cong , WANG Jun-Wei , WANG Xing-Wei , HUANG Min
2017, 28(2):246-261. DOI: 10.13328/j.cnki.jos.005116
Abstract:With the dramatic increase of microblog users, microblog websites have become the platform for a wide spectrum of users to get information. Due to the fact that blog is a special type of text with restricted length, traditional community detection algorithms cannot effectively solve the sparse problem of micro blog. To address the issue, the DC-DTM (discovery community by dynamic topic model) algorithm is proposed in this paper. First, the algorithm maps microblog as a directed-weighted network, in which the direction is the concerned relationship, and the weight is the topic's similarity of different nodes calculated by DTM (dynamic topic model). DTM is a microblog topic model which can not only mine the topics of each microblog accurately but also calculate author's influence a topic. Second, the algorithm uses label propagation WLPA (weighted lebel propagation), with low complexity, to find communities in microblog. The initial process selects nodes with the largest influence as the initial nodes, and propagates the label in the order of node's influences, from large to small. The algorithm overcomes the adverse phenomenon in the traditional label propagation algorithm, and has better stability. Experiments on real data show that the DTM model can be very good for the topic mining in microblog and DC-DTM algorithm can effectively discover the communities of microblog.
HOU Bo-Yi , CHEN Qun , YANG Jing-Ying , LI Zhan-Huai
2017, 28(2):262-277. DOI: 10.13328/j.cnki.jos.005018
Abstract:Extracting attribute names and values from textual product descriptions is important for many e-business applications such as user demand forecasting and product comparison and recommendation. The existing approaches first use supervised or semi-supervised classification techniques to extract attribute names and values, and then match them by analyzing their grammatical dependency. However, those methods have following limitations:(1) They require human intervention to label some attributes, values and the matching relationship between them; (2) The matching accuracy may be greatly affected by language habits, semantic logic, and the quality of corpus and candidates sets. To address these issues, this paper proposes an unsupervised approach for attribute name and value extraction and matching in Chinese textual merchandise descriptions. Taking advantage of search engine, it extracts the candidate set of attribute names with respect to a value by analyzing grammatical relation based on the principle of small probability event. A new algorithm for computing the matching probability between attribute names and values is also designed based on relative conditional deselect probability and Page Rank. The proposed approach can effectively extract attribute names and values from Chinese textual merchandise descriptions and match them without any human intervention, no matter whether the attribute name appears in the textual description or not. Finally, the performance of the proposed approach is evaluated on the textual descriptions of 4 types of merchandise using the search engine of Baidu. The experimental results show that the new approach for attribute name extraction can improve recall by 20%, compared with the approach of directly extracting attribute names from textual descriptions. Moreover, the new approach achieves considerably higher matching accuracy (above 30% if measured by the percentage of rank-1, above 0.3 if measured by MRR) than the existing techniques based on grammatical dependency analysis for non-quantization attributes.
ZHONG Zhao-Man , GUAN Yan , HU Yun , LI Cun-Hua
2017, 28(2):278-291. DOI: 10.13328/j.cnki.jos.005030
Abstract:Mining user interests on microblog is the basis for personalized recommendation and community classification. A descriptive model of microblog network is proposed based on the in-depth analysis over the characteristics of microblog in the work, revealing properties of multi-mode microblog. The representation and mining method of profile-based static user interests and microblog post-based dynamic user interests are proposed respectively according to the characteristics of microblog network. For mining inactive users with little profile and few microblog posts, a method of follower-based interest mining is proposed. In the case study of Sina microblog, users in fashion, business management, education, military and culture are selected for experimental analysis and comparison of interest mining and similarity calculation. Experimental results show that the proposed representation and mining method can effectively improve user interest mining comparing with other state-of-the-art methods.
ZHANG Bo , HAO Jie , MA Gang , SHI Zhong-Zhi
2017, 28(2):292-309. DOI: 10.13328/j.cnki.jos.005047
Abstract:Canonical correlation analysis (CCA) is a statistical analysis tool for analyzing the correlation between two sets of random variables. CCA requires the data be rigorously paired or one-to-one correspondence among different views due to its correlation definition. However, such requirement is usually not satisfied in real-world applications due to various reasons. Often, only a few paired and a lot of unpaired multi-view data are given, because unpaired multi-view data are relatively easier to be collected and pairing them is difficult, time consuming and even expensive. Such data is referred as semi-paired multi-view data. When facing semi-paired multi-view data, CCA usually performs poorly. To tackle this problem, a semi-paired variant of CCA, named SemiPCCA, is proposed based on the probabilistic model for CCA. The actual meaning of "semi-" in SemiPCCA is "semi-paired" rather than "semi-supervised" as in popular semi-supervised learning literature. The estimation of SemiPCCA model parameters is affected by the unpaired multi-view data which reveal the global structure within each modality. By using artificially generated semi-paired multi-view data sets, the experiment shows that SemiPCCA effectively overcome the over-fitting problem of traditional CCA and PCCA (probabilistic CCA) under the condition of insufficient paired multi-view data and performs better than the original CCA and PCCA. In addition, an automatic image annotation method based on the SemiPCCA is presented. Through estimating the relevance between images and words by using the labelled and unlabeled images together, this method is shown to be more accurate than previous published methods.
LI Miao , GU Yu , CHEN Mo , YU Ge
2017, 28(2):310-325. DOI: 10.13328/j.cnki.jos.005050
Abstract:With the proliferation of geo-positioning techniques, there has been increasing popularity of online location-based services. Specifically, reverse top-k spatial preference queries provide such services to retrieve the users that deem a given database object as one of their top-k results. The attributes of the query object are given by the spatial distance from users' preference. However in real world, users not only consider the non-spatial attributes about the objects, but also hope to find the spatial objects based on the qualities of features in their spatial neighborhood. While reverse top-k spatial preference queries have significant amount of real-life applications such as market analysis, for example, to predict the popularity of a facility in a region, they face a great challenge to compute the score of the spatial attributes online. This paper presents a processing framework and some optimal techniques including pruning and user preference grouping methods. Theoretical analysis and experimental evaluation demonstrate the efficiency of the proposed algorithms and the improvement on running time and I/O.
SHAN Jing , SHEN De-Rong , KOU Yue , NIE Tie-Zheng , YU Ge
2017, 28(2):326-340. DOI: 10.13328/j.cnki.jos.005117
Abstract:With the development of social network, information diffusion problem has received a lot of attention because of its extensive application prospects, and influence maximization problem is a hot topic of information diffusion. Influence maximization aims at selecting nodes that maximize the expected influence as initial nodes of information diffusion, and most work on influence maximization adopts probabilistic models such as independent cascade model. However, most existing solutions of influence maximization view the information diffusion process as an automatic process, and ignore the role of social network websites during the process. Besides, the probabilistic models have some issues in that, for example, they cannot guarantee the information to be delivered effectively, and they cannot adapt the dynamic networks. To tackle the problem, this paper proposes an approach for hot spread node selection based on overlapping community search. This approach selects influence maximized nodes step by step through the iterative promotion model according to users' behavior feedback, and makes the social network websites play the controller's role during information diffusion process. The paper also proposes a new method to measure the influence of nodes based on overlapping community structure, and utilizes this information measure method to select hot spread nodes. Two exact algorithms are proposed including a basic algorithm and an optimized algorithm, as well as an approximate algorithm are presented. Comprehensive experiments demonstrate the performance and accuracy of both exact and approximate algorithms, and the effectiveness of the iterative hot spread node selection method.
JIANG Huo-Wen , ZENG Guo-Sun , MA Hai-Ying
2017, 28(2):341-351. DOI: 10.13328/j.cnki.jos.005015
Abstract:To prevent privacy disclosure, table data generally needs to be anonymized before being published. Existing anonymity methods seldom distinguish different types of quasi-identifier in generalization, and also lack investigation into optimization of both information loss and time efficiency. In this paper, a greedy clustering-anonymity method is proposed using the ideas of greedy algorithm and clustering algorithm. The method makes distinct generalizations according to the type of quasi-identifier to conduct different calculations on information loss, and this providing reduction and reasonable estimate on information loss. Moreover, with regard to distance between tuples, or distance between a tuple and an equivalence class, two definitions are put forward in order to achieve minimum information loss in merging generalization. When establishing a new cluster, the tuple with the minimum distance in the ongoing cluster is always chosen to add. It ensures that the total information loss is close to minimum. Since the number of tuples in establishing each cluster is subject to k and the size of every cluster is equal to or just greater than k, the amount of calculation on distances and therefore the running time are reduced. Experimental results show that the proposed method is effective in reducing both information loss and running time.
ZHAO Chuan , JIANG Han , WEI Xiao-Chao , XU Qiu-Liang
2017, 28(2):352-360. DOI: 10.13328/j.cnki.jos.005019
Abstract:Oblivious transfer is a fundamental tool in modern cryptography. It plays an important role in the research of security protocols. In recent years, many oblivious transfer variants with more powerful functionalities are proposed to fit in different kinds of requirements and scenarios. In this paper, a new oblivious transfer variant, called cut-and-choose bilateral oblivious transfer, is proposed. Based on homomorphic encryption, an efficient one-round protocol of this primitive is constructed along with rigorous security proof in semi-honest model. When applied in security protocols based on cut-and-choose technique (especially in secure two-party computation), cut-and-choose bilateral oblivious transfer enables a more modular high-level description of the protocol framework, and also reduces the round complexity of the protocols. Besides, as a basic tool in the information security area, this primitive itself is of independent research interest.
XIONG Si-Chun , YANG Chao , MA Jian-Feng , ZHANG Jun-Wei
2017, 28(2):361-371. DOI: 10.13328/j.cnki.jos.005023
Abstract:As the most widely used graphical password scheme on mobile terminals, Android unlock pattern (AUP) is not quite uniformly distributed in its theoretical password space when in practical use, which exposes a tremendous hazard that can be easily exploited by the attacker to expedite dictionary attack or violence crack. To address this issue, this paper proposes a new scheme, Android-unlock-pattern based on random point exclusion (AUP-RPE), which helps the user to avoid habitual choices by the new interface arrangement. In addition, patterns in real-life use are collected by performing a large-scale user study with over 1 100 people. Modeling based on those patterns shows the entropy of AUP-RPE increases over 3 orders of magnitude than the entropy of AUP, which means that AUP-RPE has a much stronger security.
GONG Lin-Ming , LI Shun-Dong , WANG Dao-Shun , DOU Jia-Wei
2017, 28(2):372-383. DOI: 10.13328/j.cnki.jos.005048
Abstract:The analysis on the well-known optimal asymmetric encryption and its improved schemes reveal some drawbacks. For one, these schemes use plaintext padding mechanism and hash functions to hide the statistic property of plaintext, and the property of Hash function makes it difficult to prove that these schemes or their variants are secure in the standard model. Many research works show that, assuming that RSA problem and their variants are difficult, it is difficult to prove the RSA-OAEP schemes or their improvements secure against adaptive chosen cipher-text attack in the standard model. In addition, because these schemes encrypt randomized message using padding mechanism, the randomized message is k-bit longer than the plain-text. This increases the computational complexity of these schemes. To address the problem, this paper proposes an RSA-type encryption scheme based pairing functions. This scheme has the following advantages. First, the scheme does not use hash function to hide the statistical property of plain-text, which makes it possible to prove its security in the standard model. In this scheme, the randomized message can be shorter than the plain-text. Second, it is proved in the standard model that the scheme is secure against adaptive chosen cipher-text attacks. Third, when used in sign-encryption, it is not necessary for the users to negotiate the order of signature modulus or the encryption modulus.
MAO Wei-Xuan , CAI Zhong-Min , TONG Li
2017, 28(2):384-397. DOI: 10.13328/j.cnki.jos.005061
Abstract:Existing techniques of malware detection depend on observations of sufficient malware samples. However, only a few samples can be obtained when a novel malware first appears in the World Wide Web, which brings challenges to detect novel malware and its variants. This paper studies the anomaly and similarity of processes with respect to their access behaviors under data flow dependency network, and defines estimated risk for malware detection. Furthermore, the study proposes a malware detection method based on active learning by minimizing the estimated risk. This method achieves encouraging performance even with small samples, and is applicable to defending against rapidly increasing novel malware. Experimental results on a real-world dataset, which consists of access behaviors of 8 340 benign and 7 257 malicious processes, demonstrate better performance of the presented method than traditional malware detection method based on statistical classifier. Even with only 1% known samples, the new method achieves 5.55% error rate, which is 36.5% lower than the error rate of traditional statistical classifier based method.
LIU Ke-Nan , TONG Wei , FENG Dan , LIU Jing-Ning , ZHANG Ju
2017, 28(2):398-410. DOI: 10.13328/j.cnki.jos.005059
Abstract:At present, virtualization technology has been widely applied in data centers. However, VCPU (virtual CPU) scheduling strategy still faces intolerable I/O delay, especially for I/O-latency sensitive VMs which suffer from significant performance degradation when competing with CPU-intensive VMs. This paper presents a flexible and efficient VCPU scheduling algorithm FLMS (flexible I/O latency and multi-processor sensitive scheduler) which utilizes VM classification, VCPU binding and flexible slicer to reduce VM response delay. The work also redesigns the load balancing strategy to ensure optimal VCPU migration. FLMS is suitable for the current mainstream virtualization solutions. It has a 30% improvement comparing with the latest software virtualization. With hardware-assisted virtualization, FLMS makes it possible for VMs to achieve near bare-metal performance and ensures the fairness of the whole system
SUN Jing-Hao , SUN Jing-Chang , GUAN Nan , DENG Qing-Xu
2017, 28(2):411-428. DOI: 10.13328/j.cnki.jos.005025
Abstract:Schedulability test for EDF (earliest deadline first) systems is one of the classical NP-hard problems in the study of real-time system. Current researches mainly focus on the synchronous systems with the utilization U strictly less than 1, which can be decided exactly in pseudo-polynomial time. However, these results cannot be easily extended to the synchronous systems with U≤1 or to the asynchronous systems even with U<1. In this paper, a unified integer programming formulation, where the associated scale is independent of utilization U, is proposed for the EDF schedulability problems in both of the synchronous and asynchronous systems. The polyhedral structure of the formulation is investigated and a kind of facet inequalities is derived, resulting in a linear relaxation approach with polynomial-time complexity. Numerical results on a large scale randomly generated asynchronous and synchronous instances show that the proposed method can obtain a tight gap (0.78% and 1.27% respectively on average) between the relaxation and the optimal integer solutions. Furthermore, the comparison with the QPA exhibits that the new method is available for 70% synchronous instances and exponentially reduces the calculation time especially in situations when U>0.99. Finally, experiments on asynchronous systems find that nearly 96% instances can be exactly solved by the method, which is 29.27% lesser than the traditional method. For the rest of the instances, the upper bound of the schedulability test can be sharply reduced. For most instances, the new bound is 104 smaller than the traditional ones.
ZHAO Xiao-Gang , HU Qi-Ping , DING Ling , SHEN Zhi-Dong
2017, 28(2):429-442. DOI: 10.13328/j.cnki.jos.005026
Abstract:Today the ever-growing energy cost, especially cooling cost of data centers, draws much attention for carbon emission reduction. This paper presents an energy efficient scheduling strategy based on model prediction control (MPC) to reduce cooling cost in data centers. It uses dynamic voltage frequency scaling technology to adjust the frequencies of computing nodes of a cluster in a way to minimize heat recirculation effect among the nodes. The maximum inlet temperature of nodes can be kept under temperature limits with little stable error. The method can also deal with inner disturbance (system model variation) by dynamic frequencies regulation among the nodes. Analysis shows good scalability and small overhead, making the method applicable in huge data centers. A temperature-aware controller is designed to reduce inlet temperatures to improve energy efficiency of data centers. Using a simulated online bookstore run in a heterogeneous data center the proposed method is proved to have larger throughput in both normal and emergency cases compared with existing solutions such as safe least recirculation heat temperature controller and traditional feedback temperature controller. The MPC-based scheduling method also has less inlet temperature and cooling cost comparing with those two methods under same workload.
HE Pan , ZHENG Zhi-Hao , YUAN Yue , TAN Chun
2017, 28(2):443-456. DOI: 10.13328/j.cnki.jos.005031
Abstract:In long-time running reliable software systems, as the demand for continuous execution time and task response speed increases, the redundant component needs to be instantly switched when failure occurs. However, reliability optimization is often conducted under the assumption that cold standby redundancy is only activated when all active components fail. This paper tackles the redundancy allocation problem for a mixed redundancy strategy with instant switching to ensure system reliability as well as performance. The redundancy allocation model is built to minimize redundancy configuration cost under the transient availability and job completion rate constraints. Two system performance metrics are analyzed on top of the state transition diagram using Markov-chain theory. A numerical method is used to compute the non-linear model, and a genetic algorithm is used to solve the optimization model based on the double-element encoding mechanism. Illustrative examples are presented to explain the analysis of system transient availability and job completion rate as well as the allocation result under constraints. Experiment results indicate that with the same redundancy, the job completion rate of systems with the new mixed strategy is higher than the systems with traditional strategy. Thus, different redundancy should be allocated for different kinds of redundancy strategies, even under the same constraints.
CAO Jie , ZENG Guo-Sun , KUANG Gui-Juan , ZHANG Jian-Wei , MA Hai-Ying , HU Ke-Kun , NIU Jun
2017, 28(2):457-472. DOI: 10.13328/j.cnki.jos.005054
Abstract:Low resource utilization is becoming much more serious in cloud platform which allocates processor resources according to the peak load while providing single service application and facing dynamic variation of resource demand. To address the problem, this study uses cloud virtual machine (VM) center to provide a variety of reasonable service applications simultaneously. Gray wave forecasting algorithm is adopted to predict the future load of service requests and a VM service utility function is proposed by taking resource requirements and service priorities into account. Each VM inside a physical machine dynamically configures physical resources to maximize the service utility value of the physical machine. Besides, by applying the global load balancing and multi-time physical resource redistribution for each virtual machine in the same physical machine, the number of physical resources assigned to the VMs whose service request amount is much larger is further increased. In the end, on-demand resource reconfiguration algorithm ODRGWF based on grey wave forecasting is put forward. The simulation results show that the proposed algorithm can effectively improve processor resource utilization, which is of practical significance to improve user request completion rate and service quality.