Volume 31,Issue 4,2020 Table of Contents

Survey of Online Learning Algorithms for Streaming Data Classification

2020, 31(4):912-931. DOI: 10.13328/j.cnki.jos.005916

Abstract (3471) HTML (4945) PDF 692.47 K (10330) Comment (0) Favorites

Abstract:The objective of streaming data classification is to learn incrementally a decision function that maps input variables to a label variable, from continuously arriving streaming data, so as to accurately classify the test data that may arrive anytime. The online learning paradigm, as an incremental machine learning technology, is an effective tool for classification of streaming data. This paper mainly summarizes, from the perspective of online learning, the recent development of algorithms for streaming data classification. Specifically, the basic framework and the performance evaluation methodology of online learning are first introduced. Then, the latest development of online learning algorithms for general streaming data, for alleviating the "curse of dimensionality" problem in high-dimensional streaming data, and for resolving the "concept drifting" problem in evolving streaming data are reviewed respectively. Finally, future challenges and promising research directions for classification of high-dimensional and evolving streaming data are also discussed.

Concept Drift Detection Method Based on Online Performance Test

GUO Hu-Sheng , ZHANG Ai-Juan , WANG Wen-Jian

2020, 31(4):932-947. DOI: 10.13328/j.cnki.jos.005917

Abstract (2105) HTML (2370) PDF 1.69 M (3924) Comment (0) Favorites

Abstract:Concept drift is a common problem in dynamic streaming data mining, but the false concept drift generated by the mixed noise data or too small scale size training data will cause similar results to the concept drift, that is, the instability fluctuation of model online testing performance, which leads to confusion between them, and the false alarm of concept drift. To address the problem which is easy to confuse the authenticity of concept drift, concept drift detection method based on online performance test, namely CDPT, is presented. With CDPT, the latest acquired data are evenly divided into groups, and online learning is performed on each group sub sets. At the same time, the classification accuracy vectors obtained by training and testing of each group sub sets are recorded, and the accuracy difference between adjacent learning time units is calculated. The effective fluctuation points are obtained according to the testing accuracy decline threshold. Then, the effective fluctuation points in different groups are integrated by cross checking to eliminate the detection interference caused by the instability of the model due to the small training samples in the online learning process of streaming data, and the consistent fluctuation points are obtained according to the consistency of accuracy fluctuation. Finally, by tracking the classification accuracy of online learning, the change of testing accuracy can be achieved of neighborhood reference points of consistent fluctuation points, and the decline and convergence of model testing accuracy can be compared of neighborhood reference points of consistent fluctuation points, so as to effectively detect the true concept drift points of the consistent fluctuation points. The experimental results demonstrate that the proposed CDPT method can effectively identify the true concept drift occurring in the online learning process of streaming data, effectively avoid the negative impact of too small training samples or noise on the detection results, and improve the generalization performance of the model.

Novel Deep Reinforcement Learning Algorithm Based on Attention-based Value Function and Autoregressive Environment Model

LIANG Xing-Xing , FENG Yang-He , HUANG Jin-Cai , WANG Qi , MA Yang , LIU Zhong

2020, 31(4):948-966. DOI: 10.13328/j.cnki.jos.005930

Abstract (3257) HTML (1571) PDF 2.18 M (7650) Comment (0) Favorites

Abstract:Recently, deep reinforcement learning (DRL) is believed to be promising in continuous decision-making and intelligent scheduling problems, and some examples such as AlphaGo, OpenAI Five, and Alpha Star have demonstrated the great generalization capability of the paradigm. However, the inefficient utility of collected experience dataset in DRL restricts the universal extension to more practical scenarios and complicated tasks. As the auxiliary, the model-based reinforcement learning can well capture the dynamics of environment and bring the reduction in experience sampling. This study aggregates the model-based and model-free reinforcement learning algorithms to formulate an end-to-end framework, where the autoregressive environment model is constructed, and attention layer is incorporated to forecast state value function. Experiments on classical CartPole-V0 and so on witness the effectiveness of proposed framework in simulating environment and advancing utility of dataset. Finally, penetration mission as the practical instantiation is successfully completed with the framework.

Cooperative Prediction Method Based on Dynamic and Static Representation for Crowdfunding

ZHANG Kai , ZHAO Hong-Ke , LIU Qi , PAN Zhen , CHEN En-Hong

2020, 31(4):967-980. DOI: 10.13328/j.cnki.jos.005921

Abstract (2166) HTML (1396) PDF 1.59 M (5259) Comment (0) Favorites

Abstract:Crowdfunding is an emerging finance platform for creators to fund their efforts by soliciting relatively small contributions from a large number of individuals using the Internet. Due to the unique rules, a campaign succeeds in trading only when it collects adequate funds in a given time. To prevent creators and backers from wasting time and efforts on failing campaigns, dynamically estimating the success probability of a campaign is very important. However, existing crowdfunding systems neither have the mechanism of dynamic predictive tracking, nor consider the dynamic interaction between project sponsors and investors on the platform. To address these issues, a novel dynamic and static collaborative prediction model is designed based on long and short-term memory network. This model focuses on user behavior, including the emotional tendency of reviews and the dynamic incremental information in the financing process, so as to deeply mine and analyze the interaction between financing projects and investors. Firstly, for the static features and dynamic user behavior data on the platform, their deep characterization is obtained by different embedding methods. On this basis, a collaborative prediction model based on attention mechanism is further designed to understand the impact of timing information of project financing on the final results. Finally, experiments on real crowdfunding datasets show that the proposed dynamic and static representation prediction method is more effective than other prediction methods.

Weakly Supervised Learning Framework Based on k Labeled Samples

FU Zhi , WANG Hong-Jun , LI Tian-Rui , TENG Fei , ZHANG Ji

2020, 31(4):981-990. DOI: 10.13328/j.cnki.jos.005919

Abstract (2448) HTML (1926) PDF 1.09 M (4260) Comment (0) Favorites

Abstract:Clustering is an active research topic in the field of machine learning. Weakly supervised learning is an important research direction in semi-supervised learning, which has wide range of application scenarios. In the research of clustering and weakly supervised learning, it is proposed that a framework of weakly supervised learning is based on k labeled samples. Firstly, the framework expands labeled samples by clustering and clustering confidence level. Secondly, the energy function of the restricted Boltzmann machine is improved, and a learning model of the restricted Boltzmann machine based on k labeled samples is proposed. Finally, the model of ratiocination and algorithm are proposed. In order to test the framework and the model, a series of public data sets are chosen for comparative experiments. The experimental results show that the proposed weakly supervised learning framework based on k labeled samples is more effective.

LUO Xiao-Hui , LI Fan-Zhang , ZHANG Li , GAO Jia-Jun

2020, 31(4):991-1001. DOI: 10.13328/j.cnki.jos.005922

Abstract (2058) HTML (1552) PDF 1.24 M (5016) Comment (0) Favorites

Abstract:Manifold learning is one of the most important research directions nowadays. The performance of manifold learning methods is affected by the choice of reduced dimension. When the reduced dimension is the intrinsic dimension, it is easily to handle the original data. However, intrinsic dimension estimation is still a challenge of manifold learning. In this study, a novel unsupervised method is proposed, called similar manifold learning based on selective cluster ensemble (SML-SCE), which avoids the estimation of intrinsic dimension and achieves a promising performance. SML-SCE generates representative anchors with modified balanced K-means based hierarchical K-means (MBKHK) to construct similarity matrix efficiently. Moreover, multiple similar low-dimensional embeddings in different dimensions are obtained, which are the different presentations of original data. The diversity of these similar low-dimensional embeddings is benefit to the ensemble learning. Therefore, selective cluster ensemble method is taken advantage of as the combination rule. For the clustering results obtained by K-means in similar low-dimensional embeddings, the normalized mutual information (NMI) is calculated between clusterings as weight. Finally, the low weight clusterings is discarded and a selective vote scheme is adopted based on weight to obtain the final clustering. Extensive experiments on several data sets demonstrate the validity of the proposed method.

Conditional Graphical Generative Adversarial Networks

LI Chong-Xuan , ZHU Jun , ZHANG Bo

2020, 31(4):1002-1008. DOI: 10.13328/j.cnki.jos.005924

Abstract (1981) HTML (1776) PDF 868.40 K (3990) Comment (0) Favorites

Abstract:Generative adversarial networks (GANs) have been promise on generating realistic images and hence have been studied widely. Notably, graphical generative adversarial networks (graphical-GAN) introduce Bayesian networks to the GAN framework to learn the underlying structures of data in an unsupervised manner. This study proposes a conditional version of graphical-GAN, which can leverage coarse side information to enhance the graphical-GAN and learn finer and more complex structures, in weakly-supervised learning settings. The inference and learning of conditional graphical-GAN follows a similar protocol to graphical-GAN. Two instances of conditional graphical-GAN are presented. The conditional Gaussian mixture GAN can learn fine clusters from mixture data given a coarse label. The conditional state space GAN can learn the dynamics of videos with multiple objects given the labels of the objects..

Spectral Clustering Based Unsupervised Feature Selection Algorithms

XIE Juan-Ying , DING Li-Juan , WANG Ming-Zhao

2020, 31(4):1009-1024. DOI: 10.13328/j.cnki.jos.005927

Abstract (2459) HTML (2215) PDF 2.26 M (5782) Comment (0) Favorites

Abstract:Gene expression data usually comprise small number of samples with tens of thousands of genes. There are a large number of genes unrelated to diseases in this kind of data. The primary task is to detect those key essential genes when analyzing this kind of data. The common feature selection algorithms depend on labels of data, but it is very difficult to get labels for data. To overcome the challenges, especially for gene expression data, the unsupervised feature selection idea is proposed, named as FSSC (feature selection by spectral clustering). FSSC groups all of features into clusters by a spectral clustering algorithm, so that similar features are in same clusters. The feature discernibility and independence are defined, and the feature importance is defined as the product of its discernibility and independence. The representative feature is selected from each cluster to construct the feature subset. According to the spectral clustering algorithms used in FSSC, three kinds of unsupervised feature selection algorithms named as FSSC-SD (FSSC based on standard deviation), FSSC-MD (FSSC based on mean distance) and FSSC-ST (FSSC based on self-tuning) are developed. The SVM (support vector machines) and KNN (K-nearest neighbors) classifiers are adopted to test the performance of the selected feature subsets in experiments. Experimental results on 10 gene expression datasets show that FSSC-SD, FSSC-MD, and FSSC-ST algorithms can select powerful features to classify samples.

Distant Supervision Neural Network Relation Extraction Base on Noisy Observation

YE Yu-Xin , XUE Huan , WANG Lu , OUYANG Dan-Tong

2020, 31(4):1025-1038. DOI: 10.13328/j.cnki.jos.005929

Abstract (2167) HTML (1603) PDF 1.45 M (4785) Comment (0) Favorites

Abstract:The great advantage of distant supervision relation extraction is to generate labeled data automatically through knowledge bases and natural language texts. This simple automatic alignment mechanism liberates people from heavy labeling work, but inevitably produces various incorrect labeled data meanwhile, which would have an influential effect on the construction of high-quality relation extraction models. To handle noise labels in the distant supervision relation extraction, here it is assumed that the final label of sentence is based on noisy observations generated by some unknown factors. Based on this assumption, a new relation extraction model is constructed, which consists of encoder layer, attention based on noise distribution layer, real label output layer, and noisy observation layer. In the training phase, transformation probabilities are learned from real label to noisy label by using automatically labeled data, and in the testing phase, the real label is obtained through the real label output layer. This study proposes to combine the noise observation model with deep neural network. The attention mechanism of noise distribution is focused based on deep neural network, and unbalanced samples are denoised of under the framework of deep neural network, aiming to further improve the performance of distant supervision relation extraction based on noisy observation. To examine its performance, the proposed method is applied to a public dataset. The performance of distant supervision relation extraction model is evaluated under different distribution families. The experimental results illustrate the proposed method is more effective with higher precision and recall, compared to the existing methods.

Model Parameter Re-optimization for Linear Hashing Based on Similarity Drive

NIE Xiu-Shan , LIU Xing-Bo , XI Xiao-Ming , YIN Yi-Long

2020, 31(4):1039-1050. DOI: 10.13328/j.cnki.jos.005918

Abstract (2215) HTML (1707) PDF 1.45 M (3792) Comment (0) Favorites

Abstract:By designing and optimizing an objective function, and combining the distribution of samples, hash learning learns the hash codes of samples. In the existing hashing models, linear model is widely used due to its conciseness and high efficiency. For the parameter optimization of linear hashing model, a model parameter re-optimization method is propose based on similarity drive, which can improve the precision of the existing linear model-based hashing algorithms. Given a hashing method, this method is firstly run for several times with obtaining several hash matrices. Then, some bits are selected for these hash matrices to obtain a new final hash matrix based on the similarity preserving degree and a fusion strategy. Finally, this new hash matrix is used to re-optimize the model parameters, and a better hash model is obtained for out-of-sample extension. Extensive experiments are performed based on three benchmark datasets and the results demonstrate the superior performance of the proposed framework.

Individual Convergence of NAG with Biased Gradient in Nonsmooth Cases

LIU Yu-Xiang , CHENG Yu-Jia , TAO Qing

2020, 31(4):1051-1062. DOI: 10.13328/j.cnki.jos.005926

Abstract (2001) HTML (1357) PDF 1.36 M (3791) Comment (0) Favorites

Abstract:Stochastic method has become the first choice for dealing with large-scale regularization and deep learning optimization problems. The acquisition of its convergence rate heavily depends on the unbiased gradient of objective functions. However, for machine learning problems, many scenarios can result in the appearance of biased gradient. In contrast to the unbiased gradient cases, the well-known Nesterov accelerated gradient (NAG) accumulates the error caused by the bias with the iteration. As a result, the optimal convergence will no longer hold and even the convergence cannot be guaranteed. Recent research shows that NAG is also an accelerated algorithm for the individual convergence of projection sub-gradient methods in non-smooth cases. However, until now, there is no report about the affect when the subgradient becomes biased. In this study, for non-smooth optimization problems, it is proved that NAG can obtain a stable individual convergence bound when the subgradient bias is bounded, and the optimal individual convergence can still be achieved while the subgradient errors decrease at an appropriate. As an application, an inexact projection subgradient method is obtained in which the projection needs not calculate accurately. The derived algorithm can approach the stable learning accuracy more quick while keeping the convergence. The experiments verify the correctness of theoretical analysis and the performance of inexact methods.

Rule Inference Network Model for Classification

HUANG De-Gen , ZHANG Yun-Xia , LIN Hong-Mei , ZOU Li , LIU Zhuang

2020, 31(4):1063-1078. DOI: 10.13328/j.cnki.jos.005920

Abstract (2857) HTML (1853) PDF 1.62 M (4571) Comment (0) Favorites

Abstract:The black-box working mechanism of artificial neutral network brings the confusion of interpretability. Therefore, a rule inference network is proposed based on rule-base inference methodology using the evidential reasoning approach (RIMER). It is interpretable by the rules and the inference engine in RIMER. In the present work, the partial derivatives of inference functions are proved as the theoretical fundamental of the proposed model. The framework and the learning algorithm of rule inference network for classification are presented. The feed forward of rule inference network using the inference process in RIMER contributes for the interpretability. Meanwhile, parameters in belief rule base such as attribute weights, rule weights and belief degree of consequents are trained by gradient descent as in neural network for belief rule base establishment. Moreover, the gradient is simplified by proposing the "pseudo gradient" to reduce the learning complex during the training process. The experimental results indicate the advantages of proposed rule inference network on both interpretability and learning capability. It shows that the rule inference network performs well when the scale of the training dataset is small, and when the training data scale increases, it also achieves comforting results.

Multi-label Text Classification Method Based on Label Semantic Information

XIAO Lin , CHEN Bo-Li , HUANG Xin , LIU Hua-Feng , JING Li-Ping , YU Jian

2020, 31(4):1079-1089. DOI: 10.13328/j.cnki.jos.005923

Abstract (3738) HTML (1920) PDF 1.24 M (7873) Comment (0) Favorites

Abstract:Multi-label classification has been a practical and important problem since the boom of big data. There are many practical applications, such as text classification, image recognition, video annotation, multimedia information retrieval, etc. Traditional multi-label text classification algorithms regard labels as symbols without inherent semantics. However, in many scenarios these labels have specific semantics, and the semantic information of labels have corresponding relationship with the content information of the documents, in order to establish the connection between them and make use of them, a label semantic attention multi-label classification (LASA) method is proposed based on label semantic attention. The texts and labels of the document are relied on to share the word representation between the texts and labels. For documents embedding, bi-directional long short-term memory (Bi-LSTM) is used to obtain the hidden representation of each word. The weight of each word in the document is obtained by using the semantic representation of the label, thus taking into account the importance of each word to the current label. In addition, labels are often related to each other in the semantic space, by using the semantic information of the labels, the correlation of the labels is considered to improve the classification performance of the model. The experimental results on the standard multi-label classification datasets show that the proposed method can effectively capture important words, and its performance is better than the existing state-of-the-art multi-label classification algorithms.

Integrating Latent Item-item Complementarity with Personalized Recommendation Systems

SHAO Ying-Wei , ZHANG Min , MA Wei-Zhi , WANG Chen-Yang , LIU Yi-Qun , MA Shao-Ping

2020, 31(4):1090-1100. DOI: 10.13328/j.cnki.jos.005925

Abstract (2727) HTML (2262) PDF 1.17 M (5619) Comment (0) Favorites

Abstract:Domain-specific personalized recommendation algorithm is getting more popular nowadays. In particularly, item-item relationship (e.g. complementary good, substitute good) has already been considered in the development of recommendation algorithms. In terms of its potential application for sellers, the ability to notice actual item-item complementarity from data is of paramount importance, as it helps sellers to gain a market competitive advantage via designing better pricing strategies (e.g. bundling or pricing discount). For recommender systems, integrating algorithms with the item-item relationship is also more likely to generate better recommendation results. Therefore, how to mine item-item complementarity is a research problem deserving of study. Even though most existing methodologies leverage on co-occurrence relationship, yet, the recommendation accuracy might easily be adversely affected by noise data due to the complex dynamics in the online shopping environment. In light of the research on economics, the latent complementarity discovery model (LCDM) is proposed in an attempt to more accurately describe the item-item relationship from a different perspective. Specifically, complementarity discovery model (CDM) is firstly proposed based on cross-price elasticity of demand, which jointly utilizes item pricing and purchase history to discover item-item complementarity relationship. Comparing with existing mining methods based on item co-occurrence relationship, the proposed method yields 10.6% increase in user label consistency. Then, LCDM is constructed by integrating dual-item attention with item-item complementarity insight mined from CDM. Lastly, from the comparison experiments conducted on real-world dataset, LCDM has made a significant improvement in recommendation performance, in which there is a 54.4% and 125.8% increase in Recall@5 and NDCG@5 respectively.

Graph Convolutional Network for Recommender Systems

GE Yao , CHEN Song-Can

2020, 31(4):1101-1112. DOI: 10.13328/j.cnki.jos.005928

Abstract (3653) HTML (1891) PDF 1.34 M (5921) Comment (0) Favorites

Abstract:Graph convolutional network (GCN) is a deep learning model for graph signal processing and has been used in many real-world applications due to its powerful ability of feature extraction. As the recommendation problem can be viewed as link prediction of graph signals, recently several GCN based methods have been proposed for recommender systems. A recommender system involves two kinds of interactions, with one representing interactions between users and items and the other representing interactions among users (or items). However, existing methods focus on either heterogeneous or homogeneous interactions only, thus their modeling expressiveness is limited. In this study, a new GCN based recommendation algorithm is proposed to jointly utilize these two types of interactions. Specifically, a heterogeneous convolutional operator is used to mine information from the spectrum of user-item graphs, while a homogeneous convolutional operator is used to enforce similar vertices to be similar in the hidden space. Finally, the experiments on benchmark datasets show that the proposed method achieves better performance compared with several state-of-the-art methods.

NP-completeness of d-regular (k,s)-SAT Problem

FU Zu-Feng , XU Dao-Yun

2020, 31(4):1113-1123. DOI: 10.13328/j.cnki.jos.005896

Abstract (1509) HTML (1347) PDF 1.16 M (3503) Comment (0) Favorites

Abstract:The study on the NP-completeness of regular SAT problem is a subject which has important theoretical value. It is proved that there is a critical function f(k) such that all formulas in (k,f(k))-CNF are satisfiable, but (k,f(k)+1)-SAT is NP-complete, and there is such a critical function about regular (k,s)-SAT problem too. This work studies the regular SAT problem with stronger regular constraints. The regular (k,s)-CNF is the subclass of CNF in which each clause of formula has exactly k distinct literals and each variable occurs exactly s times. The d-regular (k,s)-CNF is a regular (k,s)-CNF formula that the absolute value of the difference between positive and negative occurrence number of each variable in the formula is at most d. A polynomial reduction from a k-CNF formula is presented to a d-regular (k,s)-CNF formula and it is proved that there is a critical function f(k,d) such that all formulas in d-regular (k,f(k,d))-CNF are satisfiable, but d-regular (k,f(k,d)+1)-SAT is already NP-complete. By adding new variables and new clauses, the reduction method not only can alter the ration of clauses to variables of any one CNF formula, but also can restrict the difference between positive and negative occurrences number of each variable. This shows that only using constrained density (the ration of clauses to variables) to character structural features of a CNF formula is not enough. The study of the critical function f(k,d) is helpful to construct some hard instance under stronger regular constraints.

State-of-the-art Survey of Consensus Mechanisms on DAG-based Distributed Ledger

GAO Zheng-Feng , ZHENG Ji-Lai , TANG Shu-Yang , LONG Yu , LIU Zhi-Qiang , LIU Zhen , GU Da-Wu

2020, 31(4):1124-1142. DOI: 10.13328/j.cnki.jos.005982

Abstract (5776) HTML (4673) PDF 1.83 M (9368) Comment (0) Favorites

Abstract:Since the emergence of Bitcoin in 2008, various decentralized consensus schemes have been brought about to realize a decentralized ledger. Most existing schemes adopt a blockchain, which is the fundamental building block of the consensus of Bitcoin, to store and extend the ledger. However, classical blockchain is heavily bounded in its scalability. Specifically, its throughput of transactions is far from satisfactory and transactions are confirmed at a slow rate, which greatly limit the practical application of blockchain. To face this issue, novel consensus schemes based on direct acyclic graphs (DAGs) are introduced in an attempt of achieving better performance. Due to its high concurrency feature, the research of DAG-based distributed ledger is getting more and more attention. With a systematic survey, it is proposed that DAG ledgers can be classified into three categories by the feature of its underground consensus mechanisms, i.e., DAG with a main chain, DAG of parallel chains, and naive DAG. To begin with, the pivotal features and characteristics of current consensus system associated with each category are introduced. After that, a comprehensive evaluation regarding different aspects of current systems is conducted. Finally, several open challenges on DAG-Based consensus schemes are identified to consider in future research endeavors.

Topic Analysis on Chinese Programming Question and Answer Websites

JIANG Jing , Lü Jiang-Feng , ZHANG Li

2020, 31(4):1143-1161. DOI: 10.13328/j.cnki.jos.005987

Abstract (2488) HTML (3384) PDF 1.93 M (6103) Comment (0) Favorites

Abstract:Programming question and answer website is a network platform where software developers can exchange technical knowledge by posting and answering questions. With the development of Internet and growth in the number of software developers, programming question and answer websites accumulate extensive discussion contents of software engineering knowledge. Researchers have applied topic analysis on English question and answer websites in recent years, yet there are few similar studies on Chinese programming question and answer websites. Analyzing these contents can help developers know more about the trends of techniques. It also benefits website administrator to improve the forum for better user experience, etc. This study applies latent Dirichlet allocation (LDA) to automatically cluster the main topics in 92 383 questions on OSCHINA. Then, several analyses are applied to these topics, including trend analysis, difficulty analysis, and keyword analysis. Several findings are as follow:(1) Topics concluded from user discussion can be divided into 6 categories, including front-end development, back-end development, databases, operating systems, general techniques, and others. Within those categories, front-end development contains the most question posts. (2) Using trend analysis, it is found that in back-end development, developers are paying more attention to more up-to-date and advanced topics (distributed systems, system design & Web interfaces) rather than basic topics (project deployment, server configuration). (3) It is also found that data presentation is the most difficult topic, as it has the highest ratio of questions which are never answered while its popularity is above average. (4) The trend of different specific techniques is analyzed in one topic. For instance, the popularity of Java in the technique learning topic is obviously higher than the popularity of Python.

Active Learning Approach for Crowdsourcing-enhanced Data Cleaning

YE Chen , WANG Hong-Zhi , GAO Hong , LI Jian-Zhong

2020, 31(4):1162-1172. DOI: 10.13328/j.cnki.jos.005801

Abstract (1917) HTML (1765) PDF 1.16 M (4598) Comment (0) Favorites

Abstract:Traditional methods usually adopt machine learning algorithms for data cleaning. Although these methods can solve some problems, there still are computational difficulties, lack of sufficient knowledge, and other limitations. In recent years, with the rise of the crowdsourcing, more and more research has introduced crowdsourcing into the process of data cleaning, providing the extra knowledge needed for machine learning. Since workers on the crowdsourcing platforms require to be paid, it is essential to study how to effectively combine machine learning algorithms with crowdsourcing on a limited budget. This study proposes two active learning models to support crowdsourcing-enhanced data cleaning. By using active learning technology to reduce crowdsourcing cost, data cleaning based on real crowdsourcing platform is realized for given data sets, which can reduce cost and improve data quality at the same time. Experimental results on the real-world datasets show the effectiveness of the proposed methods.

Adaptive Online Kernel Density Estimation Method

DENG Qi-Lin , QIU Tian-Yu , SHEN Fu-Rao , ZHAO Jin-Xi

2020, 31(4):1173-1188. DOI: 10.13328/j.cnki.jos.005674

Abstract (2456) HTML (2993) PDF 1.73 M (7918) Comment (0) Favorites

Abstract:Based on observed data, density estimation is the construction of an estimate of an unobservable underlying probability density function. With the development of data collection technology, real-time streaming data becomes the main subject of many related tasks. It has the properties of that high throughput, high generation speed, and the underlying distribution of data may change over time. However, for the traditional density estimation algorithms, parametric methods make unrealistic assumptions on the estimated density function while non-parametric ones suffer from the unacceptable time and space complexity. Therefore, neither parametric nor non-parametric ones could scale well to meet the requirements of streaming data environment. In this study, based on the analysis of the learning strategy in competitive learning, it is proposed a novel online density estimation algorithm to accomplish the task of density estimation for such streaming data. And it is also pointed out that it has pretty close relationship with the Gaussian mixture model. Finally, the proposed algorithm is compared with the existing density estimation algorithms. The experimental results show that the proposed algorithm could obtain better estimates compared with the existing online algorithm, and also get comparable estimation performance compared with state-of-the-art offline density estimation algorithms.

Spatial Context-aware Mention Target Recommendation Method

TANG Xiao-Yue , ZHOU Kang , WANG Kai

2020, 31(4):1189-1211. DOI: 10.13328/j.cnki.jos.005616

Abstract (2082) HTML (1596) PDF 2.33 M (4817) Comment (0) Favorites

Abstract:As a newly emerging social media user interactive service, mention mechanism is playing an important role in both information sharing and online social interacting. Researches on mention mechanism can provide us valuable resources to reveal the correlation between users' latent preferences and their explicit interacting behaviors and can be constructed as the data foundation for many applications such as information dissemination monitoring, business intelligence, and personalized recommendation. However, most of the previous works focused on the information diffusion aspect, lacking the in-depth study on its interaction attribute from the common users' perspective. This study aims to construct a recommendation system to automatically recommend target users for given social media posts based on the analysis and modeling of common users' mention behaviors. This study first analyzes two large-scale real-world datasets to explore the mention mechanism from the aspect of users' interactions and finds that, users' mention behaviors are impacted by both the semantic and the spatial context of their mention activities. Secondly, based on a unified definition of the joint semantic and spatial context-aware mention behavior, a joint latent probabilistic generative model named JUMBM (joint user mention behavior model) is built to simulate the generating process of users' mention activities. Specially, JUMBM is able to simultaneously capture users' movement patterns, geographical area-dependent semantic interests, and the geographical clustering patterns of the targets users. Besides, a hybrid pruning algorithm is proposed to achieve a fast high-dimensional retrieval and facilitate the online top-k query answering. Extensive experiments on real-world datasets demonstrate the significant superiority of the proposed approach over the baseline methods to make more effective and efficient recommendations.

CHS-BPR: Combining Content-aware and Heterogeneous-aware for Event Recommendation

SHANG Yan-Min , CAO Ya-Nan , LIU Yan-Bing

2020, 31(4):1212-1224. DOI: 10.13328/j.cnki.jos.005544

Abstract (2023) HTML (1530) PDF 1.54 M (4435) Comment (0) Favorites

Abstract:The Web has grown into one of the most important channels to communicate social events nowadays. However, the sheer volume of events available in event-based social networks (EBSNs) often undermines the users' ability to choose the events that best fit their interests. Recommender systems appear as a natural solution for this problem. Different from classic recommendation problems (e.g. movies), event recommendation generally faces three complex problems:Heterogeneous social relationships (online and offline) among users, the implicit feedback data and the content-context information of users/events. How to effectively fuse this information for event recommendation is a common concern for scholars in this field. This work presents a Bayesian latent factor model that combines users/items content-context information and heterogeneous social information for event recommendation. Experimental results on several real-world datasets demonstrate the proposed method can efficiently tackle with implicit feedback characteristic for event recommendation.

Accelerating Core Decomposition in Complex Network on GPUs

ZHANG Heng , CUI Qiang , HOU Peng-Peng , WU Yan-Jun , ZHAO Chen

2020, 31(4):1225-1239. DOI: 10.13328/j.cnki.jos.005627

Abstract (1946) HTML (1580) PDF 1.71 M (4124) Comment (0) Favorites

Abstract:To analysis the complex networks, core decomposition is a basic and efficient strategy to distinguish the relative importance of nodes and to discover a special family of core subgraphs in networks. After core decomposition, every node in each k-core subgraph connects to other k neighbor nodes internally. The core decomposition has been widely applied in several application scenarios, e.g., user behavior analysis in social networks, visualization of complex networks, static analysis in large system software project, etc. With the increasing scale and complexity of networks, existing works, which mostly focus on the multi-core CPU-based implementation of core decomposition, cannot satisfy the high performance of core decomposition in large-scale complex networks. Meanwhile, GPU provides not only massive parallelism degree (up to 10 000 threads) but also efficient memory I/O bandwidth (approximately 100 GB/s), which makes it an excellent hardware platform for large graph structure analytic, such as BFS (breadth first search), SSSP (single source shortest path) algorithms. This study proposes two strategies to enhance the parallel performance of core decomposition on GPU-based platform. An algorithm, RLCore, is first presented which exploits GPU-based bottom-up traverse approach and recursively distinguishes the core levels of nodes by considering their degree and edges. Then, a second optimal algorithm is proposed to improve performance and scalability, namely ESCore, based on the locality property of core decomposition. In ESCore, nodes gather and update their core level values from their neighbors, until there is no update among nodes. Compared to RLCore, ESCore strategy reduces the synchronization overhead from multi-thread contention when scaling to massive parallelism, whereas the iteration number of ESCore is depended on the convergence of nodes. From the evaluation results, two proposed acceleration algorithms achieve maximum 4.06 billion TEPS (traversed edges per second), which corresponds to up to 33.6X speedup compared to a single threaded CPU execution.

微信服务号

微信订阅号

>Review Articles

>Review Articles

Current Issue

Volume

Issue