Abstract:For many practical classification applications, the class label of new data needs to be confirmed eventually by domain expert, and the result of classifier only plays an assistant role. In addition, with the implicit values of big data calling more people's attention, classifier training is going through a transition from single dataset to distributed space dataset, and assistant classification in big data environment will also become an important branch of future classification applications. However, existing classification research lacks attention to this kind of application. Assistant classification in big data environment faces with the following three problems: 1) the training set is distributed big dataset, 2) in space, the class distributions of local datasets contained in the training set are commonly different, and 3) in time, the training set is dynamic and its class distribution may change. To address the above problems, this paper proposes a distributed assistant associative classification approach in big data environment. Firstly, a distributed associative classifier constructing algorithm in big data environment is constructed. With the new algorithm, the class distribution difference in space of the classification dataset is considered by horizontal weighting, and the performance deficiency of associative classification algorithms to imbalanced class distribution datasets is improved by giving a measure framework of "antecedent space support-correlation coefficient". Next, an adaptive factor based dynamic adjustment method for assistant associative classifier is proposed. This method can make full use of domain experts' real-time feedback to adjust classifier dynamically in the applying process of the used classifier, to improve its performance facing dynamic datasets, and to slow down its retraining frequency. Experimental results demonstrate that the presented approach can relative quickly train associative classifiers with higher classification accuracy for distributed datasets, and can improve their performance when datasets are continually expanding and changing. Thus it's an effective approach for assistant classification applications in big data environment.