Abstract:With the rapid development of deep learning technology, the research on its quality assurance is raising more attention. Meanwhile, it is no longer difficult to collect test data owing to the mature sensor technology, but it costs a lot to label the collected data. In order to reduce the cost of labeling, the existing work attempts to select a test subset from the original test set. They only ensure that the overall accuracy (the accuracy of the target deep learning model on all test inputs of the test set) of the test subset is similar to that of the original test set. However, existing work only focuses on estimating overall accuracy, ignoring other properties of the original test set. For example, it can not fully cover all kinds of test input in the original test set. This study proposes a method based on multi-objective optimization called DMOS (deep multi-objective selection). It firstly analyzes the data distribution of the original test set based on HDBSCAN (hierarchical density-based spatial clustering of applications with noise) clustering method. Then, it designs the optimization objective based on the characteristics of the clustering results and then carries out multi-objective optimization to find out the appropriate selection solution. A large number of experiments are carried out on 8 pairs of classic deep learning test sets and models. The results show that the best test subset selected by DMOS method (corresponding to the Pareto optimal solution with the best performance) can not only cover more test input categories in the original test set, but also estimate the accuracy of each test input category extremely close to the original test set. Meanwhile, it can also ensure that the overall accuracy and test adequacy are close to the original test set: The average error of the overall accuracy estimation is only 1.081%, which is 0.845% less than the PACE (practical accuracy estimation), with the improvement of 43.87%. The average error of the accuracy estimation of each category of test input is only 5.547%, which is 2.926% less than PACE, with the improvement of 34.53%. The average estimation error of the five test adequacy measures is only 8.739%, which is 7.328% lower than PACE, with the increase improvement of 45.61%.