2021, 32(3):689-711.
DOI: 10.13328/j.cnki.jos.006176
Abstract:
As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields. Furthermore, data quality can optimize and promote big data and artificial intelligence technology. Generally, physical failures or technical defects in data collection and recorder will cause certain anomalies in collected data. These anomalies will have a significant impact on subsequent data analysis and artificial intelligence processes, thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. And the constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are proposed to constrain time series data, and a series of repairing candidate points is formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. In order to verify the feasibility and effectiveness of this method, an artificial data set, two real data sets, and another real data set with real anomalies are used for experiments under different rates of anomalies and data sizes. It can be seen from the experimental results that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS error and time cost. In addition, the verification of clustering and classification accuracy with several data sets shows the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.