Order-Sensitive Missing Value Imputation Technology for Multi-Source Sensory Data
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61472071, 61272179); National Key Basic Research Program of China (973) (2012CB316201); Fundamental Research Funds for Central Universities (N140404013)

  • Article
  • | |
  • Metrics
  • |
  • Reference [23]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    In recent years, it is recognized that sensing data is growing explosively with widespread use of sensing network. Due to the inherent hardware limitation, the randomness of distribution environment and unconscious errors during data processing, a deluge of missing values are mingled in original sensing data. Thus, imputing the missing values is essential because most of the existed analysis tools are not competent to the data sets containing missing values. So far, there have been many missing data imputation algorithms, however the accuracy of these algorithms is difficult to be guaranteed in the scenario of lumped missing data. Besides, these existing algorithms don't take the imputation order which influences the imputation accuracy into consideration. To address the above issues, this paper proposes an order-sensitive missing value imputation framework called OMSMVI for multi-source sensory data. OMSMVI takes advantages of multi-dimensions relevancy, such as temporal relevancy, spatial relevancy and attributive relevancy of sensing data adequately. The missing-sources-centered similarity graphs are constructed based on multi-dimensions relevancy. At the same time, in the process of missing data imputation, the imputed missing values are used as observations to impute subsequent missing values. Taking the whole distribution of missing sources into consideration, the framework performs order-sensitive missing value imputation, meaning that the order of imputation is ascertained before applying the specific MVI (missing value imputation) methods. Order-sensitive imputation can remit the decrease of imputed result accuracy caused by the lower similarity between missing source and its neighbors when the missing sources are dense. Finally, a new neighborhood-based missing values imputation algorithm NI, which modifies the KNN imputation algorithm, is introduced into the OMSMVI framework. NI uses the multi-dimension similarity to search the missing sources' neighbors which reflect the similarity from multiple dimensions. Such NI algorithm overcomes the shortcoming that parameter K of KNN is difficult to determine. Furthermore, NI algorithm can improve the imputation accuracy further compared to KNN. Two true sensor data sets are used to compare with the baseline MVI methods to verify the accuracy and effectiveness of OMSMVI.

    Reference
    [1] Racine J,Li Q.Nonparametric estimation of regression functions with both categorical and continuous data.Journal of Econometrics,2004,119(1):99-130.[doi:10.1016/S0304-4076(03)00157-X]
    [2] Zhu XF,Zhang SC,Jin Z,Zhang ZL,Xu ZM.Missing value estimation for mixed-attribute data sets.IEEE Trans.on Knowledge and Data Engineering,2011,23(1):110-121.[doi:10.1109/TKDE.2010.99]
    [3] Zhou X,Wang X,Dougherty ER.Missing-Value estimation using linear and non-linear regression with Bayesian gene selection.Bioinformatics,2003,19(17):2302-2307.[doi:10.1093/bioinformatics/btg323]
    [4] Qin YS,Zhang SC,Zhu XF,Zhang JL,Zhang CQ.POP algorithm:Kernel-Based imputation to treat missing values in knowledge discovery from databases.Expert Systems with Applications,2009,36(2):2794-2804.[doi:10.1016/j.eswa.2008.01.059]
    [5] Velicer WF,Colby SM.A comparison of missing-data procedures for ARIMA time-series analysis.Educational and Psychological Measurement,2005,65(4):596-615.[doi:10.1177/0013164404272502]
    [6] Troyanskaya O,Cantor M,Sherlock G,Brown P,Hastie T,Tibshirani R,Botstein D,Altman RB.Missing value estimation methods for DNA microarrays.Bioinformatics,2001,17(6):520-525.[doi:10.1093/bioinformatics/17.6.520]
    [7] Joenssen DW,Bankhofer U.Hot deck methods for imputing missing data.In:Proc.of the Machine Learning and Data Mining in Pattern Recognition.Berlin,Heidelberg:Springer-Verlag,2012.63-75.[doi:10.1007/978-3-642-31537-4_6]
    [8] David I,Michael PB,Abt A.Weighted sequential hot deck imputation:SAS Macro vs.SUDAAN's PROC HOTDECK.In:Proc.of the SAS Global Forum.2013.213-2013.
    [9] Zhang CQ,Zhu XF,Zhang JL,Qin YS,Zhang SC.GBKⅡ:An imputation method for missing values.In:Proc.of the Advances in Knowledge Discovery and Data Mining.2007.1080-1087.[doi:10.1007/978-3-540-71701-0_122]
    [10] Zhang S.Parimputation:From imputation and null-imputation to partially imputation.IEEE Intelligent Informatics Bulletin,2008,9(1):32-38.
    [11] Caruana R.A non-parametric EM-style algorithm for imputing missing values.In:Proc.of the Artificial Intelligence and Statistics.2001.
    [12] Meng XL,Rubin DB.Performing likelihood ratio tests with multiply-imputed data sets.Biometrika,1992,79(1):103-111.[doi:10.1093/biomet/79.1.103]
    [13] Raghunathan TE,Lepkowski JM,Van Hoewyk J,Solenberger P.A multivariate technique for multiply imputing missing values using a sequence of regression models.Survey Methodology,2001,27(1):85-96.
    [14] Aittokallio T.Dealing with missing values in large-scale studies:Microarray data imputation and beyond.Briefings in Bioinformatics,2010,11(2):253-264.[doi:10.1093/bib/bbp059]
    [15] Mihail H,Gruenwald L.Estimating missing values in related sensor data streams.In:Proc.of the COMAD.2005.83-94.
    [16] Jiang N,Gruenwald L.Estimating missing data in data streams.In:Proc.of the Advances in Databases:Concepts,Systems and Applications.Berlin,Heidelberg:Springer-Verlag,2007.981-987.[doi:10.1007/978-3-540-71703-4_89]
    [17] Christos A,Peter T.Scaling out big data missing values imputations.In:Proc.of the SIGKDD.2014.651-660.[doi:10.1145/2623330.2623615]
    [18] Zheng Y,Liu F,Hsieh HP.U-Air:When urban air quality inference meets big data.In:Proc.of the SIGKDD.2013.1436-1444.[doi:10.1145/2487575.2488188]
    [19] Kim KY,Kim BJ,Yi GS.Reuse of imputed data in microarray analysis increases imputation efficiency.BMC Bioinformatics,2004,5(1):159-167.[doi:10.1186/1471-2105-5-159]
    [20] Verboven S,Branden KV,Goos P.Sequential imputation for missing values.Computational Biology and Chemistry,2007,31(5):320-327.[doi:10.1016/j.compbiolchem.2007.07.001]
    [21] Pan LQ,Li JZ,Lao JZ.A temporal and spatial correlation based missing values imputation algorithm in wireless sensor networks.Chinese Journal of Computers,2010,33(1):1-11(in Chinese with English abstract).http://cjc.ict.ac.cn/qwjs/view.asp?id=3008
    附中文参考文献:
    [21] 潘立强,李建中,骆吉洲.传感器网络中一种基于时-空相关性的缺失值估计算法.计算机学报,2010,33(1):1-11.http://cjc.ict.ac.cn/qwjs/view.asp?id=3008
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

马茜,谷峪,李芳芳,于戈.顺序敏感的多源感知数据填补技术.软件学报,2016,27(9):2332-2347

Copy
Share
Article Metrics
  • Abstract:2302
  • PDF: 3954
  • HTML: 1189
  • Cited by: 0
History
  • Received:September 25,2015
  • Revised:January 12,2016
  • Online: September 02,2016
You are the first2038246Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063