Constrained Query of Order-Preserving Submatrix Based on Signature and Trie
Author:
Affiliation:

Fund Project:

National Program on Key Basic Research Project of China (973) (2012CB316203); National Natural Science Foundation of China (61033007, 61272121, 61332014, 61572367, 61472321, 61502390); National High-Tech R&D Program of China (863) (2015AA015307); Fundamental Research Funds for the Central Universities (3102015JSJ0011); Graduate Starting Seed Fund of the Northwestern Polytechnical University (Z2012128)

  • Article
  • | |
  • Metrics
  • |
  • Reference [62]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    The advances of microarray technology have made large amount of gene expression data available from a variety of different experimental conditions. Analyzing the microarray data plays a key role in understanding gene functions, gene regulation and cellular process. Order-Preserving Submatrix (OPSM) is an important model in microarray data analysis, which captures the identical tendency of gene expressions across a subset of conditions. In the process of analyzing mechanism of gene expression, OPSM search undoubtedly saves the time and effort of biologists. However, OPSM retrieval mainly depends on keyword search, resulting a weak control on the obtained clusters. Typically, the analyst can determine the ad-hoc parameters which are far from the declarative specification of desired properties on operation and concept. Motivated by obtaining much more accurate query relevancy, this paper proposes two types of OPSM indexing and constrained query methods based on signature and Trie. Extensive experiments conducted on real datasets demonstrate the proposed methods have better behaviors than the state-of-the-art methods on efficiency and effectiveness.

    Reference
    [1] Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data:The order-preserving submatrix problem. In:Proc. of the 6th Annual Int'l Conf. on Computational Biology (RECOMB). ACM Press, 2002. 49-57.[doi:10.1145/565196.565203]
    [2] Hartigan JA. Direct clustering of a data matrix. Journal of the American Statistical Association, 1972,67(337):123-129.[doi:10. 1080/01621459.1972.10481214]
    [3] Cheng Y, Church GM. Biclustering of expression data. In:Proc. of the 8th Int'l Conf. on Intelligent Systems for Molecular Biology (ISMB). AAAI Press, 2000. 93-103.
    [4] Yang J, Wang W, Wang H, Yu PS. δ-Clusters:Capturing subspace correlation in a large data sets. In:Proc. of the 18th Int'l Conf. on Data Engineering (ICDE). IEEE Press, 2002. 517-528.[doi:10.1109/ICDE.2002.994771]
    [5] Cho H, Dhillon IS, Guan Y, Sra S. Minimum sum-squared residue co-clustering of gene expression data. In:Proc. of the 12th SIAM Int'l Conf. on Data Mining (SDM). SIAM Press, 2004. 114-125.[doi:10.1137/1.9781611972740.11]
    [6] Divina F, Aguilar-Ruiz JS. Biclustering of expression data with evolutionary computation. IEEE Trans. on Knowledge and Data Engineering, 2006,18(5):590-602.[doi:10.1109/TKDE.2006.74]
    [7] Deodhar M, Gupta G, Ghosh J, Cho H, Dhillon IS. A scalable framework for discovering coherent co-clusters in noisy data. In:Proc. of the 26th Annual Int'l Conf. on Machine Learning (ICML). ACM Press, 2009. 241-248.[doi:10.1145/1553374.1553405]
    [8] Cho H. Data transformation for sum squared residue. In:Proc. of the 14th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD). Berlin, Heidelberg:Springer-Verlag, 2010. 48-55.[doi:10.1007/978-3-642-13657-3_8]
    [9] Odibat O, Reddy CK. A generalized framework for mining arbitrarily positioned overlapping co-clusters. In:Proc. of the 19th SIAM Int'l Conf. on Data Mining (SDM). SIAM Press, 2011. 343-354.[doi:10.1137/1.9781611972818.30]
    [10] Ayadi W, Elloumi M, Hao JK. BicFinder:A biclustering algorithm for microarray data analysis. Knowledge and Information Systems, 2012,30(2):341-358.[doi:10.1007/s10115-011-0383-7]
    [11] Truong DT, Battiti R, Brunato M. Discovering non-redundant overlapping biclusters on gene expression data. In:Proc. of the 13th IEEE Int'l Conf. on Data Mining (ICDM). IEEE Press, 2013. 747-756.[doi:10.1109/ICDM.2013.36]
    [12] Ayadi W, Hao JK. A memetic algorithm for discovering negative correlation biclusters of DNA microarray data. Neurocomputing, 2014,145:14-22.[doi:10.1016/j.neucom.2014.05.074]
    [13] Chen S, Liu J, Zeng T. MMSE:A generalized coherence measure for identifying linear patterns. In:Proc. of the IEEE Int'l Conf. on Bioinformatics and Biomedicine (BIBM). IEEE Press, 2014. 489-492.[doi:10.1109/BIBM.2014.6999206]
    [14] Denitto M, Farinelli A, Bicego M. Biclustering gene expressions using factor graphs and the max-sum algorithm. In:Proc. of the 24th Int'l Conf. on Artificial Intelligence (IJCAI). AAAI Press, 2015. 925-931.
    [15] Wang H, Wang W, Yang J, Yu PS. Clustering by pattern similarity in large data sets. In:Proc. of the 28th ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD). ACM Press, 2002. 394-405.[doi:10.1145/564691.564737]
    [16] Liu J, Wang W. OP-Clustering by tendency in high dimensional space. In:Proc. of the 3th IEEE Int'l Conf. on Data Mining (ICDM). IEEE Press, 2003. 187-194.[doi:10.1109/ICDM.2003.1250919]
    [17] Wang H, Pei J, Yu PS. Pattern-Based similarity search for microarray data. In:Proc. of the 11th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD). ACM Press, 2005. 814-819.[doi:10.1145/1081870.1081978]
    [18] Kriegel HP, Kroger P, Renz M, Wurst SHR. A generic framework for efficient subspace clustering of high-dimensional data. In:Proc. of the 5th IEEE Int'l Conf. on Data Mining (ICDM). IEEE Press, 2005. 250-257.[doi:10.1109/ICDM.2005.5]
    [19] Jiang D, Pei J, Zang A. A general approach to mining quality pattern-based clusters from expression data. In:Proc. of the 10th Int'l Conf. on Database Systems for Advanced Applications (DASFAA). Springer-Verlag, 2005. 188-200.[doi:10.1007/11408079_18]
    [20] Gao BJ, Griffith OL, Ester M, Jones SJM. Discovering significant OPSM subspace clusters in massive gene expression data. In:Proc. of the 12th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD). ACM Press, 2006. 922-928.[doi:10.1145/1150402.1150529]
    [21] Zhang X, Wang W. An efficient algorithm for mining coherent patterns from heterogeneous microarrays. In:Proc. of the 19th Int'l Conf. on Scientific and Statistical Database Management (SSDBM). IEEE Computer Society Press, 2007. 32.[doi:10.1109/SSDBM.2007.30]
    [22] Yan L, Sun Z, Wu Y, Zhang B. Biclustering nonlinearly correlated time series gene expression data. Journal of Computer Research and Development, 2008,45(11):1865-1873(in Chinese with English abstract).
    [23] Zhang M, Wang W, Liu J. Mining approximate order preserving clusters in the presence of noise. In:Proc. of the 24th Int'l Conf. on Data Engineering (ICDE). IEEE Press, 2008. 160-168.[doi:10.1109/ICDE.2008.4497424]
    [24] Chui CK, Kao B, Yip KY, Lee SD. Mining order-preserving submatrices from data with repeated measurements. In:Proc. of the 8th IEEE Int'l Conf. on Data Mining (ICDM). IEEE Press, 2008. 133-142.[doi:10.1109/ICDM.2008.12]
    [25] Zhao Y, Yu J, Wang G, Chen L, Wang B, Yu G. Maximal subspace coregulated gene clustering. IEEE Trans. on Knowledge and Data Engineering, 2008,20(1):83-98.[doi:10.1109/TKDE.2007.190670]
    [26] Trapp AC, Prokopyev OA. Solving the order-preserving submatrix problem via integer programming. INFORMS Journal on Computing, 2010,22(3):387-400.[doi:10.1287/ijoc.1090.0358]
    [27] Fang Q, Ng W, Feng J. Discovering significant relaxed order-preserving submatrices. In:Proc. of the 16th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM Press, 2010. 433-442.[doi:10.1145/1835804.1835861]
    [28] Fang Q, Ng W, Feng J, Li Y. Mining bucket order-preserving submatrices in gene expression data. IEEE Trans. on Knowledge and Data Engineering, 2012,24(12):2218-2231.[doi:10.1109/TKDE.2011.180]
    [29] Gao BJ, Griffith OL, Ester M, Xiong H, Zhao Q, Jones SJM. On the deep order-preserving submatrix problem:A best effort approach. IEEE Trans. on Knowledge and Data Engineering, 2012,24(2):309-325.[doi:10.1109/TKDE.2010.244]
    [30] Yip KY, Kao B, Zhu X, Chui CK, Lee SD, Cheung DW. Mining order-preserving submatrices from data with repeated measurements. IEEE Trans. on Knowledge and Data Engineering, 2013,25(7):1587-1600.[doi:10.1109/TKDE.2011.167]
    [31] An P. Research on biclustering methods for gene expression data analysis[MS. Thesis]. Suzhou:Soochow University, 2013(in Chinese with English abstract).
    [32] Fang Q, Ng W, Feng J, Li Y. Mining order-preserving submatrices from probabilistic matrices. ACM Trans. on Database Systems, 2014,39(1):No.6.[doi:10.1145/2533712]
    [33] Cho S, Na JC, Park K, Sim JS. A fast algorithm for order-preserving pattern matching. Information Processing Letters, 2015,115(2):397-402.[doi:10.1016/j.ipl.2014.10.018]
    [34] Alqadah F, Bader JS, Anand R, Reddy CK. Query-Based biclustering using formal concept analysis. In:Proc. of the 12th SIAM Int'l Conf. on Data Mining (SDM). SIAM Press, 2012. 648-659.[doi:10.1137/1.9781611972825.56]
    [35] Pensa RG, Robardet C, Boulicaut JF. Towards constrained co-clustering in ordered 0/1 data sets. In:Proc. of the 16th Int'l Symp. on Methodologies for Intelligent Systems (ISMIS). Springer-Verlag, 2006. 425-434.[doi:10.1007/11875604_49]
    [36] Pensa RG, Robardet C, Boulicaut JF. Constraint-Driven co-clustering of 0/1 data. In:Proc. of the Constrained Clustering:Advances in Algorithms, Data Mining and Knowledge Discovery Series. 2008. 123-148.[doi:10.1201/9781584889977.ch6]
    [37] Pensa RG, Boulicaut JF. Constrained co-clustering of gene expression data. In:Proc. of the 8th SIAM Int'l Conf. on Data Mining (SDM). SIAM Press, 2008. 25-36.[doi:10.1137/1.9781611972788.3]
    [38] Jiang T, Li Z, Chen Q, Li K, Wang Z, Pan W. Towards order-preserving submatrix search and indexing. In:Proc. of the 20th Int'l Conf. on Database Systems for Advanced Applications (DASFAA). Part Ⅱ. Berlin, Heidelberg:Springer-Verlag, 2015. 309-326.[doi:10.1007/978-3-319-18123-3_19]
    [39] Helmer S, Moerkotte G. Evaluation of main memory join algorithm for joins with subset join predicates. In:Proc. of the 23rd Int'l Conf. on Very Large Database (VLDB). ACM Press, 1997. 386-395.
    [40] KiWi Software 1.0. 2012. http://www.bcgsc.ca/platform/bioinfo/ge/kiwi/
    [41] Broad Institute. Datasets.rar and 5q_gct_file.gct. 1999. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
    [42] Jiang T, Li Z, Chen Q, Wang Z, Li K, Wang Z. Parallel partitioning and mining gene expression data with butterfly network. In:Proc. of the 24th Int'l Conf. on Database and Expert Systems Applications (DEXA). Part I. Berlin, Heidelberg:Springer-Verlag, 2013. 129-144.[doi:10.1007/978-3-642-40285-2_13]
    [43] Jiang T, Li Z, Chen Q, Wang Z, Li K, Pan W. OMEGA:An order-preserving submatrix mining, indexing and search. In:Proc. of the European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD). Part Ⅲ. Berlin, Heidelberg:Springer-Verlag, 2015. 303-307.[doi:10.1007/978-3-319-23461-8_35]
    [44] Sim K, Gopalkrishnan V, Zimek A, Cong G. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 2013,26:332-397.[doi:10.1007/s10618-012-0258-x]
    [45] Kriegel HP, Kroger P, Zimek A. Clustering of high-dimensional data:A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. on Knowledge Discovery from Data, 2009,3(1):1-58.[doi:10.1145/1497577.1497578]
    [46] Yue F, Sun L, Wang K, Wang Y, Zuo W. State-of-the-Art of cluster analysis of gene expression data. Acta Automatica Sinica, 2008,34(2):113-120(in Chinese with English abstract).[doi:10.3724/SP.J.1004.2008.00113]
    [47] Jiang D, Tang C, Zhang AD. Cluster analysis for gene expression data:A survey. IEEE Trans. on Knowledge and Data Engineering, 2004,16(11):1370-1386.[doi:10.1109/TKDE.2004.68]
    [48] Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis:A survey. IEEE Trans. on Computational Biology and Bioinformatics, 2004,1(1):24-45.[doi:10.1109/TCBB.2004.2]
    [49] Smet RD, Marchal K. An ensemble method for querying gene expression compendia with experimental lists. In:Proc. of the 2010 IEEE Int'l Conf. on Bioinformatics and Biomedicine (BIBM). IEEE Press, 2010. 314-318.[doi:10.1109/BIBM.2010. 5706583]
    [50] Zou Q, Guo M, Liu Y, Wang J. A classification method for class-imbalanced data and its application on bioinformatics. Journal of Computer Research and Development, 2010,47(8):1407-1414(in Chinese with English abstract).
    [51] Zou Q, Li X, Jiang W, Lin Z, Li G, Chen K. Survey of mapreduce frame operation in bioinformatics. Briefings in Bioinformatics, 2014,15(4):637-647.[doi:10.1093/bib/bbs088]
    [52] Chen W, Cheng Y, Zhang S, Pan Q. Heuristic clustering method based on neighbor-seeds for 454 sequencing data. Ruan Jian Xue Bao/Journal of Software, 2014,25(5):929-938(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4547.htm[doi:10.13328/j.cnki.jos.004547]
    [53] Zou Q, Hu Q, Guo M, Wang G. HAlign:Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics, 2015,31(15):2475-2481.[doi:10.1093/bioinformatics/btv177]
    [54] Dhollander T, Sheng QZ, Lemmens K, Moor BD, Marchal K, Moreau Y. Query-Driven module discovery in microarray data. Bioinformatics, 2007,23(19):2573-2580.[doi:10.1093/bioinformatics/btm387]
    [55] Zhao H, Cloots L, Bulcke TV, Wu Y, Smet RD, Storms V, Meysman P, Engleen K, Marchal K. Query-Based biclustering of gene expression data using probabilistic relational models. BMC Bioinformatics, 2011,12(s1):S37.[doi:10.1186/1471-2105-12-S1-S37]
    [56] Tseng VS, Chen LC, Kao CP. Constrained clustering for gene expression data mining. In:Proc. of the 12th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD). Berlin, Heidelberg:Springer-Verlag, 2008. 759-766.[doi:10. 1007/978-3-540-68125-0_73]
    附中文参考文献:
    [22] 闫雷鸣,孙志挥,吴英杰,张柏礼.联合聚类非线性相关的时序基因表达数据.计算机研究与发展,2008,45(11):1865-1873.
    [31] 安平.基因表达数据的双聚类分析方法研究[硕士学位论文].苏州:苏州大学,2013.
    [46] 岳峰,孙亮,王宽全,王永吉,左旺孟.基因表达数据的聚类分析研究进展.自动化学报,2008,34(2):113-120.[doi:10.3724/SP.J.1004. 2008.00113]
    [50] 邹权,郭茂祖,刘扬,王峻.类别不平衡的分类方法及在生物信息学中的应用.计算机研究与发展,2010,47(8):1407-1414.
    [52] 陈伟,程咏梅,张绍武,潘泉.邻域种子的启发式454序列聚类方法.软件学报,2014,25(5):929-938. http://www.jos.org.cn/1000-9825/25/929.htm[doi:10.13328/j.cnki.jos.004547]
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

姜涛,李战怀,尚学群,陈伯林,李卫榜,殷知磊.基于数字签名与Trie的保序子矩阵约束查询.软件学报,2017,28(8):2175-2195

Copy
Share
Article Metrics
  • Abstract:2726
  • PDF: 4646
  • HTML: 1926
  • Cited by: 0
History
  • Received:January 20,2016
  • Revised:May 20,2016
  • Online: August 15,2017
You are the first2032443Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063