两两比较模型的Why-not问题解释及排序
作者:
作者简介:

祁丹蕊(1997-),女,内蒙古赤峰人,硕士生,主要研究领域为数据清洗;宋韶旭(1981-),男,博士,副教授,博士生导师,CCF专业会员,主要研究领域为数据库;王建民(1968-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为数据库,工作流.

通讯作者:

宋韶旭,E-mail:sxsong@tsinghua.edu.cn

基金项目:

国家重点研发计划(2016YFB1001101);国家自然科学基金(61572272,71690231)


Learning Pair-wise Relationship Models for Ranking Why-not Problem Explanations
Author:
Fund Project:

National Key Research and Development Plan (2016YFB1001101); National Natural Science Foundation of China (61572272, 71690231)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    由于数据缺失,数据库用户通常无法获得查询结果中的预期答案.它被称为"Why-not问题",即"为什么预期的元组不会出现在结果中".现有的方法通过列举可能的元组值来解释Why-not问题.枚举所给出解释的数量往往太大,无法由用户探索.完整性约束,如函数依赖,被用来排除不合格的解释.然而,许多属性在简化后解释中仅仅表示为变量,用户可能仍然无法理解.由于数据稀疏性,许多不合理的解释也会被推荐给用户.提出通过研究元组间两两比较关系,从而对Why-not问题的解释进行排序的方法.首先,重新定义为什么Why-not问题解释的形式没有变量,以便于用户理解;其次,对元组中的相等/不相等关系进行表示,提出在{0,1}表示的元组对的基础上学习统计模型,从而解决直接在原始数据上学习所带来的稀疏性问题,许多模型可以被用来推断概率,包括统计分布、分类和回归;最后,根据推断的概率对解释进行评价和排序.实验结果证明:利用统计、分类和回归方法计算两两关系概率分布的方法,可以为用户寻找Why-not问题的解释并返回较为高质量的解释.

    Abstract:

    Database users often fails to obtain the expected answer in the query results, since databases are often incomplete with missing data. It is known as the Why-not problem, that is, "why the expected tuples do not appear in the results". Existing methods present the explanations of the Why-not problem by enumerating possible values. The number of explanations presented by enumeration is often too large to explore by users. Integrity constraints, such as function dependencies, are employed to rule out irrational explanations. Unfortunately, many attributes are simply represented as variables in the reduced explanations, which the users may still not understand. There are also many unreasonable explanations, owing to data sparsity. This work proposes to study the pair-wise relationships of tuples as the features for ranking Why-not explanations. First, the format of Why-not problem explanations is re-defined, without variables, for easy understanding by users. Secondly, the equality/inequality relationships in tuple pairs are represented. Instead of learning over the original data with sparsity issue, to learn statistical models over the {0,1} representation of tuple pairs is proposed. A number of models are employed to infer the probability, including statistical distribution, classification, and regression. Finally, the explanations are evaluated and ranked according to the inferred probability. Experiments shows that high-quality explanations for Why-not question can be returned using pair-wise method.

    参考文献
    [1] Benjelloun O, Sarma AD, Halevy A, Widom J. ULDBs:Databases with uncertainty and lineage. In:Proc. of the 32nd Int'l Conf. on Very Large Data Bases. VLDB Endowment, 2006. 953-964.
    [2] Bhagwat D, Chiticariu L, Tan WC, Vijayvargiya G. An annotation management system for relational databases. The VLDB Journal, 2005,14(4):373-396.
    [3] Bidoit N, Herschel M, Tzompanaki K. Query-based why-not provenance with nedexplain. In:Proc. of the Extending Database Technology (EDBT). 2014.
    [4] Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A. Conditional functional dependencies for data cleaning. In:Proc. of the Data Engineering (ICDE 2007). IEEE, 2007. 746-755.
    [5] Peter B, Khanna S, Tan WC. Why and where:A characterization of data provenance. In:Proc. of the Int'l Conf. on Database Theory. Berlin, Heidelberg:Springer-Verlag, 2001. 316-330.
    [6] Chapman A, Jagadish HV. Why not? In:Proc. of the 2009 ACM SIGMOD Int'l Conf. on Management of Data. ACM Press, 2009. 523-534.
    [7] Cheney J, Chiticariu L, Tan WC. Provenance in databases:Why, how, and where. Foundations and Trends® in Databases, 2009, 1(4):379-474.
    [8] Cormode G, Golab L, Flip K, McGregor A, Srivastava D, Zhang X. Estimating the confidence of conditional functional dependencies. In:Proc. of the 2009 ACM SIGMOD Int'l Conf. on Management of Data. ACM Press, 2009. 469-482.
    [9] Cui Y, Widom J. Lineage tracing for general data warehouse transformations. The VLDB Journal-The Int'l Journal on Very Large Data Bases, 2003,12(1):41-58.
    [10] Cui Y, Widom J. Practical lineage tracing in data warehouses. In:Proc. of the 16th Int'l Conf. on Data Engineering. IEEE, 2000. 367-378.
    [11] Cui Y, Widom J, Wiener JL. Tracing the lineage of view data in a warehousing environment. ACM Trans. on Database Systems (TODS), 2000,25(2):179-227.
    [12] Danaparamita J, Gatterbauer W. QueryViz:Helping users understand SQL queries and their patterns. In:Proc. of the 14th Int'l Conf. on Extending Database Technology. ACM Press, 2011. 558-561.
    [13] Flach PA, Savnik I. Database dependency discovery:A machine learning approach. AI Communications, 1999,12(3):139-160.
    [14] Foster I, Vockler J, Wilde M, Zhao Y. Chimera:A virtual data system for representing, querying, and automating data derivation. In:Proc. of the 14th Int'l Conf. on Scientific and Statistical Database Management. IEEE, 2002. 37-46.
    [15] Grust T, Rittinger J. Observing SQL queries in their natural habitat. ACM Trans. on Database Systems (TODS), 2013,38(1):3.
    [16] He Z, Lo E. Answering why-not questions on top-k queries. IEEE Trans. on Knowledge and Data Engineering, 2014,26(6):1300-1315.
    [17] Hernández M, Koutrika G, Krishnamurthy R, Popa L, Wisnesky R. HIL:A high-level scripting language for entity integration. In:Proc. of the 16th Int'l Conf. on Extending Database Technology. ACM Press, 2013. 549-560.
    [18] Herschel M, Hernández MA. Explaining missing answers to SPJUA queries. Proc. of the VLDB Endowment, 2010,3(1-2):185-196.
    [19] Huang J, Chen T, Doan A, Naughton JF. On the provenance of non-answers to queries over extracted data. Proc. of the VLDB Endowment, 2008,1(1):736-747.
    [20] Islam MS, Zhou R, Liu C. On answering why-not questions in reverse skyline queries. In:Proc. of the 2013 IEEE 29th Int'l Conf. on Data Engineering (ICDE). IEEE, 2013. 973-984.
    [21] Lopes S, Petit JM, Lakhal L. Efficient discovery of functional dependencies and armstrong relations. In:Proc. of the Int'l Conf. on Extending Database Technology. Berlin, Heidelberg:Springer-Verlag, 2000. 350-364.
    [22] Meliou A, Gatterbauer W, Moore KF, Suciu D. The complexity of causality and responsibility for query answers and non-answers. Proc. of the VLDB Endowment, 2010,4(1):34-45.
    [23] Miles S, Wong SC, Fang W, Groth P, Zauner KP, Moreau L. Provenance-based validation of e-science experiments. Web Semantics:Science, Services and Agents on the World Wide Web, 2007,5(1):28-38.
    [24] Mutsuzaki M, Theobald M, De Keijzer A, Widom J, Agrawal P, Benjelloun O, Das Sarma A, Murthy R, Sugihara T. Trio-one:Layering uncertainty and lineage on a conventional DBMS. In:Proc. of the 3rd Biennial Conf. on Innovative Data Systems Research. 2007. 269-274.
    [25] Qi DR. On concise explanations of non-answers over big data. In:Proc. of the 2017 ACM Int'l Conf. on Management of Data. ACM Press, 2017. 10-12.
    [26] Tran QT, Chan CY. How to conquer why-not questions. In:Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data. ACM Press, 2010. 15-26.
    [27] Zhang AQ, Song SX, Wang JM. Reducing explanations of Non-answers using data quality rules. Journal of Computer Research and Development, 2013,(zl):221-229(in Chinese with English abstract).
    [28] Wyss C, Giannella C, Robertson E. Fastfds:A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. In:Proc. of the Int'l Conf. on Data Warehousing and Knowledge Discovery. Berlin, Heidelberg:Springer-Verlag, 2001. 101-110.
    附中文参考文献:
    [27] 张奥千,宋韶旭,王建民.基于数据质量规则的缺失结果解释约减.计算机研究与发展,2013,(zl):221-229.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

祁丹蕊,宋韶旭,王建民.两两比较模型的Why-not问题解释及排序.软件学报,2019,30(3):620-647

复制
分享
文章指标
  • 点击次数:3226
  • 下载次数: 6247
  • HTML阅读次数: 3221
  • 引用次数: 0
历史
  • 收稿日期:2018-07-21
  • 最后修改日期:2018-09-20
  • 在线发布日期: 2019-03-06
文章二维码
您是第19894393位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号