PandaDB:一种异构数据智能融合管理系统
作者:
作者简介:

沈志宏(1977-),男,博士,教授级高工,博士生导师,主要研究领域为大数据,图数据管理,语义网.
赵子豪(1994-),男,博士生,CCF学生会员,主要研究领域为分布式图数据库,融合数据查询.
王华进(1987-),男,博士,助理研究员,CCF专业会员,主要研究领域为分布式计算,大数据分析技术.
刘忠新(1989-),男,工程师,主要研究领域为分布式存储,数据库.
胡川(1998-),男,硕士生,主要研究领域为知识图谱可视化.
周园春(1975-),男,博士,研究员,博士生导师,CCF高级会员,主要研究领域为科学大数据,知识图谱.

通讯作者:

沈志宏,E-mail:bluejoe@cnic.cn

基金项目:

中国科学院战略性先导科技专项(B类)课题(XDB38030300);国家自然科学基金(61836013);科技部创新方法工作专项(2019IM020100);中国科学院信息化专项课题(XXH13503)


PandaDB: Intelligent Management System for Heterogeneous Data
Author:
Fund Project:

Strategic Priority Research Program of CAS (XDB38030300); Key Project of National Natural Science Foundation of China (61836013); Ministry of Science and Technology Innovation Methods Special work Project (2019IM020100); Informatization Plan of Chinese Academy of Sciences (XXH13503)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [63]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随着大数据应用的不断深入,对大规模结构化/非结构化数据进行融合管理和分析的需求日益凸显.然而,结构化/非结构化数据在存储管理方式、信息获取方式、检索方式方面的差异给融合管理和分析带来了技术挑战.提出了适用于异构数据融合管理和语义计算的属性图扩展模型,并定义了相关属性操作符和查询语法.接着,基于智能属性图模型提出异构数据智能融合管理系统PandaDB,并详细介绍了PandaDB的总体架构、存储机制、查询机制、属性协存和AI算法集成机制.性能测试和应用案例证明,PandaDB的协存机制、分布式架构和语义索引机制对大规模异构数据的即席查询和分析具有较好的性能表现,该系统可实际应用于学术图谱实体消歧与可视化等融合数据管理场景.

    Abstract:

    With the development of big data application, the demand of large-scale structured/unstructured data fusion management and analysis is becoming increasingly prominent. However, the differences in management, process, retrieval of structured/unstructured data brings challenges for fusion management and analysis. This study proposes an extended property graph model for heterogeneous data fusion management and semantic computing, defines related property operators and query syntax. Based on the intelligent property graph model, this study implements PandaDB, an intelligent heterogeneous data fusion management system. This study depicts the architecture, storage mechanism, query mechanism, property co-storage, AI algorithm scheduling, and distributed architecture of PandaDB. Test experiments and cases show that the co-storage mechanism and distributed architecture of PandaDB have good performance acceleration effects, and can be applied in some scenarios of fusion data intelligent management such as academic knowledge graph entity disambiguation.

    参考文献
    [1] Gantz J, Reinsel D. Extracting value from chaos. IDC Iview, 2011,1142(2011):1-12.
    [2] Liu Y, Li H, Garcia-Duran A, et al. MMKG:Multi-modal knowledge graphs. In:Proc. of the European Semantic Web Conf. 2019. 459-474.
    [3] Buneman P, Davidson S, Fernandez M, et al. Adding structure to unstructured data. In:Proc. of the Int'l Conf. on Database Theory. Berlin, Heidelberg:Springer-Verlag, 1997. 336-350.
    [4] Li W, Lang B. A tetrahedral data model for unstructured data management. Science China Information Sciences, 2010,40(8):1039-1053(in Chinese with English abstract).[doi:10.1007/s11432-010-4030-9]
    [5] Gerber D, Hellmann S, Bühmann L, et al. Real-Time RDF extraction from unstructured data streams. In:Proc. of the Int'l Semantic Web Conf. Berlin, Heidelberg:Springer-Verlag, 2013. 135-150.
    [6] Sears R, Van Ingen C, Gray J. To Blob or not to blob:Large object storage in a database or a filesystem? arXiv preprint cs/0701168, 2007.
    [7] Zhu Y, Du N, Tian H, et al. LaUD-MS:An extensible system for unstructured data management. In:Proc. of the 2010 12th Int'l Asia-Pacific Web Conf. IEEE, 2010. 435-440.
    [8] Zhang X, Du XY, Chen JC, et al. Managing a large shared bank of data by using Free-Table. In:Proc. of the 12th Asia-Pacific Web Conf. (APWeb 2010). Busan, 2010. 441-446.
    [9] Zhou NN, Zhang X, Sun XY, et al. Design and implementation of adaptive storage management system in MyBUD. Journal of Frontiers of Computer Science & Technology, 2012,6(8):673-683(in Chinese with English abstract).
    [10] Francis N, Green A, Guagliardo P, et al. Cypher:An evolving query language for property graphs. In:Proc. of the 2018 Int'l Conf. on Management of Data. 2018. 1433-1445.
    [11] Huang GB, Mattar M, Berg T, et al. Labeled faces in the wild:A database forstudying face recognition in unconstrained environments. Workshop on Faces in Real-Life Images:Detection, Alignment, and Recognition, Erik Learned-Miller and Andras Ferencz and Frederic Jurie, 2008.
    [12] Wan H, Zhang Y, Zhang J, et al. Aminer:Search and mining of academic social networks. Data Intelligence, 2019,1(1):58-76.
    [13] Wang R, Yan Y, Wang J, et al. Acekg:A large-scale knowledge graph for academic data mining. In:Proc. of the 27th ACM Int'l Conf. on Information and Knowledge Management. 2018. 1487-1490.
    [14] KDD2020. 2020. https://www.aminer.cn/conf/kdd2020/homepage
    [15] Bagga A, Baldwin B. Entity-Based cross-document coreferencing using the vector space model. In:Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th Int'l Conf. on Computational Linguistics, Vol. 1. 1998. 79-85.
    [16] Pedersen T, Purandare A, Kulkarni A. Name discrimination by clustering similar contexts. In:Proc. of the Int'l Conf. on Intelligent Text Processing and Computational Linguistics. Berlin, Heidelberg:Springer-Verlag, 2005. 226-237.
    [17] Chen Y, Martin JH. Towards robust unsupervised personal name disambiguation. In:Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007. 190-198.
    [18] Cucerzan S. Large-Scale named entity disambiguation based on Wikipedia data. In:Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007. 708-716.
    [19] Han X, Zhao J. Named entity disambiguation by leveraging wikipedia semantic knowledge. In:Proc. of the 18th ACM Conf. on Information and Knowledge Management. 2009. 215-224.
    [20] Hassell J, Aleman-Meza B, Arpinar IB. Ontology-Driven automatic entity disambiguation in unstructured text. In:Proc. of the Int'l Semantic Web Conf. Berlin, Heidelberg:Springer-Verlag, 2006. 44-57.
    [21] Minkov E, Cohen WW, Ng AY. Contextual search and name disambiguation in email using graphs. In:Proc. of the 29th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. 2006. 27-34.
    [22] Zhang B, Al Hasan M. Name disambiguation in anonymized graphs using network embedding. In:Proc. of the 2017 ACM on Conf. on Information and Knowledge Management. 2017. 1239-1248.
    [23] Huang H, Heck L, Ji H. Leveraging deep neural networks and knowledge graphs for entity disambiguation. arXiv preprint arXiv:1504.07678, 2015.
    [24] Tan R, Chirkova R, Gadepally V, et al. Enabling query processing across heterogeneous data models:A survey. In:Proc. of the 2017 IEEE Int'l Conf. on Big Data (Big Data). IEEE, 2017. 3211-3220.
    [25] Smith JM, Bernstein PA, Dayal U, et al. Multibase:Integrating heterogeneous distributed database systems. In:Proc. of the National Computer Conf. 1981. 487-499.
    [26] Armbrust M, Xin RS, Lian C, et al. Spark SQL:Relational data processing in spark. In:Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data. 2015. 1383-1394.
    [27] Zhu M, Risch T. Querying combined cloud-based and relational databases. In:Proc. of the 2011 Int'l Conf. on Cloud and Service Computing. IEEE, 2011. 330-335.
    [28] Ong KW, Papakonstantinou Y, Vernoux R. The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631, 2014.
    [29] Kepner J, Arcand W, Bergeron W, et al. Dynamic distributed dimensional data model (D4M) database and computation system. In:Proc. of the 2012 IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. 5349-5352.
    [30] Ehrig H, Prange U, Taentzer G. Fundamental theory for typed attributed graph transformation. Lecture Notes in Computer Science, 2004,3256:161-177.
    [31] Robinson I, Webber J, Eifrem E. Graph Databases:New Opportunities for Connected Data. O'Reilly Media, Inc., 2015.
    [32] Wang X, Zou L, Wang CK, Peng P, Feng ZY. Research on knowledge graph data management:A survey. Ruan Jian Xue Bao/Journal of Software, 2019,30(7):2139-2174(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5841.htm[doi:10.13328/j.cnki.jos.005841]
    [33] Angles R. A comparison of current graph database models. In:Proc. of the 2012 IEEE 28th Int'l Conf. on Data Engineering Workshops. IEEE, 2012. 171-177.
    [34] The Neo4j Team. The Neo4j manual v3.4. 2018. https://neo4j.com/docs/developer-manual/current/
    [35] Spmallette. Titan-Distributed graph database. 2018. http://titan.thinkaurelius.com/
    [36] Janus Graph Authors. JanusGraph-Distributed graph database. 2018. http://janusgraph.org/
    [37] Amazon Web Services, Inc. Amazon Neptune-Fast, reliable graph database build for cloud. 2018. https://aws.amazon.com/neptune/
    [38] Microsoft Azure. Microsoft Azure Cosmos DB. 2018. https://docs.microsoft.com/en-us/azure/cosmos-db/introduction
    [39] TigerGraph. TigerGraph-The first native parallel graph. 2018. https://www.tigergraph.com/
    [40] Callidus Software Inc. OrientDB-Multi-Model database. 2018. http://orientdb.com/
    [41] Baltrušaitis T, Ahuja C, Morency LP. Multimodal machine learning:A survey and taxonomy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018,41(2):423-443.
    [42] Ramachandram D, Taylor GW. Deep multimodal learning:A survey on recent advances and trends. IEEE Signal Processing Magazine, 2017,34(6):96-108.
    [43] Li GL, Zhou XH. XuanYuan:An AI-native database systems. Ruan Jian Xue Bao/Journal of Software, 2020,31(3):831-844(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5899.htm[doi:10.13328/j.cnki.jos.005899]
    [44] Hang J, Liu Y, Zhou K, Li G. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In:Proc. of the SIGMOD. 2019.
    [45] Wang W, Zhang M, Chen G, Jagadish HV, Ooi BC, Tan K. Database meets deep learning:Challenges and opportunities. SIGMOD Record, 2016,45(2):1722.
    [46] Kipf A, Kipf T, Radke B, Leis V, Boncz PA, Kemper A. Learned cardinalities:Estimating correlated joins with deep learning. In:Proc. of the CIDR. 2019.
    [47] Pedrozo WG, Nievola JC, Ribeiro DC. An adaptive approach for index tuning with learning classifier systems on hybrid storage environments. In:Proc. of the HAIS. 2018. 716-729.
    [48] Krishnan S, Yang Z, Goldberg K, Hellerstein JM, Stoica I. Learning to optimize join queries with deep reinforcement learning. CoRR, abs/1808.03196, 2018.
    [49] Marcus R, Papaemmanouil O. Deep reinforcement learning for join order enumeration. In:Proc. of the 1st Int'l Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 2018. 3:13:4.
    [50] Lamport L. The part-time parliament. ACM Trans. on Computer Systems, 1998,16(2):133-169.
    [51] Lamport L. Paxos made simple. ACM SIGACT News, 2001,32(4):18-25.
    [52] Wang J, Zhang MX, Wu YW, Chen K, Zheng WM. Paxos-Like consensus algorithms:A review. Journal of Computer Research and Development, 2019,56(4):692-707(in Chinese with English abstract).
    [53] Chandra TD, Griesemer D, Redstone J. Paxos made live:An engineering perspective. In:Proc. of the 26th Annual ACM Symp. on Principles of Distributed Computing (PODC 2007). New York:ACM, 2007. 398-407.
    [54] Oki BM, Liskov BH. Viewstamped replication:A new primary copy method to support highly-available distributed systems. In:Proc. of the 7th Annual ACM Symp. on Principles of Distributed Computing. 1988. 8-17.
    [55] Liskov, Barbara, Cowling J. Viewstamped Replication Revisited. 2012.
    [56] Medeiros A. ZooKeeper's atomic broadcast protocol:Theory and practice. Technical Report, 2012.
    [57] Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In:Proc. of the 2014 USENIX Annual Technical Conf. 2014. 305-319.
    附中文参考文献:
    [4] 李未,郎波.一种非结构化数据库的四面体数据模型.中国科学:信息科学,2010,40(8):1039-1053.[doi:10.1007/s11432-010-4030-9]
    [9] 周宁南,张孝,孙新云,琚星星,刘奎呈,杜小勇,王珊.MyBUD自适应分布式存储管理的设计与实现.计算机科学与探索,2012,6(8):673-683.
    [32] 王鑫,邹磊,王朝坤,彭鹏,冯志勇.知识图谱数据管理研究综述.软件学报,2019,30(7):2139-2174. http://www.jos.org.cn/1000-9825/5841.htm[doi:10.13328/j.cnki.jos.005841]
    [43] 李国良,周煊赫.轩辕:AI原生数据库系统.软件学报,2020,31(3):831-844. http://www.jos.org.cn/1000-9825/5899.htm[doi:10.13328/j.cnki.jos.005899]
    [52] 王江,章明星,武永卫,陈康,郑纬民.类Paxos共识算法研究进展.计算机研究与发展,2019,56(4):692-707.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

沈志宏,赵子豪,王华进,刘忠新,胡川,周园春. PandaDB:一种异构数据智能融合管理系统.软件学报,2021,32(3):763-780

复制
分享
文章指标
  • 点击次数:3080
  • 下载次数: 7415
  • HTML阅读次数: 3558
  • 引用次数: 0
历史
  • 收稿日期:2020-07-20
  • 最后修改日期:2020-09-03
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
文章二维码
您是第19728298位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号