Elsa:一种面向跨区域架构的无协调分布式键值存储系统
作者:
作者简介:

崔玉龙(1998-),男,硕士生,主要研究领域为数据库,分布式系统;付国(1996-),男,硕士生,主要研究领域为数据库,分布式系统,流计算;张岩峰(1982-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为大数据处理,分布式系统,云计算;于戈(1968-),男,博士,教授,博士生导师,CCF会士,主要研究领域为数据库,分布式系统,嵌入式系统

通讯作者:

张岩峰,zhangyf@mail.neu.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(62072082,61672141);CCF-华为数据库创新研究计划(CCF-HuaweiDBIR2020009B);辽宁省重点研发计划(2020JH2/10100037)


Elsa: Coordination-free Distributed KVS for Cross-region Architecture
Author:
Fund Project:

National Natural Science Foundation of China (62072082, 61672141); CCF-Huawei Database Innovation Research Funding (CCF-HuaweiDBIR2020009B); Key R\&D Program of Liaoning Province (2020JH2/10100037)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [40]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    作为具备高性能和高可伸缩性的分布式存储解决方案,键值存储系统近年来被广泛采用,例如Redis、MongoDB、Cassandra等.分布式存储系统中广泛使用的多副本机制一方面提高了系统吞吐量和可靠性,但同时也增加了系统协调和副本一致性的额外开销.对于跨域分布式系统来说,远距离的副本协调开销甚至可能成为系统的性能瓶颈,降低系统的可用性和吞吐量.提出分布式键值存储系统Elsa,这是一种面向跨区域架构的无协调键值存储系统.Elsa在保证高性能和高可拓展性的基础上,采用无冲突备份数据结构(CRDT)技术来无协调的保证副本间的强最终一致性,降低了系统节点间的协调开销.在阿里云上构建了跨4数据中心8节点的跨区域分布式环境,进行了大规模分布式性能对比实验,实验结果表明:在跨域的分布式环境下,对于高并发争用的负载,Elsa系统的性能具备明显的优势,最高达到MongoDB集群的7.37倍,Cassandra集群的1.62倍.

    Abstract:

    As a distributed storage solution with high performance and high scalability, key-value storage systems have been widely adopted in recent years, such as Redis, MongoDB, Cassandra, etc. On the one hand,the multi-replication mechanism widely used in distributed storage system improves system throughput and reliability, but also increases the extra overhead of system coordination and replicationconsistency. For the cross-region distributed system, the long-distance replication coordination overhead may even become the performance bottleneck of the system, reducing system availability and throughput. The distributed key-value storage system called Elsa, proposed in this study, is a coordination-free multi-master key-value storage system that is designed for cross-region architecture. On the basis of ensuring high performance and high scalability, Elsa adopts the conflict-free replicated data types (CRDT) technology to ensure strong eventual consistency between replications without coordination, reducing the coordination overhead between system nodes. In this study, across-region distributed environment spanning 4 data centers and 8 nodes on aliyun platform is set up and a large-scale distributed performance comparison experiment is carried out.The experimental results show that under the cross-region distributed environment, the throughput of Elsa has obvious advantages for high concurrent contention loads, reaching up to 7.37 times of the MongoDB cluster and 1.62 times of the Cassandra cluster.

    参考文献
    [1] Seeger M. Key-value stores: A pracitical overview. Medien Informatik, 2009. 1–21.
    [2] Brewer E. A certain freedom: Thoughts on the CAP theorem. In: Proc. of the 29th ACM SIGACT-SIGOPS Symp. on Principles of Distributed Computing. Zurich: ACM, 2010. 335.
    [3] Brewer E. CAP twelve years later: How the "rules" have changed. Computer, 2012, 45(2): 23–29. [doi: 10.1109/MC.2012.37]
    [4] Brewer EA. Towards robust distributed systems (abstract). In: Proc. of the 19th Annual ACM Symp. on Principles of Distributed Computing. Portland: ACM, 2000. 343477.
    [5] DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: Amazon's highly available key-value store. ACM SIGOPS Operating Systems Review, 2007, 41(6): 205–220. [doi: 10.1145/1323293.1294281]
    [6] Lu Y, Yu XY, Cao L, Madden S. Aria: A fast and practical deterministic OLTP database. Proceedings of the VLDB Endowment, 2020, 13(12): 2047–2060. [doi: 10.14778/3407790.3407808]
    [7] Sovran Y, Power R, Aguilera MK, Li JY. Transactional storage for geo-replicated systems. In: Proc. of the 23rd ACM Symp. on Operating Systems Principles. Cascais: ACM, 2011. 385–400.
    [8] Lamport L. Paxos made simple. ACM SIGACT News, 2001, 32(4): 51–58.
    [9] 王江, 章明星, 武永卫, 陈康, 郑纬民. 类Paxos共识算法研究进展. 计算机研究与发展, 2019, 56(4): 692-707.
    Wang J, Zhang MX, Wu YW, Chen K, Zheng WM. Paxos-like consensus algorithms: A review. Journal of Computer Research and Development, 2019, 56(4): 692–707 (in Chinese with English abstract).
    [10] Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proc. of the 2014 USENIX Annual Technical Conf. Philadelphia: ACM, 2014. 305–320.
    [11] Preguiça N, Baquero C, Shapiro M. Conflict-free replicated data types CRDTs. In: Sakr S, Zomaya A, eds. Encyclopedia of Big Data Technologies. Cham: Springer, 2018.
    [12] Shapiro M, Preguiça N, Baquero C, Zawirski M. Conflict-free replicated data types. In: Proc.of the 13th Int’l Symp. on Self-Stabilizing Systems. Grenoble: Springer, 2011. 386–400.
    [13] Alvaro P, Conway N, Hellerstein JM, Marczak WR. Consistency analysis in bloom: A CALM and collected approach. In: Proc. of the 5th Biennial Conf. on Innovative Data Systems Research. Asilomar: CIDR, 2011. 249–260.
    [14] Hellerstein JM, Alvaro P. Keeping CALM: When distributed consistency is easy. Communications of the ACM, 2020, 63(9): 72–81. [doi: 10.1145/3369736]
    [15] Alvaro P, Conway N, Hellerstein JM, Maier D. Blazes: Coordination analysis for distributed programs. In: Proc. of the 30th IEEE Int’l Conf. on Data Engineering. Chicago: IEEE, 2014. 52–63.
    [16] Wu CG, Faleiro J, Lin YH, Hellerstein J. Anna: A KVS for any scale. In: Proc.of the 34th IEEE Int’l Conf. on Data Engineering (ICDE). Paris: IEEE, 2018. 401–412.
    [17] Yu WH, Rostad S. A low-cost set CRDT based on causal lengths. In: Proc. of the 7th Workshop on Principles and Practice of Consistency for Distributed Data. Heraklion: ACM, 2020. 5.
    [18] Kleppmann M. Moving elements in list CRDTs. In: Proc. of the 7th Workshop on Principles and Practice of Consistency for Distributed Data. Heraklion: ACM, 2020. 4.
    [19] Auvolat A, Taïani F. Merkle search trees: Efficient state-based crdts in open networks. In: Proc. of the 38th Symp. on Reliable Distributed Systems (SRDS). Lyon: IEEE, 2019. 221–22109.
    [20] 纪业, 魏恒峰, 黄宇, 吕建. CRDT协议的TLA+描述与验证. 软件学报, 2020, 31(5): 1332-1352. http://www.jos.org.cn/1000-9825/5956.htm
    Ji Y, Wei HF, Huang Y, Lü J. Specifying and verifying CRDT protocols using TLA+. Ruan Jian Xue Bao/Journal of Software, 2020, 31(5): 1332-1352 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5956.htm
    [21] Bailis P, Davidson A, Fekete A, Ghodsi A, Hellerstein JM, Stoica I. Highly available transactions: Virtues and limitations. Proceedings of the VLDB Endowment, 2013, 7(3): 181–192. [doi: 10.14778/2732232.2732237]
    [22] Berenson H, Bernstein P, Gray J, Melton J, O'neil E, O'neil P. A critique of ANSI SQL isolation levels. ACM SIGMOD Record, 1995, 24(2): 1–10. [doi: 10.1145/568271.223785]
    [23] Ji YH, Chai YP, Zhou X, Ren LP, Qin YJ. Smart intra-query fault tolerance for massive parallel processing databases. Data Science and Engineering, 2020, 5(1): 65–79. [doi: 10.1007/s41019-019-00114-z]
    [24] Mao YD, Kohler E, Morris RT. Cache craftiness for fast multicore key-value storage. In: Proc. of the 7th ACM European Conf. on Computer Systems. Bern: ACM, 2012. 183–196.
    [25] Lehman PL, Yao SB. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems, 1981, 6(4): 650–670. [doi: 10.1145/319628.319663]
    [26] Hewitt C. Actor model of computation: Scalable robust information systems. arXiv:1008.1459, 2010.
    [27] Vernon V. Reactive Messaging Patterns with the Actor Model: Applications and Integration in Scala and Akka. Upper Saddle River: Addison-Wesley Professional, 2015.
    [28] Demers A, Greene D, Houser C, Irish W, Larson J, Shenker S, Sturgis H, Swinehart D, Terry D. Epidemic algorithms for replicated database maintenance. In: Proc. of the 6th Annual ACM Symp. on Principles of Distributed Computing. Vancouver: ACM, 1987. 1–12.
    [29] ZeroMQ. 2021. https://github.com/zeromq/zeromq4-x
    [30] Feng JH, Li JH. Google protocol buffers research and application in online game. In: IEEE Conf. Anthology. IEEE, 2013. 1–4.
    [31] Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In: Proc. of the 9th USENIX Conf. on Operating Systems Design and Implementation. Vancouver: ACM, 2010. 251–264.
    [32] Corbett JC, Dean J, Epstein M, Fikes A, Frost C, Furman JJ, Ghemawat S, Gubarev A, Heiser C, Hochschild P, Hsieh W, Kanthak S, Kogan E, Li HY, Lloyd A, Melnik S, Mwaura D, Nagle D, Quinlan S, Rao R, Rolig L, Saito Y, Szymaniak M, Taylor C, Wang R, Woodford D. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems, 2013, 31(3): 8. [doi: 10.1145/2491245]
    [33] Kulkarni SS, Demirbas M, Madappa D, Avva B, Leone M. Logical physical clocks. In: Proc. of the 18th Int’l Conf. on Principles of Distributed Systems. Cortina d’Ampezzo: Springer, 2014. 17–23.
    [34] Karger D, Sherman A, Berkheimer A, Bogstad B, Dhanidina R, Iwamoto K, Kim B, Matkins L, Yerushalmi Y. Web caching with consistent hashing. Computer Networks, 1999, 31(11–16): 1203–1213. [doi: 10.1016/S1389-1286(99)00055-9]
    [35] MySQL. 2021. https://www.mysql.com/cn/
    [36] Avni H, Aliev A, Amor O, Avitzur A, Bronshtein I, Ginot E, Goikhman S, Levy E, Levy I, Lu FY, Mishali L, Mo YQ, Pachter N, Sivov D, Veeraraghavan V, Vexler V, Wang L, Wang P. Industrial-strength OLTP using main memory and many cores. Proceedings of the VLDB Endowment, 2020, 13(12): 3099–3111. [doi: 10.14778/3415478.3415537]
    [37] MongoDB. 2021. https://www.mongodb.com/zh-cn
    [38] Cassandra. 2021. https://cassandra.apache.org/_/index.html
    引证文献
引用本文

崔玉龙,付国,张岩峰,于戈. Elsa:一种面向跨区域架构的无协调分布式键值存储系统.软件学报,2023,34(5):2427-2445

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-04-02
  • 最后修改日期:2021-06-23
  • 在线发布日期: 2021-10-20
  • 出版日期: 2023-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号