基于联邦学习的BERT模型高效训练框架
作者:
中图分类号:

TP18

基金项目:

浙江省“尖兵”计划(2024C01021)


Efficient Framework for BERT Model Training Based on Federated Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [62]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    高质量的训练数据对于预训练语言模型(PLM)至关重要, 但许多专业领域的数据因隐私问题而无法集中收集用于模型训练. 借助联邦学习, 可以在保护数据隐私的前提下完成模型训练. 然而, 联邦学习的客户端通常资源有限, 无法完成预训练语言模型的训练. 针对这一问题进行深入研究. 首先, 明确定义在资源有限前提下完成模型训练的问题, 通过调整计算开销与通信开销来优化模型的训练效果. 其次, 介绍一种适用于联邦学习环境下的BERT模型高效训练框架——FedBT. 该框架旨在实现BERT模型在联邦学习客户端上的训练, 涵盖进一步预训练和下游任务微调两种场景. FedBT适应不同的应用场景, 在客户端针对BERT模型的关键参数进行训练, 并仅将更新的参数上传至服务器进行聚合. 这种方法显著减少模型训练过程中的计算和通信成本. 最后, 在多个专业领域的数据集上进行充分的实验对比, 进一步预训练场景下, FedBT框架可以降低客户端的训练开销与通信开销至原来的34.31%和7.04%, 下游任务微调场景下, FedBT框架可以降低客户端的训练开销与通信开销至原来的48.26%和20.19%, 并且均实现同传统联邦学习训练完整模型接近的精确度.

    Abstract:

    High-quality training data is instrumental in pre-trained language models (PLMs), yet privacy concerns often preclude the centralized collection of data from many professional domains. Federated learning offers a solution by enabling model training while safeguarding data privacy. However, the limited resources of federated learning clients pose a challenge to the training of pre-trained language models. This study addresses this issue through several steps. Firstly, it defines the problem of completing model training with limited resources and explores strategies to balance computational and communication costs for optimizing training efficiency. Secondly, it introduces an efficient federated learning framework for BERT further pre-training and fine-tuning (FedBT). FedBT facilitates the training of the BERT model on federated learning clients, encompassing both further pre-training and downstream task fine-tuning. Depending on the application context, FedBT selectively trains key parameters of the BERT model at the clients, uploading only the updated parameters to the server for aggregation. This approach significantly reduces both computational and communication overhead during training. Finally, extensive experiments are conducted on datasets from multiple professional domains. Results demonstrate that FedBT reduces client-side computational costs to 34.31% and communication costs to 7.04% during further pre-training. In downstream task fine-tuning, it reduces client-side computational costs to 48.26% and communication costs to 20.19%. The accuracy achieved in both pre-training and downstream task fine-tuning is comparable to traditional federated learning methods that train the entire model.

    参考文献
    [1] 王乃钰, 叶育鑫, 刘露, 凤丽洲, 包铁, 彭涛. 基于深度学习的语言模型研究进展. 软件学报, 2021, 32(4): 1082–1115. http://www.jos.org.cn/1000-9825/6169.htm
    Wang NY, Ye YX, Liu L, Feng LZ, Bao T, Peng T. Language models based on deep learning: A review. Ruan Jian Xue Bao/Journal of Software, 2021, 32(4): 1082–1115 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6169.htm
    [2] 宋雪萌, 聂礼强, 申恒涛, 田奇, 黄华. 融合预训练技术的多模态学习研究专题前言. 软件学报, 2023, 34(5): 1997–1999. http://www.jos.org.cn/1000-9825/6776.htm
    Song XM, Nie LQ, Shen HT, Tian Q, Huang H. Special topic: Research on multimodal learning integrated with pre-training techniques preface. Ruan Jian Xue Bao/Journal of Software, 2023, 34(5): 1997–1999 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6776.htm
    [3] 顾育豪, 白跃彬. 联邦学习模型安全与隐私研究进展. 软件学报, 2023, 34(6): 2833–2864. http://www.jos.org.cn/1000-9825/6658.htm
    Gu YH, Bai YB. Survey on security and privacy of federated learning models. Ruan Jian Xue Bao/Journal of Software, 2023, 34(6): 2833–2864 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6658.htm
    [4] 谭作文, 张连福. 机器学习隐私保护研究综述. 软件学报, 2020, 31(7): 2127–2156. http://www.jos.org.cn/1000-9825/6052.htm
    Tan ZW, Zhang LF. Survey on privacy preserving techniques for machine learning. Ruan Jian Xue Bao/Journal of Software, 2020, 31(7): 2127–2156 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6052.htm
    [5] McMahan B, Moore E, Ramage D, Hampson S, Arcas BA. Communication-efficient learning of deep networks from decentralized data. In: Proc. of the 20th Int’l Conf. on Artificial Intelligence and Statistics. Fort Lauderdale: PMLR, 2017. 1273–1282.
    [6] Peters M, Neumann M, Iyyer M, Gardner M, Zettlemoyer L. Deep contextualized word representations. In: Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). New Orleans: Association for Computational Linguistics, 2018. 2227–2237.
    [7] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
    [8] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
    [9] Liu YH, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [10] Sun C, Qiu XP, Xu YG, Huang XJ. How to fine-tune BERT for text classification? In: Proc. of the 18th China National Conf. on Chinese Computational Linguistics. Kunming: Springer, 2019. 194–206. [doi: 10.1007/978-3-030-32381-3_16]
    [11] Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report, Toronto: University of Toronto, 2009.
    [12] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]
    [13] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [14] Wang HP, Stich SU, He Y, Fritz M. Progfed: Effective, communication, and computation efficient federated learning by progressive training. In: Proc. of the 39th Int’l Conf. on Machine Learning. Baltimore: PMLR, 2022. 23034–23054.
    [15] Alistarh D, Grubic D, Li JZ, Tomioka R, Vojnovic M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 1707–1718.
    [16] Lin YJ, Han S, Mao HZ, Wang Y, Dally B. Deep gradient compression: Reducing the communication bandwidth for distributed training. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
    [17] Fu FC, Hu YZ, He YH, Jiang JW, Shao YX, Zhang C, Cui B. Don’t waste your bits! Squeeze activations and gradients for deep neural networks via TinyScript. In: Proc. of the 37th Int’l Conf. on Machine Learning. PMLR, 2020. 3304–3314.
    [18] Stich SU, Cordonnier JB, Jaggi M. Sparsified SGD with memory. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 4452–4463.
    [19] Konečný J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D. Federated learning: Strategies for improving communication efficiency. arXiv:1610.05492, 2016.
    [20] Li DL, Wang JP. FedMD: Heterogenous federated learning via model distillation. arXiv:1910.03581, 2019.
    [21] Lin T, Kong LJ, Stich SU, Jaggi M. Ensemble distillation for robust model fusion in federated learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 198.
    [22] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
    [23] Wang XA, Li H, Chen K, Shou LD. FEDBFPT: An efficient federated learning framework for BERT further pre-training. In: Proc. of the 32nd Int’l Joint Conf. on Artificial Intelligence. Macao: ijcai.org, 2023. 4344–4352. [doi: 10.24963/IJCAI.2023/483]
    [24] 王勇, 李国良, 李开宇. 联邦学习贡献评估综述. 软件学报, 2023, 34(3): 1168–1192. http://www.jos.org.cn/1000-9825/6786.htm
    Wang Y, Li GL, Li KY. Survey on contribution evaluation for federated learning. Ruan Jian Xue Bao/Journal of Software, 2023, 34(3): 1168–1192 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6786.htm
    [25] Rong X. Word2Vec parameter learning explained. arXiv:1411.2738, 2014.
    [26] Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]
    [27] Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics, 2019. 3615–3620. [doi: 10.18653/v1/D19-1371]
    [28] Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234–1240.
    [29] Yang Y, Uy MCS, Huang A. FinBERT: A pretrained language model for financial communications. arXiv:2006.08097, 2020.
    [30] 周新虹, 沈洁. 面向医疗应用场景的联邦学习综述. 信息技术与信息化, 2023, (11): 135–141.
    Zhou XH, Shen J. Survey on federated learning for medical application scenarios. Information Technology and Informatization, 2023, (11): 135–141 (in Chinese).
    [31] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108, 2019.
    [32] Jiao XQ, Yin YC, Shang LF, Jiang X, Chen X, Li LL, Wang F, Liu Q. TinyBERT: Distilling BERT for natural language understanding. In: Proc. of the 2020 Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 4163–4174.
    [33] Lit Z, Sit S, Wang JZ, Xiao J. Federated split BERT for heterogeneous text classification. In: Proc. of the 2022 Int’l Joint Conf. on Neural Networks (IJCNN). Padua: IEEE, 2022. 1–8. [doi: 10.1109/IJCNN55064.2022.9892845]
    [34] Tian YYS, Wan Y, Lyu LJ, Yao DZ, Jin H, Sun LC. FEDBERT: When federated learning meets pre-training. ACM Trans. on Intelligent Systems and Technology, 2022, 13(4): 66.
    [35] Hao YR, Dong L, Wei FR, Xu K. Visualizing and understanding the effectiveness of BERT. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics, 2019. 4143–4152. [doi: 10.18653/v1/D19-1424]
    [36] Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language? In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 3651–3657.
    [37] Manginas N, Chalkidis I, Malakasiotis P. Layer-wise guided training for BERT: Learning incrementally refined document representations. In: Proc. of the 4th Workshop on Structured Prediction for NLP. Association for Computational Linguistics, 2020. 53–61.
    [38] Wang J, Chen K, Chen G, Shou LD, McAuley J. SkipBERT: Efficient inference with shallow layer skipping. In: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Dublin: Association for Computational Linguistics, 2022. 7287–7301. [doi: 10.18653/v1/2022.acl-long.503]
    [39] Fayek HM, Cavedon L, Wu HR. Progressive learning: A deep learning framework for continual learning. Neural Networks, 2020, 128: 345–357.
    [40] Lo K, Wang LL, Neumann M, Kinney R, Weld D. S2ORC: The semantic scholar open research corpus. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 4969–4983.
    [41] 梁峥, 王宏志, 戴加佳, 邵心玥, 丁小欧, 穆添愉. 预训练语言模型实体匹配的可解释性. 软件学报, 2023, 34(3): 1087–1108. http://www.jos.org.cn/1000-9825/6794.htm
    Liang Z, Wang HZ, Dai JJ, Shao XY, Ding XO, Mu TY. Interpretability of entity matching based on pre-trained language model. Ruan Jian Xue Bao/Journal of Software, 2023, 34(3): 1087–1108 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6794.htm
    [42] 盛雪晨, 陈丹伟. 基于联邦学习和差分隐私的文本分类模型研究. 信息安全研究, 2023, 9(12): 1145–1151.
    Sheng XC, Chen DW. Research on text classification model based on federated learning and differential privacy. Journal of Information Security Research, 2023, 9(12): 1145–1151 (in Chinese with English abstract).
    [43] 李博涵, 向宇轩, 封顶, 何志超, 吴佳骏, 戴天伦, 李静. 融合知识感知与双重注意力的短文本分类模型. 软件学报, 2022, 33(10): 3565–3581.
    Li BH, Xiang YX, Feng D, He ZC, Wu JJ, Dai TL, Li J. Short text classification model combining knowledge aware and dual attention. Journal of Software, 2022, 33(10): 3565–3581 (in Chinese with English abstract).
    [44] 赵京胜, 宋梦雪, 高祥, 朱巧明. 自然语言处理中的文本表示研究. 软件学报, 2022, 33(1): 102–128. http://www.jos.org.cn/1000-9825/6304.htm
    Zhao JS, Song MX, Gao X, Zhu QM. Research on text representation in natural language processing. Ruan Jian Xue Bao/Journal of Software, 2022, 33(1): 102–128 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6304.htm
    [45] Collier N, Ohta T, Tsuruoka Y, Tateisi Y, Kim JD. Introduction to the bio-entity recognition task at JNLPBA. In: Proc. of the 2004 Int’l Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP). Geneva: COLING, 2004. 73–78.
    [46] Luan Y, He LH, Ostendorf M, Hajishirzi H. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018. 3219–3232. [doi: 10.18653/v1/D18-1360]
    [47] Li J, Sun YP, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu ZY. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database, 2016, 2016: baw068.
    [48] Doğan RI, Leaman R, Lu ZY. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics, 2014, 47: 1–10.
    [49] Kringelum J, Kjaerulff SK, Brunak S, Lund O, Oprea TI, Taboureau O. ChemProt-3.0: A global chemical biology diseases mapping. Database, 2016, 2016: bav123.
    [50] Cohan A, Ammar W, van Zuylen M, Cady F. Structural scaffolds for citation intent classification in scientific publications. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 3586–3596. [doi: 10.18653/v1/N19-1361]
    [51] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [52] Le Y, Yang X. Tiny imagenet visual recognition challenge. 2015. http://vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf
    相似文献
    引证文献
引用本文

王鑫澳,陈珂,寿黎但,骆歆远,陈刚.基于联邦学习的BERT模型高效训练框架.软件学报,,():1-24

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-03-20
  • 最后修改日期:2024-05-05
  • 在线发布日期: 2025-01-24
文章二维码
您是第19893032位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号