基于自适应知识蒸馏的代码大模型轻量化
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(62202074, 62372071); 中国博士后科学基金(2022M710519); 重庆市技术创新与应用发展专项重点项目(CSTB2023TIAD-STX0015, CSTB2022TIAD-KPX0068); 重庆市出站留(来)渝博士后择优资助项目(2021LY23)


Adaptive Knowledge Distillation for Lightweight Large Code Models
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    以大语言模型(large language model, LLM)为基座的软件编程助手(如Copilot), 能够显著提升程序员开发效率, 但LLM的计算和存储需求大、本地化部署难. 构建轻量化小参数LLM能够满足计算、存储、部署需求, 但其代码生成的精度损失比大参数LLM 大. 知识蒸馏(knowledge distillation, KD)技术, 让小参数LLM (学生模型)在目标训练数据集上拟合大参数LLM (教师模型)的生成分布, 降低代码生成精度损失. 人工智能领域前沿的KD技术基于Kullback-Leibler (KL)散度损失函数, 度量并缩小因学生/教师模型的生成分布差异导致的精度损失, 但学生模型难以学习教师模型的趋零分布区域. 随后, 学者利用反向KL散度损失函数(RKL)解决该趋零分布区域的学习问题. 研究发现, RKL在高概率分布区域存在学习问题, 与KL散度损失函数存在互补性; 对于一些数据, 教师模型生成质量低, 导致学生模型学习效果差. 提出一种自适应知识蒸馏 (adaptive knowledge distillation, AKD)方法, 通过prompt提升教师模型的生成质量, 并构造自适应损失函数, 根据学生/教师模型之间的生成分布差异自适应调整学习的优先级, 确保学生模型在主要概率区域和趋零概率区域均具备学习能力. 基于AKD方法, 利用StarCoder-1B/7B (学生/教师模型)和CodeAlpaca数据, 训练了轻量化代码生成大模型, 并评估代码生成大模型的精度损失及生成代码的质量问题. 实验结果显示, 轻量化代码生成大模型规模降低85.7%, 在HumanEval和MBPP数据集上, 任务提示明确的prompt可以提高教师模型的代码生成质量, 使训练的学生模型降低6%的平均精度损失; AKD方法训练的模型较教师模型(StarCoder-7B)的平均精度损失为17.14%, 较原始学生模型平均降低30.6%; AKD方法训练的模型较前沿的KD和RKD方法的精度损失平均降低19.9%; 关于推理显存需求情况, KD和RKD方法需要54.7 GB, 而AKD方法仅增加3 GB. 关于训练时间方面, AKD方法所需训练时间增加30%; 相较而言, 即使KD和RKD方法训练至相同时长, 他们的平均效果仅提升3%, 相比AKD方法低16.9%. 因此, AKD方法增加的训练成本是值得的. 此外, 将AKD方法应用到CodeLlama和CodeGen系列模型, 相较前沿的KD及RKD方法的精度损失平均降低19.2%, 证明了AKD方法的泛化能力.

    Abstract:

    Software programming assistants based on large language models (LLMs), such as Copilot, significantly enhance programmer productivity. However, LLMs have large computing and storage requirements and are difficult to deploy locally. Building a lightweight, small LLM can meet computing, storage, and deployment requirements, but it leads to a greater accuracy loss in code generation compared to large LLMs. Knowledge distillation (KD) techniques allow small LLMs (student models) to approximate the output distributions of large LLMs (teacher models) on target training datasets, thus reducing accuracy loss in code generation. Cutting-edge KD techniques in artificial intelligence are based on the Kullback-Leibler (KL) divergence loss function, which measures and reduces accuracy loss due to discrepancies in the output distributions between student and teacher models. However, student models struggle to learn in the near-zero distribution regions of teacher models. Consequently, researchers have employed the Reverse KL (RKL) divergence loss function to address this issue in near-zero distribution regions. This study finds that RKL faces learning challenges in high-probability distribution regions and complements the KL divergence loss function. For some datasets, low-quality outputs from teacher models lead to poor learning outcomes for the student models. This study proposes an adaptive knowledge distillation (AKD) method that uses prompts to enhance teacher model output quality and constructs an adaptive loss function to adjust learning priorities based on the distributional differences between student and teacher models. This ensures the student model effectively learns in both primary and near-zero probability regions. Using the AKD method, this study trains a lightweight code generation model based on StarCoder-1B/7B (student/teacher models) and the CodeAlpaca dataset, evaluating accuracy loss and code quality issues. Experimental results show that the lightweight model size is reduced by 85.7%. On the HumanEval and MBPP data sets, prompts with clear instructions improve teacher model code generation quality, reducing the average accuracy loss of the trained student model by 6%. The AKD-trained model’s average accuracy loss compared to the teacher model (StarCoder-7B) is 17.14%, a 30.6% reduction over the original student model. The AKD-trained model’s accuracy loss is reduced by an average of 19.9% compared to state-of-the-art KD and RKD methods. Regarding inference memory requirements, the KD and RKD methods require 54.7 GB, while the AKD method only adds 3 GB. In terms of training time, the AKD method incurs a 30% increase. However, even when the KD and RKD methods are trained for the same duration, their average performance improves by only 3%, which is 16.9% lower than that of the AKD method. Therefore, the additional training cost of the AKD method is justified. Moreover, applying the AKD method to the CodeLlama and CodeGen model series reduces accuracy loss by an average of 19.2% compared to state-of-the-art KD and RKD methods, demonstrating the generalizability of the AKD method.

    参考文献
    相似文献
    引证文献
引用本文

舒善富,刘超,孙毓忠,张洪宇,高翠芸,张小洪.基于自适应知识蒸馏的代码大模型轻量化.软件学报,,():1-24

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-11-03
  • 最后修改日期:2025-01-05
  • 录用日期:
  • 在线发布日期: 2025-12-03
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号