基于分组对比学习的序贯感知技能发现
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP18

基金项目:

国家自然科学基金(62206133, 62276142); 江苏省重点研发计划(BE2021093); 南京大学计算机软件新技术国家重点实验室资助项目(KFKT2022B12); 广西多源信息挖掘与安全重点实验室开放基金(MIMS22-01); 江苏省双创博士项目(JSSCBS20210539)


Group-wise Contrastive Learning Based Sequence-aware Skill Discovery
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    强化学习在智能对话系统等决策任务中取得了令人瞩目的结果, 但其在复杂的、奖励稀疏的任务中学习效率较低. 研究人员在强化学习中引入技能发现框架, 以最大化不同技能间的差异为目标构建技能策略, 提升了智能体在上述任务中的学习效率. 然而, 受到采样轨迹数据多样性的限制, 现有的技能发现方法局限于在一个强化学习回合中学习一种技能, 导致其在一回合中具有序贯技能组合的复杂任务中表现欠佳. 针对该问题, 提出一种基于分组对比学习的序贯感知技能发现方法(group-wise contrastive learning based sequence-aware skill discovery, GCSSD), 该方法将对比学习融合到技能发现框架中. 首先, 为了提升轨迹数据的多样性, 将与环境交互的完整轨迹分段并进行分组, 利用分组轨迹构建对比损失学习技能嵌入表征; 其次, 结合技能嵌入表征与强化学习进行技能策略训练; 最后, 为了提升在具有不同序贯技能组合任务上的性能, 对采样轨迹进行分段技能表征并将其嵌入策略网络, 实现对已学技能策略的序贯组合. 实验结果表明, GCSSD方法在具有序贯技能组合的稀疏奖励任务中具有较好的训练效果, 并且在具有与训练任务不同的序贯技能组合任务中, 能够利用已学技能对该任务进行快速适应.

    Abstract:

    Reinforcement learning has achieved remarkable results in decision-making tasks like intelligent dialogue systems, yet its efficiency diminishes notably in scenarios with intricate structures and scarce rewards. Researchers have integrated the skill discovery framework into reinforcement learning, aiming to maximize skill disparities to establish policies and boost agent performance in such tasks. However, the constraint posed by the limited diversity of sampled trajectory data confines existing skill discovery methods to learning a single skill per reinforcement learning episode. Consequently, this limitation results in subpar performance in complex tasks requiring sequential skill combinations within a single episode. To address this challenge, a group-wise contrastive learning based sequence-aware skill discovery method (GCSSD) is proposed, which integrates contrastive learning into the skill discovery framework. Initially, to augment trajectory data diversity, the complete trajectories interacting with the environment are segmented and grouped, employing contrastive loss to learn skill embedding representations from grouped trajectories. Subsequently, skill policy training is conducted by combining the skill embedding representation with reinforcement learning. Lastly, to enhance performance in tasks featuring diverse sequential skill combinations, the sampled trajectories are segmented into skill representations and embedded into the learned policy network, facilitating the sequential combination of learned skill policies. Experimental results demonstrate the efficacy of the GCSSD method in tasks characterized by sparse rewards and sequential skill combinations, showcasing its capability to swiftly adapt to tasks with varying sequential skill combinations using learned skills.

    参考文献
    相似文献
    引证文献
引用本文

杨尚东,余淼盈,陈兴国,陈蕾.基于分组对比学习的序贯感知技能发现.软件学报,,():1-15

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-09-20
  • 最后修改日期:2023-12-25
  • 录用日期:
  • 在线发布日期: 2024-11-20
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号