[关键词]
[摘要]
基于内窥镜的微创手术机器人在临床上的应用日益广泛,为医生提供内窥镜视频中精准的手术器械分割信息,对提高医生操作的准确度、改善患者预后有重要意义.现阶段,深度学习框架训练手术器械分割模型需要大量精准标注的术中视频数据,然而视频数据标注成本较高,在一定程度上限制了深度学习在该任务上的应用.目前的半监督方法通过预测与插帧,可以改善稀疏标注视频的时序信息与数据多样性,从而在有限标注数据下提高分割精度,但是这些方法在插帧质量与对连续帧时序特征方面存在一定缺陷.针对此问题,提出了一种带有时空Transformer的半监督分割框架,该方法可以通过高精度插帧与生成伪标签来提高稀疏标注视频数据集的时序一致性与数据多样性,在分割网络bottleneck位置使用Transformer模块,并利用其自我注意力机制,从时间与空间两个角度分析全局上下文信息,增强高级语义特征,改善分割网络对复杂环境的感知能力,克服手术视频中各类干扰从而提高分割效果.提出的半监督时空Transformer网络在仅使用30%带标签数据的情况下,在MICCAI 2017手术器械分割挑战赛数据集上取得了平均DICE为82.42%、平均IoU为72.01%的分割结果,分别超过现有方法7.68%与8.19%,并且优于全监督方法.
[Key word]
[Abstract]
With the increasingly wide application of surgical robots in clinical practice, it is of great significance to provide doctors with precise semantic segmentation information of surgical instrument in endoscopic video to improve the clinicians’ operation accuracy and patients’ prognosis. Training surgical instrument segmentation models requires a large amount of accurately labeled video frames, which limits the application of deep learning in the surgical instrument segmentation task due to the high cost of video data labeling. The current semi-supervised methods enhance the temporal information and data diversity of sparsely labeled videos by predicting and interpolating frames, which can improve the segmentation accuracy with limited labeled data. However, these semi-supervised methods suffer from the drawbacks of frame interpolation quality and temporal feature extraction from sequential frames. To tackle this issue, this study proposes a semi-supervised segmentation framework with spatiotemporal Transformer, which can improve the temporal consistency and data diversity of sparsely labeled video datasets by interpolating frames with high accuracy and generating pseudo-labels. Here the Transformer module is integrated at the bottleneck position of the segmentation network to analyze global contextual information from both temporal and spatial perspectives, enhancing advanced semantic features while improving the perception to complex environments of the segmentation network, which can overcome various types of distractions in surgical videos and thus improve the segmentation effect. The proposed semi-supervised segmentation framework with Transformer achieves an average DICE of 82.42% and an average IOU of 72.01% on the MICCAI 2017 Surgical Instrument Segmentation Challenge dataset using only 30% labeled data, which exceeds the state-of-the-art method by 7.68% and 8.19%, respectively, and outperforms the fully supervised methods.
[中图分类号]
[基金项目]
深圳市基础研究重点项目(JCYJ20200109110208764,JCYJ20200109110420626);国家自然科学基金(U1813204,61802385);广东省自然科学基金(2021A1515012604)