Abstract:The rapid development of generative technologies has revealed their potential for real-world applications. The core objective of pose-guided person image and video generation is to transform a person from inputs into a specified pose while maintaining a high level of appearance consistency. This technology can be widely applied in various fields such as virtual try-on and fashion, advertising video generation and editing, and multimodal content creation, driving advancements in user experience and technological innovation. However, despite significant progress, the technology still faces multiple challenges, including effective extraction and rearrangement of appearance information during pose transfer, generation of unseen information, consistency preservation, and efficient model training and deployment. Based on the existing challenges, this study provides a detailed analysis of the strategies employed by current mainstream pose-guided generation methods to address these issues, discussing their feasibility and limitations in practical applications. Moreover, it explores the commonly used generative models and pose representation methods in pose-guided generation. It also reviews the datasets, their sizes, characteristics, and evaluation benchmarks used in this field. Furthermore, this study discusses the applications of this technology in virtual try-on, video generation and editing, and multimodal content generation. It highlights the remaining challenges, such as the retention of personalized information, generation in complex scenes, and model efficiency and real-time performance. Finally, this study discusses potential future development trends of pose-guided generation technology, aiming to provide researchers with a systematic summary and reference to promote its application and innovation across industries.