Abstract:In speech recognition,the selection of training corpus for robust acoustic modeling which can cover almost all phone phenomena is very important.Traditionally,corpus is selected manually first,and then tested and supplemented,which can't provide sufficient coverage of samples for various statistical modeling methods.An algorithm for automatically selecting the training samples from large-scale text corpus is developed in this paper.This algorithm can not only cover almost all phone phenomena but also ensure to include ideal samples of triphones or class-triphones and ensure enough data for training,which makes it possible to train acoustic model reliably.