Abstract:With the maturity of deep learning technology, intelligent speech recognition software has been widely used. Various deep neural networks in the intelligent software play a crucial role. Recent studies have shown that minor disturbances in adversarial examples significantly threaten the security and robustness of deep neural networks. Researchers usually take the generated adversarial examples as the test cases and input them into the intelligent speech recognition software to test whether the adversarial examples will make the software misjudge. And then defense methods are adopted to improve the security and robustness of intelligent software. For the adversarial example generation, black box intelligent speech software is more common in life and has practical research value. However, the existing generation methods have some limitations. Therefore, this study proposes a target adversarial example generation method for the black box speech software based on the firefly algorithm and gradient evaluation method, namely the firefly-gradient adversarial example generation method. With the set target text, disturbances are added to the original speech example. The firefly algorithm or gradient evaluation method is chosen to optimize the adversarial example according to the edit distance between the text of the current generated adversarial example and the target text so that the target adversarial example is generated finally. To verify the effectiveness of the method, this study conducts an experimental evaluation on common speech recognition software, using three different types of speech datasets: Common Speech dataset, Google Command dataset and LibriSpeech dataset, and looks for volunteers to evaluate the generated adversarial examples. Experimental results show that the proposed method can effectively improve the success rate of target adversarial example generation. For example, for the DeepSpeech speech recognition software, the success rate of generating adversarial examples on Common Speech datasets is 13% higher than that of the compared method.