Abstract:GUI fuzzing plays a crucial role in enhancing the reliability and compatibility of mobile apps. However, most existing GUI fuzzing methods are inefficient, mainly because they are coarse-grained, relying solely on single-modal features to understand the GUI pages holistically. The excessive abstraction of app states leads to the neglect of many details, resulting in an insufficient understanding of GUI states and widgets. To address this issue, a GUI fuzzing framework called GUIFuzzer for mobile apps is proposed based on multi-modal representation. This framework leverages multi-modal features, such as visual features, layout context features, and fine-grained meta-attribute features, to jointly infer the semantics of GUI widgets. Then, it trains a multi-level reward-driven deep reinforcement learning model to optimize the GUI event selection strategy, thus improving the efficiency of fuzz testing. The proposed framework is evaluated on a large number of real apps. Experimental results show that GUIFuzzer significantly improves the coverage of fuzz testing compared with existing competitive baselines. A case study is also conducted on customized search for specific targets, namely sensitive API triggering, which further demonstrates the practicality of the GUIFuzzer framework.