Abstract:Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers’ emotions in commit messages and identifying developers’ opinions on Q&A forums. Nevertheless, commonly used out-of-box sentiment analysis tools cannot obtain reliable results in SE tasks and misunderstanding of technical knowledge is demonstrated to be the main reason. Then researchers start to customize SE-specific methods in supervised or distantly supervised ways. To assess the performance of these methods, researchers use SE-related annotated datasets to evaluate them in a within-dataset setting, that is, they train and test each method using data from the same dataset. However, the annotated dataset for an SE-specific sentiment analysis task is not always available. Moreover, building a manually annotated dataset is time-consuming and not always feasible. An alternative is to use datasets extracted from the same platform for similar tasks or datasets extracted from other SE platforms. To verify the feasibility of these practices, it is needed to evaluate existing methods in within-platform and cross-platform settings, which refer to training and testing each method using data from the same platform but not the same dataset, and training and testing each classifier using data from different platforms. This study comprehensively evaluates existing SE-customized sentiment analysis methods in within-dataset, within-platform, and cross-platform settings. Finally, the experimental results provide actionable insights for both researchers and practitioners.