Abstract:With the fast development of open source software and wide application of development supporting tools, there have been a great many of open software development data on the Internet. To improve the software development efficiency and product quality, more and more practitioners and researchers attempt to obtain insights of software development from the data. To facilitate the data analyses and their reproduction and comparison, building and using shared datasets are proposed and practiced. However, the existing datasets are lack of traceability of dataset construction process, application scope, and consideration of data variation over time and with environment changes, which threat the data quality and analysis validity. To address these problems, an advanced approach is proposed for sharing and using the software development datasets. It constructs datasets with multiple levels and multiple versions. Through multiple levels, the datasets remain the raw data, intermediate data, and final data to possess data traceability. Meanwhile, by multiple versions, users can compare and observe the data variety to verify and improve data quality and analysis validity. Based on the previously constructed Mozilla issue tracking dataset, it is demonstrated that how to build and use multi-level and multi-version software development dataset and verified that the proposed approach can help users efficiently use the dataset.