WU Xin , WU Jian-Yu , ZHOU Ming-Hui , WANG Zhi-Qiang , YANG Li-Yun
2022, 33(1):1-25. DOI: 10.13328/j.cnki.jos.006279
Abstract:Developers usually select different open source licenses to restrain the conditions of using open source software, in order to protect intellectual property rights effectively and maintain the long-term development of the software. However, since the open source community has a wide variety of licenses available, developers generally find it difficult to understand the differences between different open source licenses. And existing selection tools of open source license require developers to understand the terms of the open source license and identify their business needs, which makes it harder for developers to make the right choice. Although there has been extensive research on open source license, there is still no systematic analysis on the actual difficulties of the developers to choose the open source license, thus lacking a clear understanding. For this reason, this study attempts to understand the difficulties faced by open source developers in choosing open source licenses, analyzes the components of open source license and the factors influencing open source license selection, and provides references for developers to choose open source licenses. This study conducts a random survey of 200 developers that participated in the open source projects on GitHub through questionnaires. With a Thematic Synthesis on the 53 feedbacks, it is found that developers often face difficulties in the selection of open source licenses in terms of complexity of terms and unknown considerations. By analyzing the ten open source licenses most widely used in 3 346 168 repositories on GitHub, this study establishes a framework of open source licenses that contains 10 dimensions. Drawing on the Theory of Planned Behavior, nine factors that affect license selection from three aspects are put forward: behavior attitude, subjective norm, and perceived behavior control. The relevance of those factors is verified by developer survey. Furthermore, the relationship between project characteristics and license selection is verified by fitting the order regression model. The results of research can deepen developers’ understanding of the contents of open source licenses, provide decision support for developers to select appropriate licenses based on their own needs, and provide a reference for implementing open source license selection tools based on developers’ needs.
GUO Zhao-Qiang , LIU Shi-Ran , TAN Ting-Ting , LI Yan-Hui , CHEN Lin , ZHOU Yu-Ming , XU Bao-Wen
2022, 33(1):26-54. DOI: 10.13328/j.cnki.jos.006292
Abstract:Technical debt is a metaphor that refers to sacrifice the long-term code quality to satisfy the short-term goals. In particular, the technical debts introduced intentionally by developers are called self-admitted technical debt (SATD), which usually exist in software projects in the form of code comments. The SATDs bring great challenges to quality and robustness of software. In order to facilitate finding and paying back them as soon as possible for assuring software quality, in recent years, great progress has been made in the field of investigating the characteristics of SATD and proposing the identification models for SATD. Nevertheless, it is still challenging to apply them in practice. This paper offers a systematic survey of recent research achievements in SATD. First, the research problems are introduced in this field. Then, the current main research work is described in detail. After that, related techniques are discussed. Finally, the opportunities and challenges in this field are summarized and the research directions in the future are outlined.
2022, 33(1):55-77. DOI: 10.13328/j.cnki.jos.006337
Abstract:Source code summaries can help software developers comprehend programs faster and better, and assist maintenance developers in accomplishing their tasks efficiently. Since writing summaries by programmers is of high cost and low efficiency, researchers have tried to summarize source code automatically. In recent years, the technologies of neural network-based automatic summarization of source code have become the mainstream techniques of automatic source code summarization, and it is a hot research topic in the domain of intelligent software engineering. Firstly, this paper describes the concept of source code summarization and the definition of automatic source code summarization, presents its development history, and reviews the methods and metrics of the quality evaluation of the generated summaries. Then, it analyzes the general framework and the main challenges of neural network-based automatic code summarization algorithms. In addition, it focuses on the classification of representative algorithms, the design principle, characteristics, and restrictions of each category of algorithms. Finally, it discusses and looks forward to the trends on techniques of neural network-based source code summarization in future.
LI Hao-Feng , MENG Hai-Ning , ZHENG Heng-Jie , CAO Li-Qing , LI Lian
2022, 33(1):78-101. DOI: 10.13328/j.cnki.jos.006345
Abstract:Pointer analysis is the basis of compiler optimization and static analysis, and a lot of applications are based on pointer analysis. Low-precision pointer analysis brings high false positive rate and false negative rate to these applications, and adding context sensitive information is an important means to improve accuracy. Since the object-oriented concept was put forward, it has been widely used. Some mainstream languages, such as Java, C++, .NET and C#, support object-oriented features. Therefore, pointer analysis for object-oriented language is getting more and more attention. This study investigates context-sensitive pointer analysis for object-oriented language by using systematic literature review (SLR) method. After analyzing and categorizing the relevant literature, five questions are summarized about context-sensitive pointer analysis for object-oriented language.
ZHAO Jing-Sheng , SONG Meng-Xue , GAO Xiang , ZHU Qiao-Ming
2022, 33(1):102-128. DOI: 10.13328/j.cnki.jos.006304
Abstract:Natural language processing is the core technology of artificial intelligence. Text representation is the basic and necessary work of natural language processing, which affects or even determines the quality and performance of natural language processing systems. This study discusses the basic principle of text representation, the formalization of natural language, the language model, and the connotation and extension of text representation. The technical classification of text representation on a macro level is analyzed. The mainstreams of text representation technologies and methods are analyzed, induced and summarized, including vector space model, topic model, graph-based model, neural network-based model, and representation learning. Event-based, semantic-based, and knowledge-based text representation technologies are also introduced. The development trends and directions of text representation technology are predicted and further discussed. Neural network-based deep learning and representation learning on text will play an important role in natural language processing. The strategy of pre-training and fine-tune optimization will gradually become the mainstream technology. Text representation needs specific analysis according to specific problems. The integration of technology and application is the driving force.
LI Hang-Yu , WANG Nan-Nan , ZHU Ming-Rui , YANG Xi , GAO Xin-Bo
2022, 33(1):129-149. DOI: 10.13328/j.cnki.jos.006306
Abstract:In recent years, deep neural networks (DNNs) have achieved outstanding performance on many AI tasks, such as computer vision (CV) and natural language processing (NLP). However, the network design relies heavily on the expert knowledge, which is time-consuming and error-prone. As a result, as one of the important sub-fields of automated machine learning (AutoML), the neural architecture search (NAS) has been paid more and more attention to, aiming to automatically design deep neural networks with superior performance. In this study, the development process of NAS is reviewed in detail and systematically summarized. Firstly, the overall research framework of NAS is given, and the function of each research content is analyzed. Next, according to the development stage in NAS field, the existing methods are divided into four aspects, and the characteristic of each stage is introduced in detail. Then, the datasets are introduced which are often used to verify the effect of NAS methods at this stage, and the normalized evaluation criteria in NAS field are innovatively summarized, so as to ensure the fairness of experimental comparison and promote the long-term development of this field. Finally, the challenges of NAS research are proposed and discussed.
ZHAO Gang , WANG Qian-Ge , YAO Feng , ZHANG Yan-Feng , YU Ge
2022, 33(1):150-170. DOI: 10.13328/j.cnki.jos.006311
Abstract:Graph neural network (GNN) is used to process graph structure data based on deep learning techniques. It combines graph propagation operations with deep learning algorithms to fully utilize graph structure information and vertex features in the learning process. GNNs have been widely used in a range of applications, such as node classification, graph classification, and link prediction, showing promised effectiveness and interpretability. However, the existing deep learning frameworks (such as TensorFlow and PyTorch) do not provide efficient storage support and message passing support for GNN’s training, which limits its usage on large-scale graph data. At present, a number of large-scale GNN systems have been designed by considering the data characteristics of graph structure and the computational characteristics of GNNs. This study first briefly reviews the GNNs, and summarizes the challenges that need to be faced in designing GNN systems. Then, the existing work on GNN training systems is reviewed, and these systems are analyzed from multiple aspects such as system architecture, programming model, message passing optimization, graph partitioning strategy and communication optimization. Finally, several open source GNN systems are chosen for experimental evaluation to compare these systems in terms of accuracy, efficiency, and scalability.
WANG Zhao-Hui , SHEN Hua-Wei , CAO Qi , CHENG Xue-Qi
2022, 33(1):171-192. DOI: 10.13328/j.cnki.jos.006323
Abstract:Graph data, as a kind of widely-existing data in the real world, naturally represent complex interactions between elements of composite objects. The classification of graph data is a very important and extremely challenging research topic. There are many key applications in the fields of bio/chemical informatics, such as molecular attribute classification and drug discovery. However, there still lacks a comprehensive review of research on graph classification. This survey first formulates the problem of graph classification and describes the main challenges of this problem; then this survey categorizes graph classification methods into similarity-based methods and graph neural network based methods. Moreover, evaluation metrics for graph classification, benchmark datasets, and comparison results are given. Finally, the application scenarios of graph classifications are summarized, and the research trends of graph classification are also discussed.
GE Yi-Zhou , LIU Heng , WANG Yan , XU Bai-Le , ZHOU Qing , SHEN Fu-Rao
2022, 33(1):193-210. DOI: 10.13328/j.cnki.jos.006342
Abstract:Present machine learning methods have reached a higher level than human intelligence in image recognition and other tasks. However, recent machine learning methods, especially deep learning methods, rely heavily on a large number of annotation data that human cognition often does not need. This weakness greatly limits the application of deep learning method in practical problem. To solve this problem, learning from a few shot examples attracts more and more community’s research interest. In order to better understand the few shot learning problem, this study extensively discusses several popular few shot learning methods, including data augmentation methods, transfer learning methods, and meta learning methods. After the processes and core ingredients of different algorithms are discussed, the advantages and disadvantages of existing methods could be clearly seen in solving few shot learning problems. At the end of this paper, the points to future research directions are highlighted in the field of few shot learning problem.
LIU Wen-Feng , ZHANG Yu , ZHANG Hong-Li , FANG Bin-Xing
2022, 33(1):211-232. DOI: 10.13328/j.cnki.jos.006218
Abstract:Domain name system (DNS) measurement research is an important way to understand DNS. This paper reviews the DNS measurement work during 1992 and 2019 on 18 topics from four aspects of components, structure, traffic, and security. Firstly, in the aspect of components, the four resolver-related topics are on public resolver, open resolver, resolver caching, and resolver selection policy; the four authoritative-server-related topics are on performance, anycast deployment, hosting, and misconfigurations. Secondly, in the aspect of structure, there are three topics: the dependency structure between stub resolvers and resolvers, the dependency structure of resolvers, and the dependency structure of domain name resolution. Then, in the aspect of traffic, there are three topics: query traffic characteristics, abnormal root query traffic, and traffic interception. Moreover, in the aspect of security, there are four topics: DNSSEC cost and risk, DNSSEC deployment, DNS encryption deployment, and malicious domain name detection. Finally, future research topics are discussed.
JIA Lin-Peng , PEI Qi , WANG Xin , ZHANG Han-Wen , YU Lei , ZHANG Jun , SUN Yi
2022, 33(1):233-253. DOI: 10.13328/j.cnki.jos.006219
Abstract:Offchain channel network (OCN) can effectively improve the performance of blockchain system. The key component for OCN to achieve long-term efficient and stable operation is routing algorithm. This study proposes OCN architecture and the basic model of offchain channel routing algorithm. From perspectives of single-path routing and multi-path routing, typical routing algorithms are systematically reviewed and discussed. Meanwhile, an evaluation system is established for offchain channel routing algorithm, in terms of effectiveness, concurrency, scalability, channel balance, routing centralization, cost-effectiveness, privacy protection, goodput, latency, success rate, and efficiency. Finally, these algorithms are compared, and challenging research issues and technology trends of offchain routing algorithm are discussed.
WANG Zhan-Feng , CHENG Guang , MA Wei-Jun , ZHANG Jia-Wei , SUN Zhong-Hao , HU Chao
2022, 33(1):254-273. DOI: 10.13328/j.cnki.jos.006303
Abstract:Protocol reverse engineering is widely used in intrusion detection system, deep packet inspection, fuzzy testing, C & C malware detection, and other fields. First, the formal definition and basic principle of protocol reverse engineering are given. Then, the existing protocol reverse methods based on network trace are analyzed in detail from two aspects of protocol format extraction and protocol state machine inference. In addition, the basic modules, main principles, and characteristics of these algorithms are explained. Finally, the existing algorithms are compared from several aspects, and the development trend of protocol reverse technology is discussed.
FU Yong-Quan , ZHAO Hui , WANG Xiao-Feng , LIU Hong-Ri , AN Lun
2022, 33(1):274-296. DOI: 10.13328/j.cnki.jos.006338
Abstract:The network behavior typically describes the interaction process among different kinds of network elements, which is based on different kinds of network service protocols and applications, formulates evolving and diverse network behavior, and reflects attributes of network scenarios during certain periods on the network topology. Network behavior emulation includes runtime framework, background traffic emulation, and foreground traffic emulation which project network behaviors in the production network environment to the test cyber environment, and provides the mirroring capability of on-demand and flexible design specifications. The application scenarios of network behavior emulation continuously evolve, including performance analysis and evaluation, product and technique evaluation, network intrusion detection, and the research and development of network attack and defense techniques. To summarize existing research results and limitations, and analyze future development trends, this study seeks to category relevant definitions and research frameworks on simulating network behaviors, summarizes the state-of-the-art research progress in terms of the framework, background traffic, and foreground traffic, and systematically surveys both commercial and open-sourced software tools. Finally, this study proposes future research topics on network behavior simulation.
WANG Chu-Yu , XIE Lei , ZHAO Yan-Chao , ZHANG Da-Qing , YE Bao-Liu , LU Sang-Lu
2022, 33(1):297-323. DOI: 10.13328/j.cnki.jos.006344
Abstract:With the rapid development and deployments of the Internet of Things (IoT) technology, the demands of IoT applications have changed from the connections of the ubiquitous passive objects to the fusion among “human-computer-objects”. As one of the key technologies in IoT, radio frequency identification (RFID) becomes one significant intermediary of battery-less sensing, due to the lightweight, labelling, and easy deployment of the RFID tags. In order to understand the research progress and methods, this study focuses on the battery-less sensing research based on RFID technology. Particularly, this paper describes and analyzes the research work on four aspects: signal sources, sensing modes, sensing targets, and application scenarios, according to the working flow of sensing research. This paper introduces the research progress in RFID-based sensing from these four aspects, and also discusses the advantages and disadvantages of different technologies among the four aspects. Finally, the existing research is summarized and promising directions are presented for future research.
WEI Song-Jie , Lü Wei-Long , LI Sha-Sha
2022, 33(1):324-355. DOI: 10.13328/j.cnki.jos.006280
Abstract:Originated as Internet financial technology, blockchain is prevailing in many application scenarios and attracting attentions from both academia and industry. Typical blockchain systems are characterized with decentralization, trustworthiness, openness, autonomy, anonymity, and immutability, which brings trustworthiness for data management and value exchange in distributed computation environment without centralized trust authority. However, blockchain is still developing as a continuously evolving new technique. Its mechanisms, peripheral facilities, and user maturity in security are yet to be optimized, resulting in various security threats and frequent security incidents. This paper first overviews the blockchain technology and its potential security vulnerabilities when being used for token transaction and exchange. Then the mostly-seen security problems are enumerated and analyzed with Bitcoin and Ethereum as two sample systems. The security problems encountered by blockchain peripheral facilities and users are presented, and their root causes are probed. Finally, the surveyed problems are categorized and the possible countermeasures or defenses are proposed to address them. Promising research areas and technology evolving directions are briefly covered for the future.
ZHAO Ye-Zi , WANG Lu , XU Yan-Ning , ZENG Zheng , GE Liang-Sheng , ZHU Jun-Qiu , XU Zi-Lin , ZHAO Yu , MENG Xiang-Xu
2022, 33(1):356-376. DOI: 10.13328/j.cnki.jos.006334
Abstract:Nowadays, the demand for photorealistic rendering in the movie, anime, game, and other industries is increasing, and the highly realistic rendering of 3D scenes usually requires a lot of calculation time and storage to calculate global illumination. How to ensure the quality of rendering on the premise of improving drawing speed is still one of the core and hot issues in the field of graphics. The data-driven machine learning method has opened up a new approach. In recent years, researchers have mapped a variety of highly realistic rendering methods to machine learning problems, thereby greatly reducing the computational cost. This article summarizes and analyzes the research progress of highly realistic rendering methods based on machine learning in recent years, including: global illumination optimization calculation methods based on machine learning, physical material modeling methods based on deep learning, and participatory media drawing method optimization based on deep learning, Monte Carlo denoising method based on machine learning, etc. This article discusses the mapping ideas of various drawing methods and machine learning methods in detail, summarizes the construction methods of network models and training data sets, and conducts comparative analysis on drawing quality, drawing time, network capabilities, and other aspects. Finally, this article proposes possible ideas and future prospects for the combination of machine learning and realistic rendering.