GAO Meng , TENG Jun-Yuan , WANG Zheng
2021, 32(10):2977-2992. DOI: 10.13328/j.cnki.jos.006024 CSTR:
Abstract:The security problems of software systems caused by integer overflow are common, while the existing model checking techniques have few engineering applications due to the shortcomings of state space explosion and failure to support interrupt-driven program detection. This paper systematically analyzes the distribution and characteristics of integer overflow in aerospace embedded software through some real cases. On the basis of bounded model checking, a program model reduction technique based on integer overflow variable dependence is proposed based on the characteristics of integer overflow variables. At the same time, we present a interference variables dependency sequentialization method for interrupt-driven programs based on the characteristic abstraction of interrupt functions. The results of benchmark programs and real aerospace embedded software experiments show that this method can not only improve the analysis efficiency, but also make the existing model checking techniques applicable to integer overflow detection of the interrupt-driven programs under the premise of guaranteeing the detection rate of integer overflow.
QIAO Jia-Lin , HUANG Xiang-Dong , YANG Yi-Fan , WANG Jian-Min , WU Kai
2021, 32(10):2993-3013. DOI: 10.13328/j.cnki.jos.006026 CSTR:
Abstract:As one of the core components of Apache Hadoop, the Hadoop distributed file system (HDFS) has been widely used in the industry. HDFS adopts a multiple replicas mechanism to ensure data reliability, which may incur inconsistency because of node failure, network partition, and write failure. HDFS is considered to have reduced data consistency compared to traditional file systems, which is difficult for users to understand when there will be inconsistent. At present, there is no relevant work to verify the consistency mechanism. When the data is inconsistent, it will increase the uncertainty of the upper applications. Thus, research for data consistency model is required. The large scale of HDFS makes the analysis more difficult. Code reading, abstracting, colored Petri net modeling, and state-space analysis are conducted to comprehend the system. The works are listed as the following. (1) Colored petri nets are used to model HDFS's process of reading and writing files, the model describes the functions of inner components and their cooperation mechanism in detail. (2) Data layer consistency and operation layer consistency of HDFS are analyzed with state-space tools based on a colored Petri net model, figuring out data consistency guaranteed by the system. (3) A time point repeatable read method is proposed to verify operation layer consistency and serial repeatable strategy is utilized to decrease state-space complexity. Based on the contribution above, the directions for HDFS application development are proposed, helping to improve the data consistency. The CPN modeling method and technique are applicated in the analysis of other distributed information systems.
DAI Qi-Ming , MAO Run-Feng , HUANG Huang , RONG Guo-Ping , SHEN Hai-Feng , SHAO Dong
2021, 32(10):3014-3035. DOI: 10.13328/j.cnki.jos.006276 CSTR:
Abstract:DevOps practices have been widely implemented by software companies to increase the frequency of product delivery and deployment. However, faced the increasingly challenging network security, security problems in software systems are becoming prominent. Time-consuming security practices are difficult to be effectively implemented in software development activities because of rapid delivery. Integration of security control measures into software processes to realize continuous security needs to be urgently investigated for companies to transit to DevOps. DevSecOps, a solution to realize continuous security in DevOps, has attracted widespread attention from academia and industry, and has also gradually become a hot research topic in the field of software engineering. In recent years, as DevSecOps research and practice develop rapidly, people have gained a more comprehensive understanding of DevSecOps and more relevant security practices have been introduced. Hence, this paper summarizes the five aspects of background, characteristics, practice, benefits, and challenges, with the aim to introduce the core content of DevSecOps to the software engineering community in China for the first time in detail. Focusing on the latest theoretical research content of DevSecOps and the current state of corporate practice, it is also aimed to provide a reference for practitioners to implement DevSecOps practices. Hopefully, this paper could provide some foundation for researchers to explore DevSecOps and call for more researchers to participate in the research of DevSecOps.
XU Dong-Qin , LI Jun-Hui , ZHU Mu-Hua , ZHOU Guo-Dong
2021, 32(10):3036-3050. DOI: 10.13328/j.cnki.jos.006207 CSTR:
Abstract:Given an AMR (abstract meaning representation) graph, AMR-to-text generation aims to generate text with the same meaning. Related studies show that the performance of AMR-to-text severely suffers from the size of the manually annotated dataset. To alleviate the dependence on manually annotated dataset, this study proposes a novel multi-task pre-training for AMR-to-text generation. In particular, based on a large-scale automatic AMR dataset, three relevant pre-training tasks are defined, i.e., AMR denoising auto-encoder, sentence denoising auto-encoder, and AMR-to-text generation itself. In addition, to fine-tune the pre-training models, the vanilla fine-tuning method is further extended to multi-task learning fine-tuning, which enables the final model to maintain performance on both AMR-to-text and pre-training tasks. With the automatic dataset of 0.39M sentences, detailed experimentation on two AMR benchmarks shows that the proposed pre-training approach significantly improves the performance of AMR-to-text generation, with the improvement of 12.27 BLEU on AMR2.0 and 7.57 on AMR3.0, respectively. This greatly advances the state-of-the-art performance with 40.30 BLEU on AMR2.0 and 38.97 on AMR 3.0, respectively. To the best knowledge, this is the best result achieved so far on AMR 2.0 while AMR-to-text generation performance on AMR 3.0 is firstly reported.
ZHANG Wei , LIN Ze-Yi , CHENG Jian , KE Ming-Yu , DENG Xiao-Ming , WANG Hong-An
2021, 32(10):3051-3067. DOI: 10.13328/j.cnki.jos.006217 CSTR:
Abstract:In recent years, hand gesture has been widely used in human-computer interaction, virtual reality, and other fields as an input channel. Especially, with the emergence of advanced technology of human-computer interaction and the rapid development of computer technology (such as deep learning, GPU, parallel computation technology, etc.), gesture understanding and interaction methods have made breakthroughs. This paper reviews the research progress of dynamic gesture understanding and typical interaction applications. Firstly, the core concepts of gesture interactions are elaborated. Secondly, the progress of dynamic gesture recognition and detection is introduced. Thirdly, the representative applications of dynamic gesture interaction are elaborated. Finally, the future development trend of gesture interaction is discussed.
WANG Shuang-Cheng , ZHENG Fei , ZHANG Li
2021, 32(10):3068-3084. DOI: 10.13328/j.cnki.jos.006012 CSTR:
Abstract:Bayesian network is a powerful tool for studying the causal relationship between variables. Causal learning, based on Bayesian network, consists of two parts:structure learning and parameter learning, while structural learning is the core of causal learning. At present, Bayesian network is mainly used to discover the causality in non-time series data (non-time series causality) and what is learned from the data is the causal relationship between general variables. In this study, the causality of time series is learned by time series preconditioning, time series variable sorting, construction of transformation data set, local greedy search-scoring, and so on. Combining the time series preconditioning including segmentation, the structure learning of causal relationship for time series segments, the construction of causality structure data set, the variable sorting of causal relationship, local greedy search-scoring, maximum likelihood parameter estimation, etc., meta causal relationship (used to study the randomness of causal relationship) is established. Thus, two levels of causality learning can be realized, and the foundation is laid for further quantitative causal analysis. Experiments and analyses are carried out by using simulation, UCI, and finance time series, the results verify the validity, reliability, and practicability of learning causal relationship and Meta causality based on Bayesian network.
ZHU Er-Zhou , SUN Yue , ZHANG Yuan-Xiang , GAO Xin , MA Ru-Hui , LI Xue-Jun
2021, 32(10):3085-3103. DOI: 10.13328/j.cnki.jos.006016 CSTR:
Abstract:Clustering analysis is a hot research topic in the fields of statistics, pattern recognition, and machine learning. Through effective clustering analysis, the intrinsic structure and characteristics of datasets can be well discovered. However, due to the unsupervised learning feature, the existing clustering methods are still facing the problems of unstable and inaccurate on processing different types of datasets. In order to solve these problems, a hybrid clustering algorithm, K-means-AHC, is firstly proposed based on the combination of the K-means algorithm and the hierarchical clustering algorithm. Then, based on the inflexion point detection, a new clustering validity index, DAS (difference of average synthesis degree), is proposed to evaluate the results of the K-means-AHC clustering algorithm. Finally, through the combination of the K-means-AHC algorithm and the DAS index, an effective method of finding the optimal clustering numbers and optimal partitions of datasets is designed. The K-means-AHC algorithm is used to test many kinds of datasets. The experimental results have shown that the proposed algorithm improves the accuracy of clustering analysis while without too much time overhead. At the same time, the new DAS index is superior to the current commonly used clustering validity indexes in the evaluation of clustering results.
XI Liang , YAO Zhi-Yu , ZHANG Feng-Bin
2021, 32(10):3104-3121. DOI: 10.13328/j.cnki.jos.006017 CSTR:
Abstract:Artificial immune system (AIS) is one of the important branches of artificial intelligence technology, and it is widely used in many fields such as anomaly detection, data mining, and machine learning. The detectors are its core knowledge set, and the application effects are determined by the generation, optimization, and detection of the detectors. At present, the problem space of AIS mainly applied real-valued shape-space. But the detectors in the real-valued shape-space have some problems that have not been solved, such as the holes in the non-self-shape-space, slow speed of generation, detector overlapping redundancy, dimension curse, which lead to the unsatisfactory detection effects. In view of this, based on the neighborhood shape-space, a new shape-space, and the improved neighborhood negative selection algorithm, a multi-source-inspired neighborhood negative selection algorithm (MSNNSA) is proposed by introducing chaotic map and genetic algorithm. And then, based on this algorithm, the multi-source-inspired immune detector generation and detection methods in neighborhood shape-space are built to make the construction and generation more targeted, so that the generated detectors have better distribution performance. Meanwhile, the method also improves the detectors' generation efficiency and the detection performances, and overcomes the shortcomings in the real-valued shape-space mentioned before. Experimental results show that the proposed method enhances generation efficiency, whole detection performances, and stability.
LI Sheng-Jie , LI Xiang , ZHANG Yue , WANG Ya-Sha , ZHANG Da-Qing
2021, 32(10):3122-3138. DOI: 10.13328/j.cnki.jos.006027 CSTR:
Abstract:As one of the common daily behaviors, walking could reveal much important information, such as one's identity and health condition. Fine-grained walking information such as walking velocity, walking direction, the number of steps, and stride length could provide important references for indoor tracking, gait analysis, elder care, and other context-aware situation applications. Thus, the perception of human walking utilizing the environmental Wi-Fi signal has been widely concerned by researchers in recent years. In order to achieve the perception of human walking, current methods usually need to gather a lot of walking data and then extract signal feature from extensive data through empirical observation or off-line training. However, due to the lack of theoretical instruction, the extracted signal feature is indirect and often contains redundant information of environment and sensing target. Therefore, as long as there is a change of the environment or sensing target, these systems have to regather data and relearn the signal feature for new situation. This would cause difficulties when applied in real life with varied wireless environment. Different from these works, this study has achieved the walking recognization in daily continuous activities without any learning requirement. Moreover, the fine-grained parameters such as walking velocity, walking direction, the number of steps, and stride length have been estimated in order to provide crucial context for upper layer context-aware applications. Specially, by analyzing the relationship between channel state information (CSI) and Doppler effect introduced by human movement, a Doppler velocity model is firstly established revealing that the theoretical relationship between human movement and CSI variation. Then by utilizing the MUSIC algorithm, the Doppler velocity could be obtained from Wi-Fi CSI which serves as an effective signal feature in revealing human movement and unrelated to the environment and human target. Finally, by studying the relationship between Doppler velocity and real human walking velocity, walking behavior as well as estimating fine-grained walking parameters could be recognized. Through extensive experiments done by different volunteers in different environments, the results have demonstrated the accuracy and robustness of the system. The system achieves an accuracy of 95.5% in walking recognition, a relative median error of 12.2% in walking velocity estimation, a median error of 9° in walking direction estimation, an accuracy of 90% in step counting and a median error of 0.12m in stride length estimation.
GUO Jun-Jun , LIU Zhen-Cheng , YU Zheng-Tao , HUANG Yu-Xin , XIANG Yan
2021, 32(10):3139-3150. DOI: 10.13328/j.cnki.jos.006028 CSTR:
Abstract:Due to the insufficiency of few shot charges and the similarity of case descriptions for the confusing charges, the prediction performance of the existing methods for few shot charges and confusing charges is not promising. To address the forementioned drawbacks, a novel few shot and confusing charges prediction method is proposed, which is based on bi-direction mutual attention mechanism with the auxiliary sentences of case. For the proposed model, firstly, the auxiliary sentence of case via the judicial field is constructed, where the auxiliary sentence of case is considered as external knowledge for mapping the description of the case to the corresponding charge. Secondly, the multi-granularity characteristics of case description and the auxiliary sentence of case are extracted at the level of both word and character, respectively. At the same time, the auxiliary sentence of case and case description are used to build bi-direction mutual attention. Finally, the tendency representation of the case description with the guidance of the auxiliary sentence of case are derived, which improve the prediction accuracy of few shot and confusing charges. The experimental results conducted on the benchmark data of criminal cases show that the proposed model increases the F1 value and prediction accuracy by 13.2% and 4.5%, respectively, and increases the F1 values for the few shot charges and confusing charges by 4.3% and 8.2%, respectively, which significantly enhance the prediction performance for few shot and confusing charges.
2021, 32(10):3151-3175. DOI: 10.13328/j.cnki.jos.006030 CSTR:
Abstract:The flower pollination algorithm (FPA) is a novel, easy and efficient optimization algorithm proposed in recent years. It has been widely used in various fields, but its search strategy has some defects, which become an impediment to its application. Therefore, this paper introduces an improved flower pollination algorithm based on multi-strategy. First, the new global search strategy was adopted through two groups of random individual difference vectors and Lévy flight to increase the diversity of population and expand the search range, making the algorithm easier to escape the local optimum and improve its exploitation ability. Second, the elite mutation strategy was used in the local search, and a new local pollination strategy was developed by combing it with the random individual mutation mechanism. The elite individuals were used to guide the evolution direction of other individuals and improve the search speed of the algorithm. The random individual mutation strategy was adopted to keep the population diverse and enhance the continuous optimization capability of the algorithm. In addition, the two mutation strategies were adjusted through linear decreasing probability rule to make them complement with each other and improve the optimization capability of the algorithm. Finally, a new solution was generated by the cosine function search factor strategy to replace the unimproved solution and improve the quality of the solution. The stability and effectiveness of the algorithm were proved by simulation experiments of 5 kinds of classical test functions and statistical analysis. The experimental results show that the improved algorithm proposed in this paper is a novel and competitive algorithm compared with the existing classical and state-of-the-art improved algorithms. At the same time, the proposed algorithm was used to solve the route planning problem of unmanned combat aerial vehicle (UCAV) in the military field. The test results show that the proposed algorithm also has certain advantages in solving practical engineering problems.
WEI Jian-Hao , XIA Ye-Feng , GONG Xue-Qing
2021, 32(10):3176-3202. DOI: 10.13328/j.cnki.jos.006203 CSTR:
Abstract:Traditional database systems are built around a model of query-at-a-time, and concurrent queries in the context are executed independently. Due to the limitations of this model, traditional databases cannot optimize multiple queries at a time. Multi-query sharing technology is designed to share the common part between queries to improve the overall response time and throughput of the system. This study divides the multi-query execution mode into two categories and introduces their respective prototype systems:the multi-query prototype system based on the global query plan and on demand simultaneous pipelining. Also, the advantages of the two systems and the applicable scenarios are discussed. In the following content, the multi-query sharing technology is divided into multiple query sharing technologies in the query compilation phase and query execution phase according to the various stages of the query. There are two major types of multi-query sharing technologies. Taking these two directions as clues, the research results in various directions such as the multi-query plan representation method, multi-query expression combination, multi-query sharing algorithm, and multi-query optimization are reviewed here. On this basis, the applications of shared query technology in relational database and non-relational database are also introduced. Finally, it analyzes the opportunities and challenges faced by shared query technology.
ZHU Yue-An , JIAN Huai-Bing , LONG Yong-Chao , LI Bin , WANG Shu , WU Xi-Liang , ZHONG Zhi-Chu , ZHANG Yan-Song
2021, 32(10):3203-3218. DOI: 10.13328/j.cnki.jos.006023 CSTR:
Abstract:In recent year, the write-heavy applications are more and more prevalent. How to efficiently handle this sort of workload is one of intensive research direction in the field of database system. The overhead caused by write operation is mainly issued by two factors. One is the hardware level, i.e., the IO cost caused by write operation. This cost cannot be removed in short period. The other is dual-copy software architecture, i.e., multiple writes caused by modifying in-memory data copy and formulating log records. The log-as-database architecture (the following refers it as single-copy system) can reduce the IOs and software cost caused by write as well. But existing systems treating log-as-database either are built on top of special infrastructure such as infiniband or NVRam (non-volatile random access memory) which is far from widely available or is constructed with the help of other system such as Dynamo, which is lack of flexibility and generality. This study builds from scratch a single copy system called LogStore oriented for commodity environment, which adopts log-as-database design philosophy to fully utilize its advantages that can boost the write performance and minimize the gap between primary and secondary. Embedding consensus module into system other than dependent on auxiliary systems makes it more flexible and controllable. The novel execution model binding thread to certain partition plus multi-version concurrency control technique eliminates read-write, write-write conflict, and context switch overhead in lock-free style. The YCSB benchmark is used to assess system performance thoroughly. Compared to prevalent key-value store HBase and its single-copy implementation LogBase, the proposed system can achieve about 4x better. In term of crash recovery, LogStore can finish recovery within one minute for TB scale data volume, which is one order of magnitude recovery time less than LogBase.
LI Huo-Ran , LIU Xuan-Zhe , MEI Qiao-Zhu , MEI Hong
2021, 32(10):3219-3235. DOI: 10.13328/j.cnki.jos.006199 CSTR:
Abstract:Smartphones and smartphone apps have undergone an explosive growth in the past decade. However, smartphone battery technology hasn't been able to keep pace with the rapid growth of the capacity and the functionality of devices and apps. As a result, battery has always been a bottleneck of a user's daily experience of smartphones. An accurate estimation of the remaining battery life could tremendously help the user to schedule their activities and use their smartphones more efficiently. Existing studies on battery life prediction have been primitive due to the lack of real-world smartphone usage data at scale. This paper presents a novel method that uses the state-of-the-art machine learning models for battery life prediction, based on comprehensive and real-time usage traces collected from smartphones. The method is evaluated using a dataset collected from 51 users for 21 months, which covers comprehensive and fine-grained smartphone usage traces including system status, sensor indicators, system events, and app status. We find that the battery life of a smartphone can be accurately predicted based on how the user uses the device at the real-time, in the current session, and in history. As a conclusion, the proposed model could significantly raise the prediction accuracy.
LIU Zhen , HAN Yi-Liang , YANG Xiao-Yuan , LIU Shu-Guang
2021, 32(10):3236-3253. DOI: 10.13328/j.cnki.jos.006013 CSTR:
Abstract:To save bandwidth and computation without sacrificing security while constructing a multi-receiver signcryption scheme, this study extended the paradigm namely the re-use of all randomness to another common scenario, proposed the re-use of partial randomness, and redefined the multi-receiver signcryption scheme, reproducible signcryption scheme, and security model to the re-use of partial randomness. It then given and proved the reproducibility theorem that the security condition of the re-use of partial randomness is that the scheme is reproducible. Finally, it proved that the LWWD16 signcryption scheme based on lattice is a reproducible signcryption scheme with the re-use of partial randomness, and firstly constructed a multi-message to multi-receiver signcryption scheme with the re-use of partial random numbers based on lattice, which satisfied the security of adaptively indistinguishable against chosen ciphertext attacks (IND-CCA2) and existentially unforgeable against chosen message attacks (euf-CMA). Efficiency analysis shows that the multi-message and multi-receiver signcryption scheme with the re-use of partial randomness can effectively save bandwidth and computation, and it provides a general construction method for multi-message to multi-receiver signcryption.
ZHOU Jie-Ying , HE Peng-Fei , QIU Rong-Fa , CHEN Guo , WU Wei-Gang
2021, 32(10):3254-3265. DOI: 10.13328/j.cnki.jos.006062 CSTR:
Abstract:As a security defense technique to protect the network from attacks, the system of network intrusion detection system, as a security defense technology to protect the network from attacks, plays a very important crucial role in the field of guaranteeing computer system and network security. However, for the multi-classification problem of unbalanced data in network intrusion detection data, machine learning has been widely used in intrusion detection so as to achieve high intelligence and accuracy. In this paper, the current multi-classification method for network intrusion detection is improved, and an intrusion detection model RF-GBDT is proposed, which applies based on the random forest model for to feature conversion and classification using the model of gradient boosting decision tree to classification model is proposed. The model is mainly includes divided into three parts:Feature selection, feature conversion, and classifier. The UNSW-NB15 dataset was used for the experimental data set to test; experimental tests were carried out on the RF-GBDT model. Compared with the other three algorithms in the same field, RF-GBDT, this model not only reduces training time, but also has a higher detection rate and a lower false alarm rate. The area under the subject's working characteristic curve on the test data set can reach 98.57%. RF-GBDT, the proposed model has significant advantages in solving the multi-class problem of multi-classification of unbalanced data in network intrusion detection data and is a feasible method for network intrusion detection.
ZHANG Ming-Wu , HUANG Jia-Jun , HARN Lein
2021, 32(10):3266-3282. DOI: 10.13328/j.cnki.jos.006086 CSTR:
Abstract:With the rapid development of medical information systems, the information system based on medical clouds stores massive electronic health records (EHRs) in medical cloud systems and employs the powerful storage and computing capacity of medical clouds to manage EHRs in a safe and unified manner. Although the traditional encryption mechanism can protect the privacy of medical data in semi-honest cloud servers, it is still an open problem to perform safe and efficient range-based search for the encrypted EHRs. To address this problem, in this work, a range-based multi-keyword searchable scheme is proposed. It can implement searchable encryption of complex query structures with scalar-product preserving encryption and support the query of connection keywords, ranges, and wildcard characters. Furthermore, the indexes and trapdoors are created in a random manner to hide the search mode and protect the privacy of search statements. The Hadamard product is adopted to reduce the dimension of the required key matrix. Theoretical analysis and experimental results show that the scheme can efficiently protect the privacy users' search strategy while guaranteeing the privacy of medical data. This scheme improves the retrieval efficiency and reduces the time in index and trapdoor creation, achieving the range-based search of medical data in multi-user and multi-file medical environments.
2021, 32(10):3283-3292. DOI: 10.13328/j.cnki.jos.006018 CSTR:
Abstract:Rain streaks can severely degrade the quality of captured images and affect outdoor vision. However, due to non-uniform in shape, direction, and density of rain in different images, it is a difficult task to remove rain from a single image. This study proposes a single image de-raining using an ensemble recurrent dual-attention-residual network, called RDARENet. In the network, as contextual information is very important for the process of rain removal, a multi-scale dilated convolution network is firstly adopted to acquire large receptive field. Rain streaks can be regarded as the accumulation of multiple rain streaks layers, the residual of the channel attention and spatial attention mechanisms are used to extract the features of the rain streaks and restore the background layer information. The channel attention can assign different weights to rain streaks layers, and the spatial attention enhances the representation of the area through the relationship between adjacent spatial features. With the deepening of the network, to prevent the loss of low-level information, a cascaded residual network and a long-term memory network are used to transfer low-level feature information to the high-level and remove rain streaks stage by stage. In the output of the network, ensemble learning method is adopted to weight the output of each stage through the gated network, and add to get the clean image. Extensive experiments demonstrate that the effect of removing rain and restoring texture details is greatly improved.
CHEN Xing-Shu , CAI Meng-Juan , WANG Wei , WANG Qi-Xu , JIN Xin
2021, 32(10):3293-3309. DOI: 10.13328/j.cnki.jos.006022 CSTR:
Abstract:Virtual machine introspection is a method to acquire the information of the target virtual machine, and monitor as well as analyze its running status outside the target virtual machine. Aiming at the problem of poor portability and low efficiency in the process of semantic reconstruction of existing virtual machine introspection method, a sematic reconstruction improvement method is proposed in this study. In this method, constraint conditions are made based on the characteristics of the process structure members, and the offsets of the process structure key members are automatically obtained without knowing the kernel version of the target virtual machine, and the resulting offsets can be provided to the open source or self-developed virtual machine introspection tools to complete the process of semantic reconstruction. The VMOffset prototype system is implemented on the KVM (kernel-based virtual machine) virtualization platform, and the effectiveness and performance of VMOffset are experimentally analyzed based on virtual machines of different kernel version operating systems. The results show that VMOffset can automatically complete the process-level semantic reconstruction process of each target virtual machine, and only introduces the performance loss within 0.05% in the startup phase of the target virtual machine.
WU Hua , YU Zhen-Hua , CHENG Guang , HU Xiao-Yan
2021, 32(10):3310-3330. DOI: 10.13328/j.cnki.jos.006025 CSTR:
Abstract:Encrypted video identification is an urgent problem in the field of network security and network management. The existing methods are to match the video transmission fingerprint of encrypted video with the video fingerprint in the video fingerprint database. The existing research mainly focuses on the study of matching recognition algorithm, but there is neither particular research on matching data sources nor the analysis of precision and false positive rate in large-scale video fingerprint library. The resulting practicality of existing methods cannot be guaranteed. In order to address this problem, this study firstly analyses the reason why the length of the cipher text of the application data unit (ADU) encrypted by TLS drifts relative to the length of the plaintext. For the first time, HTTP head feature and TLS fragment features are used as fitting features for ADU length restoration, and then this study proposes an accurate fingerprint restoration method HHTF for the encrypted ADU, and applies HHTF to the encrypted video recognition. A large fingerprint database of 200 000 videos was built based on the simulation of real Facebook videos. Theoretical derivation and calculation demonstrate that the accuracy, precision, and recall rate can reach 100%, and the false positive rate is 0 requiring only one-tenth the numbers of ADUs of the existing method. The experimental results in simulating large-scale video fingerprint database are consistent with the theoretical calculations. The application of the HHTF method makes it possible to recognize encrypted transmitted video in large-scale video fingerprint library scenarios, which is of great practicality and application value.