CAI Yu , SUN Cheng-Guo , DU Zhao-Hui , LIU Zi-Xing , KANG Meng-Bo , LI Shuang-Shuang
2021, 32(8):2289-2306. DOI: 10.13328/j.cnki.jos.006002 CSTR:
Abstract:Improving the efficiency of heterogeneous HPL needs to fully utilize the computing power of acceleration components and CPU, the acceleration components integrate more computing cores and are responsible for the main calculation. The general CPU is responsible for task scheduling and also participates in calculation. Under the premise of reasonable division of tasks and load balancing, optimizing CPU-side computing performance is particularly important to improve overall efficiency. Optimizing the basic linear algebra subprogram (BLAS) functions for specific platform architecture characteristics can often make full use of general-purpose CPU computing capabilities to improve the overall system efficiency. The BLAS-like Library Instantiation Software (BLIS) algorithm library is an open source BLAS function framework, which has the advantages of easy development, portability, and modularity. Based on the heterogeneous system platform architecture and HPL algorithm characteristics, this study uses three-level cache, vectorized instructions, and multi-threaded parallel technology to optimize the BLAS functions called by the CPU, applies auto-tuning technology to optimize the matrix block parameters, and eventually forms the HygonBLIS algorithm library. Compared with MKL, the overall performance of the HPL using HygonBLIS has been improved by 11.8% in the heterogeneous environment.
LI Lei-Sheng , YANG Wen-Hao , MA Wen-Jing , ZHANG Ya , ZHAO Hui , ZHAO Hai-Tao , LI Hui-Yuan , SUN Jia-Chang
2021, 32(8):2307-2318. DOI: 10.13328/j.cnki.jos.006003 CSTR:
Abstract:Nowadays, the mainstream supercomputers in the world adopt heterogeneous systems with accelerators more and more. The increase of float point computation performance of the accelerators requires other components to match its speed, including CPU, memory, bus, and network. High performance Linpack (HPL) is the traditional benchmark for high performance computers. Complex heterogeneous systems have brought both opportunities and challenges to the benchmarking with HPL. Therefore, for heterogeneous supercomputers, a new task partitioning scheme between the CPU and the accelerators is proposed, using the balance point theory to guide the optimization of HPL. For optimizing HPL, a look-ahead algorithm is proposed to coordinate the collaboration of CPU and the accelerators, as well as a contiguous row-swap algorithm, enabling the parallelism among CPU, accelerators, and network. Besides, new panel factorization and row-swap implementations have been designed for the system with accelerators, improving the effectiveness and efficiency of the usage of accelerators. With the configuration of 4 GPUs on each computing node, HPL efficiency of 79.51% on a single node.
SHUI Chao-Yang , YU Xian-Zhi , WANG Yin-Shan , TAN Guang-Ming
2021, 32(8):2319-2328. DOI: 10.13328/j.cnki.jos.006004 CSTR:
Abstract:As heterogeneous system becomes one of the most important choices to build super computers, how to orchestrate CPU and accelerator to leverage the great computability of heterogeneous systems is of great significance. HPL is the most important benchmark in HPC field, traditional HPL algorithm targeting at CPU-only systems cannot achieve high performance by only offloading matrix multiplication workload to accelerators. To solve this problem, this work proposes a HPL performance model and a multithread fine-grained pipelining algorithm for domestic-processor-domestic-accelerator heterogeneous system. Meanwhile, a light weight cross-platform heterogeneous framework is implemented to carry out a cross-platform HPL algorithm. The proposed performance model predicts HPL performance accurately on similar heterogeneous systems. On NVIDIA platform, the proposed HPL algorithm outperforms the NVIDIA proprietary counterparts by 9%. On domestic-processor-domestic-accelerator platform, the finally optimized Linpack program achieves 2.3 PFLOPS on 512 nodes, with floating-point efficiency 71.1%.
SUN Qiao , SUN Jia-Chang , MA Wen-Jing , ZHAO Yu-Wen
2021, 32(8):2329-2340. DOI: 10.13328/j.cnki.jos.006005 CSTR:
Abstract:HPL (high performance Linpack) is a widely used benchmark for measuring computer performance. Over the decades, the practice of optimizing and tuning of HPL has constantly drawn great attention in both industrial and academic circle, to evaluate the performance of contemporary cutting-edge computer platforms. For current heterogeneous HPC platforms with multiple accelerating co-processors, an approach of high-performance HPL benchmark, Hetero-HPL, is proposed in this paper. In Hetero-HPL, the mapping between process set and (co-) processor set becomes adjustable, so that the computation within each computing node may avoid inter-process message exchange, and each important procedure of the HPL algorithm may make full use of the hardware resources of the computing node, such as memory, CPU cores, co-processors, and PCI-e bus etc.Without redundant computation and communication, the working set of Hetero-HPL is not restricted by the limit of pinned memory size in a single allocation, and is distributed in a way that the workload is balanced among all the co-processors and massive fine-grained parallelism can be exploited. On one experimental platform with four co-processors, Heter-HPL can reach an efficiency of 76.5% (the efficiency of function dgemm is 84%) in one computing node, and further experiment suggests that Hetero-HPL is also a feasible approach in distributed environment.
LIU Fang-Fang , WANG Zhi-Jun , WANG Quan , WU Li-Xin , MA Wen-Jing , YANG Chao , SUN Jia-Chang
2021, 32(8):2341-2351. DOI: 10.13328/j.cnki.jos.006006 CSTR:
Abstract:HPCG benchmark is a new standard for supercomputer ranking. This benchmark is used mainly for evaluating how fast a supercomputer is able to solve a large scale sparse linear system, which is closer to real applications, and has attracted extensive attention recently. Research of parallel HPCG on domestic heterogeneous many-core supercomputers is very important, not only to improve the HPCG ranking of Chinese supercomputers, but also to provide reference of parallel algorithm and optimization techniques for many applications. This work studies parallelization and optimization of HPCG on a domestically produced complex heterogeneous supercomputer, leveraging blocked graph coloring algorithm for parallelism exploration for the first time on this system, and proposes a graph coloring algorithm for structured grids. The parallelism produced by this algorithm is higher than the traditional JPL and CC algorithm, with better coloring quality. With this algorithm, successfully reduced the iteration number of HPCG by 3 times, and the total performance is improved by 6%. This study also analyzes the data transfer cost of each component in the complex heterogeneous system, providing a task partitioning method, which is more suitable for HPCG, and the neighbor communication cost in SpMV and SymGS is hidden by inner-outer region partitioning. In the whole-system test, an HPCG performance of 1.67% of the peek GFLOPS of the system is achieved, compared to single-node performance, the weak-scaling efficiency on the whole system has reached 92%.
SUN Qiao , LI Lei-Sheng , ZHAO Hai-Tao , ZHAO Hui , WU Chang-Mao
2021, 32(8):2352-2364. DOI: 10.13328/j.cnki.jos.006007 CSTR:
Abstract:Task parallelism is one of the fundamental patterns for designing parallel algorithms. Due to algorithm complexity and distinctive hardware features, however, implementation of algorithms in task parallelism often remains to be challenging. On the newly SW26010 many-core CPU platform, a general runtime framework, SWAN, which supports nested task parallelism is proposed in this study. SWAN provides high-level abstractions for programmers to implement task parallelism so that they can focus mainly on the algorithm itself, enjoying an enhanced productivity. In the aspect of performance, the shared resources and information manipulated by SWAN are partitioned in a fine-grained manner to avoid fierce contention among working threads. The core data structures within SWAN take advantage of the high-bandwidth memory access mechanism, fast on-chip scratchpad cache as well as atomic operations of the platform to reduce the overhead of SWAN itself. Besides, SWAN provides dynamic load-balancing strategies in runtime to ensure a full occupation of the threads. In the experiment, a set of recursive algorithms in nested parallelism, including the N-queens problem, binary-tree traversal, quick sort, and convex hull, are implemented using SWAN on the target platform. The experimental results reveal that each of the algorithms can gain a significant speedup, from 4.5x to 32x, against its serial counterpart, which suggests that SWAN has a high usability and performance.
XU Shun , WANG Wu , ZHANG Jian , JIANG Jin-Rong , JIN Zhong , CHI Xue-Bin
2021, 32(8):2365-2376. DOI: 10.13328/j.cnki.jos.006008 CSTR:
Abstract:It is very important to develop high performance computing algorithm and software adapting China heterogeneous supercomputer. This has also great significance in minimizing the gap between China's HPC hardware and HPC software. Firstly, the article briefly introduces current trend and challenges of high-performance computing application software and analyzes computing algorithm characteristics of various typical high performance computing applications including N-body simulation in computing cosmology, earth system model, computational material phase field dynamics, molecular dynamics, quantum chemistry, and lattice QCD. Secondly, solution to use domestic heterogeneous computing system has been discussed and typical application algorithms, common questions of software including core algorithm, development of algorithm, strategies of optimizing codes have been summarized as well. Finally, a summary of high-performance computing algorithms and software for heterogeneous computing is given.
LIU Li , ZHU Jian-Cheng , HAN Guang-Jie , BI Yuan-Guo
2021, 32(8):2379-2390. DOI: 10.13328/j.cnki.jos.006188 CSTR:
Abstract:Data-driven fault diagnosis models for specific mechanical equipment lack generalization capabilities. As a core component of various types of machinery, the health status of bearings makes sense in analyzing derivative failures of different machinery. This study proposes a bearing health monitoring and fault diagnosis algorithm based on 1D-CNN (one-dimensional convolution neural network) joint feature extraction. The algorithm first partitions the original vibration signal of the bearing in segmentations. The signal segmentations are used as feature learning spaces and input into the 1D-CNN in parallel to extract the representative feature domain under each working condition. To avoid processing overlapping information generated by faults, a bearing health status discriminant model is built in advance based on the feature domain sensitive to health status. If the health model recognizes that the bearing is not in a healthy state, the feature domain will be reconstructed jointly with the original signal and coupled with an automatic encoder for failure mode classification. Bearing data provided by Case Western Reserve University are used to carry out experiments. Experimental results demonstrate that the proposed algorithm inherits the accuracy and robustness of the deep learning model, and has higher diagnosis accuracy and lower time delay.
GONG Cheng , LU Ye , DAI Su-Rong , LIU Fang-Xin , CHEN Xin-Wei , LI Tao
2021, 32(8):2391-2407. DOI: 10.13328/j.cnki.jos.006189 CSTR:
Abstract:Deep neural network (DNN) quantization is an efficient model compression method, in which parameters and intermediate results are expressed by low bit width. The bit width of data will directly affect the memory footprint, computing power and energy consumption. Previous researches on model quantization lack effective quantitative analysis, which leads to unpredictable quantization loss of these methods. This study proposes an ultra-low loss quantization (μL2Q) method for DNN compression, which reveals the internal relationship between quantization bit width and quantization loss, effectively guiding the selection of quantization bit width and reducing quantization loss. First, the original data is mapped to the data with standard normal distribution and then the optimal parameter configuration is sought to reduce the quantization loss under the target bit width. Finally, μL2Q has been encapsulated and integrated into two popular deep learning training frameworks, including Caffe and Keras, to support the design and training of end-to-end model compression. The experimental results show that compared with the state-of-the-art three clusters of quantization solutions, μL2Q can still guarantee the accuracy and deliver 1.94%, 3.73%, and 8.24% of accuracy improvements under the typical neural networks with the same quantization bit width, respectively. In addition, it is also verified that μL2Q can be competent for more complex computer vision tasks through salient object detection experiments.
YANG Shi-Gui , WANG Yuan-Yuan , LIU Wei-Chen , JIANG Xu , ZHAO Ming-Xiong , FANG Hui , YANG Yu , LIU Di
2021, 32(8):2408-2424. DOI: 10.13328/j.cnki.jos.006190 CSTR:
Abstract:With the increase of the number of cores in computers, temperature-aware multi-core task scheduling algorithms have become a research hotspot in computer systems. In recent years, machine learning has shown great potential in various fields, and thus many work using machine learning techniques to manage system temperature have emerged. Among them, reinforcement learning is widely used for temperature-aware task scheduling algorithms due to its strong adaptability. However, the state-of-the-art temperature-aware task scheduling algorithms based on reinforcement learning do not effectively model the system, and it is difficult to achieve a better trade-off among temperature, performance, and complexity. Therefore, this study proposes a new multi-core temperature-aware scheduling algorithm based on reinforcement learning-ReLeTA. In the new algorithm, a more comprehensive state modeling method and a more effective reward function are proposed to help the system further reduce the temperature. Experiments are conducted on three different real computer platforms. The experimental results show the effectiveness and scalability of the proposed method. Compared with existing methods, ReLeTA can control the system temperature better.
SHAO Ming-Li , CAO E , HU Ming , ZHANG Yue , CHEN Wen-Jie , CHEN Ming-Song
2021, 32(8):2425-2438. DOI: 10.13328/j.cnki.jos.006191 CSTR:
Abstract:Intelligent traffic light control can effectively improve the order and efficiency of road traffic. In urban traffic networks, special vehicles with urgent tasks have higher requirements for traffic efficiency. However, current intelligent traffic light control algorithms generally treat all vehicles equally, without considering the priority of special vehicles, while the traditional methods basically adopt signal preemption to ensure the priority of special vehicles, which has a great influence on the passage of ordinary vehicles. Therefore, this study proposes a traffic light optimization control method orient priority vehicle awareness. It learns traffic light control strategies through continuous interaction with the road environment. the weight of special vehicles is increased in state definition and reward function, and Double DQN and Dueling DQN are used to improve the performance of the model. Finally, the experiments are carried out in the urban traffic simulator SUMO. After the training stabilizes, compared with the fixed time control method, the proposed method can reduce the average waiting time of special vehicles and ordinary vehicles by about 68% and 22%, respectively. Compared with the method without considering priority, the average waiting time of special vehicles is also optimized by about 35%, all these results prove that the proposed method can not only improve the efficiency of all vehicles, but also give special vehicles higher priority. At the same time, the experiment also shows that the proposed method can be extended to apply in multi-intersection scenes.
GUO Zhen-Bei , LI Fu-Liang , LIANG Bo-Cheng , ZHANG Xiao-Rui , SUN Lei
2021, 32(8):2439-2456. DOI: 10.13328/j.cnki.jos.006192 CSTR:
Abstract:WiFi Direct (WFD) supported by Android has been widely used in device-to-device (D2D) communications. Compared with Bluetooth, WFD has advantages in data transmission rate and connection distance. At the same time, WFD can quickly create a connection than WiFi HotSpot. Therefore, it is widely used to form D2D communication networks and support edge computing, traffic offloading, mobile crowdsourcing, and other studies. However, it brings high energy consumption simultaneously, which is still a major concern for battery-constrained devices. Existing studies pay more attention to measuring and optimizing the performance of WFD-based networks, while few studies focus on the energy consumption. In this study, an energy-saving mechanism for the WFD based on power control is proposed, which makes a supplement to the default energy-saving mechanism of WFD. First of all, this study constructs a WFD-based communication group and a measurement analysis of the default energy-saving mechanism. Measurement results show that the energy consumption of the group owner is always higher than that of the group member. Then, the proposed energy-saving mechanism is described in detail, which can reduce the transmission consumption of devices and balance the energy consumption of the group owner by switching the role of the devices. At last, the proposed mechanism is evaluated with simulation experiments, and results show that the proposed mechanism can reduce 11.86% energy consumption with a throughput loss of 2%.
ZOU Min-Hui , ZHOU Jun-Long , SUN Jin , WANG Cheng-Liang
2021, 32(8):2457-2468. DOI: 10.13328/j.cnki.jos.006193 CSTR:
Abstract:Computing systems based on the emerging device resistive random-access memory (RRAM) have received a lot of attention due to its capability of performing matrix-vector-multiplications operations in memory. However, the security of the RRAM computing system has not been paid enough attention. An attacker can gain access to the neural network models stored in the RRAM computing system by illegally accessing an unauthorized RRAM computing system and then carrying on a black-box attack. The goal of this study is to thwart such attacks. The defense method proposed in this study is based on benign Trojan, which means that when the RRAM computing system is not authorized, the Trojan in the system are extremely easy to be activated, which in turn affects the prediction accuracy of the system's output, thus ensuring that the system is not able to operate normally; when the RRAM computing system is authorized, the Trojan in the system are extremely difficult to be activated accidently, thus enabling the system to operate normally. It is shown experimentally that the method enables the output prediction accuracy of an unauthorized RRAM computing system to be reduced to less than 15%, with a hardware overhead of less than 4.5% of the RRAM devices in the system.
LUO Wu , SHEN Qing-Ni , WU Zhong-Hai , WU Peng-Fei , DONG Chun-Tao , XIA Yu-Tang
2021, 32(8):2469-2504. DOI: 10.13328/j.cnki.jos.006153 CSTR:
Abstract:With the popularity of cloud computing and mobile computing, browser applications show the characteristics of diversification and scale, and the browser security issues are increasingly prominent. To ensure the security of Web application resources, the browser's same-origin policy is proposed. Since then, the introduction of the same-origin policy in RFC6454, W3C and HTML5 standards has driven modern browsers (e.g., Chrome, Firefox, Safari, and Edge) to implement the same-origin policy as the basic access control policy. The same-origin policy, however, in practice, faces the problems including handling security threats introduced by the third-party scripts, limiting the permissions of same-origin frames, assigning more permissions for cross-origin frames when they collaborate with browser's other mechanisms. It also cannot guarantee the safety of cross-domain or cross-origin communication mechanisms and the security under memory attacks. This paper reviews the existing researches on browser's same-origin policy security. Firstly, this paper describes the same-origin policy rules, followed by summarizing the threat model for researches on same-origin policy and the research directions, including insufficient same-origin policy rules and defenses, attacks and defenses on cross-domain and cross-origin mechanisms, and same-origin policy security under memory attacks. Finally, this paper prospects the future research direction of browser's same-origin policy security.
HUANG Zi-Jie , CHEN Jun-Hua , GAO Jian-Hua
2021, 32(8):2505-2521. DOI: 10.13328/j.cnki.jos.006082 CSTR:
Abstract:Code Smells are symptoms of poor design and implementation choices. Detect and identify Code Smell precisely provide guidance on software refactoring, and lead to improvement of software usability and reliability. Design problems of software systems could be quantified through Code Smell metrics. JavaScript has become one of the most widely used programming languages, class is a design pattern of JavaScript, loose coupling and strong cohesion are characteristics of a well-designed class. Prior works measured coupling and cohesion Code Smells of JS programs in lower levels, i.e., function-wide and statement-wide, which were capable for providing refactoring suggestions about basic implementations, but not enough to identify design problems. This paper proposed JS4C, a method to detect coupling and cohesion Code Smells of JS classes including FE, DC and Blob. This method is an approach of static analysis works on both server and client-side applications, it iterates over every class in software system and takes advantage of source code textual patterns. While JS4C detects Code Smells, it also determines intensity for each of them. Missing type information in static analysis is reinforced by extended object type inference and non-strict coupling dispersion (NSCDISP) metric during structural analysis. Experiments made on 6 open-sourced projects indicate that JS4C can correctly detect coupling and cohesion design problems.
BAO Xi-Gang , ZHOU Chun-Lai , XIAO Ke-Jing , QIN Biao
2021, 32(8):2522-2544. DOI: 10.13328/j.cnki.jos.006215 CSTR:
Abstract:Visual question answering (VQA) is an interdisciplinary direction in the field of computer vision and natural language processing. It has received extensive attention in recent years. In the visual question answering, the algorithm is required to answer questions based on specific pictures (or videos). Since the first visual question answering dataset was released in 2014, several large-scale datasets have been released in the past five years, and a large number of algorithms have been proposed based on them. Existing research has focused on the development of visual question answering, but in recent years, visual question answering has been found to rely heavily on language bias and the distribution of datasets, especially since the release of the VQA-CP dataset, the accuracy of many models has been greatly reduced. This paper mainly introduces the proposed algorithms and the released datasets in recent years, especially discusses the research of algorithms on strengthening the robustness. The algorithms of visual question answering are summarized and their motivation, details, and limitations are also introduced. Finally, the challenge and prospect of visual question answering are discussed.
JU Sheng-Gen , LI Tian-Ning , SUN Jie-Ping
2021, 32(8):2545-2556. DOI: 10.13328/j.cnki.jos.006114 CSTR:
Abstract:Fine-grained named entity recognition is to locate entities in text and classify them into predefined fine-grained categories. At present, Chinese fine-grained named entity recognition only uses pre-trained language models to encode characters in sentences and does not take into account that the category label information can distinguish entity categories. Since the predicted sentence does not have the entity label, the associated memory network is used to capture the entity label information of the sentences in the training set and to incorporate label information into the representation of predicted sentences in this paper. In this method, sentences with entity labels in the training set are used as memory units, the pre-trained language model is used to obtain the contextual representations of the original sentence and the sentence in the memory unit. Then, the label information of the sentences in the memory unit is combined with the representation of the original sentence by the attention mechanism to improve the recognition effect. On the CLUENER 2020 Chinese fine-grained named entity recognition task, this method improves performance over the baseline methods.
PU Yong-Lin , YU Jiong , LU Liang , LI Zi-Yang , BIAN Chen , LIAO Bin
2021, 32(8):2557-2579. DOI: 10.13328/j.cnki.jos.006074 CSTR:
Abstract:As one of the most popular platforms in big data stream computing, Storm is suffering from the problem of high energy consumption and low energy efficiency due to the lack of consideration for energy saving strategy in the design process. Without taking the performance constraint of Storm into consideration, the traditional energy-efficient strategies may affect the real-time performance of cluster. Aiming at this issue, models of the resource constraint, the optimal executor reallocation, and the data migration are set up, and the energy-efficient strategy based on executor reallocation and data migration in Storm (ERDM) is further proposed, while ERDM is composed of resource constraint algorithm and data migration algorithm. The resource constraint algorithm estimates whether the cluster is appropriate for data migration according to the utilization of CPU, memory, and network bandwidth in each work node. The data migration algorithm designs optimal method to migrate data according to the resource constraint model and the optimal executor reallocation model. Moreover, the ERDM allocates the executors so as to reduce communication cost between nodes. The ERDM is evaluated by measuring the cluster performance as well as energy consumption efficiency in big data stream computing environment. The experimental results show that the proposed strategy can reduce communication cost and energy consequence efficiently while the cluster performance is improved compared with existing researches.
SHI Tuo , LI Jian-Zhong , GAO Hong
2021, 32(8):2580-2596. DOI: 10.13328/j.cnki.jos.006216 CSTR:
Abstract:The battery-free sensor network is an emerging IoT network architecture. The battery-free sensor network aims to address the energy and lifetime limitations in traditional wireless sensor networks. In the battery-free sensor network, battery-free nodes can harvest energy from the ambient environment by specific energy harvesting component. Since the energy in the ambient environment is infinite, the lifetime of the battery-free sensor networks is unlimited in terms of energy. Thus, the lifetime limitation of the wireless sensor network can be addressed. However, since the ambient energy is usually very weak and it distributes unevenly, the coverage problem in battery-free sensor networks is very complex than that in traditional wireless sensory networks. In order to solve the coverage problem in battery-free sensor networks and more reasonably use the harvested energy, a battery-free sensor network is considered in which battery-free nodes have multi-level communication radius. Furthermore, a coverage problem is defined in such networks. It is proved that this problem is NP-Hard, and an approximation algorithm is proposed to solve this problem. The approximation ratio of such algorithm is analyzed and the simulations are also carried out to evaluate the performance of the algorithm. The results demonstrate that the algorithm is effective and efficient.
ZHANG Qi-Fei , GUI Chao , SONG Ying , SUN Bao-Lin , DAI Zhi-Feng
2021, 32(8):2597-2612. DOI: 10.13328/j.cnki.jos.005985 CSTR:
Abstract:Opportunistic networks utilize the contact opportunities created by node movement to forward data between node pairs. Data is piggybacked during its carrier's movement, which guarantees node independence while imposes an impact on data transmission performance. This study designs a routing algorithm for opportunistic networks based on node movement characters. Considering the factors of data transmission, data content and application demands, a data forwarding priority evaluation model is developed and adopted to determine the data transmission rules combined with the division of node activity range. The transfer strategy of differential message replicas is proposed to achieve a tradeoff between transmission efficiency and system overhead. A node free motion degree function is constructed according to activity range distribution, centrality degree, and energy level to evaluate the node motion level. And then a utility function is deduced for relaying node selection. The simulation results demonstrate that the proposed algorithm achieves higher packet delivery ratio and less delivery latency while satisfying application demands and restraining network overhead.
WEI Xin , WANG Xin-Yan , YU Zhuo , GUO Shao-Yong , QIU Xue-Song
2021, 32(8):2613-2628. DOI: 10.13328/j.cnki.jos.006033 CSTR:
Abstract:Aiming at information exchange requirements of cross-trust domains under IoT scenario, this paper constructs an authentication architecture which suits for IoT with blockchain and edge computing. Firstly, based on consortium chain, the paper designs architecture and process for cross-domain authentication in IoT, creates secure cross-domain information exchange environment for IoT. In addition, edge gateway is introduced to shield heterogeneity and sensitive information of things. Based on edge gateway,authentication protocol for cross-trust domain authentication is designed, which strengthens privacy preserving in IoT. Finally,performance analysis proves that the design could resist common attacks in IoT scenario. Simulation results prove that the design has better performance than traditional way in both computation consumption and communication consumption, can be used for IoT.