Abstract:As the privacy-preserving problem gains increasing concerns in today's machine learning (ML) world, constructing an ML serving system with a data security guarantee becomes very important. Meanwhile, trusted execution environments (e.g., Intel SGX) have been widely used for developing trusted applications and systems. For instance, Intel SGX offers developers hardware-based secure containers (i.e., enclaves) to guarantee application confidentiality and integrity. This paper presents S3ML, an SGX-based secure serving system for ML inference. S3ML leverages Intel SGX to host ML models for users' privacy protection. To build a practical secure serving system, S3ML addresses several challenges to run model servers inside SGX enclaves. In order to ensure availability and scalability, a frontend ML inference service typically consists of many backend model server instances. When these instances are running inside SGX enclaves, new system architectures and protocols are in need to synchronize cryptographic certificates and keys to construct distributed secure enclave clusters. A dedicated module is designed, it is called attestation-based enclave configuration service in S3ML, responsible for generating, persisting, and distributing certificates and keys among clients and model server instances. The existing infrastructure can then be reused to do transparent load balancing and failover to ensure service high-availability and scalability. Besides, SGX enclaves rely on a special memory region called the enclave page cache (EPC), which has a limited size and is contended by a host’s all enclaves. Therefore, the performance of SGX-based applications is vulnerable to EPC interferences. To satisfy the service-level objective (SLO) of ML inference services, S3ML first integrates lightweight ML framework/models to reduce EPC consumption. Furthermore, through offline analysis, it is found feasible to use EPC paging throughput as indirect monitoring metric to satisfy SLO. Based on this result, S3ML uses real-time EPC paging information to control service load balancing and scaling activities for SLO satisfaction. S3ML has been implemented based on Kubernetes, TensorFlow Lite, and Occlum. The system overhead, feasibility, and effectiveness of S3ML are demonstrated through extensive experiments on a series of popular ML models.