Abstract:Nowadays, the big data processing frameworks such as Hadoop and Spark have been widely used for data processing and analysis in industry and academia. These big data processing frameworks adopt the distributed architecture, generally developed in object-oriented languages like Java and Scala. These frameworks take Java virtual machine (JVM) as the runtime environment on cluster nodes to execute computing tasks, i.e., relying on JVM’s automatic memory management mechanism to allocate and reclaim data objects. However, current JVMs are not designed for the big data processing frameworks, leading to many problems such as long garbage collection (GC) time and high cost of data serialization and deserialization. As reported by users and researchers, GC time can take even more than 50% of the overall application execution time in some cases. Therefore, JVM memory management problem has become the performance bottleneck of the big data processing frameworks. This study systematically reviews the recent JVM optimization research work for big data processing frameworks. The contributions include the following three outcomes. First, the root causes of the performance degradation of big data applications when executed in JVM are summarized. Second, the existing JVM optimization techniques are summarized for big data processing frameworks. These methods are also classified into categories, the advantages and disadvantages of each are compared and analyzed, including the method’s optimization effects, application scopes, and burdens on users. Finally, some future JVM optimization directions are proposed, which will help the performance improvement of big data processing frameworks.