Abstract:Since processing large-scale spatial join aggregate (SJA) is usually difficult to be implemented on a single machine, parallel computing on cluster has been the key to process large-scale SJA operation efficiently. Map-Reduce has been the mainstream parallel computing technique for massive data on cluster. However, Map-Reduce does not directly support processing parallel SJA with both high efficiency and straightforward way, for it needs to perform a second reduce operation. This paper proposes a novel parallel computing model, Map-Reduce-Combine (MRC), which is able to process large-scale SJA efficiently with a simple way on cluster. MRC adds to Map-Reduce a Combine phase that can efficiently combine partial aggregate results distributed among different Reducers, which is caused by the multiple assignment of spatial object. For the spatial object assigned only once, a filter optimization method has been proposed to pick up the result of single assignment object obtained in Reduce phase and further enhance the performance of processing SJA. Extensive experiments in large real spatial data have demonstrated the efficiency, effectiveness, scalability and simplicity of the proposed parallel computing model for processing SJA on massive spatial data.