Abstract:Many new data applications, such as wireless sensor networks, and traditional data applications that process images produce massive probabilistic data. In probabilistic data management, the Top-k similarity join operation returns the most similar k pairs of probabilistic data and thus has important value. Histogram is one of the most frequently-used data models for representing probabilistic data. Earth Mover's Distance (EMD) is more robust in quantifying the similarities between histogram-represented probabilistic data. However, EMD computation has a cubic time complexity, which brings great challenges to the EMD-based Top-k similarity joins. Based on the MapReduce framework, this paper utilizes the good properties of EMD's dual problem and proposes two EMD-based Top-k similarity join algorithms on massive probabilistic dataset. A baseline solution named Top-k BNLJ algorithm is first developed on the idea of block-nested-loop join, and the novel Top-k DLPJ algorithm is then built on the improved idea of data locality preserving data partition strategy which significantly reduces the amount of data transferred during MapReduce jobs. Extensive experiments on large real-world datasets, with millions of probabilistic data, have verified the efficiency, effectiveness and scalability of Top-k DLPJ algorithm.