Abstract:Existing techniques of malware detection depend on observations of sufficient malware samples. However, only a few samples can be obtained when a novel malware first appears in the World Wide Web, which brings challenges to detect novel malware and its variants. This paper studies the anomaly and similarity of processes with respect to their access behaviors under data flow dependency network, and defines estimated risk for malware detection. Furthermore, the study proposes a malware detection method based on active learning by minimizing the estimated risk. This method achieves encouraging performance even with small samples, and is applicable to defending against rapidly increasing novel malware. Experimental results on a real-world dataset, which consists of access behaviors of 8 340 benign and 7 257 malicious processes, demonstrate better performance of the presented method than traditional malware detection method based on statistical classifier. Even with only 1% known samples, the new method achieves 5.55% error rate, which is 36.5% lower than the error rate of traditional statistical classifier based method.