Abstract:Reference counting (refcount) is a common memory management technique in modern software. Refcount errors can often lead to severe memory errors such as memory leak, use-after-free, etc. Many efforts to harden refcount security rely on known refcount fields as their input. However, due to the complexity of software code, identifying refcount fields in source code is very challenging. Traditional methods of identifying refcount fields are mainly based on code pattern matching and have great limitations such as requiring expert experience to summarize patterns, which is a laborious job. Besides, the manually-summarized patterns do not cover all cases, resulting in a low recall. To address these issues, this studyproposes to characterize a field based on the field name and the code behaviour associated with the field; and designs a multimodal deep learning based approach. The study implements a prototype of the new approach for Linux kernel code. In the evaluation, the precision and recall achieved by the prototype system are 96.98% and 93.54%. In contrast, the traditional code-pattern-based identification method did not report any refcount fields on the testing set. In addition, sixty-one refcount fields are identified which are implemented with insecure data types in the latest Linux kernel. Until now, twenty-one of them are reported to the Linux community, of which six have been confirmed.