Subjective and objective code similarity measures
2022-09-23, 16:15–17:00, Tesla

To set the scene, we begin this session with an analysis of characteristics of malicious dropper VBA code employed by APT actors in South Asia. This code regularly evolves but keeps some regular features such as using VBA Forms to store the executable payload in a lightly obfuscated format.

The similarity of the code used by the opposing groups is easy to spot for a human researcher but not as obvious to machine algorithms.

The focus of the presentation are similarity algorithms such as Normalized Compression Distance, Winnowing, Jaccard similarity as well as common diffing algorithms. The similarity algorithms are one of the main tools we can use to successfully cluster malicious executable as well as the source code. Their background is often in the domain of the natural language processing and plagiarism/copycat detection.

We will describe their operation and discuss their performance on a small set of samples attributed to groups we describe in the first part of the session.

We compare the effectiveness of the algorithms on unmodified code and code with various levels of normalization.

We conclude with a discussion of the scalability of similarity algorithms applied when applied on a large set of samples.