Malware Identification Through STAMINA

To make malware identification easier, Microsoft‘s Threat Protection Intelligence Team joined hands with Intel Labs in exploring the use of deep learning to fight malware attacks. The new approach, called static malware-as-image network analysis (STAMINA), builds on an earlier collaboration between the two research teams.

With STAMINA, Microsoft and Intel’s goal is to turn codes into grayscale images. A simple stream of pixels makes up the images, turning them into dimensions that differ based on factors like their file sizes.

In this form, the deep learning system would be able to analyze the textural and structural patterns of the images. Then, it could examine whether the code is malicious or not.

In their whitepaper, the joint team of researchers wrote:

“Classical malware detection approaches involve extracting the binary signatures or fingerprints of the malware. However, the rapid increase of signatures, often in exponential growth, makes the signature matching less straightforward.”

According to the researchers, their novel approach was able to outperform previous and existing malware classifiers.

How the STAMINA Malware Identification Process Works

STAMINA’s malware identification process consists of four steps: preprocessing, transfer learning, evaluation, and interpretation. The team demonstrated the first three steps of the STAMINA approach in the diagram below:

The first three steps of the STAMINA method. | Image credit: Intel

In the preprocessing stage, STAMINA converts a binary code into a one-dimension pixel stream. Then, the system reshapes the pixel stream into two dimensions to apply transfer learning and computer vision.

After reshaping the pixel stream, images would be resized so that the transfer learning techniques could be used. In the STAMINA approach, the researchers used transfer learning to train a sophisticated malware classifier for static malware classification.

However, limited datasets made it challenging for the researchers to train a deep learning system from scratch. So, they tried a different approach. The researchers explained:

“What has been done in the computer vision space is that, for specific tasks, models pre-trained on a large number of images are used, and transfer learning is conducted on target tasks.”

After completing transfer learning, the system moves on to the next step – evaluation.

The STAMINA approach uses metrics like accuracy, false-positive rate, recall, F1 score, and area under the receiver operating curve (ROC) to evaluate the image. It then uses Microsoft’s dataset of 2.2 million hashes of malware binaries to analyze the data.

During their experiment, the joint team of researchers was able to achieve 87.05 percent at 0.1 percent false-positive rate and 99.66 percent recall and 99.07 percent accuracy at 2.58 false-positive rate overall.

For more information about Microsoft and Intel’s STAMINA project, click here.