Introducing the TypeRefHash (TRH)

06/23/2020
G DATA Blog

We introduce the TypeRefHash (TRH) which is an alternative to the ImpHash that does not work with .NET binaries. Our evaluation shows that it can effectively be used to identify .NET malware families.

Update (20.10.2022)

After publishing this post, we were notified that Joe Desimone (@dez_ on Twitter) used a very similar approach for hashing the TypeRef table in his ClrGuard (Code on Github) in 2017. He also presented his work on Derbycon 2017 (Video on Youtube).

Introduction

The ImpHash was introduced in 2014 by FireEye [1]. It has since been used by many malware analysts and implemented in tools like VirusTotal to identify similar malware samples by their imports. In theory, if programs use the same imports, they use similar source code. 

.NET samples usually only import mscoree.dll, such that there is only a handful of different ImpHashes for all .NET binaries. Therefore, the ImpHash cannot be used here. This motivated us to find an alternative, the TypeRefHash (TRH). To show the imported DLLs, functions and the TypeRef table, we used the online tool penet.io.

.NET files store imported namespaces of their referenced types in a so-called Metadata table. We can use these to construct an identifier like the ImpHash. Similar to the combination of DLL/function name in the Import table, the TypeRef table contains a list with type names and their corresponding namespace. For example a .NET binary may import the type DebuggerBrowsableState from the namespace System.Diagnostics.  

Calculation

To calculate the TRH we extract the TypeRef table and resolve the indices to the corresponding strings. 

  1. Order the entries by TypeNamespace and then by TypeName
  2. Concatenate the TypeNamespaces and TypeNames with a dash. In case that the namespace is empty, the concatenated string starts with the dash. 
  3. Join all strings with commas and calculate the SHA256 hashsum of the resulting UTF8 byte-string. 

We use SHA256, instead of MD5 which is used for the ImpHash, as we already see MD5 collisions on our data sets. We order the entries in the table to prevent attacks where a different TypeRefHash could be created for a sample by just reordering the table. A similar attack was shown for the ImpHash by Balles and Sharfuddin [2]. We chose a dash and a comma as the seperators, as they are not valid in namespaces and type names in .NET.

Imagine we have a .NET sample with the following simplified TypeRef table: 

#TypeName (Resolved) TypeNamespace (Resolved) 
0CompilationRelaxationsAttribute System.Runtime.CompilerServices 
1RuntimeCompatibilityAttribute System.Runtime.CompilerServices 
2TargetFrameworkAttribute System.Runtime.Versioning 
3DebuggingModes  
4AssemblyFileVersionAttribute System.Reflection 

 

This results in the following ordered and concatenated strings. It should be noted that TypeRefs that have an empty namespace are sorted to the beginning of the list. 

-DebuggingModesSystem 
System.Reflection-AssemblyFileVersionAttribute 
System.Runtime.CompilerServices-CompilationRelaxationsAttribute 
System.Runtime.CompilerServices-RuntimeCompatibilityAttribute 
System.Runtime.Versioning-TargetFrameworkAttribute 

 

This is concatenated to the following final string:

 

-DebuggingModesSystem,System.Reflection-AssemblyFileVersionAttribute,System.Runtime.CompilerServices-CompilationRelaxationsAttribute,System.Runtime.CompilerServices-RuntimeCompatibilityAttribute,System.Runtime.Versioning-TargetFrameworkAttribute 

 

 

The resulting TRH is the SHA256 hashsum of the above string.  

 

63AE8074B4C2EF8E36FE3272BE23B445CEAB495E14877935C457E75CFB5E5A1E

 

You can find the TRH reference implementation in the PeNet library here.

Evaluation

How good can a TypeRefHash identify a certain malware family? To answer this, we evaluated .NET samples that we received mid May to mid June 2020 and looked at the corresponding hashes for seven families. We chose those, because we were able to collect a significant number of samples for each malware family, such that a meaningful evaluation is possible. 

We looked at the following families: 

Malware Family# Samples
AsyncRAT558
Blackshades5035
Bladabindi7793
DiscordTokenGrabber159
Nanocore1335
QuasarRAT517
RevengeRAT276

 

We inspected the distribution of different TypeRefHashes for those families. In the following figures the blue sections depict the most common TRH for that family. If the number of samples with the same TypeRefHash was equal or lower than five, we aggregated those TRHs in the shaded areas, to not pollute the chart.

We can see that in most cases one TypeRefHash dominates a family. Especially blackshades could be identified very successfully with the two most common TRHs comprising 97% of all analysed samples. 

We evaluated the distribution for different malware families. The most common TypeRefHash for each family can be seen in the following table:  

Malware FamilyMost common TRH
AsyncRAT4807b5cd7256fad54967dfe3c394c27d16bad1ac95b0306911a3546025bd6ccf
Blackshades 306db7dcdf4dd7bbf2eaa054a8c050fb97cbe84c0da87528c6e508ac5e11607b 
Bladabindi 695409c18e59ff8a2c04f5572f61d35157ea1ce34e6f3db4975dfbaeb5d7e07f 
DiscordTokenGrabber 6f917770f111b5e0f6bd7b1ccd3adf491fbc756bf031fe107233d3b19d4737d 
Nanocore 31feea84c77a972ebe0bfc87ac90630ad824e91965b664c47d0d2b0761b30d16 
QuasarRAT03d72f6a261029edbd5028d814b27b075f5c3c62219dbfe8a349998909d07b9a 
RevengeRATfaaf850b8f9ce7eeed4c9d18b2fbd70ef1c9dde8d920c6e333829f3150d9ca08 

 

The distribution can be seen in the following figures. 

We can see that for five families, we hit the right samples in 100% of the cases. When looking at the most common TypeRefHash of QuasarRATwe found one CardinalRAT sample, too. Only with RevengeRAT our results are a little bit more inaccurate, as we found 15 Bladabindi and one AsyncRAT samples. We also found two samples known to be clean. Therefore, the TypeRefHash cannot be used effectively for some malware families, like Revengerat

Summary

As the ImpHash cannot be used with .NET binaries, we developed a similar method called TypeRefHash (TRH). The TRH is a SHA256 hashsum over the imported .NET namespaces and types. This is similar to the ImpHash, which is an MD5 hashsum over the imported DLLs and their functions. 

Our evaluation showed that the TRH can be used to identify malware families with a similar precision as the ImpHash for non-.NET files. Depending on the family, the TRH can be unique for one malware family or can be found in multiple families. 

You can find the reference implementation in the PeNet library here.

You can find a list with the samples used for the evaluation with their corresponding family name and TRH here.

A command line tool to compute the TRH on Windows and Linux can be found here.

References

[1]: https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html (accessed: 17.06.2020) 

[2]: Balles, C. and Sharfuddin, A., 2019. Breaking Imphash.  https://arxiv.org/ftp/arxiv/papers/1909/1909.07630.pdf (accessed: 17.06.2020) 

Disclaimer: The PeNet library and penet.io are both projects from one of the authors of this blog entry (Stefan Hausotte).

from Phillip Kemkes
R&D Engineer

Stefan Hausotte
Team Lead Automated Threat Analysis