Decrypting malware ciphers
In our daily work of Protective Monitoring we see a lot of encoded/encrypted traffic – from webpages served over HTTPS, to passwords being obscured using Base64, to zipped binary data, and PGP emails.
Sometimes, however, something sticks out as being a bit different…
An encryption mystery…
While monitoring a client's internet traffic recently we started to see regular connections out from a computer on their network to an external site. It was sending small batches of letters and numbers that didn't match any of the normal kinds of encryption we see.
We captured the traffic, identified the machine, and the likely cause of the traffic (a Browser Helper Object) and reported it to the customer, who was able to clear off the suspected application, which stopped the connections being made.
All-in-all a good day – successful detection, response and remediation.
However, there was still an element of this particular case that didn't sit right with me.
If this traffic was not using 'normal' encryption, what was it using? And what data was it sending?
Could the data it was sending be personal, sensitive or confidential?
So I started to analyse the data to see if I could detect the type of encryption in use.
Note: samples shown here are trimmed to remove personal data
Analysis – Step 1: Character Sets and Regularity
Almost immediately it became apparent that the encryption wasn't consistent with the more complex cryptography you see in schemes like Public Key cryptography.
For one, the character-set was not broad enough. I was only seeing Upper Case letters and a few numbers – with the character-set inconsistent with Base64, or other methods of transferring binary-encoded encrypted data over HTTP.
The other reason was the many repeating sequences – which just wouldn't happen with a more complex cryptographic algorithm.
No, what we were seeing was a substitution cipher.
Analysis – Step 2: Frequency Analysis
With that diagnosis in hand it was an obvious next step to run some statistical analysis on the charset.
The number of chars in use was only 36 – not high enough for a 1:1 substitution, however that was not a problem as the captured samples were always divisible by two, as were the repeating patterns.
Splitting the text into pairs allowed for new statistical analysis, this time matching 68 unique pairs – consistent with A-Z, a-z, 0-9 and a few symbols too (26 + 26 + 10 + 6).
Analysing the frequency distributions of these showed a curve of frequencies of occurrence – matching closely to 'normal' text – with some letters like E, T and A occurring far more regularly than X, V and Q. This compares starkly to strong encryption where letters have roughly equal frequency.
At this point I was pretty certain that the data was indeed a 1:2 substitution cipher.
Cracking the code – First Clues
Once the type of cipher was determined that allowed for some brute-force attempts at breaking it.
I applied letter substitutions via some quick command-line scripts. However, this proved unsuccessful, due to not knowing the type of idioms and likely content of the data in these samples.
At this point someone else I had asked to look at the code pointed out a possible substitution that seemed worth investigating.
The observation was that treating the letters preceded with a 0 as literals formed some patterns, with occasional short strings.
Mostly these were not words – text like "CHMM" and "SLNG", but there were a few strings in there that looked interesting such as "EXE"
Assuming this new substitution to be correct and the non-word strings to be codewords or variable names I worked at matching other patterns with known plaintext that could appear in the remaining enciphered text.
Cracking the code – Pattern Recognition and Known Values
With the data in the sample being sent from a particular machine on the client's network I collected some data about that machine in an attempt to find known plaintexts in the code, which would give me additional substitutions.
Amongst other data I collected was: hostname, local IP address, username, MAC address, plus the date and time of when the sample was sent.
My big leverage into the decryption was finding four of the letters from the MAC address at the right distance from each other in the sample code.
I substituted the remaining numbers into the sample, and replicated this substitution across the rest of the data.
Initially this didn't add much, however it did show one more pattern: "19?.16?.1?.??"
Anyone who has set up their own router at home will probably recognise the pattern from the private IPv4 address ranges.
I was then able to substitute the local IP address of the client machine into that sequence and widen the known values further.
By this point I knew I was onto something and just needed to plug away at it.
From this point on I was able to guess at a few special characters that were used to separate variable names and values, and records in the list.
The addition of the dot from the IP address also showed a pattern of ???.?????.??? appearing in the sample a number of times.
Once I realised that the first three characters were the same one repeated I was able to add w, c, o and m to the growing decryption dictionary, and guess a few more when I worked out the homepage of the local browser was set to Yahoo!: www.???oo.com
As the letters fell into place the partial words were more easily filled in with less guess-work involved.
Before long I had a complete plain-to-encrypted key and was able to pop it into a BASH script to decode all the packets we had previously captured.
Analysing the decrypted data
Now that I could read these packets of encrypted traffic I set about analysing them.
The sender was indeed a Browser Helper Object (BHO), and it was regularly sending software version information to a remote server.
Periodically it would send a shorter sample and then receive an update in a binary file.
However, the larger samples I had been analysing were much more interesting…
In these larger samples was full OS and hardware information, including MAC addresses, hardware device models and firmware versions, BIOS version, security permissions levels, and anti-malware defence information – such as whether features were turned on and off. In short, the sort of information an attacker would love to get their hands on – and more than any Browser Helper Object needs to know.
We have since been able to report all this to our client, so they can be more aware of the dangers of user-installed software, and in case they wanted to take any further steps at their end, such as user education or policy changes.
'Paying it forward'
I'll admit – cracking a secret cipher is fun, but the reason why we went to these efforts was to keep our clients safe.
The encryption system being used by this software is not just confined to this BHO as we have seen from finding samples with similar patterns on www.virustotal.com and other malware reporting sites, linked with known viruses and trojans.
As this scheme is currently in use we are providing some useful tools for detection/decryption below.
Firstly, if you are using some form of monitoring system you can get it to filter each of the following strings to detect the presence of the cipher in use:
Secondly, a rather clever colleague – who goes by the name of Trinity – recognised a pattern in the key dictionary I had built. With an algorithm he devised I have written a Python commandline program to decode the samples, which we are now sharing on BitBucket, in case you come across this cipher in use and want to decrypt it yourself.
As a final note, here are a couple of samples we have found online of this cipher in use in sending data to malicious sites, and in hiding commands in shellcode: