How to analyze malicious PDF files

Open the PDF file in an editor like nano, gedit or scite. Mind you, some of these might not display all encodings, so if it feels like something is missing try opening the file in multiple editors. Scite seems to be doing the best for this type of work.

Look around for clear-text scripts and base64 encoded blobs. Decode them to see what they are. This can be done with base64dump.py:

python base64dump.py /path/to/pdf/file

This will give you a list of all the blobs it’s found. Start with the biggest, and extract them to see what’s in them:

python base64dump.py -s <ID number> -S

To look for embedded shellcode:

python base64dump.py -e pu /path/to/pdf/file
python base64dump.py -e bu /path/to/pdf/file
python base64dump.py -e hex /path/to/pdf/file

If you find something, dump it as a binary with the -d and redirect output to a file.

Objects that indicate malicious behaviour/intent:

  • /Launch (run external or embedded executable)
  • /EmbeddedFiles (run external or embedded executable)
  • /JS (embedded JavaScript)
  • /JavaScript (embedded JavaScript)
  • /XFA (embedded JavaScript)
  • /RichMedia (Embedded Flash)
  • /URI (request resource from site)
  • /SubmitForm (Posts data to site)

Streams can contain malicious (or benign) content.

To get a quick list of objects in a PDF file:

python pdfid.py /path/to/file

If you spot any of the suspicious objects, run pdf-parser to search for JavaScript payloads or anything in your dirty words list:

python pdf-parser.py /path/to/file –search JavaScript

or

python pdf-parser.py /path/to/file –searchstream JavaScript

To see the entire object:

python pdf-parser.py /path/to/file –object 123

If you see an object with something suspicious, dump it and review it:

python pdf-parser.py /path/to/file –object 123 –filter –raw -d /path/to/output/file