Code Monkey home page Code Monkey logo

Comments (7)

IV1T3 avatar IV1T3 commented on June 18, 2024

O-checker could provide a good starting point. Especially, pdfanalysis.py illustrates how to detect specific malware inside a PDF.

GitHub: https://github.com/yotsubo/o-checker

Presentation: https://www.blackhat.com/docs/us-16/materials/us-16-Otsubo-O-checker-Detection-of-Malicious-Documents-through-Deviation-from-File-Format-Specifications.pdf

Whitepaper: https://www.blackhat.com/docs/us-16/materials/us-16-Otsubo-O-checker-Detection-of-Malicious-Documents-through-Deviation-from-File-Format-Specifications-wp.pdf

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

Interesting PDF analyzation tool: https://blog.didierstevens.com/programs/pdf-tools/
Also used as forensic package in Kali: https://tools.kali.org/forensics/pdfid
Pypi package usable as a library: https://github.com/mlodic/pdfid

Example Output of pdfid.py Release 1.0.4 (2021/03/25):

{  
    '/AA': 0,
    '/AcroForm': 0,
    '/Colors > 2^24': 0,
    '/EmbeddedFile': 0,
    '/Encrypt': 0,
    '/JBIG2Decode': 0,
    '/JS': 0,
    '/JavaScript': 0,
    '/Launch': 0,
    '/ObjStm': 0,
    '/OpenAction': 0,
    '/Page': 36,
    '/RichMedia': 0,
    '/XFA': 0,
    'endobj': 469,
    'endstream': 186,
    'filename': 'analyzing.pdf',
    'header': '%PDF-1.4',
    'obj': 469,
    'startxref': 1,
    'stream': 186,
    'trailer': 1,
    'version': '0.2.7',
    'xref': 1
}

The output description is as following:

Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document).

/Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.

/Encrypt indicates that the PDF document has DRM or needs a password to be read.

/ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters).

/JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend.

/AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.

The combination of automatic action and JavaScript makes a PDF document very suspicious.

/JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.

/RichMedia is for embedded Flash.

/Launch counts launch actions.

/XFA is for XML Forms Architecture.

A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode).

BTW, all the counters can be skewed if the PDF document is saved with incremental updates.

Because PDFiD is just a string scanner (supporting name obfuscation), it will also generate false positives. For example, a simple text file starting with %PDF-1.1 and containing words from the list will also be identified as a PDF document.

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

Submitted PR to extend functionality for DMF: mlodic/pdfid#3

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

Add more to PDF M score:

/JS 0 #This indicates the presence of Javascript
/JavaScript 0 #This indicates the presence of Javascript
/AA 0 #This indicates the presence of automatic action on opening
/OpenAction 0 #This indicates the presence of automatic action on opening
/AcroForm 0 #This indicates the presence of AcroForm which could contain JavaScript
/JBIG2Decode 0 #This indicates the use of JBIG2 compression which could be used for obfuscating content
/RichMedia 0 #This indicates the presence of rich media within the PDF such as Flash
/Launch 0 #This counts the launch actions
/EmbeddedFile 0 #This indicates there are embedded files within the PDF
/XFA 0 #This indicates the presence of XML Forms within the PDF

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

ToDO:

  • Sanitize PDF file by removing Actions and JS

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

Just created a new PR for mlodic/pdfid#4.

This feature will allow to sanitize given PDFs in memory.

from django-middleware-fileuploadvalidation.

IV1T3 avatar IV1T3 commented on June 18, 2024

PDF detection and sanitization now implemented as of commit 2a7fe33.

from django-middleware-fileuploadvalidation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.