Comments (7)
O-checker could provide a good starting point. Especially, pdfanalysis.py
illustrates how to detect specific malware inside a PDF.
GitHub: https://github.com/yotsubo/o-checker
from django-middleware-fileuploadvalidation.
Interesting PDF analyzation tool: https://blog.didierstevens.com/programs/pdf-tools/
Also used as forensic package in Kali: https://tools.kali.org/forensics/pdfid
Pypi package usable as a library: https://github.com/mlodic/pdfid
Example Output of pdfid.py
Release 1.0.4 (2021/03/25):
{
'/AA': 0,
'/AcroForm': 0,
'/Colors > 2^24': 0,
'/EmbeddedFile': 0,
'/Encrypt': 0,
'/JBIG2Decode': 0,
'/JS': 0,
'/JavaScript': 0,
'/Launch': 0,
'/ObjStm': 0,
'/OpenAction': 0,
'/Page': 36,
'/RichMedia': 0,
'/XFA': 0,
'endobj': 469,
'endstream': 186,
'filename': 'analyzing.pdf',
'header': '%PDF-1.4',
'obj': 469,
'startxref': 1,
'stream': 186,
'trailer': 1,
'version': '0.2.7',
'xref': 1
}
The output description is as following:
Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document).
/Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.
/Encrypt indicates that the PDF document has DRM or needs a password to be read.
/ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters).
/JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend.
/AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.
The combination of automatic action and JavaScript makes a PDF document very suspicious.
/JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.
/RichMedia is for embedded Flash.
/Launch counts launch actions.
/XFA is for XML Forms Architecture.
A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode).
BTW, all the counters can be skewed if the PDF document is saved with incremental updates.
Because PDFiD is just a string scanner (supporting name obfuscation), it will also generate false positives. For example, a simple text file starting with %PDF-1.1 and containing words from the list will also be identified as a PDF document.
from django-middleware-fileuploadvalidation.
Submitted PR to extend functionality for DMF: mlodic/pdfid#3
from django-middleware-fileuploadvalidation.
Add more to PDF M score:
/JS 0 #This indicates the presence of Javascript
/JavaScript 0 #This indicates the presence of Javascript
/AA 0 #This indicates the presence of automatic action on opening
/OpenAction 0 #This indicates the presence of automatic action on opening
/AcroForm 0 #This indicates the presence of AcroForm which could contain JavaScript
/JBIG2Decode 0 #This indicates the use of JBIG2 compression which could be used for obfuscating content
/RichMedia 0 #This indicates the presence of rich media within the PDF such as Flash
/Launch 0 #This counts the launch actions
/EmbeddedFile 0 #This indicates there are embedded files within the PDF
/XFA 0 #This indicates the presence of XML Forms within the PDF
from django-middleware-fileuploadvalidation.
ToDO:
- Sanitize PDF file by removing Actions and JS
from django-middleware-fileuploadvalidation.
Just created a new PR for mlodic/pdfid#4.
This feature will allow to sanitize given PDFs in memory.
from django-middleware-fileuploadvalidation.
PDF detection and sanitization now implemented as of commit 2a7fe33.
from django-middleware-fileuploadvalidation.
Related Issues (20)
- Support office documents and check for macros HOT 4
- Bundle configuration in VIEW_UPLOAD_CONFIGURATION
- inspect compressed files?
- Integrate defusedxml
- Support more mimetypes
- Admin shouldn't have to add all upload configs
- Integrate quicksand HOT 1
- Substitute PyPDF2 with pikepdf HOT 1
- Integrate MITRE ATT&CK into reporter
- Detect malicious files using YARA rules HOT 1
- Instead of forbidden, redirect to custom upload not successful page
- Sandboxing for potentially dangerous processing HOT 1
- Add Quicksand YARA matches to custom yara matches output
- Only sanitize if YARA rule matching HOT 1
- Create custom YARA rules for generic matching
- Seperate new file name from sanitization
- django.core.exceptions.ImproperlyConfigured: WSGI application 'main.wsgi.application' could not be loaded; Error importing module. HOT 8
- Crash if original request is ASGI request HOT 1
- Make 3rd party library usage more deterministic and set Dependabot
- Implement Decorators for Easier Configuration of File Upload Validation in Django Views HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from django-middleware-fileuploadvalidation.