Code Monkey home page Code Monkey logo

pdfid's People

Contributors

dealbreaker973 avatar iv1t3 avatar manfred-kaiser avatar mlodic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pdfid's Issues

pdf disarm doesn't work

Hi, I got the following output when I ran the provided test file, where I tried to check the content of the disarmed_pdf_buffers = disarm_pdfs_by_buffer(filenames, file_buffers) by printing it out:

STARTING DISARM
/JS -> /js
PDFiD 0.2.7 ./Dante.pdf
 PDF Header: %PDF-1.7
 obj                   20
 endobj                20
 stream                 5
 endstream              5
 xref                   1
 trailer                1
 startxref              1
 /Page                  1
 /Encrypt               0
 /ObjStm                0
 /JS                    1
 /JavaScript            0
 /AA                    0
 /OpenAction            0
 /AcroForm              0
 /JBIG2Decode           0
 /RichMedia             0
 /Launch                0
 /EmbeddedFile          0
 /XFA                   0
 /Colors > 2^24         0

{'buffers': []} <-- nothing in the returned buffer

And in testing 3.1, analyze_pdfs_by_buffer actually loaded the file by filename instead of checking the sanitized buffer, which I believe is not the expected behavior.

PDF parsing error

Hi, I got the following error when testing the library on a pdf exploit generated by Metasploit module exploit/windows/fileformat/adobe_pdf_embedded_exe_nojs: NameError: name 'name' is not defined.

I believe that the error was introduced in the following line:

filename_dict['%s_hexcode_count' % name] = int(node.getAttribute('HexcodeCount'))

Deleting L697 and L698 will fix the issue.

Unbound variable when scanning PDF with hex characters

If a PDF is given with hex characters (for example obfuscated JS tags like /JavaScript --> /#4AavaScript), the following error is encountered:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/worker/venv/lib/python3.10/site-packages/pdfid/pdfid.py", line 1096, in PDFiDMain
    ProcessFile(filename, options, plugins, list_of_dict["reports"], disarmed_buffers["buffers"])
  File "/home/worker/venv/lib/python3.10/site-packages/pdfid/pdfid.py", line 819, in ProcessFile
    PDFID2Dict(xmlDoc, options.nozero, options.force, list_of_dict)
  File "/home/worker/venv/lib/python3.10/site-packages/pdfid/pdfid.py", line 698, in PDFID2Dict
    filename_dict['%s_hexcode_count' % name] = int(node.getAttribute('HexcodeCount'))
NameError: name 'name' is not defined

The bit of code responsible for this is in the function PDFID2Dict here where in line 698 it references a variable name that does not exist within the scope of the function (or anywhere else for that matter):

pdfid/pdfid/pdfid.py

Lines 683 to 720 in f7674ff

def PDFID2Dict(xmlDoc, nozero, force, list_of_dict):
filename_dict = {}
filename_dict['version'] = xmlDoc.documentElement.getAttribute('Version')
filename_dict['filename'] = xmlDoc.documentElement.getAttribute('Filename')
if xmlDoc.documentElement.getAttribute('ErrorOccured') == 'True':
filename_dict['error_occured'] = xmlDoc.documentElement.getAttribute('ErrorMessage')
return
if not force and xmlDoc.documentElement.getAttribute('IsPDF') == 'False':
filename_dict['error_occured'] = ' Not a PDF document\n'
return
filename_dict['header'] = xmlDoc.documentElement.getAttribute('Header')
for node in xmlDoc.documentElement.getElementsByTagName('Keywords')[0].childNodes:
if not nozero or nozero and int(node.getAttribute('Count')) > 0:
filename_dict[node.getAttribute('Name')] = int(node.getAttribute('Count'))
if int(node.getAttribute('HexcodeCount')) > 0:
filename_dict['%s_hexcode_count' % name] = int(node.getAttribute('HexcodeCount'))
if xmlDoc.documentElement.getAttribute('CountEOF') != '':
filename_dict['eof'] = int(xmlDoc.documentElement.getAttribute('CountEOF'))
if xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF') != '':
filename_dict['after_last_eof'] = int(xmlDoc.documentElement.getAttribute('CountCharsAfterLastEOF'))
for node in xmlDoc.documentElement.getElementsByTagName('Dates')[0].childNodes:
filename_dict[node.getAttribute('Value')] = node.getAttribute('Name')
if xmlDoc.documentElement.getAttribute('TotalEntropy') != '':
filename_dict['entropy'] = {
"total": xmlDoc.documentElement.getAttribute('TotalEntropy'),
"bytes": '%10s bytes' % xmlDoc.documentElement.getAttribute('TotalCount')
}
if xmlDoc.documentElement.getAttribute('StreamEntropy') != '':
filename_dict['entropy_inside_streams'] = {
"total": xmlDoc.documentElement.getAttribute('StreamEntropy'),
"bytes": '%10s bytes' % xmlDoc.documentElement.getAttribute('StreamCount')
}
if xmlDoc.documentElement.getAttribute('NonStreamEntropy') != '':
filename_dict['entropy_outside_streams'] = {
"total": xmlDoc.documentElement.getAttribute('NonStreamEntropy'),
"bytes": '%10s bytes' % xmlDoc.documentElement.getAttribute('NonStreamCount')
}
list_of_dict.append(filename_dict)

I cannot provide a fix since I do not know what name is supposed to be in the first place. If anyone can help, that would be much appreciated. :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.