Code Monkey home page Code Monkey logo

Comments (7)

brechtm avatar brechtm commented on June 30, 2024

The size of cab['folder_data[0]'].uncompressed_data matches the sum of the sizes of the files. However, it is a string while it should obviously be bytes for binary files (such TTF files). I suspect I need to encode this string using a particular 8-bit encoding to be able to write out the files, but I haven't yet found out which. latin_1 encoding produces a corrupt TTF file that can be opened by Font Book in macOS, but it's still different from the TTF produced by cabextract.

from hachoir.

vstinner avatar vstinner commented on June 30, 2024

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

from hachoir.

brechtm avatar brechtm commented on June 30, 2024

I think latin_1 encoding uncompressed_data is indeed the way to go here, since the very first version of Unicode used the code points of ISO-8859-1 as the first 256 Unicode code points. [Wikipedia]. However, as stated above, the resulting TTF is corrupt.

Examining this closer, I see that the TTF only differs from the known good version in two bytes (though other extracted files differ in more bytes). I have not been able to determine the cause of this, but I suspect that there is a bug in the LZX decompression code. That's not unlikely, since there aren't any tests for it and the LZX algorithm specification is known to have some errors.

I'd love to track down and fix this bug, but the use case doesn't allow for spending more time on this problem, unfortunately.

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

I think there are probably other files stored in the /section_rsrc besides the CAB, so I would need to somehow get the offset/length from the other fields.

from hachoir.

nneonneo avatar nneonneo commented on June 30, 2024

In the .exe, it seems like one of the raw_res[] entries contains the file you want. For example, in arial32.exe, the contents of /section_rsrc/raw_res[1] contains the .cab file exactly.

The issue with uncompressed_data being a string is due to missing the lzx module when moving from Python 2 to 3. It should not be a string; I will devise a fix. Thanks for the report!

from hachoir.

nneonneo avatar nneonneo commented on June 30, 2024

Secondly, it looks like I forgot to handle the Intel jump fixups in LZX. This has now been fixed in #66. Thanks for bringing it to my attention.

from hachoir.

nneonneo avatar nneonneo commented on June 30, 2024

With #66 applied, the following code successfully extracts the files correctly from arial32.exe:

from hachoir.parser.program import ExeFile
from hachoir.parser.archive import CabFile
from hachoir.stream import FileInputStream
from io import BytesIO


f = FileInputStream("arial32.exe")
exe = ExeFile(f)
rsrc = exe["section_rsrc"]
for content in rsrc.array("raw_res"):
    # get directory[][][] and corresponding name
    # this is a bit hacky, ideally API would provide this linkage directly
    directory = content.entry.inode.parent
    name_field = directory.name.replace("directory", "name")
    if name_field in rsrc and rsrc[name_field].value == "CABINET":
        break
else:
    raise Exception("No CABINET raw_res found")

cabdata = content.getSubIStream()
cab = CabFile(cabdata)
# request substream to force generation of uncompressed_data
cab["folder_data[0]"].getSubIStream()
folder_data = BytesIO(cab["folder_data[0]"].uncompressed_data)
for file in cab.array("file"):
    with open(file["filename"].value, "wb") as outf:
        outf.write(folder_data.read(file["filesize"].value))

from hachoir.

brechtm avatar brechtm commented on June 30, 2024

@nneonneo Many thanks for the fixes and the sample code. Highly appreciated!

Looks like you figured out what was wrong with the LZX decompression code very quickly. Sure, you worked on that code 10 years ago, but still. 😁

from hachoir.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.