I'm attempting to extract the Microsoft

With <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Q: Extracting files from Win32 Cabinet Self-Extractor? about hachoir HOT 7 CLOSED

vstinner commented on June 30, 2024

Q: Extracting files from Win32 Cabinet Self-Extractor?

from hachoir.

Comments (7)

brechtm commented on June 30, 2024

The size of cab['folder_data[0]'].uncompressed_data matches the sum of the sizes of the files. However, it is a string while it should obviously be bytes for binary files (such TTF files). I suspect I need to encode this string using a particular 8-bit encoding to be able to write out the files, but I haven't yet found out which. latin_1 encoding produces a corrupt TTF file that can be opened by Font Book in macOS, but it's still different from the TTF produced by cabextract.

from hachoir.

vstinner commented on June 30, 2024

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

from hachoir.

brechtm commented on June 30, 2024

I think latin_1 encoding uncompressed_data is indeed the way to go here, since the very first version of Unicode used the code points of ISO-8859-1 as the first 256 Unicode code points. [Wikipedia]. However, as stated above, the resulting TTF is corrupt.

Examining this closer, I see that the TTF only differs from the known good version in two bytes (though other extracted files differ in more bytes). I have not been able to determine the cause of this, but I suspect that there is a bug in the LZX decompression code. That's not unlikely, since there aren't any tests for it and the LZX algorithm specification is known to have some errors.

I'd love to track down and fix this bug, but the use case doesn't allow for spending more time on this problem, unfortunately.

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

I think there are probably other files stored in the /section_rsrc besides the CAB, so I would need to somehow get the offset/length from the other fields.

from hachoir.

nneonneo commented on June 30, 2024

In the .exe, it seems like one of the raw_res[] entries contains the file you want. For example, in arial32.exe, the contents of /section_rsrc/raw_res[1] contains the .cab file exactly.

The issue with uncompressed_data being a string is due to missing the lzx module when moving from Python 2 to 3. It should not be a string; I will devise a fix. Thanks for the report!

from hachoir.

nneonneo commented on June 30, 2024

Secondly, it looks like I forgot to handle the Intel jump fixups in LZX. This has now been fixed in #66. Thanks for bringing it to my attention.

from hachoir.

nneonneo commented on June 30, 2024

With #66 applied, the following code successfully extracts the files correctly from arial32.exe:

from hachoir.parser.program import ExeFile
from hachoir.parser.archive import CabFile
from hachoir.stream import FileInputStream
from io import BytesIO


f = FileInputStream("arial32.exe")
exe = ExeFile(f)
rsrc = exe["section_rsrc"]
for content in rsrc.array("raw_res"):
    # get directory[][][] and corresponding name
    # this is a bit hacky, ideally API would provide this linkage directly
    directory = content.entry.inode.parent
    name_field = directory.name.replace("directory", "name")
    if name_field in rsrc and rsrc[name_field].value == "CABINET":
        break
else:
    raise Exception("No CABINET raw_res found")

cabdata = content.getSubIStream()
cab = CabFile(cabdata)
# request substream to force generation of uncompressed_data
cab["folder_data[0]"].getSubIStream()
folder_data = BytesIO(cab["folder_data[0]"].uncompressed_data)
for file in cab.array("file"):
    with open(file["filename"].value, "wb") as outf:
        outf.write(folder_data.read(file["filesize"].value))

from hachoir.

brechtm commented on June 30, 2024

@nneonneo Many thanks for the fixes and the sample code. Highly appreciated!

Looks like you figured out what was wrong with the LZX decompression code very quickly. Sure, you worked on that code 10 years ago, but still. 😁

from hachoir.

Q: Extracting files from Win32 Cabinet Self-Extractor? about hachoir HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent