vstinner / hachoir Goto Github PK

View Code? Open in Web Editor NEW

588.0 588.0 69.0 24.03 MB

Hachoir is a Python library to view and edit a binary stream field by field

Home Page: http://hachoir.readthedocs.io/

License: GNU General Public License v2.0

Shell 0.05% Python 99.95%

hachoir's People

Contributors

Stargazers

Watchers

Forkers

masatake markogle fcrespo82 sillybits jeremybanks jarrahconsulting neuroradiology olivierh59500 jayxon nosmokingbandit pombredanne rambo cuthred plumpmath xhao79 espes draperjames stilldavid gingerbeardman chrahunt hibive lukasbombach rnwtenor jcao1022 konstantinklepikov dansemacabre hivestack vatsaldin luisperezm eslamelhadedy jonz-secops mr-segfault miigotu apperian jeffbryner penn5 benbou8231 pakula-dev-pl qzane swiftpengu samb leichenlager adoom2017 kelvan simoncozens zxexz nneonneo l1kw1d cccs-jh sts0mrg0 sponce patricklcam pro741 ccczone oooooleg cqntechxv brainhub24 nx6110a5100 hartl3y94 mehmet-zahid capuanob matrixsociety jpluimers mayhemheroes wynngate kawukiandrew8888 morilli ttsugriy the-dsr

hachoir's Issues

hachoir-metadata parses and outputs incorrect duration metadata for ogv files

For some reason, hachoir-metadata has been producing bad duration output for ogv files. I've been testing with the "Computer Chronicles" collection on archive.org. To reproduce:

wget https://archive.org/download/CC517_commodore_64/CC517_commodore_64.ogv

from hachoir import parser as hachoir_parser
from hachoir import metadata as hachoir_metadata

video_file = 'CC517_commodore_64.ogv'
parser = hachoir_parser.createParser(video_file,video_file)
hachoir_metadata.extractMetadata(parser).exportPlaintext()

['Common:', '- Title: Commodore 64', '- Duration: 1 min 14 sec 423 ms', '- Location: http://www.archive.org/details/CC517_commodore_64', '- Copyright: http://creativecommons.org/licenses/by-nc-nd/2.0/', '- Producer: Xiph.Org libTheora I 20081020 3 2 1', '- MIME type: video/theora', '- Endianness: Little endian', 'Video:', '- Image width: 400 pixels', '- Image height: 300 pixels', '- Pixel format: 4:2:0', '- Compression: Theora', '- Frame rate: 30.0 fps', '- Comment: Quality: 0', '- Format version: Theora version 3.2 (revision 1)', 'Audio:', '- Channel: stereo', '- Sample rate: 44.1 kHz', '- Compression: Vorbis', '- Format version: Vorbis version 0']

harchoir-metadata reports that the duration is 1 minute and 14 seconds, but if you open it up in VLC and watch it, it's actually 28 minutes and 31 seconds.

Using Python 3.8.2 and hachoir 3.1.1.

Please release 3.1.3 (or 3.1.3a1)

I am aware that this project is unmaintained, but my project depends on the fixes introduced following #65. I was planning on simply using an URL in my dependency specification (https://github.com/vstinner/hachoir/archive/8000dbeb9aad587e8dc5be8202796cdfb67f899e.zip), but PyPI will not accept such a dependency.

I was hoping you could release a new version of hachoir including these fixes, if you can find the time. Alternatively, I can fork the project and publish it as e.g. hachoir-reloaded. I'd rather not fork though, if that can be avoided.

Rename Hachoir3 project to Hachoir?

The original Python2-only "Hachoir" project hosted on Bitbucket didn't get much love:

last commit at 2015-10-26
latest hachoir-core release (hachoir-core 1.3.3): 2010-02-26

I propose to rename Hachoir3 to Hachoir:

rename https://github.com/vstinner/hachoir3/ to https://github.com/vstinner/hachoir/
change the project name in setup.py (edit hachoir/version.py in practice)
publish a new release on PyPI as the Hachoir name

Hopefully, releases of the old Python2-only Hachoir project can co-exist since they were published as a different name (hachoir-core, hachoir-metadata, etc.)

Support NTFS USA stream patching

For the NTFS parser I recently learned about the "update sequence array", which is a set of binary patches applied to specific offsets in the stream. Essentially, suppose you have a USA value of "05 00, 6F 20". Then, later in the file, you might see

68 65 6C 6C 05 00 77 6F 72 6C 64

(with the 05 00 at a specific offset - 510 bytes from the start of a 512-byte sector). You are supposed to apply the patch here, to fix it up into

68 65 6C 6C 6F 20 77 6F 72 6C 64

The patches are always at predictable addresses, but in general they could land in the middle of any field set, causing massive breakage when the NTFS parser attempts to parse the USA substitute values instead of the intended bytes.

Is there a way I could temporarily patch the stream reader to return the correct bytes, or is there some other option for properly handling this kind of weakly context-dependent stream patching?

hachoir-grep: TypeError: decoding str is not supported

$ hachoir-grep foo tests/files/*.png
Traceback (most recent call last):
  File "/home/jwilk/.local/bin/hachoir-grep", line 8, in <module>
    sys.exit(main())
  File "/home/jwilk/.local/lib/python3.8/site-packages/hachoir/grep.py", line 183, in main
    values, pattern, filenames = parseOptions()
  File "/home/jwilk/.local/lib/python3.8/site-packages/hachoir/grep.py", line 66, in parseOptions
    pattern = str(arguments[0], "ascii")
TypeError: decoding str is not supported

(originally reported by Samuel Thibault in https://bugs.debian.org/969914)

Cannot parse 32 bit bpp tga images

version 3.0a3

Tried parsing and reading metadata of TGA image, received following warning:

[warn] Skip parser 'TargaFile': Unknown bits/pixel value 32

Get markers out of PNG file using hachoir

I am working on a project where I need all the tags present in PNG/JPEG and other similar file formats. I found hachoir-urwid utility is doing the job when I open a file using hachoir-urwid it gives all the headers/tags/data, It does the annotation at every level.

My question is it gives the output in interactive manner, I have to press enter at every '+' to expand the inner details further.

file:twitter_32x32.png: PNG picture: 32x32x24 (414 bytes)
0) id= "\x89PNG\r\n\x1a\n": PNG identifier ('\x89PNG\r\n\x1A\n') (8 bytes)

1. header: Header: 32x32 pixels and 24 bits/pixel (25 bytes)
1. data[0]: Image data (369 bytes)
1. end: End (12 bytes)
  I don't want it I want the complete output something like https://web.archive.org/web/20081017190716/http://hachoir.org/wiki/hachoir-parser/examples#logo-kubuntu.png in stdout. I have around 1000 PNG files, I can't go to every file and then see tags present in it. I want to automate this by running hachoir-urwid in every file and extract tags out of it by writing a bash/python script.

Can someone please help me out, I am unable to find some utility which can directly give me the complete output in stdout instead of some interactive screen.

Packaging scripts to be available automatically on install?

There's several useful scripts in the root of the repository, like hachoir-urwid. It would be cool if these were packaged in such a way so that they would be available when hachoir is installed (as described here). Ideally:

pip install hachoir3
hachoir-urwid
it works

file is not closed

hachoir/parser/guess.py

def createParser(filename, real_filename=None, tags=None):
    """
    Create a parser from a file or returns None on error.

    Options:
    - file (str|io.IOBase): Input file name or
        a byte io.IOBase stream  ;
    - real_filename (str): Real file name.
    """
    if not tags:
        tags = []
    stream = FileInputStream(filename, real_filename, tags=tags)
    guess = guessParser(stream)
    if guess is None:
        stream.close()
    return guess

You should return stream with guess. You should let us close stream.

[Feature Request] Provide "Builder" API

A Builder API allowing a user to construct binary data according to a defined set of FieldSets.

Taking an example from the docs:
https://hachoir.readthedocs.io/en/latest/developer.html#parser-with-sub-field-sets

The user should be able to create the value of the data variable roughly as follows:

stream = BuilderByteStream()
creatable = MyFormat(stream)
creatable["signature"].value = b"MYF"
creatable["count"].value = len(pointlist)
for subfieldset, point in zip(creatable["point"], pointlist):
subfieldset["letter"].value = point["letter"]
subfieldset["code"].value = point["code"]
data = stream.to_bytes()

How to exhaust memory with one reputably sourced image

Using hachoir commit #5b9e05a on Windows 10 x64.

Steps...

Save the code below as test.py
Downloaded the only image in section BACKGROUND from the reputable source
https://fanart.tv/series/331821/the-looming-tower/
(direct image link is https://fanart.tv/api/download.php?type=download&image=88183&section=1)
Save image file to same folder as the test code as fanart.jpg (the jpg image crc is BA866C09)
Run Python -V # output Python 3.7.4 (32 bit)
Run python.exe test.py

Result:

infinite loop exhausting memory until Python crashes with MemoryError.

[warn] [/exif/content/ifd[0]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
[warn] [/exif/content/ifd[1]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
[warn] [/exif/content/ifd[2]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
...
and so on...
...
[warn] [/exif/content/ifd[231]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
...
and so on...

Test code

test.py script located where cloned hachoir3 (or named whatever) folder is located.

import os
import sys
  
HACHOIR_CLONE_PATH = 'hachoir3'  
  
sys.path.insert(1, os.path.abspath(os.path.join(os.path.dirname(__file__),
                                   HACHOIR_CLONE_PATH)))  
  
from hachoir import parser
from hachoir import metadata

path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'fanart.jpg'))

try:
    parser = parser.createParser(path)
    metadata = metadata.extractMetadata(parser)
except Exception as e:
    print('Unable to extract metadata %r' % e)

scripts doesn't work directly on windows

λ hachoir-metadata
'hachoir-metadata' is not recognized as an internal or external command,
operable program or batch file.

but it's in PATH

λ which hachoir-metadata
/c/Users/JayXon/AppData/Local/Programs/Python/Python36/Scripts/hachoir-metadata

I have to do this:

λ python C:/Users/JayXon/AppData/Local/Programs/Python/Python36/Scripts/hachoir-metadata
Usage: hachoir-metadata [options] files

Options:
  -h, --help            show this help message and exit
  --type                Only display file type (description)
  --mime                Only display MIME type
  --level=LEVEL         Quantity of information to display from 1 to 9 (9 is
                        the maximum)
  --raw                 Raw output
  --bench               Run benchmark
  --force-parser=FORCE_PARSER
                        List all parsers then exit
  --parser-list         List all parsers then exit
  --profiler            Run profiler
  --version             Display version and exit
  --quality=QUALITY     Information quality (0.0=fastest, 1.0=best, and
                        default is 0.5)
  --maxlen=MAXLEN       Maximum string length in characters, 0 means unlimited
                        (default: 300)
  --verbose             Verbose mode
  --debug               Debug mode

I think the reason is that windows doesn't support #! in script, you might have to create a batch script for this to work.
For example a hachoir-metadata.bat file in the same directory with hachoir-metadata with something like this works for me.

@python %~dp0hachoir-metadata

Or a hachoir-metadata.bat file like this which doesn't need hachoir-metadata script.

@python -c __import__('hachoir.metadata.main').metadata.main.main()

Integrate ReadTheDocs build with GitHub commit/merge

Could be convenient. I had to click the 'Build' button on ReadTheDocs before it reflected the most recent changes. Before that it was showing docs from a few months back.

Q: Extracting files from Win32 Cabinet Self-Extractor?

I'm attempting to extract the Microsoft Core Fonts for the Web in Python. From the How to extract a windows cabinet file in python StackOverflow question, I learned about hachoir.

hachoir happily parses the self-extracting exe file, and I was able to extract something (/section_rsrc stream) that is happily accepted by cabextract. However, hachoir won't parse it:

>>> cab = createParser('rsrc.cab')
[warn] Skip parser 'CabFile': Invalid magic

Stripping all data before the CAB header using a hex editor, I can get hachoir to parse the CAB file. Does hachoir offer the means to extract this CAB file, without leading/trailing data? Or is that something that I need to look up in a Microsoft specification document?

Same question for extracting the files from the CAB file; does hachoir offer the abstraction level to do this?

Thanks!

example code

Is there any example code for this python wrapper hachoir

Error running on macOS

I'm running:
Python 3.7.0 (default, Jul 23 2018, 20:22:55)
macOS 10.13.6 (17G65)
bash

I did:
pip3 install hachoir-wx
(which was successful)

But running it I get:

  File "/usr/local/bin/hachoir-wx", line 27
    print "%s version %s" % (PACKAGE, VERSION)
                        ^
SyntaxError: invalid syntax

Any ideas? Thanks

[bug] MemoryErrors when using subfile

Error output:

$ python3 -m hachoir.subfile DSN-CTL-V23R01.exe 
[+] Start search on 6475848 bytes (6.2 MB)

[+] File at 0 size=57344 (56.0 KB): Microsoft Windows Portable Executable: Intel 80386, Windows GUI
[!] Memory error!

[+] End of search -- offset=524288 (512.0 KB)
Total time: 676 ms -- global rate: 756.7 KB/sec

Can be reproduced with https://dsn-ctl.fr/DSN-CTL-V23R01.exe

$ wget https://dsn-ctl.fr/DSN-CTL-V23R01.exe
--2023-07-04 13:19:30--  https://dsn-ctl.fr/DSN-CTL-V23R01.exe
Resolving dsn-ctl.fr (dsn-ctl.fr)... 85.236.158.186
Connecting to dsn-ctl.fr (dsn-ctl.fr)|85.236.158.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6475848 (6.2M) [application/x-msdownload]
Saving to: ‘DSN-CTL-V23R01.exe’

DSN-CTL-V23R01.exe          100%[==========================================>]   6.18M  3.71MB/s    in 1.7s    

2023-07-04 13:19:32 (3.71 MB/s) - ‘DSN-CTL-V23R01.exe’ saved [6475848/6475848]

$ shasum -a256 DSN-CTL-V23R01.exe 
b4e855f92c4ae8cec77b9ccaf8b6e0cf53134eb47f5e668980e20afdc149d99f  DSN-CTL-V23R01.exe

Host info:

$ uname -rvm
6.2.6-76060206-generic #202303130630~1685473338~22.04~995127e SMP PREEMPT_DYNAMIC Tue M x86_64

Python version:

$ python3 -V
Python 3.10.6

Hachoir version:

$ pip show hachoir
Name: hachoir
Version: 3.2.0
Summary: Package of Hachoir parsers used to open binary files
Home-page: http://hachoir.readthedocs.io/
Author: Hachoir team (see AUTHORS file)
Author-email: 
License: GNU GPL v2
Location: /home/agrajag9/.local/lib/python3.10/site-packages
Requires: 
Required-by:

Extracting MP4File metadata seems to load the entire file

When extracting simple info like width and height from an MP4File hachoir seems to first parse out the entire file before yielding the info.
Dimensions and probably duration too, should be easily accessible at the start of the file.

Please make versions 3.0a1 and 3.0a2 available on pypi

Hi, do you mind making those ☝️versions available on pypi, please?
Thanks in advance.

Is this the new official repo?

Some posts on Stackoverflow suggested to use hachoir-wx.
Those were linked to bitbucket: https://bitbucket.org/haypo/hachoir/wiki/hachoir-wx - this repository is down.
Then according to https://directory.fsf.org/wiki/Hachoir_project-_hachoir_wx there was a website http://www.hachoir.org/wiki/hachoir-wx that is also down.
So is this here now the offical repo is is this some fork project?

I tried to run hachoir-wx and nothing happend after pythonw started. May I suggest you add at least some print output if wxPython is not installed to have the user install it?

Cannot display FAT structure on mtools-generated image

Hachoir (from Debian 3.1.0+dfsg-3) can parse the structure of a FAT image created by mkfs.msdos (showing first 0x200 bytes as Boot, then both FAT copies, and root directory), but on an image created by mtools (specifically mformat -i "test-mtools.img" ::.) it reports a MasterBootRecord partition table followed by only RawBytes, whereas those should be identified as a FAT.

Could this be the presence of a non-empty partition table causing 2 different dissectors to be used?

CI broken: test_random_stream() fails

Hum, previously I configured Travis CI to only send email notifications to me. It seems like @nneonneo broke a test but didn't get a notification.

@nneonneo: can you please try to run "tox" to run tests before pushing a change?

It seems like the regression has been introduced by the commit b0306e6 according to git bisect.

Parse FLAC file failed

Ubuntu 16.04
Python 3.8

Flac file: https://files.catbox.moe/gafg3k.flac
file_hash: 138ae53711c6ec55ee88c9e8f54c846e469649c1bc16d5011786b1d70d143828

In [21]: with hachoir.parser.createParser("./gafg3k.flac") as parser:
    ...:      result = hachoir.metadata.extractMetadata(parser)
    ...:

[warn] [/metadata] Duplicate field name Key 'stream_info' already exists
...
...
...
[warn] [/metadata] Duplicate field name Key 'stream_info' already exists---------------------------------------------------------------------------
UniqKeyError                              Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/hachoir/field/generic_field_set.py in _addField(self, field)
    193         try:
--> 194             self._fields.append(field._name, field)
    195         except UniqKeyError as err:

/usr/local/lib/python3.8/dist-packages/hachoir/core/dict.py in append(self, key, value)
     66         if key in self._index:
---> 67             raise UniqKeyError("Key '%s' already exists" % key)
     68         self._index[key] = len(self._value_list)^C

UniqKeyError: Key 'stream_info' already exists

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-21-a7157485c236> in <module>
      1 with hachoir.parser.createParser("./gafg3k.flac") as parser:
----> 2      result = hachoir.metadata.extractMetadata

eyboardInterrupt

Update minimum Python version?

The README currently says the library supports Python 3.3+ (EOL 2017-09-29). If minimum compatibility is set to at least 3.5 then we could make use of type annotations which can simplify docstrings and make the library more IDE-friendly. 3.6 would be a minor improvement over 3.5, since f-strings offer a less verbose alternative to the % formatting currently used in several places.

Timezone information in creation_date metada

Hi, I've noticed that the DateTime object returned by extractMetadata(parser).get('creation_date') does not contain timezone information, and I was not able to find it anywhere else in the extracted metadata. Is this information actually stored in the movie file and the library is ignoring it?

Error + Bugfix: https://github.com/vstinner/hachoir/blob/master/hachoir/parser/misc/pdf.py#L395-L396

https://github.com/vstinner/hachoir/blob/master/hachoir/parser/misc/pdf.py#L395-L396

%(Trailer.MAGIC, self.absolute_address // 8))

hachoir.parser is broken for MKV -> SimpleBlock -> reserved bits

At first glance the problem is here, because self.parent_.name_ is going to be something like SimpleBlock[0].

CabFile extraction produces corrupt files in some cases

@nneonneo Sorry to bother you again, but for verdan32.exe, two of the extracted TTF files differ in 1 (Verdanai.TTF) or 2 bytes (Verdanab.TTF) at position 0x56 from those extracted by cabextract.

The other core font installers seems to extract fine, but I haven't verified all files.

guess.py:createParser() failure leaves FileInputStream open (Version 3.1.2)

Calling guess.py:createParser() with an invalid file type leaves open FileInputSteam object
Code is:

     stream = FileInputStream(filename, real_filename, tags=tags)
     return guessParser(stream)

Probably should be something like:

    stream = FileInputStream(filename, real_filename, tags=tags)
    guess = guessParser(stream)
    if not guess:
        stream.close()
    return guess

Specify `extras_require` in setup.py

Currently hachoir-urwid and maybe other utilities have unstated external dependencies. For hachoir-urwid this is a problem because doing a plain pip install urwid will install version 2.x which seems to be incompatible (see #34). If we pass something like

'extras_require': {
    'urwid': [
        'urwid==1.3.1'
    ]
}

to the setup method in setup.py then users could pip install hachoir hachoir[urwid] and be assured they get a tried and tested version.

Saving sections broken with hachoir-urwid

To reproduce:

Execute hachoir-urwid to point to some binary
Highlight a section and press C-e to save it
Enter filename and press enter
Program terminates, leaving an empty file and the traceback:

Traceback (most recent call last):
  File "cmd.py", line 4, in <module>
    main()
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 831, in main
    "display_value": values.display_value,
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 740, in exploreFieldSet
    ui.run_wrapper(run)
  File "/usr/local/lib/python3.5/dist-packages/urwid/display_common.py", line 763, in run_wrapper
    return fn()
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 663, in run
    e = top.keypress(size, e)
  File "/usr/local/lib/python3.5/dist-packages/urwid/container.py", line 1116, in keypress
    return self.footer.keypress((maxcol,),key)
  File "/usr/local/lib/python3.5/dist-packages/urwid/container.py", line 1587, in keypress
    key = self.focus.keypress(tsize, key)
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 533, in keypress
    self._done(self.get_edit_text())
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 224, in <lambda>
    raise NeedInput(lambda path: self.save_field(path, key == 'ctrl e'),
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 388, in save_field
    copyfileobj(stream.file(), os.fdopen(fd, 'wb'))
  File "/usr/lib/python3.5/shutil.py", line 73, in copyfileobj
    buf = fsrc.read(length)
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/stream/input.py", line 85, in read
    if size is None or None < self._size < pos + size:
TypeError: unorderable types: NoneType() < int()

pointing to here.

[help] example on how to parse an iso file

followup:
https://stackoverflow.com/questions/45107320/parsing-an-iso-file-with-hachoir

is it possible to iterate over these hachoir.parser.file_system.iso9660.Volume items and list the filenames inside the iso file?

Empty strings?

At string_field.py:142, we find the following:

             if not (1 <= nbytes <= 0xffff):

Empty strings occur often in real files - think for example, empty strings for constants (where I first encountered this issue). Although it seems odd to have a zero-length field, Hachoir seems to deal with this fine.

Any objections to me changing the line to

            if not (0 <= nbytes <= 0xffff):

pip install hachoir[urwid] fails on "recent" setuptools versions

hachoir[urwid] requires urwid==1.3.1 (released on 2015-11-02)
This produces distutils.errors.DistutilsSetupError: use_2to3 is invalid.
use_2to3 was indeed removed from setuptools 58 (released on 2021-09-05)

Software environment:

KDE neon 6.0 (based on Ubuntu 22.04)
Python 3.10.12

Commands run:

$ python3 -m venv env-hachoir
$ cd env-hachoir
$ . bin/activate
$ pip install hachoir
Successfully installed hachoir-3.3.0
$ pip install hachoir[urwid]
Requirement already satisfied: hachoir[urwid] in ./lib/python3.10/site-packages (3.3.0)
Collecting urwid==1.3.1 (from hachoir[urwid])
  Downloading urwid-1.3.1.tar.gz (588 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  × python setup.py egg_info did not run successfully.
    exit code: 1
      Traceback (most recent call last):
      (...)
      File "/home/(...)/env-hachoir/lib/python3.10/site-packages/setuptools/dist.py", line 139, in invalid_unless_false
          raise DistutilsSetupError(f"{attr} is invalid.")
      distutils.errors.DistutilsSetupError: use_2to3 is invalid.
(...)

A workaround (or solution) is to require a more recent (or most recent) version of urwid.

hachoir-urwid works fine for me with the latest urwid (2.6.7).

Pylint warning

W0631: Using possibly undefined loop variable 'index' (undefined-loop-variable)

hachoir/hachoir/regex/regex.py

Line 506 in 4403b0c

del ranges[index]

How do I add new fields in the metadata?

As titled is there any way to add new fields in the file metadata?
example not working

# ....
            parser = createParser(file_content)
            md5_sum = md5(file_content.getbuffer()).hexdigest()
            metadata = extractMetadata(parser)
            if metadata:
                md5_field = MissingField("md5", str(md5_sum))
                metadata.add(md5_field)
                metadata = self._list_to_dict(metadata.exportPlaintext())
                return metadata

Warnings do not appear when using urwid 2.x

When running hachoir-urwid with urwid 2.0.1, any warnings that typically appear at the bottom of the interface (below the tab toolbar) instead appear as blank lines. I have highlighted these lines in my terminal and tried to copy them but they do not appear to contain any text.

I am using the latest version of hachoir from master.

Running hachoir-urwid tests/files/mev.64bit.big.elf:

With urwid 1.3.1:

With urwid 2.0.1:

time for a new version?

It's been more than a year since last release, and there have been significant fixes since then, for example b547efa. Maybe it's time for a release please?

hachoir-metadata returns "Company" field named as "NumWords"

I have file "o_gosmonitor_.doc" as example it includes name of organization "ИВЦ Минприроды" (russian) and when I use hachoir-metadata against this file I see

But it affects any .doc file
`
λ hachoir-metadata o_gosmonitor_.doc
Metadata:

Author: ADronova
Nb page: 7
Creation date: 2015-03-25 08:01:00
Last modification: 2015-03-25 08:03:00
Producer: Microsoft Office Word
Comment: NumWords: ИВЦ Минприроды
Comment: Keywords: 152
Comment: Comments: 43
Comment: Thumbnail: 21516
Comment: 23: 786432
Comment: LastPrinted: False
Comment: NumCharacters: False
Comment: Security: False
Comment: 22: False
Comment: Encrypted: False
Comment: Template: Normal
Comment: RevisionNumber: 1
Comment: TotalEditingTime: 0:02:00
Comment: NumWords: 3217
Comment: NumCharacters: 18342
MIME type: application/msword
Endianness: Little endian
`

Py2 compatible versions removed from pypi?

1.3.3 was the latest version that worked on py2, and I have been using it for years with thousands of users.
Did someone remove hachoir-core, hachoir-parser, hachoir-metadata packages on pypi for a reason?

3.x is py3 only obviously

[Feature Request] Accept file with list of files as parameter for hachoir-metadata

Sometimes I need to process hundreds of thousands files inside multiple directories. It's not so simple right now and parameter like "--filename " could help.

Just like FIDO tool - https://github.com/openpreserve/fido with very similar purposes of file identification.

Question: How to test the field of <fragment group>

How can I write a test case for testing a fragment group?

self.checkDesc(parser, ???, VALUE)

I wonder what should be specifed as PATH argument of check* methods.

Detection failure of embedded RAR files created with RAR 5.0 archive format

Hello,

It seems that I cannot successfully detect embedded RAR files created using the RAR 5.0 archiver version. After a brief review of rar.py, I believe this is because of the slight difference in file format being used with the newer archiver.

Newer versions of WinRAR use a slightly different file magic for RAR files.

; RAR archive version 1.50 onwards
52 61 72 21 1A 07 00
; RAR archive version 5.0 onwards
52 61 72 21 1A 07 01 00

Quick testing...

; Successfully detected embedded RAR inside SFX file. Created with RAR archiver < 5.0
0002fff0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00030000  52 61 72 21 1a 07 00 cf  90 73 00 00 0d 00 00 00  |Rar!.....s......|
00030010  00 00 00 00 08 b9 7a 00  80 23 00 a8 00 00 00 38  |......z..#.....8|

; Unsuccessfully detected embedded RAR inside SFX file. Created with RAR archiver >= 5.0
00072bf0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00072c00  52 61 72 21 1a 07 01 00  2c 10 d9 3c 0b 01 05 07  |Rar!....,..<....|
00072c10  00 06 01 01 e4 ce 81 00  6a c4 39 9b 13 03 02 83  |........j.9.....|

To test independently, create a self extracting RAR file with an archiver version >= 5.0. The latest version of WinRAR (software version 5.50) uses this by default now.

   1. WinRAR and command line RAR use RAR 5.0 archive format by default.
      You can change it to RAR 4.x compatible format with "RAR4" option
      in archiving dialog or -ma4 command line switch.

[Feature request] PDF Metadata extraction

Is it possible?
Python implementation for PDF exists, it's PyPDF2 https://github.com/mstamy2/PyPDF2 library.

Probably it could be implemented and hachoir-metadata could support PDF too?

[Feature Request / Enhancement] Extract metadata from MTS

Command:
hachoir-metadata /..../00014.MTS

Expected result:
Metadata extracted correctly

Actual result:
[err!] [<MPEG_TS>] Hachoir can't extract metadata, but is able to parse: /..../00014.MTS

Sorry for the poor feature request.

Truncated jpeg memory use

A truncated jpeg can have a JpegImageData field with no terminator, which is created without a known size.
Because the size isn't known the corrupted JpegImageData must be parsed in full to calculate its size when the field is added to its parent JpegFile during JpegFile parsing. This forces simple operations that don't care about the JpegImageData, like checking if a field with a given name is in the JpegFile, to parse the corrupted JpegImageData fully. Parsing the corrupted section can blow up the memory use of the parser as it tries to parse the entire rest of the file in small chunks.

An example file that causes this issue can be found here: https://github.com/CybercentreCanada/assemblyline-service-characterize/issues/12.
This jpeg truncated to 500 000 bytes consumes approximately 1 GB of memory parsing JpegHuffmanUnits until it reaches the end of the file and errors. This happens when extracting metadata, or whenever a checking for a field name in the jpeg that isn't there.

Compatibility PY2

Hello,

Given there are only a few commits in order to fix, then please reconsider maintaining compatibility with PY2 as there are many installs with PY2 that would benefit from hachoir updates.

Thank you.

Incorrect website data on https://pypi.python.org/pypi/hachoir-metadata/1.3.3

The Pypi entry for https://pypi.python.org/pypi/hachoir-metadata/1.3.3 shows the BitBucket as http://bitbucket.org/haypo/hachoir/wiki/hachoir-metadata, which leads to a 404.

I don't know if it's an out-of-data project that was mimicking your hachoir-metadata or if it's your Pypi and it's got the wrong site. Either way, figured I'd let you know about it.

Edit: I found the same bug on http://hachoir3.readthedocs.io/. Perhaps it's a deprecated BitBucket account then?

[Feature Request / Enhancement] hachoir-metadata CSV or JSON output to external file

When I process multiple files it often require command-line tools to create CSV/JSON pipeline. Right now hachoir-metadata tool can't be used to create CSV or JSON.
It will be very helpful if hachoir-metadata and other cmd tools will produce machine readable results.

[Feature Request] Support Microsoft DOCX/PPTX/XLSX formats

Actually it would be helpful to parse DOCX/PPTX/XLSX formats and all other OpenXML formats too.
I'am using Yumo, Ruby-lib over Tika https://github.com/Erol/yomu but it's Ruby and Python for me is much more easier to understand.