Code Monkey home page Code Monkey logo

hachoir's People

Contributors

allerter avatar andriyor avatar cccs-rs avatar cdump avatar chrahunt avatar depassp avatar espes avatar fcrespo82 avatar gingerbeardman avatar jangrudo avatar jayxon avatar justerk avatar jwilk avatar kcwu avatar m42i avatar masatake avatar morilli avatar nneonneo avatar oliver avatar oooooleg avatar patricklcam avatar petrjasek avatar psztur avatar rambo avatar simoncozens avatar swiftpengu avatar ttsugriy avatar ushuz avatar vstinner avatar yhuelf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hachoir's Issues

hachoir-metadata parses and outputs incorrect duration metadata for ogv files

For some reason, hachoir-metadata has been producing bad duration output for ogv files. I've been testing with the "Computer Chronicles" collection on archive.org. To reproduce:

wget https://archive.org/download/CC517_commodore_64/CC517_commodore_64.ogv
from hachoir import parser as hachoir_parser
from hachoir import metadata as hachoir_metadata

video_file = 'CC517_commodore_64.ogv'
parser = hachoir_parser.createParser(video_file,video_file)
hachoir_metadata.extractMetadata(parser).exportPlaintext()

['Common:', '- Title: Commodore 64', '- Duration: 1 min 14 sec 423 ms', '- Location: http://www.archive.org/details/CC517_commodore_64', '- Copyright: http://creativecommons.org/licenses/by-nc-nd/2.0/', '- Producer: Xiph.Org libTheora I 20081020 3 2 1', '- MIME type: video/theora', '- Endianness: Little endian', 'Video:', '- Image width: 400 pixels', '- Image height: 300 pixels', '- Pixel format: 4:2:0', '- Compression: Theora', '- Frame rate: 30.0 fps', '- Comment: Quality: 0', '- Format version: Theora version 3.2 (revision 1)', 'Audio:', '- Channel: stereo', '- Sample rate: 44.1 kHz', '- Compression: Vorbis', '- Format version: Vorbis version 0']

harchoir-metadata reports that the duration is 1 minute and 14 seconds, but if you open it up in VLC and watch it, it's actually 28 minutes and 31 seconds.

Using Python 3.8.2 and hachoir 3.1.1.

Please release 3.1.3 (or 3.1.3a1)

I am aware that this project is unmaintained, but my project depends on the fixes introduced following #65. I was planning on simply using an URL in my dependency specification (https://github.com/vstinner/hachoir/archive/8000dbeb9aad587e8dc5be8202796cdfb67f899e.zip), but PyPI will not accept such a dependency.

I was hoping you could release a new version of hachoir including these fixes, if you can find the time. Alternatively, I can fork the project and publish it as e.g. hachoir-reloaded. I'd rather not fork though, if that can be avoided.

Rename Hachoir3 project to Hachoir?

The original Python2-only "Hachoir" project hosted on Bitbucket didn't get much love:

  • last commit at 2015-10-26
  • latest hachoir-core release (hachoir-core 1.3.3): 2010-02-26

I propose to rename Hachoir3 to Hachoir:

Hopefully, releases of the old Python2-only Hachoir project can co-exist since they were published as a different name (hachoir-core, hachoir-metadata, etc.)

Support NTFS USA stream patching

For the NTFS parser I recently learned about the "update sequence array", which is a set of binary patches applied to specific offsets in the stream. Essentially, suppose you have a USA value of "05 00, 6F 20". Then, later in the file, you might see

68 65 6C 6C 05 00 77 6F 72 6C 64

(with the 05 00 at a specific offset - 510 bytes from the start of a 512-byte sector). You are supposed to apply the patch here, to fix it up into

68 65 6C 6C 6F 20 77 6F 72 6C 64

The patches are always at predictable addresses, but in general they could land in the middle of any field set, causing massive breakage when the NTFS parser attempts to parse the USA substitute values instead of the intended bytes.

Is there a way I could temporarily patch the stream reader to return the correct bytes, or is there some other option for properly handling this kind of weakly context-dependent stream patching?

hachoir-grep: TypeError: decoding str is not supported

$ hachoir-grep foo tests/files/*.png
Traceback (most recent call last):
  File "/home/jwilk/.local/bin/hachoir-grep", line 8, in <module>
    sys.exit(main())
  File "/home/jwilk/.local/lib/python3.8/site-packages/hachoir/grep.py", line 183, in main
    values, pattern, filenames = parseOptions()
  File "/home/jwilk/.local/lib/python3.8/site-packages/hachoir/grep.py", line 66, in parseOptions
    pattern = str(arguments[0], "ascii")
TypeError: decoding str is not supported

(originally reported by Samuel Thibault in https://bugs.debian.org/969914)

Cannot parse 32 bit bpp tga images

version 3.0a3

Tried parsing and reading metadata of TGA image, received following warning:

[warn] Skip parser 'TargaFile': Unknown bits/pixel value 32

Get markers out of PNG file using hachoir

I am working on a project where I need all the tags present in PNG/JPEG and other similar file formats. I found hachoir-urwid utility is doing the job when I open a file using hachoir-urwid it gives all the headers/tags/data, It does the annotation at every level.

My question is it gives the output in interactive manner, I have to press enter at every '+' to expand the inner details further.

  1. file:twitter_32x32.png: PNG picture: 32x32x24 (414 bytes)
    0) id= "\x89PNG\r\n\x1a\n": PNG identifier ('\x89PNG\r\n\x1A\n') (8 bytes)

Can someone please help me out, I am unable to find some utility which can directly give me the complete output in stdout instead of some interactive screen.

Packaging scripts to be available automatically on install?

There's several useful scripts in the root of the repository, like hachoir-urwid. It would be cool if these were packaged in such a way so that they would be available when hachoir is installed (as described here). Ideally:

  1. pip install hachoir3
  2. hachoir-urwid
  3. it works

file is not closed

hachoir/parser/guess.py

def createParser(filename, real_filename=None, tags=None):
    """
    Create a parser from a file or returns None on error.

    Options:
    - file (str|io.IOBase): Input file name or
        a byte io.IOBase stream  ;
    - real_filename (str): Real file name.
    """
    if not tags:
        tags = []
    stream = FileInputStream(filename, real_filename, tags=tags)
    guess = guessParser(stream)
    if guess is None:
        stream.close()
    return guess
You should return stream with guess. You should let us close stream.

[Feature Request] Provide "Builder" API

A Builder API allowing a user to construct binary data according to a defined set of FieldSets.

Taking an example from the docs:
https://hachoir.readthedocs.io/en/latest/developer.html#parser-with-sub-field-sets

The user should be able to create the value of the data variable roughly as follows:

stream = BuilderByteStream()
creatable = MyFormat(stream)
creatable["signature"].value = b"MYF"
creatable["count"].value = len(pointlist)
for subfieldset, point in zip(creatable["point"], pointlist):
subfieldset["letter"].value = point["letter"]
subfieldset["code"].value = point["code"]
data = stream.to_bytes()

How to exhaust memory with one reputably sourced image

Using hachoir commit #5b9e05a on Windows 10 x64.

Steps...

  1. Save the code below as test.py
  2. Downloaded the only image in section BACKGROUND from the reputable source
    https://fanart.tv/series/331821/the-looming-tower/
    (direct image link is https://fanart.tv/api/download.php?type=download&image=88183&section=1)
  3. Save image file to same folder as the test code as fanart.jpg (the jpg image crc is BA866C09)
  4. Run Python -V # output Python 3.7.4 (32 bit)
  5. Run python.exe test.py

Result:

infinite loop exhausting memory until Python crashes with MemoryError.

[warn] [/exif/content/ifd[0]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
[warn] [/exif/content/ifd[1]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
[warn] [/exif/content/ifd[2]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
...
and so on...
...
[warn] [/exif/content/ifd[231]] [Autofix] Fix parser error: stop parser, found unparsed segment: start 1408, length 8, found unparsed segment: start 1480, length 8, found unparsed segment: start 1552, length 8, found unparsed segment: start 1712, length 16
...
and so on...

Test code

test.py script located where cloned hachoir3 (or named whatever) folder is located.

import os
import sys
  
HACHOIR_CLONE_PATH = 'hachoir3'  
  
sys.path.insert(1, os.path.abspath(os.path.join(os.path.dirname(__file__),
                                   HACHOIR_CLONE_PATH)))  
  
from hachoir import parser
from hachoir import metadata

path = os.path.abspath(os.path.join(os.path.dirname(__file__), 'fanart.jpg'))

try:
    parser = parser.createParser(path)
    metadata = metadata.extractMetadata(parser)
except Exception as e:
    print('Unable to extract metadata %r' % e)

scripts doesn't work directly on windows

λ hachoir-metadata
'hachoir-metadata' is not recognized as an internal or external command,
operable program or batch file.

but it's in PATH

λ which hachoir-metadata
/c/Users/JayXon/AppData/Local/Programs/Python/Python36/Scripts/hachoir-metadata

I have to do this:

λ python C:/Users/JayXon/AppData/Local/Programs/Python/Python36/Scripts/hachoir-metadata
Usage: hachoir-metadata [options] files

Options:
  -h, --help            show this help message and exit
  --type                Only display file type (description)
  --mime                Only display MIME type
  --level=LEVEL         Quantity of information to display from 1 to 9 (9 is
                        the maximum)
  --raw                 Raw output
  --bench               Run benchmark
  --force-parser=FORCE_PARSER
                        List all parsers then exit
  --parser-list         List all parsers then exit
  --profiler            Run profiler
  --version             Display version and exit
  --quality=QUALITY     Information quality (0.0=fastest, 1.0=best, and
                        default is 0.5)
  --maxlen=MAXLEN       Maximum string length in characters, 0 means unlimited
                        (default: 300)
  --verbose             Verbose mode
  --debug               Debug mode

I think the reason is that windows doesn't support #! in script, you might have to create a batch script for this to work.
For example a hachoir-metadata.bat file in the same directory with hachoir-metadata with something like this works for me.

@python %~dp0hachoir-metadata

Or a hachoir-metadata.bat file like this which doesn't need hachoir-metadata script.

@python -c __import__('hachoir.metadata.main').metadata.main.main()

Q: Extracting files from Win32 Cabinet Self-Extractor?

I'm attempting to extract the Microsoft Core Fonts for the Web in Python. From the How to extract a windows cabinet file in python StackOverflow question, I learned about hachoir.

hachoir happily parses the self-extracting exe file, and I was able to extract something (/section_rsrc stream) that is happily accepted by cabextract. However, hachoir won't parse it:

>>> cab = createParser('rsrc.cab')
[warn] Skip parser 'CabFile': Invalid magic

Stripping all data before the CAB header using a hex editor, I can get hachoir to parse the CAB file. Does hachoir offer the means to extract this CAB file, without leading/trailing data? Or is that something that I need to look up in a Microsoft specification document?

Same question for extracting the files from the CAB file; does hachoir offer the abstraction level to do this?

Thanks!

example code

Is there any example code for this python wrapper hachoir

Error running on macOS

I'm running:
Python 3.7.0 (default, Jul 23 2018, 20:22:55)
macOS 10.13.6 (17G65)
bash

I did:
pip3 install hachoir-wx
(which was successful)

But running it I get:

  File "/usr/local/bin/hachoir-wx", line 27
    print "%s version %s" % (PACKAGE, VERSION)
                        ^
SyntaxError: invalid syntax

Any ideas? Thanks

[bug] MemoryErrors when using subfile

Error output:

$ python3 -m hachoir.subfile DSN-CTL-V23R01.exe 
[+] Start search on 6475848 bytes (6.2 MB)

[+] File at 0 size=57344 (56.0 KB): Microsoft Windows Portable Executable: Intel 80386, Windows GUI
[!] Memory error!

[+] End of search -- offset=524288 (512.0 KB)
Total time: 676 ms -- global rate: 756.7 KB/sec

Can be reproduced with https://dsn-ctl.fr/DSN-CTL-V23R01.exe

$ wget https://dsn-ctl.fr/DSN-CTL-V23R01.exe
--2023-07-04 13:19:30--  https://dsn-ctl.fr/DSN-CTL-V23R01.exe
Resolving dsn-ctl.fr (dsn-ctl.fr)... 85.236.158.186
Connecting to dsn-ctl.fr (dsn-ctl.fr)|85.236.158.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6475848 (6.2M) [application/x-msdownload]
Saving to: ‘DSN-CTL-V23R01.exe’

DSN-CTL-V23R01.exe          100%[==========================================>]   6.18M  3.71MB/s    in 1.7s    

2023-07-04 13:19:32 (3.71 MB/s) - ‘DSN-CTL-V23R01.exe’ saved [6475848/6475848]

$ shasum -a256 DSN-CTL-V23R01.exe 
b4e855f92c4ae8cec77b9ccaf8b6e0cf53134eb47f5e668980e20afdc149d99f  DSN-CTL-V23R01.exe

Host info:

$ uname -rvm
6.2.6-76060206-generic #202303130630~1685473338~22.04~995127e SMP PREEMPT_DYNAMIC Tue M x86_64

Python version:

$ python3 -V
Python 3.10.6

Hachoir version:

$ pip show hachoir
Name: hachoir
Version: 3.2.0
Summary: Package of Hachoir parsers used to open binary files
Home-page: http://hachoir.readthedocs.io/
Author: Hachoir team (see AUTHORS file)
Author-email: 
License: GNU GPL v2
Location: /home/agrajag9/.local/lib/python3.10/site-packages
Requires: 
Required-by: 

Extracting MP4File metadata seems to load the entire file

When extracting simple info like width and height from an MP4File hachoir seems to first parse out the entire file before yielding the info.
Dimensions and probably duration too, should be easily accessible at the start of the file.

Is this the new official repo?

Some posts on Stackoverflow suggested to use hachoir-wx.
Those were linked to bitbucket: https://bitbucket.org/haypo/hachoir/wiki/hachoir-wx - this repository is down.
Then according to https://directory.fsf.org/wiki/Hachoir_project-_hachoir_wx there was a website http://www.hachoir.org/wiki/hachoir-wx that is also down.
So is this here now the offical repo is is this some fork project?

I tried to run hachoir-wx and nothing happend after pythonw started. May I suggest you add at least some print output if wxPython is not installed to have the user install it?

Cannot display FAT structure on mtools-generated image

Hachoir (from Debian 3.1.0+dfsg-3) can parse the structure of a FAT image created by mkfs.msdos (showing first 0x200 bytes as Boot, then both FAT copies, and root directory), but on an image created by mtools (specifically mformat -i "test-mtools.img" ::.) it reports a MasterBootRecord partition table followed by only RawBytes, whereas those should be identified as a FAT.

Could this be the presence of a non-empty partition table causing 2 different dissectors to be used?

CI broken: test_random_stream() fails

Hum, previously I configured Travis CI to only send email notifications to me. It seems like @nneonneo broke a test but didn't get a notification.

@nneonneo: can you please try to run "tox" to run tests before pushing a change?

It seems like the regression has been introduced by the commit b0306e6 according to git bisect.

Parse FLAC file failed

Ubuntu 16.04
Python 3.8

Flac file: https://files.catbox.moe/gafg3k.flac
file_hash: 138ae53711c6ec55ee88c9e8f54c846e469649c1bc16d5011786b1d70d143828

In [21]: with hachoir.parser.createParser("./gafg3k.flac") as parser:
    ...:      result = hachoir.metadata.extractMetadata(parser)
    ...:

[warn] [/metadata] Duplicate field name Key 'stream_info' already exists
...
...
...
[warn] [/metadata] Duplicate field name Key 'stream_info' already exists---------------------------------------------------------------------------
UniqKeyError                              Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/hachoir/field/generic_field_set.py in _addField(self, field)
    193         try:
--> 194             self._fields.append(field._name, field)
    195         except UniqKeyError as err:

/usr/local/lib/python3.8/dist-packages/hachoir/core/dict.py in append(self, key, value)
     66         if key in self._index:
---> 67             raise UniqKeyError("Key '%s' already exists" % key)
     68         self._index[key] = len(self._value_list)^C

UniqKeyError: Key 'stream_info' already exists

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-21-a7157485c236> in <module>
      1 with hachoir.parser.createParser("./gafg3k.flac") as parser:
----> 2      result = hachoir.metadata.extractMetadata

eyboardInterrupt


Update minimum Python version?

The README currently says the library supports Python 3.3+ (EOL 2017-09-29). If minimum compatibility is set to at least 3.5 then we could make use of type annotations which can simplify docstrings and make the library more IDE-friendly. 3.6 would be a minor improvement over 3.5, since f-strings offer a less verbose alternative to the % formatting currently used in several places.

Timezone information in creation_date metada

Hi, I've noticed that the DateTime object returned by extractMetadata(parser).get('creation_date') does not contain timezone information, and I was not able to find it anywhere else in the extracted metadata. Is this information actually stored in the movie file and the library is ignoring it?

guess.py:createParser() failure leaves FileInputStream open (Version 3.1.2)

Calling guess.py:createParser() with an invalid file type leaves open FileInputSteam object
Code is:

     stream = FileInputStream(filename, real_filename, tags=tags)
     return guessParser(stream)

Probably should be something like:

    stream = FileInputStream(filename, real_filename, tags=tags)
    guess = guessParser(stream)
    if not guess:
        stream.close()
    return guess

Specify `extras_require` in setup.py

Currently hachoir-urwid and maybe other utilities have unstated external dependencies. For hachoir-urwid this is a problem because doing a plain pip install urwid will install version 2.x which seems to be incompatible (see #34). If we pass something like

'extras_require': {
    'urwid': [
        'urwid==1.3.1'
    ]
}

to the setup method in setup.py then users could pip install hachoir hachoir[urwid] and be assured they get a tried and tested version.

Saving sections broken with hachoir-urwid

To reproduce:

  1. Execute hachoir-urwid to point to some binary
  2. Highlight a section and press C-e to save it
  3. Enter filename and press enter
  4. Program terminates, leaving an empty file and the traceback:
Traceback (most recent call last):
  File "cmd.py", line 4, in <module>
    main()
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 831, in main
    "display_value": values.display_value,
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 740, in exploreFieldSet
    ui.run_wrapper(run)
  File "/usr/local/lib/python3.5/dist-packages/urwid/display_common.py", line 763, in run_wrapper
    return fn()
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 663, in run
    e = top.keypress(size, e)
  File "/usr/local/lib/python3.5/dist-packages/urwid/container.py", line 1116, in keypress
    return self.footer.keypress((maxcol,),key)
  File "/usr/local/lib/python3.5/dist-packages/urwid/container.py", line 1587, in keypress
    key = self.focus.keypress(tsize, key)
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 533, in keypress
    self._done(self.get_edit_text())
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 224, in <lambda>
    raise NeedInput(lambda path: self.save_field(path, key == 'ctrl e'),
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/urwid_ui.py", line 388, in save_field
    copyfileobj(stream.file(), os.fdopen(fd, 'wb'))
  File "/usr/lib/python3.5/shutil.py", line 73, in copyfileobj
    buf = fsrc.read(length)
  File "/home/chrahunt/.local/lib/python3.5/site-packages/hachoir/stream/input.py", line 85, in read
    if size is None or None < self._size < pos + size:
TypeError: unorderable types: NoneType() < int()

pointing to here.

Empty strings?

At string_field.py:142, we find the following:

             if not (1 <= nbytes <= 0xffff):

Empty strings occur often in real files - think for example, empty strings for constants (where I first encountered this issue). Although it seems odd to have a zero-length field, Hachoir seems to deal with this fine.

Any objections to me changing the line to

            if not (0 <= nbytes <= 0xffff):

?

pip install hachoir[urwid] fails on "recent" setuptools versions

hachoir[urwid] requires urwid==1.3.1 (released on 2015-11-02)
This produces distutils.errors.DistutilsSetupError: use_2to3 is invalid.
use_2to3 was indeed removed from setuptools 58 (released on 2021-09-05)

Software environment:

  • KDE neon 6.0 (based on Ubuntu 22.04)
  • Python 3.10.12

Commands run:

$ python3 -m venv env-hachoir
$ cd env-hachoir
$ . bin/activate
$ pip install hachoir
Successfully installed hachoir-3.3.0
$ pip install hachoir[urwid]
Requirement already satisfied: hachoir[urwid] in ./lib/python3.10/site-packages (3.3.0)
Collecting urwid==1.3.1 (from hachoir[urwid])
  Downloading urwid-1.3.1.tar.gz (588 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  × python setup.py egg_info did not run successfully.
    exit code: 1
      Traceback (most recent call last):
      (...)
      File "/home/(...)/env-hachoir/lib/python3.10/site-packages/setuptools/dist.py", line 139, in invalid_unless_false
          raise DistutilsSetupError(f"{attr} is invalid.")
      distutils.errors.DistutilsSetupError: use_2to3 is invalid.
(...)

A workaround (or solution) is to require a more recent (or most recent) version of urwid.

hachoir-urwid works fine for me with the latest urwid (2.6.7).

How do I add new fields in the metadata?

As titled is there any way to add new fields in the file metadata?
example not working

# ....
            parser = createParser(file_content)
            md5_sum = md5(file_content.getbuffer()).hexdigest()
            metadata = extractMetadata(parser)
            if metadata:
                md5_field = MissingField("md5", str(md5_sum))
                metadata.add(md5_field)
                metadata = self._list_to_dict(metadata.exportPlaintext())
                return metadata

Warnings do not appear when using urwid 2.x

When running hachoir-urwid with urwid 2.0.1, any warnings that typically appear at the bottom of the interface (below the tab toolbar) instead appear as blank lines. I have highlighted these lines in my terminal and tried to copy them but they do not appear to contain any text.

I am using the latest version of hachoir from master.

Running hachoir-urwid tests/files/mev.64bit.big.elf:

With urwid 1.3.1:

image

With urwid 2.0.1:

image

time for a new version?

It's been more than a year since last release, and there have been significant fixes since then, for example b547efa. Maybe it's time for a release please?

hachoir-metadata returns "Company" field named as "NumWords"

hachoir-metadata returns "Company" field named as "NumWords"

I have file "o_gosmonitor_.doc" as example it includes name of organization "ИВЦ Минприроды" (russian) and when I use hachoir-metadata against this file I see

But it affects any .doc file
`
λ hachoir-metadata o_gosmonitor_.doc
Metadata:

  • Author: ADronova
  • Nb page: 7
  • Creation date: 2015-03-25 08:01:00
  • Last modification: 2015-03-25 08:03:00
  • Producer: Microsoft Office Word
  • Comment: NumWords: ИВЦ Минприроды
  • Comment: Keywords: 152
  • Comment: Comments: 43
  • Comment: Thumbnail: 21516
  • Comment: 23: 786432
  • Comment: LastPrinted: False
  • Comment: NumCharacters: False
  • Comment: Security: False
  • Comment: 22: False
  • Comment: Encrypted: False
  • Comment: Template: Normal
  • Comment: RevisionNumber: 1
  • Comment: TotalEditingTime: 0:02:00
  • Comment: NumWords: 3217
  • Comment: NumCharacters: 18342
  • MIME type: application/msword
  • Endianness: Little endian
    `

Py2 compatible versions removed from pypi?

1.3.3 was the latest version that worked on py2, and I have been using it for years with thousands of users.
Did someone remove hachoir-core, hachoir-parser, hachoir-metadata packages on pypi for a reason?

3.x is py3 only obviously

Detection failure of embedded RAR files created with RAR 5.0 archive format

Hello,

It seems that I cannot successfully detect embedded RAR files created using the RAR 5.0 archiver version. After a brief review of rar.py, I believe this is because of the slight difference in file format being used with the newer archiver.

Newer versions of WinRAR use a slightly different file magic for RAR files.

; RAR archive version 1.50 onwards
52 61 72 21 1A 07 00
; RAR archive version 5.0 onwards
52 61 72 21 1A 07 01 00

Quick testing...

; Successfully detected embedded RAR inside SFX file. Created with RAR archiver < 5.0
0002fff0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00030000  52 61 72 21 1a 07 00 cf  90 73 00 00 0d 00 00 00  |Rar!.....s......|
00030010  00 00 00 00 08 b9 7a 00  80 23 00 a8 00 00 00 38  |......z..#.....8|

; Unsuccessfully detected embedded RAR inside SFX file. Created with RAR archiver >= 5.0
00072bf0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00072c00  52 61 72 21 1a 07 01 00  2c 10 d9 3c 0b 01 05 07  |Rar!....,..<....|
00072c10  00 06 01 01 e4 ce 81 00  6a c4 39 9b 13 03 02 83  |........j.9.....|

To test independently, create a self extracting RAR file with an archiver version >= 5.0. The latest version of WinRAR (software version 5.50) uses this by default now.

   1. WinRAR and command line RAR use RAR 5.0 archive format by default.
      You can change it to RAR 4.x compatible format with "RAR4" option
      in archiving dialog or -ma4 command line switch.

[Feature Request / Enhancement] Extract metadata from MTS

Command:
hachoir-metadata /..../00014.MTS

Expected result:
Metadata extracted correctly

Actual result:
[err!] [<MPEG_TS>] Hachoir can't extract metadata, but is able to parse: /..../00014.MTS

Sorry for the poor feature request.

Truncated jpeg memory use

A truncated jpeg can have a JpegImageData field with no terminator, which is created without a known size.
Because the size isn't known the corrupted JpegImageData must be parsed in full to calculate its size when the field is added to its parent JpegFile during JpegFile parsing. This forces simple operations that don't care about the JpegImageData, like checking if a field with a given name is in the JpegFile, to parse the corrupted JpegImageData fully. Parsing the corrupted section can blow up the memory use of the parser as it tries to parse the entire rest of the file in small chunks.

An example file that causes this issue can be found here: https://github.com/CybercentreCanada/assemblyline-service-characterize/issues/12.
This jpeg truncated to 500 000 bytes consumes approximately 1 GB of memory parsing JpegHuffmanUnits until it reaches the end of the file and errors. This happens when extracting metadata, or whenever a checking for a field name in the jpeg that isn't there.

Compatibility PY2

Hello,

Given there are only a few commits in order to fix, then please reconsider maintaining compatibility with PY2 as there are many installs with PY2 that would benefit from hachoir updates.

Thank you.

Incorrect website data on https://pypi.python.org/pypi/hachoir-metadata/1.3.3

The Pypi entry for https://pypi.python.org/pypi/hachoir-metadata/1.3.3 shows the BitBucket as http://bitbucket.org/haypo/hachoir/wiki/hachoir-metadata, which leads to a 404.

I don't know if it's an out-of-data project that was mimicking your hachoir-metadata or if it's your Pypi and it's got the wrong site. Either way, figured I'd let you know about it.

Edit: I found the same bug on http://hachoir3.readthedocs.io/. Perhaps it's a deprecated BitBucket account then?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.