The puremagic from cdgriffith

Remove unsupported Python stuff

Since puremagic only supports Python 3.5 and up, I think you can remove:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L25

And:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L35

Create setup.py and integrate with PyPi

JPEG XS Two mime types

I was having a look around the various JPEG X* flavours and came across https://en.wikipedia.org/wiki/JPEG_XS which is both a still image and video codec.

Just to be awkward they use the same fingerprint 0xFF10 FF50 for both image and video but then give it two mime types image/jxsc and video/jxsv.

What would be the best approach to handle this? Two entries in the .json one for each type? I'm not sure of other formats that would do this but I reckon they are out there.

mimetype from stream

Are you interested in a pull-request with something like this?

def mimetype_from_stream(stream, filename=None):
    """
    Reads in stream, attempts to identify content based
    off magic number and will return the mimetype.
    If filename is provided it will be used in the computation.
    """
    assert isinstance(stream, BytesIO)
    head, foot = _stream_details(stream)
    ext = puremagic.ext_from_filename(filename) if filename else None
    return puremagic.main._magic(head, foot, True, ext)


def _stream_details(stream):
    """ Grab the start and end of the stream"""
    assert isinstance(stream, BytesIO)
    max_head, max_foot = puremagic.main._max_lengths()
    head = stream.read(max_head)
    try:
        stream.seek(-max_foot, os.SEEK_END)
    except IOError:
        stream.seek(0)
    foot = stream.read()
    return head, foot

I will mimic your from_string function more in the pull-request. In my case I only need the mime-type, not the extension.

Integrate advanced file scanning techniques

Better identify common files. Such as opening .docx/.pptx/.xlsx and viewing the XML file to figure out which exactly they are.

Missing mime types for some headers

Some of the headers in magic_data lack mime-type, for example:
https://github.com/cdgriffith/puremagic/blob/master/puremagic/magic_data.json#L69
This causes from_string(str, mime=True) to return empty string instead of the mime-type in some cases, eg JPEG image.

Is there a specific reason for this, or is the dataset just incomplete?

Confidence/Selection logic question

Hi,

I just found PureMagic and am trying to use it to identify if a file my script receives is ELF or not. I am using a test ELF binary and instead of get back "ELF executable" as i would expect I am getting ".AppImage".

I did run readelf against the file and here is results:

$ readelf -h elf_hello/chello
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x80482f0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1904 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         27
  Section header string table index: 26

$ python3 -m puremagic elf_hello/chello
'elf_hello/chello' : .AppImage

I also dug into the magic_data.json file and found out that those 2 file types share a lot of the same bytes:

  "454c46", 1, ".AppImage"
"7f454c46", 0, "", "", "ELF executable"

After doing some more digging it looks like puremagic find both options but always returns the AppImage entry.

These are the 2 results from the confidence function:

PureMagicWithConfidence(byte_match=b'ELF', offset=1, extension='.AppImage', mime_type='application/x-iso9660-appimage', name='AppImage application bundle', confidence=0.9)
PureMagicWithConfidence(byte_match=b'\x7fELF', offset=0, extension='', mime_type='', name='ELF executable', confidence=0.9)

I admit that i can be totally blind and am not seeing where the logic decides which one to choose. I'd get it if it looked at the file extension and saw that there wasn't one and choose the ELF executable vs. the AppImage, but it looks like it is a toss up when the confidence level is the same...

Thanks in advance for any insight, suggestions, etc :)

Price-matching other repos for more file support

Take a look at these Python repos to see what can puremagic add to cover more formats.

https://github.com/floyernick/fleep-py/blob/master/fleep/data.json (193 stars)
https://github.com/h2non/filetype.py/tree/master/filetype/types (134 stars)
https://github.com/openpreserve/fido/blob/master/fido/conf/format_extensions.xml (84 stars)
https://github.com/omriher/Whatype/blob/master/whatype/magics.csv (12 stars)
https://github.com/schlerp/pyfsig/blob/master/src/pyfsig/file_signatures.py (9 stars)
https://github.com/7h3rAm/cigma/blob/master/cigma/magicbytes.json (1 star)

If we include non-Python repos:

Find something as quick yet safer than eval()

Instead of opening a seperate data file and evaling it, find a method that is just as fast to store that information that is safer and just as easy.

How to handle two sets of bytes for matching improvements?

Hi there,

I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.

One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love:

puremagic/puremagic/magic_data.json

Line 1084 in ff042db

["464f524d", 0, "", "application/x-iff", "IFF file"],

But in Amiga land that .iff FORM header is used for many things Wikipedia: List_of_file_signatures

Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM ACBM etc..? I'm happy to help with a PR if it can be done.

Adding JPEG-XL Support

From https://github.com/ImageMagick/jpeg-xl/blob/main/doc/format_overview.md

JPEG-XL Either starts with
0xFF0A for a Raw JPEG-XL Codestream
0x0000000C 4A584C20 0D0A870A for a ISOBMFF-based container

Checking the files at https://jpegxl.info/art/2021-04_jon.html, I can confirm that 0xFF0A is used for Raw Codestreams

same (mp3) file, different name ... different output: mp3 versus koz

same (mp3) file, different name ... different output

Make a copy:
sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
Verify it's there with same size:

sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3

... and same contents:

sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/testblabla.bla

... but puremagic says the first one is mp3 and the second is ... koz?

sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz

Is this wanted behaviour, or a bug?

PS: Linux' file reports it correctly as mp3:

sander@brixit:~/git/puremagic$ file  test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3:       Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo

Add mime type lookup for exts

SVG images not recogniced

SVG images return mime_type='video/x-ms-asf'

Is it possible to use filehandles / bytestream?

I would love to do something like this:

import puremagic

with open(file_path, "rb") as fh:
    ext = puremagic.from_file_handler(fh)

Especially the bytestream support might be nice in case the file is not / cannot / should not be stored on the disk (e.g. AWS Lambda)

.epub listed as "INI Config file" in magic_data.json

A file with an .epub extension is extremely likely to be the zip+xml+html based, and extremely popular, e-book format rather than any form of INI configuration file. I cannot find any reference to .epub being a configuration file name for any system with a cursory search.

As it is listed directly after .ini in the data file, it was likely a copy/paste error.

The line should probably read as:

        ["", 0, ".epub", "application/epub+zip", "electronic book document"]

RuntimeWarning when running package

I am getting this warning in Python 3.6 and 3.7 beta when running `python3 -m puremagic":

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))

New archivers support: Brotli, LZ4 and ZStd

Recently was published few new compression algorithms and formats that are going to be quite popular:
Brotli from Google: is supported by all browsers, was standardized and have it's RFC but it doesn't have magic signature.
LZ4 which is ultra fast and lightweight.
ZStandard from Facebook which is already widely supported by a lot of systems including Linux kernel. It's mime type is application/zstd and it's magic is \x28\xB5\x2F\xFD
Could you add a support for them? If yes please also add their tar version (tzst, tlz4, tbr).
The more programs supports them then the easier it will be to migrate to this compressors.

Webp image mime type is empty

Hello,

I encountered a discrepancy when running a test with the following code:

import puremagic

print(puremagic.from_file("test/resources/images/test.webp", mime=True)) # prints "image/webp"
with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.from_string(f.read(), mime=True)). # prints ""

In comparison, the python-magic library outputs "image/webp" for both the from_file and from_buffer functions.

I am uncertain whether this difference in behavior is intentional.

Thank you!

Multi-part checks with negative offset for second match

I'm just going through some various files for some formats, in some cases I could increase the confidence as the file has both a solid position for a header, but also a solid position for a footer. Would it be possible to have the multi-part use a negative byte count to match from?

This would be handy as in one case I have a file that is clashing with another match at 0.8, adding the footer as an additional match should push past this to give a solid first match. I believe this would help with things like #37 .svg matching confidence as well.

Example entry for multi-part-headers:

"4352454D" : [
    ["444f4e4500000000", -8, ".ctm", "", "CreamTracker module"]			
]

missing mime type for webp

For example, this is a webp image, download it to test.webp:

This is the difference between puremagic and magic:

In [22]: import puremagic

In [23]: puremagic.from_file("test.webp", mime=True)
Out[23]: ''

In [24]: import magic

In [25]: magic.from_file("test.webp", mime=True)
Out[25]: 'image/webp'

It is seems that mime is missing, but if I remove the mime=True. I can get the webp extension:

In [26]: puremagic.from_file("test.webp")
Out[26]: '.webp'

Some common filetypes are not detected

Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).

$ file changelog.txt
changelog.txt: ASCII English text

$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

cdgriffith / puremagic Goto Github PK

puremagic's People

Contributors

Stargazers

Watchers

Forkers

puremagic's Issues

Recommend Projects

Recommend Topics

Recommend Org