Code Monkey home page Code Monkey logo

puremagic's People

Contributors

cdgriffith avatar reduxionist avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

puremagic's Issues

JPEG XS Two mime types

I was having a look around the various JPEG X* flavours and came across https://en.wikipedia.org/wiki/JPEG_XS which is both a still image and video codec.

Just to be awkward they use the same fingerprint 0xFF10 FF50 for both image and video but then give it two mime types image/jxsc and video/jxsv.

What would be the best approach to handle this? Two entries in the .json one for each type? I'm not sure of other formats that would do this but I reckon they are out there.

mimetype from stream

Are you interested in a pull-request with something like this?

def mimetype_from_stream(stream, filename=None):
    """
    Reads in stream, attempts to identify content based
    off magic number and will return the mimetype.
    If filename is provided it will be used in the computation.
    """
    assert isinstance(stream, BytesIO)
    head, foot = _stream_details(stream)
    ext = puremagic.ext_from_filename(filename) if filename else None
    return puremagic.main._magic(head, foot, True, ext)


def _stream_details(stream):
    """ Grab the start and end of the stream"""
    assert isinstance(stream, BytesIO)
    max_head, max_foot = puremagic.main._max_lengths()
    head = stream.read(max_head)
    try:
        stream.seek(-max_foot, os.SEEK_END)
    except IOError:
        stream.seek(0)
    foot = stream.read()
    return head, foot

I will mimic your from_string function more in the pull-request. In my case I only need the mime-type, not the extension.

Confidence/Selection logic question

Hi,

I just found PureMagic and am trying to use it to identify if a file my script receives is ELF or not. I am using a test ELF binary and instead of get back "ELF executable" as i would expect I am getting ".AppImage".

I did run readelf against the file and here is results:

$ readelf -h elf_hello/chello
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x80482f0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1904 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         27
  Section header string table index: 26

$ python3 -m puremagic elf_hello/chello
'elf_hello/chello' : .AppImage

I also dug into the magic_data.json file and found out that those 2 file types share a lot of the same bytes:

  "454c46", 1, ".AppImage"
"7f454c46", 0, "", "", "ELF executable"

After doing some more digging it looks like puremagic find both options but always returns the AppImage entry.

These are the 2 results from the confidence function:

PureMagicWithConfidence(byte_match=b'ELF', offset=1, extension='.AppImage', mime_type='application/x-iso9660-appimage', name='AppImage application bundle', confidence=0.9)
PureMagicWithConfidence(byte_match=b'\x7fELF', offset=0, extension='', mime_type='', name='ELF executable', confidence=0.9)

I admit that i can be totally blind and am not seeing where the logic decides which one to choose. I'd get it if it looked at the file extension and saw that there wasn't one and choose the ELF executable vs. the AppImage, but it looks like it is a toss up when the confidence level is the same...

Thanks in advance for any insight, suggestions, etc :)

Price-matching other repos for more file support

Take a look at these Python repos to see what can puremagic add to cover more formats.

If we include non-Python repos:

How to handle two sets of bytes for matching improvements?

Hi there,

I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.

One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love:

["464f524d", 0, "", "application/x-iff", "IFF file"],

But in Amiga land that .iff FORM header is used for many things Wikipedia: List_of_file_signatures

image

Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM ACBM etc..? I'm happy to help with a PR if it can be done.

same (mp3) file, different name ... different output: mp3 versus koz

same (mp3) file, different name ... different output

Make a copy:
sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
Verify it's there with same size:

sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3

... and same contents:

sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/testblabla.bla

... but puremagic says the first one is mp3 and the second is ... koz?

sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz

Is this wanted behaviour, or a bug?

PS: Linux' file reports it correctly as mp3:

sander@brixit:~/git/puremagic$ file  test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3:       Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo

Is it possible to use filehandles / bytestream?

I would love to do something like this:

import puremagic

with open(file_path, "rb") as fh:
    ext = puremagic.from_file_handler(fh)

Especially the bytestream support might be nice in case the file is not / cannot / should not be stored on the disk (e.g. AWS Lambda)

.epub listed as "INI Config file" in magic_data.json

A file with an .epub extension is extremely likely to be the zip+xml+html based, and extremely popular, e-book format rather than any form of INI configuration file. I cannot find any reference to .epub being a configuration file name for any system with a cursory search.

As it is listed directly after .ini in the data file, it was likely a copy/paste error.

The line should probably read as:

        ["", 0, ".epub", "application/epub+zip", "electronic book document"]

RuntimeWarning when running package

I am getting this warning in Python 3.6 and 3.7 beta when running `python3 -m puremagic":

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))

New archivers support: Brotli, LZ4 and ZStd

Recently was published few new compression algorithms and formats that are going to be quite popular:
Brotli from Google: is supported by all browsers, was standardized and have it's RFC but it doesn't have magic signature.
LZ4 which is ultra fast and lightweight.
ZStandard from Facebook which is already widely supported by a lot of systems including Linux kernel. It's mime type is application/zstd and it's magic is \x28\xB5\x2F\xFD
Could you add a support for them? If yes please also add their tar version (tzst, tlz4, tbr).
The more programs supports them then the easier it will be to migrate to this compressors.

Webp image mime type is empty

Hello,

I encountered a discrepancy when running a test with the following code:

import puremagic

print(puremagic.from_file("test/resources/images/test.webp", mime=True)) # prints "image/webp"
with open("test/resources/images/test.webp", "rb") as f:
    print(puremagic.from_string(f.read(), mime=True)). # prints ""

In comparison, the python-magic library outputs "image/webp" for both the from_file and from_buffer functions.

I am uncertain whether this difference in behavior is intentional.

Thank you!

Multi-part checks with negative offset for second match

I'm just going through some various files for some formats, in some cases I could increase the confidence as the file has both a solid position for a header, but also a solid position for a footer. Would it be possible to have the multi-part use a negative byte count to match from?

This would be handy as in one case I have a file that is clashing with another match at 0.8, adding the footer as an additional match should push past this to give a solid first match. I believe this would help with things like #37 .svg matching confidence as well.

Example entry for multi-part-headers:

"4352454D" : [
    ["444f4e4500000000", -8, ".ctm", "", "CreamTracker module"]			
]

missing mime type for webp

For example, this is a webp image, download it to test.webp:

This is the difference between puremagic and magic:

In [22]: import puremagic

In [23]: puremagic.from_file("test.webp", mime=True)
Out[23]: ''

In [24]: import magic

In [25]: magic.from_file("test.webp", mime=True)
Out[25]: 'image/webp'

It is seems that mime is missing, but if I remove the mime=True. I can get the webp extension:

In [26]: puremagic.from_file("test.webp")
Out[26]: '.webp'

Some common filetypes are not detected

Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).

$ file changelog.txt
changelog.txt: ASCII English text

$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.