cdgriffith / puremagic Goto Github PK
View Code? Open in Web Editor NEWPure python implementation of identifying files based off their magic numbers
License: MIT License
Pure python implementation of identifying files based off their magic numbers
License: MIT License
Since puremagic
only supports Python 3.5 and up, I think you can remove:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L25
And:
https://github.com/cdgriffith/puremagic/blob/master/setup.py#L35
I was having a look around the various JPEG X* flavours and came across https://en.wikipedia.org/wiki/JPEG_XS which is both a still image and video codec.
Just to be awkward they use the same fingerprint 0xFF10 FF50
for both image and video but then give it two mime types image/jxsc
and video/jxsv
.
What would be the best approach to handle this? Two entries in the .json one for each type? I'm not sure of other formats that would do this but I reckon they are out there.
Are you interested in a pull-request with something like this?
def mimetype_from_stream(stream, filename=None):
"""
Reads in stream, attempts to identify content based
off magic number and will return the mimetype.
If filename is provided it will be used in the computation.
"""
assert isinstance(stream, BytesIO)
head, foot = _stream_details(stream)
ext = puremagic.ext_from_filename(filename) if filename else None
return puremagic.main._magic(head, foot, True, ext)
def _stream_details(stream):
""" Grab the start and end of the stream"""
assert isinstance(stream, BytesIO)
max_head, max_foot = puremagic.main._max_lengths()
head = stream.read(max_head)
try:
stream.seek(-max_foot, os.SEEK_END)
except IOError:
stream.seek(0)
foot = stream.read()
return head, foot
I will mimic your from_string
function more in the pull-request. In my case I only need the mime-type, not the extension.
Better identify common files. Such as opening .docx/.pptx/.xlsx and viewing the XML file to figure out which exactly they are.
Some of the headers in magic_data lack mime-type, for example:
https://github.com/cdgriffith/puremagic/blob/master/puremagic/magic_data.json#L69
This causes from_string(str, mime=True)
to return empty string instead of the mime-type in some cases, eg JPEG image.
Is there a specific reason for this, or is the dataset just incomplete?
Hi,
I just found PureMagic and am trying to use it to identify if a file my script receives is ELF or not. I am using a test ELF binary and instead of get back "ELF executable" as i would expect I am getting ".AppImage".
I did run readelf
against the file and here is results:
$ readelf -h elf_hello/chello
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Intel 80386
Version: 0x1
Entry point address: 0x80482f0
Start of program headers: 52 (bytes into file)
Start of section headers: 1904 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 7
Size of section headers: 40 (bytes)
Number of section headers: 27
Section header string table index: 26
$ python3 -m puremagic elf_hello/chello
'elf_hello/chello' : .AppImage
I also dug into the magic_data.json file and found out that those 2 file types share a lot of the same bytes:
"454c46", 1, ".AppImage"
"7f454c46", 0, "", "", "ELF executable"
After doing some more digging it looks like puremagic find both options but always returns the AppImage
entry.
These are the 2 results from the confidence function:
PureMagicWithConfidence(byte_match=b'ELF', offset=1, extension='.AppImage', mime_type='application/x-iso9660-appimage', name='AppImage application bundle', confidence=0.9)
PureMagicWithConfidence(byte_match=b'\x7fELF', offset=0, extension='', mime_type='', name='ELF executable', confidence=0.9)
I admit that i can be totally blind and am not seeing where the logic decides which one to choose. I'd get it if it looked at the file extension and saw that there wasn't one and choose the ELF executable vs. the AppImage, but it looks like it is a toss up when the confidence level is the same...
Thanks in advance for any insight, suggestions, etc :)
Take a look at these Python repos to see what can puremagic add to cover more formats.
If we include non-Python repos:
Instead of opening a seperate data file and evaling it, find a method that is just as fast to store that information that is safer and just as easy.
Hi there,
I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.
One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love:
puremagic/puremagic/magic_data.json
Line 1084 in ff042db
But in Amiga land that .iff FORM
header is used for many things Wikipedia: List_of_file_signatures
Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM
ACBM
etc..? I'm happy to help with a PR if it can be done.
From https://github.com/ImageMagick/jpeg-xl/blob/main/doc/format_overview.md
JPEG-XL Either starts with
0xFF0A
for a Raw JPEG-XL Codestream
0x0000000C 4A584C20 0D0A870A
for a ISOBMFF-based container
Checking the files at https://jpegxl.info/art/2021-04_jon.html, I can confirm that 0xFF0A
is used for Raw Codestreams
same (mp3) file, different name ... different output
Make a copy:
sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
Verify it's there with same size:
sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3
... and same contents:
sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c test/resources/audio/testblabla.bla
... but puremagic says the first one is mp3 and the second is ... koz?
sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz
Is this wanted behaviour, or a bug?
PS: Linux' file
reports it correctly as mp3:
sander@brixit:~/git/puremagic$ file test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
SVG images return mime_type='video/x-ms-asf'
I would love to do something like this:
import puremagic
with open(file_path, "rb") as fh:
ext = puremagic.from_file_handler(fh)
Especially the bytestream support might be nice in case the file is not / cannot / should not be stored on the disk (e.g. AWS Lambda)
A file with an .epub extension is extremely likely to be the zip+xml+html based, and extremely popular, e-book format rather than any form of INI configuration file. I cannot find any reference to .epub being a configuration file name for any system with a cursory search.
As it is listed directly after .ini in the data file, it was likely a copy/paste error.
The line should probably read as:
["", 0, ".epub", "application/epub+zip", "electronic book document"]
I am getting this warning in Python 3.6 and 3.7 beta when running `python3 -m puremagic":
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125:
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in
unpredictable behaviour
warn(RuntimeWarning(msg))
Recently was published few new compression algorithms and formats that are going to be quite popular:
Brotli from Google: is supported by all browsers, was standardized and have it's RFC but it doesn't have magic signature.
LZ4 which is ultra fast and lightweight.
ZStandard from Facebook which is already widely supported by a lot of systems including Linux kernel. It's mime type is application/zstd
and it's magic is \x28\xB5\x2F\xFD
Could you add a support for them? If yes please also add their tar version (tzst, tlz4, tbr).
The more programs supports them then the easier it will be to migrate to this compressors.
Hello,
I encountered a discrepancy when running a test with the following code:
import puremagic
print(puremagic.from_file("test/resources/images/test.webp", mime=True)) # prints "image/webp"
with open("test/resources/images/test.webp", "rb") as f:
print(puremagic.from_string(f.read(), mime=True)). # prints ""
In comparison, the python-magic
library outputs "image/webp" for both the from_file
and from_buffer
functions.
I am uncertain whether this difference in behavior is intentional.
Thank you!
I'm just going through some various files for some formats, in some cases I could increase the confidence as the file has both a solid position for a header, but also a solid position for a footer. Would it be possible to have the multi-part use a negative byte count to match from?
This would be handy as in one case I have a file that is clashing with another match at 0.8, adding the footer as an additional match should push past this to give a solid first match. I believe this would help with things like #37 .svg matching confidence as well.
Example entry for multi-part-headers:
"4352454D" : [
["444f4e4500000000", -8, ".ctm", "", "CreamTracker module"]
]
For example, this is a webp image, download it to test.webp
:
This is the difference between puremagic and magic:
In [22]: import puremagic
In [23]: puremagic.from_file("test.webp", mime=True)
Out[23]: ''
In [24]: import magic
In [25]: magic.from_file("test.webp", mime=True)
Out[25]: 'image/webp'
It is seems that mime is missing, but if I remove the mime=True
. I can get the webp extension:
In [26]: puremagic.from_file("test.webp")
Out[26]: '.webp'
Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).
$ file changelog.txt
changelog.txt: ASCII English text
$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125:
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in
unpredictable behaviour
warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified
$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125:
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in
unpredictable behaviour
warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.