tonyvalenti / mime-detective-clarkis117 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ofthelit/mime-detective

38.0 38.0 9.0 22.49 MB

Mime type detector for files, byte arrays, and streams, .NET Standard Fork

License: MIT License

C# 98.64% PowerShell 1.14% Shell 0.23%

mime-detective-clarkis117's People

Contributors

Stargazers

Watchers

Forkers

realtyshares airudit 1100594 pydez ruilomba mdinsurance sbespaly henricobjektvision matheusz2

mime-detective-clarkis117's Issues

Word / Excel 97-2003 files may be detected as MSDOC

Word / Excel 97-2003 files may be detected as MSDOC type and not WORD or EXCEL.
I faced the issue by creating a blank Excel file in Excel 2007 and save it as XLS (check blank.zip)

From what I could check, Office 97-2003 file signatures are based on "subheaders" and there might have several of them without a clear documentation. However the library would detect it as MSDOC type.

I would therefore suggest to

Rename MSDOC type as MS_OFFICE to be more accurate
Add the list of known MS office extensions (at least doc,ppt,xls)
- or the list as defined here

Something like
// OLECF - Object Linking and Embedding (OLE) Compound File (CF)
// Compound Binary File format by Microsoft, used by Microsoft Office 97-2003 applications(Word, Powerpoint, Excel, Wizard)
public readonly static FileType MS_OFFICE = new FileType(new byte?[] { 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1 }, "doc,ppt,xls", "application/octet-stream");

Since the type appears after WORD and EXCEL types, the detection would first match based on subheaders and default to this one if the subheader does not match.

Massive Upgrade

Hi @clarkis117 -
I have recently done a complete rewrite of Mime Detective and added the ability to detect over 14,000 different file types.

I'm interested in publishing my update and, with your permission, taking over maintenance of the nuget package. Can you contact me to discuss?

When ShouldDisposeStream is set to false, consider resetting the stream position to 0

We found out today that files that were analyzed (using stream extension - GetFileTypeAsync) were corrupted as we had not reset the stream position before saving it as a file.

I would therefore suggest to reset the stream position to 0 in case the stream is not disposed, or at least making it explicit in comments/wiki/docs

Expose option to keep stream open

mimeTypes.GetFileType(() => fileData, stream, shouldDisposeStream: false)

Detecting .doc and .xls file using the StreamExtensions or ByteArrayExtensions

When loading a old Microsoft Office file into a stream or byte array, the GetFileType() resulting MimeType is doc,ppt,xls, not the expect xls or doc.

For example. Both text fail.

    [Fact]
    public void CanReadExcelFileFromByteArray()
    {
        var result = File.ReadAllBytes("./data/Documents/XlsExcel2007.xls").GetFileType();

        Assert.NotNull(result);
        Assert.Equal(MimeTypes.EXCEL, result);
    }

    [Fact]
    public void CanReadExcelFileFromStream()
    {
        using (FileStream stream = File.Open("./data/Documents/XlsExcel2007.xls", FileMode.Open))
        {
            var result = stream.GetFileType(false, true);

            Assert.NotNull(result);
            Assert.Equal(MimeTypes.EXCEL, result);
        }
    }

Any ideas how to get this to work without having to save the files to disk and then loading them using the FileInfo object?

mp4

I have used your tool very successfully, however, my code returns a null in the following when an mp4 is tested:
(The filePath extension is mp4)
NSUrl videoFileURL = NSUrl.FromString(filePath);
Uri uri = new Uri(videoFileURL.ToString());
StreamReader streamReader = new StreamReader(filePath);
FileInfo fileInfo = new FileInfo(filePath);

            Stream stream = streamReader.BaseStream;
            FileType fileType = MimeDetective.FileInfoExtensions.GetFileType(fileInfo);

(The fileType here is null)
This code works fine for .mov files (quicktime) but fails with mp4. Please let me know if I am using it incorrectly for these type of files or if there is actually an issue.
Thanks!

PDF File Detected as plain/text

Hi,

I have tried to detect a 'pdf' file but I got 'txt' as a response

IFormFileCollection invoices
       
foreach (var invoice in invoices)
{
    using (var ms = invoice.OpenReadStream())
    {
        FileType fileType = ms.GetFileType();
    }
}

The file bytes starting with 25 50 44 46 and it should be acceptable as pdf

The pdf file bytes are starting with these:

25 50 44 46 2d 31 2e 37 0a 25 ef bf bd ef bf bd ef bf bd ef bf bd 0a 31 20 30 20 6f 62 6a 0a 3c 3c 2f 54 79 70 65 2f 43 61 74 61 6c 6f 67 2f 50 61 67 65 73 20 32 20 30 20 52 2f 4c 61 6e 67 28 74 72 2d 54 52 29 20 2f 53 74 72 75 63 74 54 72 65 65 52 6f 6f 74 20 31 30 20 30 20 52 2f 4d 61 72 6b 49 6e 66 6f 3c 3c 2f 4d 61 72 6b 65 64 20 74 72 75 65 3e 3e 2f 4d 65 74 61 64 61 74 61 20 32 31 20 30 20 52 2f 56 69 65 77 65 72 50 72 65 66 65 72 65 6e 63 65 73 20 32 32 20 30 20 52 3e 3e 0a 65 6e 64 6f 62 6a 0a 32 20 30 20 6f 62 6a 0a 3c 3c 2f 54 79 70 65 2f 50 61 67 65 73 2f 43 6f 75 6e 74 20 31 2f 4b 69 64 73 5b 20 33 20 30 20 52 5d 20 3e

.Net Core Project

net451 support

net451 support would be nice to have, because many companies are still limited to this framework.

XML File detected as plain/text

Calling GetFileType(this byte[] bytes) returns "plain/text" instead of "application/xml".

Add Xunit Based Unit Tests

There should be some sort of unit tests with local files included in the unit tests project

Empty ZIP files not recognized

Empty ZIP files are a special kind of ZIP. The header differs.

File contents: 50 4b 05 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Calling GetFileType(this byte[] bytes) returns null.

$ file -i ZIPempty.zip
ZIPempty.zip: application/zip; charset=binary

Throwing exeption when it cannot determine the file type.

Methods which return FileType from (MimeTypes.cs) usually have this comment:

	/// <summary>
	/// Read header of a file and depending on the information in the header
	/// return object FileType.
	/// *Return null in case when the file type is not identified.*
	/// Throws Application exception if the file can not be read or does not exist
	/// </summary>

But because of that line https://github.com/clarkis117/Mime-Detective/blob/master/src/Mime-Detective/MimeTypes.cs#L314 this statement about return value isn't true.

ZipArchive disposes stream

Input stream will be disposed after calling GetFileType().

Regards!

PDF files detected as plain/text

I faced the same issue #12 as reported in the original repository.

The proposed change was to run the file signature detection first, and then plain text detection.
That would definitely fix the issue and be more reliable.

Would you consider making this change?

If so, it might then be possible to improve the plain text detection to detect the file encoding.

NullReferenceException reading Excel: 0.0.6-alpha1

When detecting a Excel File from a byte array throws a NullReferenceException at MimeDetective.MimeTypes.FindZipType(ReadResult& readResult) in \src\Mime-Detective\MimeTypes.cs:line 314


var fileData = File.ReadAllBytes(filePath);
var type = fileData.GetFileType();

test.xlsx

"End of Central Directory record could not be found."- 0.0.6-beta1

Hi @clarkis117,

I've updated the nuget to version 0.0.6-beta1 and now I'm getting the following error "End of Central Directory record could not be found" with a byte array of a Excel file.

Sample of code: File.ReadAllBytes(filePath).GetFileType()

Sample of file: test.xlsx

TargetFramework: .NET Core 2.0

Could you help me with this?

Additionally, do you have any expected date for the final nuget release of the 0.0.6?

Thank you,

Add more framework targets

Your current targets are:
netstandard1.3;net45

Would it be possible to add also net471, netstandard2.0 ?

I tried including your latest beta nuget into my net471 project and got some unexpected warnings.
Warning Found conflicts between different versions of "System.Net.Http" that could not be resolved. These reference conflicts are listed in the build log when log verbosity is set to detailed.
Dependencies System.Buffers and System.Xml.XmlSerializer both got included but they shouldn't?

MimeDetective LearnMimeType reads every other byte of a header

It appears thatpublic static FileType LearnMimeType(FileInfo first, FileInfo second, string mimeType, int maxHeaderSize = 12, int minMatches = 2, int maxNonMatch = 3) in MimeDetective.cs is checking every other byte for matches in this method and adds the matches to the header for the new file type it returns.

Is this intentional?

Additionally i'm having a hard time figuring out the purpose of this method at all. Is it used to find the common header info of a known filetype/mime extension, or is it used to find the mime type of an unknown file, by comparing it to a known mime type? or something else

Any though on when 0.0.6 final will be published?

The latest version really looks nice. I know that this is a personal project but do you have any idea of when you will reach 0.0.6 final?

What remain to be done? And if you need some help, maybe add issues for the community to help you out.

Keep on the good work!

System.NullReferenceException - 0.0.6 beta 2

Hi @clarkis117,

Sorry for the spam, but I'm having another issue with txt files in the version 0.0.6 beta 2.

Exception: System.NullReferenceException: 'Object reference not set to an instance of an object.'
MimeDetective.ByteArrayExtensions.GetFileType(...) returned null.

Use case:
File.ReadAllBytes(filePath).GetFileType()

Sample of file: test.txt

TargetFramework: .NET Core 2.0

Thank you,

Add Async Support

open issue for adding TAP based Async Support

Add custom FileType

How can I add a custom file type?

I tried the following but without succes.

static MyClass()
{
    MimeAnalyzers.PrimaryAnalyzer.Insert(EPS);
}

static void Check() 
{
    var type = uploadedFile.InputStream.GetFileType(); // returns null but should return the EPS file. 
}

// https://www.garykessler.net/library/file_sigs.html
private static readonly FileType EPS = new FileType(
    new byte?[] { 0x25, 0x21, 0x50, 0x53, 0x2D, 0x41, 0x64, 0x6F,
        0x62, 0x65, 0x2D, 0x33, 0x2E, 0x30, 0x20, 0x45,
        0x50, 0x53, 0x46, 0x2D, 0x33, 0x20, 0x30 },
    "eps",
    "application/postscript");

I checked the file with an hex viewer and the magic bytes are correctly configured above.

Anything I'm missing?

Csv detected was wav

Hi,

Is csv supported? I've tried to detect a csv, but the file is detected as wav.

Thanks,

Does not work with files under 560 B.

The MIME type detection crashes (throws an ArgumentException) on files smaller then 560 bytes, e. g. tiny plain text files.

Csv detected as wav

Hi,