Code Monkey home page Code Monkey logo

mime-detective-clarkis117's People

Contributors

andersonpimentel avatar clarkis117 avatar joana-osorio avatar muraad avatar ofthelit avatar pydez avatar sandrock avatar wewebber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mime-detective-clarkis117's Issues

Word / Excel 97-2003 files may be detected as MSDOC

Word / Excel 97-2003 files may be detected as MSDOC type and not WORD or EXCEL.
I faced the issue by creating a blank Excel file in Excel 2007 and save it as XLS (check blank.zip)

From what I could check, Office 97-2003 file signatures are based on "subheaders" and there might have several of them without a clear documentation. However the library would detect it as MSDOC type.

I would therefore suggest to

  • Rename MSDOC type as MS_OFFICE to be more accurate
  • Add the list of known MS office extensions (at least doc,ppt,xls)
    • or the list as defined here

Something like
// OLECF - Object Linking and Embedding (OLE) Compound File (CF)
// Compound Binary File format by Microsoft, used by Microsoft Office 97-2003 applications(Word, Powerpoint, Excel, Wizard)
public readonly static FileType MS_OFFICE = new FileType(new byte?[] { 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1 }, "doc,ppt,xls", "application/octet-stream");

Since the type appears after WORD and EXCEL types, the detection would first match based on subheaders and default to this one if the subheader does not match.

Massive Upgrade

Hi @clarkis117 -
I have recently done a complete rewrite of Mime Detective and added the ability to detect over 14,000 different file types.

I'm interested in publishing my update and, with your permission, taking over maintenance of the nuget package. Can you contact me to discuss?

Detecting .doc and .xls file using the StreamExtensions or ByteArrayExtensions

When loading a old Microsoft Office file into a stream or byte array, the GetFileType() resulting MimeType is doc,ppt,xls, not the expect xls or doc.

For example. Both text fail.

    [Fact]
    public void CanReadExcelFileFromByteArray()
    {
        var result = File.ReadAllBytes("./data/Documents/XlsExcel2007.xls").GetFileType();

        Assert.NotNull(result);
        Assert.Equal(MimeTypes.EXCEL, result);
    }

    [Fact]
    public void CanReadExcelFileFromStream()
    {
        using (FileStream stream = File.Open("./data/Documents/XlsExcel2007.xls", FileMode.Open))
        {
            var result = stream.GetFileType(false, true);

            Assert.NotNull(result);
            Assert.Equal(MimeTypes.EXCEL, result);
        }
    }

Any ideas how to get this to work without having to save the files to disk and then loading them using the FileInfo object?

mp4

I have used your tool very successfully, however, my code returns a null in the following when an mp4 is tested:
(The filePath extension is mp4)
NSUrl videoFileURL = NSUrl.FromString(filePath);
Uri uri = new Uri(videoFileURL.ToString());
StreamReader streamReader = new StreamReader(filePath);
FileInfo fileInfo = new FileInfo(filePath);

            Stream stream = streamReader.BaseStream;
            FileType fileType = MimeDetective.FileInfoExtensions.GetFileType(fileInfo);

(The fileType here is null)
This code works fine for .mov files (quicktime) but fails with mp4. Please let me know if I am using it incorrectly for these type of files or if there is actually an issue.
Thanks!

PDF File Detected as plain/text

Hi,

I have tried to detect a 'pdf' file but I got 'txt' as a response

IFormFileCollection invoices
       
foreach (var invoice in invoices)
{
    using (var ms = invoice.OpenReadStream())
    {
        FileType fileType = ms.GetFileType();
    }
}

The file bytes starting with 25 50 44 46 and it should be acceptable as pdf

The pdf file bytes are starting with these:

25 50 44 46 2d 31 2e 37 0a 25 ef bf bd ef bf bd ef bf bd ef bf bd 0a 31 20 30 20 6f 62 6a 0a 3c 3c 2f 54 79 70 65 2f 43 61 74 61 6c 6f 67 2f 50 61 67 65 73 20 32 20 30 20 52 2f 4c 61 6e 67 28 74 72 2d 54 52 29 20 2f 53 74 72 75 63 74 54 72 65 65 52 6f 6f 74 20 31 30 20 30 20 52 2f 4d 61 72 6b 49 6e 66 6f 3c 3c 2f 4d 61 72 6b 65 64 20 74 72 75 65 3e 3e 2f 4d 65 74 61 64 61 74 61 20 32 31 20 30 20 52 2f 56 69 65 77 65 72 50 72 65 66 65 72 65 6e 63 65 73 20 32 32 20 30 20 52 3e 3e 0a 65 6e 64 6f 62 6a 0a 32 20 30 20 6f 62 6a 0a 3c 3c 2f 54 79 70 65 2f 50 61 67 65 73 2f 43 6f 75 6e 74 20 31 2f 4b 69 64 73 5b 20 33 20 30 20 52 5d 20 3e

.Net Core Project

net451 support

net451 support would be nice to have, because many companies are still limited to this framework.

Empty ZIP files not recognized

Empty ZIP files are a special kind of ZIP. The header differs.

File contents: 50 4b 05 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Calling GetFileType(this byte[] bytes) returns null.

$ file -i ZIPempty.zip
ZIPempty.zip: application/zip; charset=binary

Throwing exeption when it cannot determine the file type.

Methods which return FileType from (MimeTypes.cs) usually have this comment:

	/// <summary>
	/// Read header of a file and depending on the information in the header
	/// return object FileType.
	/// *Return null in case when the file type is not identified.*
	/// Throws Application exception if the file can not be read or does not exist
	/// </summary>

But because of that line https://github.com/clarkis117/Mime-Detective/blob/master/src/Mime-Detective/MimeTypes.cs#L314 this statement about return value isn't true.

PDF files detected as plain/text

I faced the same issue #12 as reported in the original repository.

The proposed change was to run the file signature detection first, and then plain text detection.
That would definitely fix the issue and be more reliable.

Would you consider making this change?

If so, it might then be possible to improve the plain text detection to detect the file encoding.

NullReferenceException reading Excel: 0.0.6-alpha1

When detecting a Excel File from a byte array throws a NullReferenceException at MimeDetective.MimeTypes.FindZipType(ReadResult& readResult) in \src\Mime-Detective\MimeTypes.cs:line 314


var fileData = File.ReadAllBytes(filePath);
var type = fileData.GetFileType(); 

test.xlsx

"End of Central Directory record could not be found."- 0.0.6-beta1

Hi @clarkis117,

I've updated the nuget to version 0.0.6-beta1 and now I'm getting the following error "End of Central Directory record could not be found" with a byte array of a Excel file.

Sample of code: File.ReadAllBytes(filePath).GetFileType()

Sample of file: test.xlsx

TargetFramework: .NET Core 2.0

Could you help me with this?

Additionally, do you have any expected date for the final nuget release of the 0.0.6?

Thank you,

Add more framework targets

Your current targets are:
netstandard1.3;net45

Would it be possible to add also net471, netstandard2.0 ?

I tried including your latest beta nuget into my net471 project and got some unexpected warnings.
Warning Found conflicts between different versions of "System.Net.Http" that could not be resolved. These reference conflicts are listed in the build log when log verbosity is set to detailed.
Dependencies System.Buffers and System.Xml.XmlSerializer both got included but they shouldn't?

MimeDetective LearnMimeType reads every other byte of a header

It appears thatpublic static FileType LearnMimeType(FileInfo first, FileInfo second, string mimeType, int maxHeaderSize = 12, int minMatches = 2, int maxNonMatch = 3) in MimeDetective.cs is checking every other byte for matches in this method and adds the matches to the header for the new file type it returns.

Is this intentional?

Additionally i'm having a hard time figuring out the purpose of this method at all. Is it used to find the common header info of a known filetype/mime extension, or is it used to find the mime type of an unknown file, by comparing it to a known mime type? or something else

Any though on when 0.0.6 final will be published?

The latest version really looks nice. I know that this is a personal project but do you have any idea of when you will reach 0.0.6 final?

What remain to be done? And if you need some help, maybe add issues for the community to help you out.

Keep on the good work!

System.NullReferenceException - 0.0.6 beta 2

Hi @clarkis117,

Sorry for the spam, but I'm having another issue with txt files in the version 0.0.6 beta 2.

Exception: System.NullReferenceException: 'Object reference not set to an instance of an object.'
MimeDetective.ByteArrayExtensions.GetFileType(...) returned null.

Use case:
File.ReadAllBytes(filePath).GetFileType()

Sample of file: test.txt

TargetFramework: .NET Core 2.0

Thank you,

Add custom FileType

How can I add a custom file type?

I tried the following but without succes.

static MyClass()
{
    MimeAnalyzers.PrimaryAnalyzer.Insert(EPS);
}

static void Check() 
{
    var type = uploadedFile.InputStream.GetFileType(); // returns null but should return the EPS file. 
}

// https://www.garykessler.net/library/file_sigs.html
private static readonly FileType EPS = new FileType(
    new byte?[] { 0x25, 0x21, 0x50, 0x53, 0x2D, 0x41, 0x64, 0x6F,
        0x62, 0x65, 0x2D, 0x33, 0x2E, 0x30, 0x20, 0x45,
        0x50, 0x53, 0x46, 0x2D, 0x33, 0x20, 0x30 },
    "eps",
    "application/postscript");

I checked the file with an hex viewer and the magic bytes are correctly configured above.

Anything I'm missing?

Csv detected was wav

Hi,

Is csv supported? I've tried to detect a csv, but the file is detected as wav.

Thanks,

Csv detected as wav

Hi,

Is csv supported? I've tried to detect a csv, but the file is detected as wav.

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.