Code Monkey home page Code Monkey logo

Comments (9)

sezonis avatar sezonis commented on May 24, 2024 1

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

I am chasing the (admittedly somewhat unreasonable) dream of removing ads / intros / outros / recaps / undesirable repetitive content completely, while simultaneously retaining all desired content, automatically/programmatically with no user input beyond a folder full of similar content and 2 files to fingerprint.

This is for youtube right? Yea, that was pretty much what I was using it for. I ended up just using image recognition along with this library to get it "more accurate". Sure, it uses more resources but it gets the job done.

As for accuracy, I think it's very complicated to get it "pixel perfect" (like near 0ms) because I'd have to assume that the sound waves are always "similar" in some sense, so the algorithm has to make sure it's a match. This library isn't the only library with this issue as well, there are a few others and I don't think it's a simple solution to get it always accurate. Like, it was able to get the audio for me at 0ms with my tricks, by giving it a slightly earlier sound and it matched it at least with a 50ms delay, instead of 500ish. But of course, 1 small change of sound (even if it's so tiny only a program can see it) the delay goes back up. If you want a real foolproof solution, then I suggest you use this library in conjunction with something else, to ensure it is accurate.

from soundfingerprinting.

AddictedCS avatar AddictedCS commented on May 24, 2024 1

startsAtSecond and secondsToProcess

@jack4455667788 the issue #207 had a broader effect, and in case you were using these parameters in conjunction with MediaType.Audio | MediaType.Video during fingerprinting or query I suggest you upgrade to v8.24.0 where it was fixed, and see if you get better results.

It would be wonderful if this could be done, and I don't really understand why it can't.

Intuitively you can think of it as a discretization problem, the challenge of transforming a signal (audio in this case) into a set of discrete fingerprints that approximate it. There is a resolution that defines a fingerprint (i.e., 128x32) which approximates about 1.48 seconds of audio signal. These fingerprints are generated using a certain stride, a step between consecutive fingerprints. By default, the stride is 512 samples during fingerprinting (92 ms) and a random value between [256, 512] during query (46 - 92 milliseconds) (values defined in Configs class).

You can decrease the stride between consecutive fingerprints during fingerprinting (say to an extreme case of 1 sample) to increase the chances of having a perfectly aligned fingerprint during query time, but this will substantially increase the footprint of your model service that stores these fingerprints (generating 512x more fingerprints):

FingerprintCommandBuilder.Instance
                .BuildFingerprintCommand()
                .From(pathToFile)
                .WithFingerprintConfig(cfg =>
                {
                    // specifying a stride of 1, meaning we will create new fingerprints with a step of one sample (~0.18 ms )
                    cfg.Audio.Stride = new IncrementalStaticStride(incrementBy: 1);
                    return cfg;
                })
                .UsingServices(audioService)
                .Hash()

This is still not a good solution because even with perfectly aligned signals, you can have distortions generated by encoding/aliasing that will prevent perfect matches. The default values have been empirically defined to maximize recall and precision while minimizing the audio signal's footprint.

Now to the problem of cutting the ads to the precise frame. SoundFingerprinting.Emy contains a strategy that can help you with your problem. There is an experimental class named EdgeSearchStrategy which looks for edges in a video file.

How it works: once you identify a match, you can run a second analysis over the video looking for edges (i.e., black frames and scene changes) around the area where you expect the content to have started/ended. This implies you need access to the matched content (for example, if you are matching over streaming content, you need to generate a file from the streaming match that covers the area where the match happened).

var StartEndEdgeSearchLocationDelta = 3;

// audio object of type QueryResult
var optimalLength = audio?.BestMatch.Track.Length;

// this file has to cover the area of the audio.BestMatch
// also it is recommended to extend the area of the match by StartEndEdgeSearchLocationDelta
// as an example, if your match happened at 09:30:00 till 09:30:30 (hh:mm:ss), then extend the area of the analyzed content by 3 seconds at start/end location 09:29:57 till 09:30:33 (totally extending the match by 6 seconds)
var extendedMediaFile = "path to streaming content that matched";

var edgeSearchStrategy = new EdgeSearchStrategy(new NLogLoggerFactory());
var edgeSearchConfig = new EdgeSearchConfig(new BlackFramesFilterConfiguration { Threshold = 32, Amount = 94 }, SceneChangeThreshold: 0.4, OptimalLength: optimalLength, StartsAtHint: StartEndEdgeSearchLocationDelta, EndsAtHint: StartEndEdgeSearchLocationDelta + optimalLength);
var mediaSegment = edgeSearchStrategy.FindMediaSegmentClosestToOptimalLength(extendedMediaFile, edgeSearchConfig);

if(mediaSegment != null)
{
       // better edges have been found
}

Keep in mind this is an experimental API, and you need FFmpeg installed to use it https://github.com/AddictedCS/soundfingerprinting/wiki/Audio-Services.

Let me know if it any of the above helped.

from soundfingerprinting.

AddictedCS avatar AddictedCS commented on May 24, 2024

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

from soundfingerprinting.

AddictedCS avatar AddictedCS commented on May 24, 2024

Referenced commits 762a9e2 are fixing issue #207, adding a comment to remove confusion.

from soundfingerprinting.

jack4455667788 avatar jack4455667788 commented on May 24, 2024

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

I am chasing the (admittedly somewhat unreasonable) dream of removing ads / intros / outros / recaps / undesirable repetitive content completely, while simultaneously retaining all desired content, automatically/programmatically with no user input beyond a folder full of similar content and 2 files to fingerprint.

It would be wonderful if this could be done, and I don't really understand why it can't.

Assuming this was a "brick wall" I have already implemented a manual user interface to scan fwd/back through the video/audio frames (around the query result match suggested time) so the user can specify the precise points where the undesirable content begins and ends - but it would be much preferred if that were not necessary.

from soundfingerprinting.

jack4455667788 avatar jack4455667788 commented on May 24, 2024

This is for youtube right?

My hope is that it might be for most everything that has repeating content to be removed.

I ended up just using image recognition along with this library to get it "more accurate". Sure, it uses more resources but it gets the job done.

I was attempting to do that with this library (as it does videofingerprinting as well) but I didn't get very far (it didn't recognize the common content across the two sub-clips AND it appeared to be fingerprinting the entire video instead of just the segment specified by startsAtSecond and secondsToProcess). The real trouble is that without first knowing. precisely, where the first frame of the content to remove is - image recognition doesn't really help. My hope is for all this analysis and comparison to be done programmatically and require no user input.

Out of curiosity, what did you end up using for the image recognition? And was it able to compare two different video streams and find the common frames between them (ideally with frame perfect accuracy)?

This library isn't the only library with this issue as well

I've noticed! I've only tried a few others so far, but they have the same inaccuracy issue.

Like, it was able to get the audio for me at 0ms with my tricks, by giving it a slightly earlier sound and it matched it at least with a 50ms delay, instead of 500ish.

I'm doing something similar, and often the delay isn't so bad - but I want it to be 0. If it isn't the hashing algorithm itself, I think the random stride may be involved in the inconsistent results - but I am hoping to understand the problem better in any case.

If you want a real foolproof solution, then I suggest you use this library in conjunction with something else, to ensure it is accurate.

Thanks for the tips! I'm open to any suggestions you might have regarding the "something else"!

from soundfingerprinting.

nicko88 avatar nicko88 commented on May 24, 2024

FWIW I try to use this software for timing accuracy in order to sync external systems with media playback (using real-time matching). The more accurate timing the better in my case.

from soundfingerprinting.

AddictedCS avatar AddictedCS commented on May 24, 2024

Hey @jack4455667788 did anything from the above message helped in solving your issue?

from soundfingerprinting.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.