tut-arg / sed_eval Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 45.0 4.44 MB

Evaluation toolbox for Sound Event Detection

Home Page: http://tut-arg.github.io/sed_eval

License: MIT License

Python 100.00%

sed_eval's People

Contributors

Stargazers

Watchers

Forkers

qiuqiangkong zhaoyang10 gkaramanolakis vanova pquochuy harshas123 stevenlol runngezhang davidayllon lamyaaa sophiyacse blank-wang cocoxili aguai abrutti windstudent hongsixin byfaith chrisqlasty clockw blackc03r afd77 xiongmaoxia battyone itmgr matchading skarbs001 kuonanhong manojkl zzfon kedarphatak4 saeidjamili road2018 jperezmacias zhordiffallah srvanrell zhangwq740 tankgauravgt ggzhang0071 jd07 tjddn0402 swagshaw xiao-quan-bose mamkramer mint-0806

sed_eval's Issues

EventBasedMetrics documentation is misleading

The documentation for EventBasedMetrics is misleading.

It says:

     

t_collar : float (0,]

    Time collar used when evaluating validity of the onset and offset, in seconds. Default value 0.2

percentage_of_length : float in [0, 1]

    Second condition, percentage of the length within which the estimated offset has to be in order to be consider valid estimation. Default value 0.5

The documentation for percentage_of_length suggests that is an "AND" condition, not an "OR" condition. However, the code indicates that the max is being used, which is also what mir_eval does:

        # Detect field naming style used and validate onset
        if 'event_offset' in reference_event and 'event_offset' in estimated_event:
            annotated_length = reference_event['event_offset'] - reference_event['event_onset']

            return math.fabs(reference_event['event_offset'] - estimated_event['event_offset']) <= max(t_collar, percentage_of_length * annotated_length)

        elif 'offset' in reference_event and 'offset' in estimated_event:
            annotated_length = reference_event['offset'] - reference_event['onset']

            return math.fabs(reference_event['offset'] - estimated_event['offset']) <= max(t_collar, percentage_of_length * annotated_length)

I would suggest adapting the documentation for EventBasedMetrics based upon the mir_eval documentation for offset_ratio and offset_min_tolerance to make the sed_eval documentation clearer.

Confusion matrix implemented ?

Hi ! first thanks for this nice toolbox , very helpful! I was just wondering whether you had implemented a confusion matrix somewhere ? I would be particularly interested in having one for event-based metrics

thanks
Dorian

EventList class is missing in 0.2.0, but is still in the documentation

sed_eval/sed_eval/sound_event.py

Line 92 in e8b9690

reference_event_list = sed_eval.util.EventList(

evaluated_length_seconds, evaluated_files variables in SegmentBasedMetrics and EventBasedMetics classes are not initialized to 0 when running reset() function.

The reset() function initializes all internal states.
So isn't it natural to set these variables be zero, too?

in SegmentBasedMetrics class,

def reset(self):
    """Reset internal state
    """
    self.evaluated_length_seconds = 0
    self.evaluated_files = 0

    self.overall = {
        'Ntp': 0.0,
        'Ntn': 0.0,
        'Nfp': 0.0,
        'Nfn': 0.0,
        'Nref': 0.0,
        'Nsys': 0.0,
        'ER': 0.0,
        'S': 0.0,
        'D': 0.0,
        'I': 0.0,
    }

    self.class_wise = {}
    for class_label in self.event_label_list:
        self.class_wise[class_label] = {
            'Ntp': 0.0,
            'Ntn': 0.0,
            'Nfp': 0.0,
            'Nfn': 0.0,
            'Nref': 0.0,
            'Nsys': 0.0,
        }

    return self

Thank you for your hard work!

What's the difference between accuracy in SED's metrics and binary_accuracy in tensorflow?

Great library!
I have use this library for a while, but still I can't tell the difference between segment-based (set segment length to 1 frame) accurary used in SED and binary accuracy in tensorflow. My understanding is that 1 frame segment-based equal to binary accuracy in tensorflow, since they both calculate the acc in each frame. However, these two value are not equal in my model. I just don't know why.

The accuracy I mean is

accuracy = ( TP + TN ) / ( TP + TN + FP + FN )

Binary accuracy in tensorflow is

def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)

Two formula above actually mean the same thing. My output is frame by frame, and each frame has num_of_class scalars to present existence of each class.

Please tell me whether I misunderstand it. Thanks!

sed_eval.io.load_event_list does not support .csv

Hi Toni,

I found the sed_eval.io.load_event_list function works well for .txt file. But when I change the suffix to .csv, it is parsing the events wrongly. Do you have any ideas? Many thanks!

Qiuqiang

Unified output types?

There's some inconsistency in the types of the computed metrics: error rate is float, but f-score is numpy.float64. Nsubs is float, but Ntp is numpy.float64. Should they maybe be unified?

F1 metric is nan when should be 0

In a particular test, my system failed to produce any event of one of the reference classes (C). As expected, its recall is 0, and its precision NaN on that particular class (see below). Yet, the F1 score on that class should, IMHO, be 0 (yes pr/p+r is 0/0 fine but at a higher level it's a full miss, the denominator really is ε).

In turn the class-wise average should take that C class into account as 0, rather than as NaN as currently, and should be the average of A and C instead of just A.

  Class-wise metrics
  ======================================
    Event label  | Nref    Nsys  | F        Pre      Rec    | ER       Del      Ins    | Sens     Spec     Bacc     Acc     
    ------------ | -----   ----- | ------   ------   ------ | ------   ------   ------ | ------   ------   ------   ------  
    A            | 37      30    | 74.6%    83.3%    67.6%  | 0.46     0.32     0.14   | 67.6%    97.2%    82.4%    92.1%   
    B            | 0       0     | nan%     nan%     nan%   | 0.00     0.00     0.00   | 0.0%     100.0%   50.0%    100.0%  
    C            | 33      0     | nan%     nan%     0.0%   | 1.00     1.00     0.00   | 0.0%     100.0%   50.0%    84.7%   
    D            | 0       0     | nan%     nan%     nan%   | 0.00     0.00     0.00   | 0.0%     100.0%   50.0%    100.0%  
  Class-wise average metrics (macro-average)
  ======================================
  F-measure
    F-measure (F1)                  : 74.63 %
    Precision                       : 83.33 %
    Recall                          : 33.78 %

SegmentBasedMetrics constructor doesn't type check

The SegmentBasedMetrics constructor take a list of valid event labels, which must be of type list. Giving it an numpy.ndarray will work during construction, but will cause the code to crash when calling evaluate() with a not-so-easy-to-parse error. It would be helpful if the constructor checked for the types of its two input args (list and float > 0) and raised errors if they are incorrect.

Examples in docs for evaluating results directly in python without loading from files

Currently the only example (I could find) in the documentation for using sed_eval in python assumes the reference and estimate events are saved to disk as lab files. It would be helpful to have an example showing how to use sed_eval to compare ref/est that live in memory directly, including the expected data format (EventList?). Thanks!

Typo in installation page

Found typo "setyp" in installation page, and many that link to the heading

Key error 'file'

Hey there,
I just noticed an error, that during evaluation for scene detection, the following error happens:

Traceback (most recent call last):
  File "../../../../system/evaluate17.py", line 36, in <module>
    file_pair['estimated_scene_list'])
  File ".local/lib/python3.6/site-packages/sed_eval/scene.py", line 159, in evaluate
    if estimated_item['file'] == reference_item['file']:
KeyError: 'file'

I checked out the source for .load() in dcase_util and it seems that all variables have been renamed to filename not file. A simple fix would be to just replace the name.

Or did I misunderstand some of the usage of this script?

dcase_util dependency?

Hey all, thanks for providing this package!

I noticed that there is now a dependence on dcase_util, even though only two pieces of that package are used in sed_eval: MetaDataContainer and FancyStringifier. I understand that these packages are developed together, but I wonder if it's worth refactoring the design a bit to reverse the direction of the dependency? That is, make dcase_util depend on sed_eval instead?

I bring this up because dcase_util itself has a rather heavy dependency chain, while sed_eval's is comparatively light. As far as I can tell, there's nothing in sed_eval that requires any audio or signal processing, but dcase_util brings over a load of otherwise unused dependencies (librosa, youtube-dl, etc). More to the point, there are many contexts outside of dcase where sed_eval could be useful, so it would be beneficial to keep the footprint as small as possible.

Graph-based matching for optimal (and correct) reference to estimate event matching

Just found out about this library, great work! (was about to start writing basically the same but decided to have one final look to see if there's anything out there already - glad I found it).

For the event-based metrics the evaluate function matches events by iterating over all reference and estimate events in a nested loop. I have two questions about this:

Doesn't this mean that the same estimated event can be matched against more than one reference event? I'm not sure that's a desired behavior - presumably every reference event should be matched again at most one estimated event and vice versa?
Assuming the goal of (1) is indeed to match every reference event against at most one estimated event (and vice versa), then using a nested loop means the matching is greedy and not necessarily optimal (i.e. it might not find the optimal pairing of matching events) leading to an underestimation of performance. This is exactly the case for note transcription (which is basically the same problem: matching pairs of events based on onset, offset and label (pitch) criteria). To get around this in mir_eval we match notes using bi-partite graph matching, which guarantees to find the optimal pairing of reference and estimated events. The same approach is also used everywhere else in mir_eval where to sets of events needs to be matched into pairs.

Let me know what you think, cheers!

Process to generate Event list is CSV formatted text-file

Hi @toni-heittola,
I am first time working on the SED task and got stuck in the generation of the Event list is CSV formatted text-file during the evaluation in Development dataset and Evaluation dataset of DCASE 2017 task 2 and 3. I can get the test predictions but am not sure about how to convert these frame-wise predictions to the time intervals (onset and offset time interval) for the Event list is CSV formatted text-file. Kindly help me to get any reference where I can learn this process from.

Stay Safe
Best Regards