Code Monkey home page Code Monkey logo

audioset_tagging_cnn's People

Contributors

dopc avatar qiuqiangkong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audioset_tagging_cnn's Issues

SED in unavailable for some models

Even if the we can use the sound_event_detection on the model "Cnn14_DecisionLevelMax_mAP=0.385.pth" with the command :
python pytorch/inference.py sound_event_detection --model_type="Cnn14_DecisionLevelMax" --checkpoint_path="models\Cnn14_DecisionLevelMax_mAP=0.385.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

The models "MobileNetV1_mAP=0.389.pth" and "Wavegram_Logmel_Cnn14_mAP=0.439.pth" does not work with command :
python pytorch/inference.py sound_event_detection --model_type="Wavegram_Logmel_Cnn14" --checkpoint_path="models\Wavegram_Logmel_Cnn14_mAP=0.439.pth" --audio_path="examples/R9_ZSCveAHg_7s.wav" --cuda

Indeed, the 'framewise_output' is not given by the model raising the error :
Traceback (most recent call last): File "pytorch/inference.py", line 202, in <module> sound_event_detection(args) File "pytorch/inference.py", line 132, in sound_event_detection framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0] KeyError: 'framewise_output'

KeyError: 'framewise_output'

Hi,

When I run
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path="resources/7061-6-0-0.wav" --cuda
I got an error saying

Traceback (most recent call last):
  File "pytorch/inference.py", line 202, in <module>
    sound_event_detection(args)
  File "pytorch/inference.py", line 132, in sound_event_detection
    framewise_output = batch_output_dict['framewise_output'].data.cpu().numpy()[0]

Then if I print batch_output_dict I see that the keys are: dict_keys(['clipwise_output', 'embedding']). Am I missing something ?

Thanks

code to plot "log spectrogram"+class probabilities

Hi I wanted to ask if you could please provide me with the code for your visualization. I would really like to reproduce your plot with other audios.

In detail: The visualization of sound event detection with the log spectrogram on the top and the class probabilities in the bottom. The image can be found in resources/sed_R9_ZSCveAHg_7s.png

That would be really great!
Thanks in advance.
Lydia

Shape doesn't match when inferencing Cnn14_16k model

Great work! And appreciate for sharing!

When I run this code according to readme:

python pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type="Cnn14_16k" --checkpoint_path="Cnn14_16k_mAP=0.438.pth" --audio_path='resources/R9_ZSCveAHg_7s.mp3'

raise error:

`

Traceback (most recent call last):
File "pytorch/inference.py", line 201, in
audio_tagging(args)
File "pytorch/inference.py", line 42, in audio_tagging
model.load_state_dict(checkpoint['model'])
File "/home/zongbowen/anaconda2/envs/tensorflow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

`

How can i download the DataSet ?

Hi, when I run runme.sh, I got many errors likes this : sh: 1: youtube-dl: not found

So could you tell me another way to download this large DataSet?

Thank you!

Transfer learning for a few classes

Hey, thanks for the great work.

I want to fine-tune your pre-trained models for less classes than 527.
Can you please guide me?

I have run finetune_template.

GPU number: 1 Load pretrained model successfully! Process finished with exit code 0

That's the only output.

Also tried to train from scratch with just 2 classes.
but I got several errors because of indexing.
I just followed runme.sh for training from scratch.

Thx

Last dropout is disconnected from fc_audioset layer

It appears that, in all CNN models, the last dropout, i.e., embedding = F.dropout(x, p=0.5, training=self.training), is actually disconnected from the output linear layer, i.e., self.fc_audioset(x).
Indeed, the forward method of these models reads:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(x))

By reading the arXiv paper, it seems that the last dropout should have instead connected the 2048-embedding layer to the 527-output layer. Indeed, the paper reads:

"Dropout [38] is applied after each downsampling operation and fully connected layers to prevent systems from overfitting."

Therefore, I expected to see the following:

x = F.relu_(self.fc1(x))
embedding = F.dropout(x, p=0.5, training=self.training)
clipwise_output = torch.sigmoid(self.fc_audioset(embedding))

Am I missing something?

Thank you,
Alessandro

batchnorm1d doesn't seem to be used in attention block

In class AttBlock(nn.Module) the __init__ has

self.bn_att = nn.BatchNorm1d(n_out)

but the forward doesn't seem to be using it.

Also, temperature variable does not seem to be used.

Can these be removed without affecting the learning?

Assertion error and low MAP on bal/eval set

  1. I am getting the assertion error while running your script to create hdf5 files. It occurs in float32_to_int16() conversion. Here is a simplified version.
def float32_to_int16(x):
    assert np.max(np.abs(x)) <= 1.
    return (x * 32767.).astype(np.int16)

aud, sr = librosa.core.load(wav_files[0], sr=32000, mono=True)
aud = float32_to_int16(aud)

print (np.max(np.abs(aud)))
>>> 1.0048816

Some of my audio files are out of range. If I comment out the assertion then everything works. Will it be correct to remove the assertion?

  1. I am also getting a low MAP scores on balanced set and evaluation set by using your trained models.
The ResNet38 
bal set :: 0.52
eval set :: 0.37

CNN10 
bal set :: 0.48
eval set :: 0.32 

Do you think the above issue has anything to do with it? I mean I prepare the data by commenting out the assertion.

Why transpose(1, 3) before BatchNorm?

May I ask why do transpose(1, 3) before BN? Is it intended to do batch normalization for each frequency bin, what is the advantage for this? Thanks.

x = x.transpose(1, 3)
x = self.bn0(x)
x = x.transpose(1, 3)

Can't download the AudioSet

When using youtube-dl to download the AudioSet, it return an exception:
OSError: ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

It seems that youtube has already ban my ip. Have any suggestion?

Variable Length Sequences

Hi,
How to use your CNN14 network with batches of input audio sequences of variable lengths? Also, is there a recommended length for audio input to the pretrained Cnn14_16k_mAP=0.438.pth?

Binarizing output values

Hi Qiuqiang,

I would like to know what is the best way to binarize the linear predicted probabilities in a way that :

  • 0 : audio label is absent
  • 1: audio label is present

If you have any suggestion for binarization issue , it would be great to know it.

And one more question about clipwise_output , as I understood from the paper linear probability value for each label shows the presence of that audio label in the input audio and probability value doesn't depend on the duration of period of audio label happens. I mean if it happens during the very short duration or long duration. Am I right?

It would be great for me to get your answers for above mentioned questions.

Anar Sultani

Pretrained Cnn14 16kHz wrong shape errors

After downloading Cnn14_16k_mAP=0.438.pth and following these instructions:

MODEL_TYPE="Cnn14_16k"
CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth"   # Trained by a later version of code, achieves higher mAP than the paper.
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging --sample_rate=16000 --window_size=512 --hop_size=160 --mel_bins=64 --fmin=50 --fmax=8000 --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path='resources/R9_ZSCveAHg_7s.wav' --cuda

I get the following error:

Traceback (most recent call last):
  File "pytorch/inference.py", line 201, in <module>
    audio_tagging(args)
  File "pytorch/inference.py", line 42, in audio_tagging
    model.load_state_dict(checkpoint['model'])
  File "/home/*user*/anaconda3/envs/onseilake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Cnn14_16k:
	size mismatch for spectrogram_extractor.stft.conv_real.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for spectrogram_extractor.stft.conv_imag.weight: copying a param with shape torch.Size([257, 1, 512]) from checkpoint, the shape in current model is torch.Size([129, 1, 256]).
	size mismatch for logmel_extractor.melW: copying a param with shape torch.Size([257, 64]) from checkpoint, the shape in current model is torch.Size([129, 64]).

Thank you for open sourcing everything!

Change License to Reflect Proper Authors

Hello, I am interested in leveraging the great work you folks have done here. However, the current MIT License appears to just be a copy of the one used for the AngularJS project and thus doesn't reflect that the copyright holders are the authors of the associated paper "Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley". Updating this to would be greatly appreciated!

Is it fully compatible with mixed precision ?

Hello,

I thank you for sharing the weights and experiment of your papers, it is a very good work and very helpful.

I am experimenting your Wavegram_Logmel_Cnn14 model on a custom dataset and I have seen some issue when I am using mixed precision in pytorch 1.6 with the layer LogmelFilterBank. In fact, I get sometimes nan values in the forward output of this layer which makes nan value in the loss function later.
I was wondering if you have an idea why ? I do not have this issue when I am not using mixed precision.

ERROR - code is too big

When I run panns-reference with CPU, it shows "ERROR - code is too big".

Is panns-reference only available on GPU? Why does this error occur when using the CPU?

numba.decorators ModuleNotFoundError

Hi. I got this error when trying to run pytorch/inference.py .
I installed the required packages by running pip install -r requirements.txt

Below is the traceback:

Traceback (most recent call last): File "pytorch/inference.py", line 6, in <module> import librosa File "/usr/local/lib/python3.7/dist-packages/librosa/__init__.py", line 12, in <module> from . import core File "/usr/local/lib/python3.7/dist-packages/librosa/core/__init__.py", line 109, in <module> from .time_frequency import * # pylint: disable=wildcard-import File "/usr/local/lib/python3.7/dist-packages/librosa/core/time_frequency.py", line 10, in <module> from ..util.exceptions import ParameterError File "/usr/local/lib/python3.7/dist-packages/librosa/util/__init__.py", line 71, in <module> from . import decorators File "/usr/local/lib/python3.7/dist-packages/librosa/util/decorators.py", line 9, in <module> from numba.decorators import jit as optional_jit ModuleNotFoundError: No module named 'numba.decorators'

convert the sound detection event predicting image into csv (Pandas format)

`def plot_sound_event_detection_result(framewise_output):
"""Visualization of sound event detection result.

Args:
  framewise_output: (time_steps, classes_num)
"""
out_fig_path = 'results/sed_result.png'
os.makedirs(os.path.dirname(out_fig_path), exist_ok=True)

classwise_output = np.max(framewise_output, axis=0) # (classes_num,)

idxes = np.argsort(classwise_output)[::-1]
idxes = idxes[0:5]

ix_to_lb = {i : label for i, label in enumerate(labels)}
lines = []
for idx in idxes:
    line, = plt.plot(framewise_output[:, idx], label=ix_to_lb[idx])
    lines.append(line)

plt.legend(handles=lines)
plt.xlabel('Frames')
plt.ylabel('Probability')
plt.ylim(0, 1.)
plt.savefig(out_fig_path)
print('Save fig to {}'.format(out_fig_path))

`

convert this into pandas format (timestamp,class_Name) on which particular time which kind of classes are predicting?

Could i change the input size of wav files?

Hi, there.
I have a question about the input size of wav files.
So, I'm doing some work on a transfer learning task based on your pretrained model.
In config.py, you set

sample_rate = 32000
clip_samples = sample_rate * 10     # Audio clips are 10-second

I'm wondering : could i change this two number?
If i changed them, does it means I can't use your pretrained model for next steps?

Confusion about Finetune

Hi bro,
When I use the model for finetune training, the training task is the type of guns. I tried to change the lr and epochs, and the results were bad. Then I use a simple Vgg16 structure, and it can achieve good results. Could you please answer my confusion? Many thanks!

Feeding long audio data vs second-by-second or smaller chunks

Dear authors,

Thanks for the great work!

I would like to ask a question that is there any potential difference between feeding audio data that is typically 20-90 seconds long vs slicing it in chunks or running second-by-second predictions. I fed the CNN14 model with audio data that is typically 20-90 seconds long and after getting linear predicted probabilities I checked feature importance, it was almost near to 0 for all the audio labels.
And after binarizing them with threshold=0.3 it was clear that support was extremely low for 525/527 labels(except Speech & Music)

Now I am thinking that maybe feeding the model with second-by-second audio data may increase the accuracy because with sec-by-sec data each instance has the chance to be monophonic which may lead us to better results.

I would like to know your opinion about the above-mentioned thoughts if possible.

Best Regards

What's the input size of CNN

Hello,

I try to print the input size of each layer, take the Cnn14 model code for example:

  1. use function librosa.load to load audio wav. [1, 32000]
  2. spectrogram_extractor: [1, 1, 1001, 513]
  3. logmel_extractor: [1, 1, 1001, 64]

I have three questions:

  1. Different audio has different length, for example, some audio may be [1, 32000], others may be [1, 294198], so they have different size after spectrogram_extractor. Why can you input different size of tensor into CNN? Or have you reshape them into the same size?
  2. How do you input a (1001, 64) size( not the same width and length) into CNN?
  3. I test your model , the accuracy is really high. I try to extract audio features using mfcc, and train the audioset on VGGNet, but the accuracy is about 50%. So how do you improve your model‘s accuracy?

Looking forward to your reply. Thank you.

class_labels_indices.csv is missing

Hey guys!

Thanks for sharing the code, but running the inference, this error pops up:

Traceback (most recent call last):
File "pytorch/inference_template.py", line 26, in
import config
File "/Users/admin/Desktop/audioset_tagging_cnn/pytorch/../utils/config.py", line 8, in
with open('metadata/class_labels_indices.csv', 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'metadata/class_labels_indices.csv'

Could you pls. add that file?

Thx a lot,
Max

The procedure

Hi, I was confused about the procedure of the experiment although I had looked through the README.md. Could you list out the steps of the experiment? Thanks a lot.

Other Pretrained Models

Really Amazing stuff there
Can you provide other pretrained models too like mobilnets for audio tagging
Thank You

Get embedding not classification

Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.

Literature pointers for better understanding the `Cnn14_DecisionLevelAtt` model

Hello, Thanks for the awesome repo.

I am new to Audio & SED domain. I have been using your arch for one of the recent Kaggle competition and getting decent result. Therefore, I would like to better understand details of Cnn14_DecisionLevelAtt

I have read the PANNs paper, but it mostly focuses on the CNN feature extractor part. I am interested in understanding why things are done in the way they are for the Cnn14_DecisionLevelAtt model ( basically everything beside the CNN feature extractor ). Can you point me to some write-ups that explains this ?

Thanks

Can not reproducing the result of audio_tagging result of mobilenetv1 in the PANNs paper, is there any tricks when training?

Hello, thanks for providing the source code and traning data.
I have download the audioset dataset from Baidu network disk you provided, and train the mobilenetv1 model from scratch following the steps you mentioned in "Train PANNs from scratch". But the problem is, I can not reproducing your training result which you provided.(MobileNetV1_mAP=0.389.pth)
When my training iteration reaches 234000, the LOSS is still 1.1358, and the Validate bal mAP is 0.005 and Validate Test mAP is 0.005. It seems that the two mAP never changed and the model can not convergent.
would you please give me some guidance? Is there any tricks when traning the model?

Looking forward for your reply~ thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.