cvondrick / soundnet Goto Github PK

View Code? Open in Web Editor NEW

460.0 460.0 91.0 46 KB

SoundNet: Learning Sound Representations from Unlabeled Video. NIPS 2016

Home Page: http://projects.csail.mit.edu/soundnet/

License: MIT License

Lua 100.00%

computer-vision deep-learning sound

soundnet's People

Contributors

Stargazers

Watchers

Forkers

ml-lab wanjinchang soonsyj ahn19 benjamesbabala techscientist dorniwang fireae stevenlol phan0035 leezqcst yongxuustc ilovecv vyraun antriv kittenish sudanenator arbdigital dvlsoft caomw azmonkh tpys selimam sherawat42 ankitshah009 ilibx dshin3 deeplearningsprint jalused haooooooqi hiredd sanleypeter96 brucetsao vivoutlaw sawon1234 keunwoochoi jdarchitect hoangcuong2011 futureism mcartwright statml mohanarunachalam jiashaoyong jiancao92 yanhedewang shubhampachori12110095 yaojipeng jiongfu di-guodi decebel joeblack22 ylqi briancylui jencmart gutsy-robot afcarl pandinosaurus aascode sushantjha8 afd77 wangzhiwei-ai amirunpri2018 xiaoyeye1117 xuanhanyu wuzl kzhang3256 uri-tuval saumishr cyberwjf dengandong billythemusical fkqw catonblack merlotq soul-an s2t2 ank-22 jason-lee-lxx alfian878787 dtaoo tawawhite imaxxs sunjinny bakwankawa jq8205 chester-w-xie nomore-ly passysosysmas

soundnet's Issues

error: running extract_predictions.lua, CUDNN_STATUS_NOT_SUPPORTED error

Trying to run original extract_predictions code to get categories classification.

Env: Torch7, Cuda 10.0, Cudnn version: 7501

Come across following problem:
linux command: list=/home/kzhang3256/soundnet/forSoundNet/data.txt th extract_predictions.lua

Results:
{
force : 0
write : 0
model : "models/soundnet8_final.t7"
list : "/home/kzhang3256/soundnet/forSoundNet/data.txt"
}
Loading network: models/soundnet8_final.t7
Network:
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output]
(1): cudnn.SpatialConvolution(1 -> 16, 1x64, 1,2, 0,32)
(2): nn.SpatialBatchNormalization (4D) (16)
(3): cudnn.ReLU
(4): cudnn.SpatialMaxPooling(1x8, 1,8)
(5): cudnn.SpatialConvolution(16 -> 32, 1x32, 1,2, 0,16)
(6): nn.SpatialBatchNormalization (4D) (32)
(7): cudnn.ReLU
(8): cudnn.SpatialMaxPooling(1x8, 1,8)
(9): cudnn.SpatialConvolution(32 -> 64, 1x16, 1,2, 0,8)
(10): nn.SpatialBatchNormalization (4D) (64)
(11): cudnn.ReLU
(12): cudnn.SpatialConvolution(64 -> 128, 1x8, 1,2, 0,4)
(13): nn.SpatialBatchNormalization (4D) (128)
(14): cudnn.ReLU
(15): cudnn.SpatialConvolution(128 -> 256, 1x4, 1,2, 0,2)
(16): nn.SpatialBatchNormalization (4D) (256)
(17): cudnn.ReLU
(18): cudnn.SpatialMaxPooling(1x4, 1,4)
(19): cudnn.SpatialConvolution(256 -> 512, 1x4, 1,2, 0,2)
(20): nn.SpatialBatchNormalization (4D) (512)
(21): cudnn.ReLU
(22): cudnn.SpatialConvolution(512 -> 1024, 1x4, 1,2, 0,2)
(23): nn.SpatialBatchNormalization (4D) (1024)
(24): cudnn.ReLU
(25): nn.ConcatTable {
input
|-> (1): cudnn.SpatialConvolution(1024 -> 1000, 1x8, 1,2) -> (2): cudnn.SpatialConvolution(1024 -> 401, 1x8, 1,2)
... -> output
}
(26): nn.MapTable {
cudnn.SpatialSoftMax
}
}
/home/kzhang3256/torch/install/bin/luajit: .../kzhang3256/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:

/home/kzhang3256/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnSetFilterNdDescriptor)
stack traceback:
[C]: in function 'error'
/home/kzhang3256/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:45: in function 'resetWeightDescriptors'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:358: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:357>
[C]: in function 'xpcall'
.../kzhang3256/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...kzhang3256/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
extract_predictions.lua:74: in main chunk
[C]: in function 'dofile'
...3256/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunkenter code here
[C]: at 0x555d0ebb7610

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
.../kzhang3256/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...kzhang3256/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
extract_predictions.lua:74: in main chunk
[C]: in function 'dofile'
...3256/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x555d0ebb7610

Anyone knows what's this error: "Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnSetFilterNdDescriptor)"

Thanks!

CPU model

Dear,
I have downloaded the model, but when I run demo.lua to load the model, I am getting error:

/home/parallels/torch/install/bin/luajit: /home/parallels/torch/install/share/lua/5.1/torch/File.lua:343: unknown Torch class <torch.CudaTensor>
stack traceback:
	[C]: in function 'error'
	/home/parallels/torch/install/share/lua/5.1/torch/File.lua:343: in function 'readObject'
	/home/parallels/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
	/home/parallels/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
	/home/parallels/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
	/home/parallels/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
	demo.lua:15: in main chunk
	[C]: in function 'dofile'
	...lels/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

I think this is because the model in trained with GPU, but I must use it in CPU-only environment. How can I convert the model to CPU or Can u give me a CPU model?
Thanks very much.

Are the class probabilities for the MP3s released?

Hi, in the Soundnet website I can see the MP3s are available to download, but I do not see their corresponding class probabilities.
Are they not available yet? Otherwise could you point me where I can find them?
Thanks

Can I simply use cpu on extract_prediction.lua

Now I simply use a virtual machine running under VM. I just want to use the pretrained model to get some result. Can I simply use cpu instead of gpu? If I can ,what should I do?

Error when trying to evaluate a new dataset: CUDNN_STATUS_BAD_PARAM

Hi, thanks for releasing this code.

I want to evaluate SoundNet with another dataset.
I have created text files for training and testing that contain a column for the full path to the WAV files and another column for their class. All the audio files are the same length.

I have modified the eval_dcase.lua script to read these text files and expect the duration of the files.

When I run it, I get the following error:

/home/jdieza15/torch/install/bin/luajit: /home/jdieza15/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 25 module of nn.Sequential:
In 1 module of nn.ConcatTable:
/home/jdieza15/torch/install/share/lua/5.1/cudnn/init.lua:162: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)
stack traceback:
	[C]: in function 'error'
	/home/jdieza15/torch/install/share/lua/5.1/cudnn/init.lua:162: in function 'errcheck'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:140: in function 'createIODescriptors'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:188: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
	[C]: in function 'xpcall'
	/home/jdieza15/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	.../jdieza15/torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function <.../jdieza15/torch/install/share/lua/5.1/nn/ConcatTable.lua:9>
	[C]: in function 'xpcall'
	/home/jdieza15/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	...e/jdieza15/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	eval_urbansound8k.lua:64: in function 'read_dataset'
	eval_urbansound8k.lua:115: in main chunk
	[C]: in function 'dofile'
	...za15/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
	[C]: in function 'error'
	/home/jdieza15/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
	...e/jdieza15/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	eval_urbansound8k.lua:64: in function 'read_dataset'
	eval_urbansound8k.lua:115: in main chunk
	[C]: in function 'dofile'
	...za15/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

This is the code in line 64:
net:forward(snd:view(1,1,-1,1):cuda())

I never used lua before, so I do not know how to interpret this error. Is it related to my installation of CuDNN or I am doing something wrong when running the code?

Thanks

A question about the output of visual CNN.

Hi, thanks for your nice paper. I met a question that in your paper you say the numbers of frames of the videos are variable. So how do you fuse the CNN output from different frames so the length of last output is a constant? Just computing the average or something else? Thank you very much.

size problems for audio classification

I am so sorry to disturb you. when i use pre-train soundnet to speech emotion recognition, I have some questions. Could you please give me a hand? Thanks

Question 1:
wav, sr = torchaudio.load(path) reads the audio samples, then it is preprocessed by wav.unsqueeze(1).unsqueeze(-1).repeat(1,1,8,1).
what are the requirements for the audio sample rate? Does the sample rate must be 22050? what are the other restrictions?

Question 2:
the last layer is nn.Conv2d(1024, 401, kernel_size=(8, 1), stride=(2, 1)) to extract speech features.
Feature size varies depending upon the length of the audio, what does it depend upon? I want to use the feature for audio classification. How do I get constant dimension feature vector for all of my audio files?
the same as your mentioned, an audio file with 1476864 samples produces feature of dimension [1x1024x46x1] and other files with 2199168 samples produce a feature of dimension [1x1024x68x1]. [1x1024x46x1], 1 represents batch, 1024 channel_out, what is 46 represented? what is the last dimension 1 represented?

Question 3:
How do get constant dimension feature vector for both files? Finally, when I try to classify, What do I need to do with the features of ouput （ 1, 401, feature, 1）so that I can use them in the final classification task？ how can the faltten method be better, (batch, channel_out* 1, feature)? average on the channel? or other methods?

PS
I am new to audio and DL, sorry ask basic problem
Thanks
best

where can I find soundnet8_final.lua

I just simply run demo.lua and it end up with:
lua: cannot open <models/soundnet8_final.lua> in mode r at /home/yangshuo/torch/pkg/torch/lib/TH/THDiskFile.c:670
I searched https://projects.csail.mit.edu/soundnet/ and git,but I still not find soundnet8_final.lua.
where can i find it

Size of feature

@cvondrick @yusufaytar
Thank you very much for sharing this code.

I am new to audio. I was trying extract features from my audio files. Feature size varies depending upon the length of the video, what does it depend upon? I want to use the feature for classification. How do I get constant dimension feature vector for all of my audio files?
For example, an audio file with 1476864 samples produces feature of dimension [1x1024x46x1] and other files with 2199168 samples produce a feature of dimension [1x1024x68x1]. How do get constant dimension feature vector for both files?

How do I have to modify sound signal to apply net in sliding window fashion in the temporal direction?

label_text_file = '/data/vision/torralba/crossmodal/soundnet/lmdbs/train_frames4_%04d.txt'

Hello，thank you very much for sharing the code and the dataset.
I am a new one in audio_visual.
I want to reappear the source code.
when I read the code ,I find I do not have the file "label_text_file" in main_train.lua.
If i want to train the model by myself, i have to have the raw mp3s(359GB), and the image features(88GB), but i dont know how to get the label_text_file.
if everyone knows, or everyone had re-trained the model by yourself, please give me some advise or some experience.
thank you very much

Tips on crawling images from video

Hi, I'd like to ask some tips that would be generally applicable in video/image stuff deep learning, I've been only working on music-related works. Some (if not all) might seem dumb ;)

What would be good image format of extracted frames of video? jpeg or png?
What was image sampling rate in the work? -- how many images per second did you sample?
Any other tip/hack would be appreciated.

Thanks!

How much memory is need to run the training scripts?

I have downloaded the training data from the demo website (https://projects.csail.mit.edu/soundnet/), and was trying to run the main_train.lua script. But I always get an out-of-memory error at the following line:

    optim.adam(fx, parameters, optimState)

The same thing happens even if I run main_train_small.lua.

I am using 120GB CPU memory and 4.7GB GPU memory. Do I need more?

Erros when running finetune

Hello @cvondrick ,

It seems that audio_simple, the dataset variable defined in main_finetune.lua, is not valid for data.lua to load the input audios. It shows following error:

/home/yclin/distro/install/bin/luajit: /home/yclin/Workspace/soundnet/data/data.lua:24: Unknown dataset: audio_simple

I also tried to replace audio_simple with donkey_audio and donkey_audio_labeled, but none of them work.

Would you please have a look in the finetune section of README?

Question: Steps to get category labels

Hi, thanks for making great implementation!

I tried to extract features from a sound by using the pretrained models like:

sky: 43.56%
stage, indoor: 5.46%
amusement park: 5.24%

spotlight: 16.74%
fountain: 12.33%
traffic light: 5.76%

I want to get each category labels but I don't understand how to convert them from HDF5 format.
Could you please provide me how to get category labels step by step?

Any help will be appreciated.

Fuck dataset

Hi, Guys, to be honest, i think this repo is just like a shit, if u also think so please feel free, the fuck dataset link is just here https://projects.csail.mit.edu/soundnet/

Not working

munender@cseproj149:~/code_space/soundnet$ list=data.txt th extract_feat.lua
/users/gpu/munender/src/torch/install/bin/lua: ...ender/src/torch/install/share/lua/5.1/trepl/init.lua:389: ...unender/src/torch/install/share/lua/5.1/hdf5/ffi.lua:56: expected align(#) on line 687
stack traceback:
[C]: in function 'error'
...ender/src/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
extract_feat.lua:4: in main chunk
[C]: in function 'dofile'
.../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: ?

Layers for other CNN

What are the layer number inputs for other CNN layers?

_pickle.UnpicklingError: invalid load key, '\x04'.

Hi, when I run
torch.load('soundnet8_final.t7')
with python3.6, pytorch0.4.0
I got this error.
Do you know what's going on?
Thank you~

All features negative

hi,
i used trained model to extracti feature from mp3s, but all 1000 dimentional features are negative. is this normal?

Unzipping the frames folder!

Dear @cvondrick ,
Thank you very much for making this code and the dataset publicly available.
I have a strange experience with unzipping the frames folder. It has been 2 days and still continue to unzipping all the frames.
I simply used tar -xvzf frames_public.tar.gz

Have you had similar experience with that? Or is the problem on my side?

Thank you very much.

What's the place CNN exactly?

Hi, I couldn't find what pre-trained network you used for the place CNN from the paper or the website. Where does it come from?