bill9800 / speech_separation Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 61.0 31.24 MB

Include some core functions and model to handle speech separation

License: MIT License

Python 100.00%

speech_separation's People

Contributors

Stargazers

Watchers

Forkers

chenxinglili jasdasdf jungleengine pranoot jasonaidm richardyang40148 vanka0051 bassentzaghloul waries agangzz yingmuying a0981228638 aarriandiaga dathu yongxuustc hangtingchen liuspencersjtu wl3b10s ukikwon starhxh zarina-zhang zhang405744522 lym0302 dennistang742 vitrioil librence jusperlee iamrishab aboul3ata gaoyanzeng ihopeyoudie18 dunkindonat zhaoyj1122 kingstorm ruizewang vuthede faheem-khaskheli faheemkhaskheli9 boriq mayeedit3 ashbeats fchest fantasyyyy njerschow overcautious wangpengxu2020 willligithub yangx1123 fatehsinghiit dnfcallan unvoicelinjlu sgomber bigero zhuzhu718 sehwanyoo wangfeiyan12 27386274 ajitkumar15 ningmeiling xinluyu1

speech_separation's Issues

About test loss

when i run AV_MODER_V2 ， a error ：Unknown loss function:loss_func，i can't understand it

Hi there,
i´m a bit confused. Your diagramm displays, that you get 75 * 1 * 1 * 1792 face embeddings frome FaceNet. But in the original paper they used 1×1×1024 face embeddings. They used the layer called "avg pool" in FaceNet. In the code -> pretrain_load_test.py it seems like your using the layer "avg pool". But why 1792?
Greetings:)

about mkdir fun in AVHandler.py

There I think we need not have "cd" in function. If I am Wrong then correct me.

Operands error

@bill9800 I'm getting the following error while evaluating the audio-visual model
Traceback (most recent call last):
File "AV_model_eval.py", line 92, in
T = utils.fast_istft(F,power=False)
File "/home/lenovo/Downloads/speech_separation/lib/utils.py", line 75, in fast_istft
data = istft(real_imag_shrink(data))
File "/home/lenovo/Downloads/speech_separation/lib/utils.py", line 30, in istft
Total[start:end] = Total[start:end] + data[i, :] * windows
ValueError: operands could not be broadcast together with shapes (257,) (512,)

How to solve this?

Question about pre-trained AV Model

Thanks for sharing the code.Due to the limitation of our devices，I have been working for a long time on this model.However, I still don't get good result. I want to see the results, compare them with my work. Do you mind sharing a pre-trained model? This will help me a lot to go on with my work.
Thanks a lot!

Problem while evaluating the audio_only model

@bill9800 sir while evaluating the audio_only model the pred folder generates the .wav files. All the files are silent that means there is no any voice in that. why is that so?

How would you infer on a downloaded video?

I have trained my own model and have been able to use the model_v2/AV_model_eval.py to test against testing data.

Have you been able to infer on a video that the system has not seen before?

What steps are needed to be taken to process the new video ready to be inferred on?

Evaluation or Pre-Trained AV Model?

Hey @bill9800, thanks for sharing this repo. We've been working on some versions of this model since the AVSpeech dataset was published. Do you have any evaluation metrics, or do you mind sharing a pre-trained v2 model? We've been playing with a much narrower model trained on embeddings with 128 features, and it would be great to understand the differences a bit.

Thanks again!

Could you share the pretrained model? I have no a powerful GPU to train this model. Thank you!

No improvement of model after 5 epochs

Trained using:

2 GPUS
batch size=2
number audio/video files=50k
epochs=5

I infer on a video that the model has not seen.

The output contains 2 audio files that sound very similar to each other. I cannot distinguish voices between each audio file.

Has anyone got a working model that seems to be doing better than mine?

Indexing issue in model/lib/model_AV_new.py

I have created a pull request addressing the indexing issue (#17) . Keras during run time adds another dimension at axis 0 which becomes the batch axis. Hence, sliced() function slices through the second last dimension viz. embedding and not the people_num.

Question about the loss funvtion

The cost function is defined as follows,in your codes:
def audio_discriminate_loss2(gamma=0.1,beta = 20.1,num_speaker=2):
def loss_func(S_true,S_pred,gamma=gamma,beta=beta,num_speaker=num_speaker):
sum_mtr = K.zeros_like(S_true[:,:,:,:,0])
for i in range(num_speaker):
sum_mtr += K.square(S_true[:,:,:,:,i]-S_pred[:,:,:,:,i])
for j in range(num_speaker):
if i != j:
sum_mtr -= gamma(K.square(S_true[:,:,:,:,i]-S_pred[:,:,:,:,j]))
for i in range(num_speaker):
for j in range(i+1,num_speaker):
#sum_mtr -= betaK.square(S_pred[:,:,:,i]-S_pred[:,:,:,j])
#sum_mtr += betaK.square(S_true[:,:,:,:,i]-S_true[:,:,:,:,j])
pass
#sum = K.sum(K.maximum(K.flatten(sum_mtr),0))
loss = K.mean(K.flatten(sum_mtr))
return loss
return loss_func
However, I do not understand the meaning of this parts:
for j in range(num_speaker):
if i != j:
sum_mtr -= gamma*(K.square(S_true[:,:,:,:,i]-S_pred[:,:,:,:,j]))
I guess you want to use the permutation invariant (PIT) loss, but the definition of PIT is not like that. what is the meaning of this part?

Not actually an issue, but does it work?

videos downloader

No function m_link in avhandler.py

trained

Question about the training loss

@bill9800 Hi, It's really a amazing work. Thanks for sharing the code. However, I have some problems about the training loss. I trained 9 epoch ( datasets about 30000 videos,batchsize = 2. ).
I noticed that my original training loss was about 0.45. After 9 epoch, my training loss is about 0.18 and it can not decrease. It is normal? what is the situation about your training loss?
I am looking for your replay!

problems about keeping training using the pretrained model

@bill9800 It is really a amazing work! Thanks a lot for sharing the code. Howerver,I meet a problem when I try to keep training using the pretrained model. AS follows:
Traceback (most recent call last):
File "/home/yyh/pycharm-2018.3.4/helpers/pydev/pydevd.py", line 1741, in
main()
File "/home/yyh/pycharm-2018.3.4/helpers/pydev/pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/yyh/pycharm-2018.3.4/helpers/pydev/pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/yyh/pycharm-2018.3.4/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/yyh/xym/speech_separation-master/model/model_v2/AV_train.py", line 74, in
AV_model = load_model(latest_file,custom_objects={"tf": tf})
File "/home/yyh/anaconda3/envs/xym/lib/python3.6/site-packages/keras/engine/saving.py", line 289, in load_model
sample_weight_mode=sample_weight_mode)
File "/home/yyh/anaconda3/envs/xym/lib/python3.6/site-packages/keras/engine/training.py", line 139, in compile
loss_function = losses.get(loss)
File "/home/yyh/anaconda3/envs/xym/lib/python3.6/site-packages/keras/losses.py", line 133, in get
return deserialize(identifier)
File "/home/yyh/anaconda3/envs/xym/lib/python3.6/site-packages/keras/losses.py", line 114, in deserialize
printable_module_name='loss function')
File "/home/yyh/anaconda3/envs/xym/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 165, in deserialize_keras_object
':' + function_name)
ValueError: Unknown loss function:loss_func

my environment is tensorflow-gpu==1.8.0. keras ==2.2.2. python=3.6.
cloud you give me some advices?

The filename, directory name, or volume label syntax is incorrect. The system cannot find the path specified.

I am trying to download the datasets using audio_downloader.py. But I am getting error in the download function of AVHandler.py which says "The filename, directory name, or volume label syntax is incorrect.
The system cannot find the path specified". Any help will be appreciate

hello I want to know why my mix data' shape is [411,257,2] but not [298,257,2] please

The voice generation after STFT in AO_model is not 2982572. Why are the numbers in the first column different?

Cannot load the pretrained model

Thanks for your efforts. I have an issue and a couple of questions.
I've followed the steps to preprocess the data, I've then downloaded your pretrained h5 model from google drive(https://drive.google.com/file/d/1GfTtnisfnRluUf-V1FQzCWe8_BG5tNYI/view?usp=drivesdk). After that I've tried to run the evaluation script of model_v2(using python2.7 and 3.5), but the code produces segmentation fault when calling the load_model function.
Do you have any suggestions?
What is your python version?
What are your Keras and tensorflow versions?
How many gigabytes of vram required to load the model during the evaluation time?
How long does it take to process(i.e. feedfoward) a single 3 seconds segment on your GPU?
Do you have any sample outputs (wav files that have been generated by your pretrained model) to share with us?

Thank your very much :D

Pre-trained_model_v1() not working?

Hello, I have some question,
while evaluating the audio_only model with the AOmodel-2p-001-0.00000.h5 file, the pred folder generates the .wav files. All the files are silent which means there is no voice in that. why is that so?

IndexError in AV_model_eval.py at parse_X_data()

Hi there,
wehen i am trying to predict there is this error going on:

line 41 in speech_separation/model/model_v2/AV_model_eval.py /

face_embs[1, :, :, :, i] = np.load(face_path + single_idxs[i] + "_face_emb.npy")

IndexError: index 1 is out of bounds for axis 0 with size 1

face_embs shape is (1,75,1,1972,2)
i can be in my case 0 or 1
np.load**(face_path + single_idxs[i] + "_face_emb.npy")** shape is (75,1,1972)

Whats wrong here? Do we need to change line 41 from face_embs[1, :, :, :, i] to from face_embs[0, :, :, :, i]

Greetings:)

trained

Problem while trying to run AV_train.py file

@bill9800 Thanks for doing this amazing project and sharing the code.But while executing, I faced some problem while training ( /model_v2/AV_train.py ) the module. As follows

Epoch 1/100
Traceback (most recent call last):
File "AV_train.py", line 107, in
initial_epoch=initial_epoch
File "/home/avicky/env/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/avicky/env/lib/python3.7/site-packages/keras/engine/training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "/home/avicky/env/lib/python3.7/site-packages/keras/engine/training_generator.py", line 185, in fit_generator
generator_output = next(output_generator)
File "/home/avicky/env/lib/python3.7/site-packages/keras/utils/data_utils.py", line 625, in get
six.reraise(*sys.exc_info())
File "/home/avicky/env/lib/python3.7/site-packages/six.py", line 696, in reraise
raise value
File "/home/avicky/env/lib/python3.7/site-packages/keras/utils/data_utils.py", line 610, in get
inputs = future.get(timeout=30)
File "/usr/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/avicky/env/lib/python3.7/site-packages/keras/utils/data_utils.py", line 406, in get_index
return _SHARED_SEQUENCES[uid][i]
File "../lib/MyGenerator.py", line 84, in getitem
[X1, X2], y = self.__data_generation(filename_temp)
File "../lib/MyGenerator.py", line 106, in __data_generation
y[i, :, :, :, j] = np.load(self.database_dir_path+'audio/AV_model_database/crm/' + info[j + 1])
File "/home/avicky/env/lib/python3.7/site-packages/numpy/lib/npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: '../../data/audio/AV_model_database/crm/mix_face_emb.npy'

I have run the code for small amount of data to check for errors.I have attatched the photos of data file generated for training. Can u please check if they are generated right? and help me with the solution of above problem so that i can go ahead.Thank you!!!