ddlbojack / emotion2vec Goto Github PK

[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Python 96.10% Shell 2.35% Jupyter Notebook 1.56%

iemocap pytorch-implementation speech-emotion-recognition speech-representation

emotion2vec's Issues

Info about checkpoint file

Hi @ddlBoJack,

Please share some information about the checkpoint file shared in the readme. Is it the best performing model so far?

Also the train.py file given for IEMOCAP, is it the frame-level or utterance level features?

Thanks,

KeyError: 'text' when inferring with iic/emotion2vec_plus_large model in FunASR

Description:

I encountered an issue while performing inference using the iic/emotion2vec_plus_large model with FunASR. Here's the traceback of the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
    model = self.model if model is None else model
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
    )
      
KeyError: 'text'

Code Used:

from funasr import AutoModel
import librosa
import soundfile as sf
model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
                          vad_model="fsmn-vad", vad_model_revision="v2.0.4",
                          max_single_segment_time=19000,
                          )

y, sr = librosa.load(wav_file)
y_16k = librosa.resample(y,orig_sr=sr,target_sr=16000)
sf.write("./temp.wav", y_16k, 16000, subtype='PCM_24')
res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
print(res_emotion)

Complete Console Information:

>>> model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
...                           vad_model="fsmn-vad", vad_model_revision="v2.0.4",
...                           max_single_segment_time=1000,
...                           )
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
2024-07-02 17:45:00,793 - modelscope - INFO - Use user-specified model revision: v2.0.4
>>> 
>>> res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
rtf_avg: 2.022: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.72s/it]
rtf_avg: 2.878:   0%|▍                                                                                                                           | 1/261 [00:01<06:36,  1.53s/it]
rtf_avg: 1.028:   1%|▋                                                                                                                           | 1/191 [00:01<03:44,  1.18s/it]
rtf_avg: 0.613:   1%|▊                                                                                                                           | 1/154 [00:00<02:28,  1.03it/s]
rtf_avg: 0.423:   1%|▉                                                                                                                           | 1/131 [00:00<01:47,  1.21it/s]
rtf_avg: 0.317:   1%|█                                                                                                                           | 1/113 [00:00<01:22,  1.37it/s]
rtf_avg: 0.246:   1%|█▏                                                                                                                          | 1/102 [00:00<01:06,  1.53it/s]
rtf_avg: 0.209:   1%|█▎                                                                                                                           | 1/94 [00:00<00:57,  1.61it/s]
rtf_avg: 0.183:   1%|█▍                                                                                                                           | 1/84 [00:00<00:49,  1.69it/s]
rtf_avg: 0.159:   1%|█▋                                                                                                                           | 1/75 [00:00<00:42,  1.75it/s]
rtf_avg: 0.138:   1%|█▊                                                                                                                           | 1/69 [00:00<00:37,  1.80it/s]
rtf_avg: 0.115:   2%|██                                                                                                                           | 1/62 [00:00<00:31,  1.96it/s]
rtf_avg: 0.104:   2%|██▏                                                                                                                          | 1/56 [00:00<00:28,  1.95it/s]
rtf_avg: 0.090:   2%|██▍                                                                                                                          | 1/51 [00:00<00:24,  2.04it/s]
rtf_avg: 0.080:   2%|██▋                                                                                                                          | 1/47 [00:00<00:21,  2.09it/s]
rtf_avg: 0.075:   2%|██▊                                                                                                                          | 1/44 [00:00<00:20,  2.05it/s]
rtf_avg: 0.068:   2%|███▏                                                                                                                         | 1/40 [00:00<00:18,  2.12it/s]
rtf_avg: 0.063:   3%|███▍                                                                                                                         | 1/36 [00:00<00:16,  2.10it/s]
rtf_avg: 0.058:   3%|███▊                                                                                                                         | 1/33 [00:00<00:15,  2.07it/s]
rtf_avg: 0.050:   3%|████▎                                                                                                                        | 1/29 [00:00<00:13,  2.12it/s]
rtf_avg: 0.045:   4%|████▊                                                                                                                        | 1/26 [00:00<00:11,  2.09it/s]
rtf_avg: 0.040:   4%|█████▍                                                                                                                       | 1/23 [00:00<00:10,  2.08it/s]
rtf_avg: 0.036:   5%|██████▎                                                                                                                      | 1/20 [00:00<00:09,  2.05it/s]
rtf_avg: 0.034:   6%|███████▎                                                                                                                     | 1/17 [00:00<00:08,  1.92it/s]
rtf_avg: 0.031:   7%|████████▎                                                                                                                    | 1/15 [00:00<00:07,  1.80it/s]
rtf_avg: 0.025:  10%|████████████▌                                                                                                                | 1/10 [00:00<00:05,  1.80it/s]
rtf_avg: 0.023:  12%|███████████████▊                                                                                                              | 1/8 [00:00<00:05,  1.32it/s]
  0%|                                                                                                                                                      | 0/1 [01:13<?, ?it/s]
Traceback (most recent call last):██▊                                                                                                              | 1/8 [00:00<00:05,  1.36it/s]
  File "<stdin>", line 1, in <module>
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
    model = self.model if model is None else model
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
    )
      
KeyError: 'text'

Emotion2Vec Pretraining code

Thank you for your contribution; your work is truly amazing. However, I would like to train emotion2vec for a pretraining task. Could you provide the source code or offer any suggestions?

OOM while processing IEMOCAP dataset

I was trying to create iemocap embedding on my own, but my GPU with 8GB memory gave me OOM from cuda. How much size do I need to process this?

The WeChat group QR code has expired again

其实我是有一个需求，是长音频需要切片算情感分类概率，比如每5s得到一个，但是目前pipeline api封装得太死了，不支持这么操作，只支持全局平均算出一个。如果pipeline接口能额外输入一个切片长度，得到的概率向量多一个时间维度，就好了

In the script emotion2vec_extract_features.sh, I noticed that features are extracted from the last layer.
Have you tried extracting features from other layers as well?
I'm just curious if this approach is based on empirical insight.

群聊的二维码过期了

您好，群聊的二维码过期了

Batching model inference

Hi,

Thank you for the great work you've done on this model! Is there any way to batch the model using funasr? I've been trying to batch with padding and set the padding_mask to mask out the unused frames, but I'm not getting the same results as when I run inference sequentially.

Here's a sample of the code I'm using. I've tried a number of different configurations of arguments - there are several mask parameters, and it seems like mask refers to the MLM pretraining schema, and padding_mask refers to the attention mask? I'm not sure though because there's no documentation. Any guidance would be appreciated.

from funasr.utils.load_utils import load_audio_text_image_video
from funasr import AutoModel
from torch.nn.utils.rnn import pad_sequence

model = AutoModel(model="iic/emotion2vec_plus_large").model
model.eval()
model.to("cuda")

padding_value = -1

# Audios is a list of audio tensors resampled to 16kHz
x = load_audio_text_image_video(audios)
x = [torch.nn.functional.layer_norm(x_, x_.shape).squeeze() for x_ in x]
masked_x = pad_sequence(x, batch_first=True, padding_value=padding_value)
mask = masked_x == padding_value

out = model.extract_features(masked_x, mask=False, padding_mask=mask, remove_extra_tokens=True)
out_mask = out["padding_mask"]
feats = out["x"]

feats[out_mask] = 0
print(feats.sum(dim=1) / (~out_mask).sum(dim=1).unsqueeze(-1))

What is Emo-262?

What is the dataset Emo-262? Does your group collect it and will it be available for the public? How can I get it?

Hint: The word LSSED in the Table 2 caption is wrong and was written as LSED. Maybe you can check your paper writing.

Finetuning

Could you please share the script to train the network for upstream task? I want to finetune the model.

Thanks!

Inference

Thank you for providing the code!
I am a novice in the field of SER. I have trained the downstream model using the provided train.npy, train.lengths, and train.emo files, but I'm unsure how to use the obtained model for category inference on the features within train.npy.
I noticed that the shape of the train.npy you provided is (1253877, 768). In my understanding, it represents 1253877 samples with 768-dimensional features each. I would like to classify these 1253877 samples using the pre-trained model. How can I achieve this?

About platform

I want to know if the emotion2vec can run on arm server.

fine-tuning pre train model

Hi, thank you very much for your work.

I want to continue to do some interesting work based on your work.
I have not found any related model fine-tuning on modelscore and github.
Can you please guide me on how to use your model for model fine-tuning and retraining?

many thanks

群二维码过期了，请问能更新一下吗

如题

About reproducing data2vec2 results

When loading the data2vec2 model using fairseq. checkpoint_utils. load_model_ensemble_and_task ([ckpt_path]), an error occurred while loading the data2vec2 model: KeyError : "_name", Could you please tell me how to solve the problem of loading the model

Wechat Group application

Hello! One of my work recently used Emotion2Vec. Could I join this group chat to communicate with you? My wechat can be get by my profile picture(QR code) If you are not busy, you can get my wechat by scanning it! Thank you very much.

Optimal segment length

Hello!

Thank you for such a nice work!

I am performing speaker diarization with pyannote, and want to use the audio segments which i recieve from the diarization model to perfrom emotion detection on them. The segments are of different sizes, I'm sure I'll have to do some kind of splitting because of the CUDA OOM for very long segments (like 200 sec), but I'm wondering what is the optimal segment size for the emotion2vec_plus_large model? 3 seconds, 15 seconds or whatever?

Thank you!

extreactfeature won't work with the new models

extrafeature only work with the base model. is there any plan to fix this?

_MISSING_TYPE

omegaconf.errors.ValidationError: Object of unsupported type: '_MISSING_TYPE'
full_key:
reference_type=None
object_type=None
Is this due to a software package conflict？I cant solve this problem.

The WeChat group QR code has expired.

Please update the QR code.

微信群

你好可以更新微信群二维码吗

Request for test and dev files

Dear Authors,

You have only shared the train.npy, train.lengths, train.emo in the iemocap_downstream folder.
Do you mind sharing also the test and dev versions of the files? This will make testing your models more convenient.

Thank you in advance.

Best regards,
Aaron

The WeChat group QR code has expired

MaxRetryError

When trying to run the prediction using the fine-tuned models through model scope

inference_pipeline = pipeline(
    task=Tasks.emotion_recognition,
    model="iic/emotion2vec_plus_large")  # Alternative: iic/emotion2vec_plus_seed, iic/emotion2vec_plus_base, iic/emotion2vec_plus_large and iic/emotion2vec_base_finetuned

rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(rec_result)

I run into this error


The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
Cell In[40], [line 7](vscode-notebook-cell:?execution_count=40&line=7)
      [1](vscode-notebook-cell:?execution_count=40&line=1) '''
      [2](vscode-notebook-cell:?execution_count=40&line=2) Using the emotion representation model
      [3](vscode-notebook-cell:?execution_count=40&line=3) rec_result only contains {'feats'}
...
--> [515](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:515)     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    [517](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:517) log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
    [519](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:519) return new_retry

MaxRetryError: None: Max retries exceeded with url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png (Caused by HTTPError('404 Client Error: Not Found for url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png'))```

The WeChat group QR code has expired

sry for missing the last update

微信群能更新吗？又失效了群主想交流一下

二维码过期了

utterance embedding

How are utterance embedding obtained? Are they obtained from frame-level features through convolution or pooling?

RAVDESS test result

I had tested the result of RAVDESS Speech and RAVDESS Song by the given emotion2vec_large. The weighted acc of RAVDESS Speech is 87%, similar with the result in paper, but the result of RAVDESS Song is 64%, which is very different from the results of the paper. Are there any different of testing two dataset? I dont know why?

微信群二维码失效

想加群，群主能否更新一下二维码？谢谢啦

The performance of utterance-level features is poor.

Hello, I'm new to this field. I'd like to ask you why I got a poor result when I used the utterance-level provided by you for emotion recognition, and the WA was probably over 60.I also only use the linear layer as the basemodel.
I am looking forward to your answer, thank you.

Two key models in finetune without annotated data

非常感谢作者开源这么好的情绪预训练模型。

我在modelscope上看到有这样的描述：
首先使用语音情感识别学术数据集fine-tune emotion2vec，然后对15万小时中英数据进行标注，筛选文本情感与语音情感相同，并且置信度高的数据。
请问能否开源下文本情绪模型和采用学术数据集训练的语音情绪模型吗，我想基于此方法训练一个3分类模型。

谢谢！

A question

Hey Author , Thanks for the opensource

I wanted to ask if emotion2vec is better than https://github.com/audeering/w2v2-how-to

Thanks in advance

ddlbojack / emotion2vec Goto Github PK

emotion2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org