Thank you for providing the code! I am a novice in the field of SER. I have traine

I encountered an error while executing the command: <code class="notransl

Inference about emotion2vec HOT 10 CLOSED

ddlbojack commented on July 21, 2024

Inference

from emotion2vec.

Comments (10)

zszheng147 commented on July 21, 2024

Thanks for your question. The IEMOCAP dataset consists of 5531 samples (after removing labels that are not needed), and the train.lengths file indicates the varying lengths of these samples. Each sample comprises a sequence of 768-dimensional feature vectors, where the total number of vectors across all samples is 1253877.

To perform inference on these samples using your pre-trained model, you can

Load your model: model = torch.load(model_path).to(device).
Prepare test_loader and label_dict like training.
Run inference: test_wa, test_ua, test_f1 = validate_and_test(model, test_loader, device, num_classes=len(label_dict)).

The codes I mentioned are also present in the main.py. In fact, once the training is completed, you should be able to get a classification result. If you only need the classification label, you can look into the validate_and_test function in the misc.py file, where you can easily obtain an emotion label.

outputs = model(feats, speech_padding_mask)
 _, predicted = torch.max(outputs.data, 1)

from emotion2vec.

Efr0nd commented on July 21, 2024

Thank you for your reply!
In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.

For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?

I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?

data = np.load('path/to/feats/train.npy')
sample = data[:97, :]
sample_i_features = torch.tensor(sample).float().cuda()
model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt')
model.to('cuda')
input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda')
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)[0].item()
print(f"Predicted class for sample i: {predicted_class}")

from emotion2vec.

Efr0nd commented on July 21, 2024

I encountered an error while executing the command:
bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path.
I got emotion2vec_base from https://www.modelscope.cn/damo/emotion2vec_base.git
And my own path is:

IEMOCAP_ROOT=./path/to/IEMOCAP_full_release
manifest_path=./path/to/manifest
checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt
feat_path=./path/to/feats
fairseq_root=https://github.com/pytorch/fairseq
model_path=./path/to/emotion2vec_base

The error is:

Traceback (most recent call last):
  File "scripts/emotion2vec_speech_features.py", line 123, in <module>
    main()
  File "scripts/emotion2vec_speech_features.py", line 111, in main
    generator, num = get_iterator(args)
  File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator
    reader = Emotion2vecFeatureReader(
  File "scripts/emotion2vec_speech_features.py", line 48, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint])
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

Is this an issue with the fairseq version? My current version is 0.12.2.

from emotion2vec.

zszheng147 commented on July 21, 2024

Thank you for your reply! In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.

For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?

I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?
data = np.load('path/to/feats/train.npy')
sample = data[:97, :]
sample_i_features = torch.tensor(sample).float().cuda()
model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt')
model.to('cuda')
input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda')
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)[0].item()
print(f"Predicted class for sample i: {predicted_class}")

Yes, it's correct.

from emotion2vec.

zszheng147 commented on July 21, 2024

I encountered an error while executing the command: bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path. I got emotion2vec_base from https://www.modelscope.cn/damo/emotion2vec_base.git And my own path is:

IEMOCAP_ROOT=./path/to/IEMOCAP_full_release
manifest_path=./path/to/manifest
checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt
feat_path=./path/to/feats
fairseq_root=https://github.com/pytorch/fairseq
model_path=./path/to/emotion2vec_base

The error is:

Traceback (most recent call last):
  File "scripts/emotion2vec_speech_features.py", line 123, in <module>
    main()
  File "scripts/emotion2vec_speech_features.py", line 111, in main
    generator, num = get_iterator(args)
  File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator
    reader = Emotion2vecFeatureReader(
  File "scripts/emotion2vec_speech_features.py", line 48, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint])
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

Is this an issue with the fairseq version? My current version is 0.12.2.

Hi. There's nothing wrong with the fairseq version. Could you please check or pull the latest code? We have provided guidance on how to perform inference in it.

from emotion2vec.

Efr0nd commented on July 21, 2024

Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.

Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?

When I use the code

from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs")
print(res)

at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?

from emotion2vec.

LauraGPT commented on July 21, 2024

Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.

Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?

When I use the code
from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs")
print(res)
at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?

You could set the granularity as:

from funasr import AutoModel

model = AutoModel(model="./emotion2vec_base")
wav_file = f"{model.model_path}/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(res)

But you should re-install funasr by:

git clone -b main --single-branch https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
pip install -e ./

from emotion2vec.

ddlBoJack commented on July 21, 2024

Hi, you can control this with granularity="utterance" or granularity="frame". Note that granularity="utterance" is simply a temporal pooling of granularity=" frame.

from emotion2vec.

Efr0nd commented on July 21, 2024

Thank you very much for your patient responses! I have resolved the issue.

from emotion2vec.

Efr0nd commented on July 21, 2024

I have an additional question. Are there currently any well-performing downstream models for sentiment classification? The simple downstream models provided by your project already yield good classification results, but I would like to make some improvements to achieve a more comprehensive speech emotion recognition system. Could you provide me with some materials or tips?

from emotion2vec.

Inference about emotion2vec HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent