Code Monkey home page Code Monkey logo

Comments (10)

zszheng147 avatar zszheng147 commented on July 21, 2024

Thanks for your question. The IEMOCAP dataset consists of 5531 samples (after removing labels that are not needed), and the train.lengths file indicates the varying lengths of these samples. Each sample comprises a sequence of 768-dimensional feature vectors, where the total number of vectors across all samples is 1253877.

To perform inference on these samples using your pre-trained model, you can

  1. Load your model: model = torch.load(model_path).to(device).
  2. Prepare test_loader and label_dict like training.
  3. Run inference: test_wa, test_ua, test_f1 = validate_and_test(model, test_loader, device, num_classes=len(label_dict)).

The codes I mentioned are also present in the main.py. In fact, once the training is completed, you should be able to get a classification result. If you only need the classification label, you can look into the validate_and_test function in the misc.py file, where you can easily obtain an emotion label.

outputs = model(feats, speech_padding_mask)
 _, predicted = torch.max(outputs.data, 1)

from emotion2vec.

Efr0nd avatar Efr0nd commented on July 21, 2024

Thank you for your reply!
In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.

For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?

I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?

data = np.load('path/to/feats/train.npy')
sample = data[:97, :]
sample_i_features = torch.tensor(sample).float().cuda()
model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt')
model.to('cuda')
input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda')
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)[0].item()
print(f"Predicted class for sample i: {predicted_class}")

from emotion2vec.

Efr0nd avatar Efr0nd commented on July 21, 2024

I encountered an error while executing the command:
bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path.
I got emotion2vec_base from https://www.modelscope.cn/damo/emotion2vec_base.git
And my own path is:

IEMOCAP_ROOT=./path/to/IEMOCAP_full_release
manifest_path=./path/to/manifest
checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt
feat_path=./path/to/feats
fairseq_root=https://github.com/pytorch/fairseq
model_path=./path/to/emotion2vec_base

The error is:

Traceback (most recent call last):
  File "scripts/emotion2vec_speech_features.py", line 123, in <module>
    main()
  File "scripts/emotion2vec_speech_features.py", line 111, in main
    generator, num = get_iterator(args)
  File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator
    reader = Emotion2vecFeatureReader(
  File "scripts/emotion2vec_speech_features.py", line 48, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint])
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

Is this an issue with the fairseq version? My current version is 0.12.2.

from emotion2vec.

zszheng147 avatar zszheng147 commented on July 21, 2024

Thank you for your reply! In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.

For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?

I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?

data = np.load('path/to/feats/train.npy')
sample = data[:97, :]
sample_i_features = torch.tensor(sample).float().cuda()
model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt')
model.to('cuda')
input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda')
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)[0].item()
print(f"Predicted class for sample i: {predicted_class}")

Yes, it's correct.

from emotion2vec.

zszheng147 avatar zszheng147 commented on July 21, 2024

I encountered an error while executing the command: bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path. I got emotion2vec_base from https://www.modelscope.cn/damo/emotion2vec_base.git And my own path is:

IEMOCAP_ROOT=./path/to/IEMOCAP_full_release
manifest_path=./path/to/manifest
checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt
feat_path=./path/to/feats
fairseq_root=https://github.com/pytorch/fairseq
model_path=./path/to/emotion2vec_base

The error is:

Traceback (most recent call last):
  File "scripts/emotion2vec_speech_features.py", line 123, in <module>
    main()
  File "scripts/emotion2vec_speech_features.py", line 111, in main
    generator, num = get_iterator(args)
  File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator
    reader = Emotion2vecFeatureReader(
  File "scripts/emotion2vec_speech_features.py", line 48, in __init__
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint])
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

Is this an issue with the fairseq version? My current version is 0.12.2.

Hi. There's nothing wrong with the fairseq version. Could you please check or pull the latest code? We have provided guidance on how to perform inference in it.

from emotion2vec.

Efr0nd avatar Efr0nd commented on July 21, 2024

Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.

Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?

When I use the code

from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs")
print(res)

at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?

from emotion2vec.

LauraGPT avatar LauraGPT commented on July 21, 2024

Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.

Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?

When I use the code

from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs")
print(res)

at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?

You could set the granularity as:

from funasr import AutoModel

model = AutoModel(model="./emotion2vec_base")
wav_file = f"{model.model_path}/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(res)

But you should re-install funasr by:

git clone -b main --single-branch https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
pip install -e ./

from emotion2vec.

ddlBoJack avatar ddlBoJack commented on July 21, 2024

Hi, you can control this with granularity="utterance" or granularity="frame". Note that granularity="utterance" is simply a temporal pooling of granularity=" frame.

from emotion2vec.

Efr0nd avatar Efr0nd commented on July 21, 2024

Thank you very much for your patient responses! I have resolved the issue.

from emotion2vec.

Efr0nd avatar Efr0nd commented on July 21, 2024

I have an additional question. Are there currently any well-performing downstream models for sentiment classification? The simple downstream models provided by your project already yield good classification results, but I would like to make some improvements to achieve a more comprehensive speech emotion recognition system. Could you provide me with some materials or tips?

from emotion2vec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.