Comments (10)
Thanks for your question. The IEMOCAP dataset consists of 5531 samples (after removing labels that are not needed), and the train.lengths
file indicates the varying lengths of these samples. Each sample comprises a sequence of 768-dimensional feature vectors, where the total number of vectors across all samples is 1253877.
To perform inference on these samples using your pre-trained model, you can
- Load your model:
model = torch.load(model_path).to(device)
. - Prepare
test_loader
andlabel_dict
like training. - Run inference:
test_wa, test_ua, test_f1 = validate_and_test(model, test_loader, device, num_classes=len(label_dict))
.
The codes I mentioned are also present in the main.py. In fact, once the training is completed, you should be able to get a classification result. If you only need the classification label, you can look into the validate_and_test
function in the misc.py
file, where you can easily obtain an emotion label.
outputs = model(feats, speech_padding_mask)
_, predicted = torch.max(outputs.data, 1)
from emotion2vec.
Thank you for your reply!
In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.
For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?
I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?
data = np.load('path/to/feats/train.npy')
sample = data[:97, :]
sample_i_features = torch.tensor(sample).float().cuda()
model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt')
model.to('cuda')
input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda')
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)[0].item()
print(f"Predicted class for sample i: {predicted_class}")
from emotion2vec.
I encountered an error while executing the command:
bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path.
I got emotion2vec_base from https://www.modelscope.cn/damo/emotion2vec_base.git
And my own path is:
IEMOCAP_ROOT=./path/to/IEMOCAP_full_release
manifest_path=./path/to/manifest
checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt
feat_path=./path/to/feats
fairseq_root=https://github.com/pytorch/fairseq
model_path=./path/to/emotion2vec_base
The error is:
Traceback (most recent call last):
File "scripts/emotion2vec_speech_features.py", line 123, in <module>
main()
File "scripts/emotion2vec_speech_features.py", line 111, in main
generator, num = get_iterator(args)
File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator
reader = Emotion2vecFeatureReader(
File "scripts/emotion2vec_speech_features.py", line 48, in __init__
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint])
File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task)
File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
assert (
AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])
Is this an issue with the fairseq version? My current version is 0.12.2.
from emotion2vec.
Thank you for your reply! In my understanding, the method you mentioned involves batch classifying the entire test set. I'd like to know how to classify a specific sample or a new sample individually.
For example, if the first line of train.lengths is 97, as I understand it, this means the first sample has 97 feature vectors, each of 768 dimensions. The downstream model expects input of 768 dimensions. How can I input the first sample into the model to get the correct classification?
I've written a simple code, I passed the entire (96,768) tensor into the model, and it ran successfully. Is this correct?
data = np.load('path/to/feats/train.npy') sample = data[:97, :] sample_i_features = torch.tensor(sample).float().cuda() model = torch.load('path/to/save_dir/20240110_175202/checkpoint.pt') model.to('cuda') input_tensor = torch.tensor(sample_i_features, dtype=torch.float32).to('cuda') output = model(input_tensor) probabilities = torch.nn.functional.softmax(output, dim=1) predicted_class = torch.argmax(probabilities, dim=1)[0].item() print(f"Predicted class for sample i: {predicted_class}")
Yes, it's correct.
from emotion2vec.
I encountered an error while executing the command:
bash scripts/emotion2vec_extract_features.sh $fairseq_root $manifest_path $model_path $checkpoint_path $feat_path.
I got emotion2vec_base fromhttps://www.modelscope.cn/damo/emotion2vec_base.git
And my own path is:IEMOCAP_ROOT=./path/to/IEMOCAP_full_release manifest_path=./path/to/manifest checkpoint_path=./path/to/emotion2vec_base/emotion2vec_base.pt feat_path=./path/to/feats fairseq_root=https://github.com/pytorch/fairseq model_path=./path/to/emotion2vec_base
The error is:
Traceback (most recent call last): File "scripts/emotion2vec_speech_features.py", line 123, in <module> main() File "scripts/emotion2vec_speech_features.py", line 111, in main generator, num = get_iterator(args) File "scripts/emotion2vec_speech_features.py", line 84, in get_iterator reader = Emotion2vecFeatureReader( File "scripts/emotion2vec_speech_features.py", line 48, in __init__ model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint]) File "/root/miniconda3/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task task = tasks.setup_task(cfg.task) File "/root/miniconda3/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task assert ( AssertionError: Could not infer task type from {'_name': 'emotion2vec_pretraining', 'data': '/mnt/lustre/sjtu/home/zym22/data/emotion_recognition/manifest', 'labels': None, 'multi_corpus_keys': None, 'multi_corpus_sampling_weights': None, 'binarized_dataset': False, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 320000, 'min_sample_size': 8000, 'num_batch_buckets': 0, 'tpu': False, 'text_compression_level': 'none', 'rebuild_batches': True, 'precompute_mask_config': {'feature_encoder_spec': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'mask_prob': 0.5, 'mask_prob_adjust': 0.05, 'mask_length': 5, 'inverse_mask': False, 'mask_dropout': 0.0, 'clone_batch': 8, 'expand_adjacent': False, 'non_overlapping': False}, 'post_save_script': None, 'subsample': 1.0, 'seed': 1, 'sort_indices_mutiple_corpora': True, 'batch_sample_multiple_corpora': False}. Available argparse tasks: dict_keys(['multilingual_masked_lm', 'translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'speech_unit_modeling', 'denoising', 'multilingual_denoising', 'speech_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'sentence_ranking', 'audio_finetuning', 'translation_multi_simple_epoch', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'multilingual_translation', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'multilingual_language_modeling', 'audio_pretraining', 'speech_unit_modeling', 'sentence_prediction', 'audio_finetuning', 'language_modeling', 'translation_from_pretrained_xlm', 'sentence_prediction_adapters', 'translation_lev', 'masked_lm', 'hubert_pretraining', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])
Is this an issue with the fairseq version? My current version is 0.12.2.
Hi. There's nothing wrong with the fairseq version. Could you please check or pull the latest code? We have provided guidance on how to perform inference in it.
from emotion2vec.
Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.
Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?
When I use the code
from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs")
print(res)
at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?
from emotion2vec.
Thank you very much for your response! It has been very helpful to me. I have already resolved the issue of model inference.
Another question is regarding the file train.lengths. In the file, the first number is 97, and the second number is 68. Does this mean that the first audio produced 97 feature vectors of 768-dimensions each, while the second audio produced 68 feature vectors of 768-dimensions each? If this is the case, why is the number of extracted feature vectors different when using the upstream model to extract features from different audios?
When I use the code
from funasr import AutoModel model = AutoModel(model="./emotion2vec_base") res = model(input="./emotion2vec_base/example/test.wav", output_dir="./outputs") print(res)
at https://www.modelscope.cn/damo/emotion2vec_base.git to extract features from my recorded audio, I found that the npy files obtained from the outputs folder contain only one 768-dimensional feature vector, not 97 or 68, just one. Can you help me understand why this is happening?
You could set the granularity
as:
from funasr import AutoModel
model = AutoModel(model="./emotion2vec_base")
wav_file = f"{model.model_path}/example/test.wav"
res = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(res)
But you should re-install funasr by:
git clone -b main --single-branch https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
pip install -e ./
from emotion2vec.
Hi, you can control this with granularity="utterance"
or granularity="frame"
. Note that granularity="utterance"
is simply a temporal pooling of granularity=" frame
.
from emotion2vec.
Thank you very much for your patient responses! I have resolved the issue.
from emotion2vec.
I have an additional question. Are there currently any well-performing downstream models for sentiment classification? The simple downstream models provided by your project already yield good classification results, but I would like to make some improvements to achieve a more comprehensive speech emotion recognition system. Could you provide me with some materials or tips?
from emotion2vec.
Related Issues (20)
- Finetuning HOT 1
- A question HOT 3
- Info about checkpoint file HOT 1
- Wechat Group application HOT 2
- 微信群 HOT 1
- 群二维码过期了,请问能更新一下吗 HOT 1
- The WeChat group QR code has expired. HOT 1
- The WeChat group QR code has expired HOT 1
- Two key models in finetune without annotated data HOT 1
- Emotion2Vec Pretraining code
- The WeChat group QR code has expired HOT 1
- The WeChat group QR code has expired again HOT 2
- About platform HOT 1
- Request for test and dev files HOT 2
- 二维码过期了 HOT 1
- extreactfeature won't work with the new models HOT 2
- fine-tuning pre train model HOT 3
- About reproducing data2vec2 results HOT 7
- Optimal segment length HOT 6
- _MISSING_TYPE HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emotion2vec.