Code Monkey home page Code Monkey logo

allosaurus's Introduction

Allosaurus

CI Test

Allosaurus is a pretrained universal phone recognizer. It can be used to recognize phones in more than 2000 languages.

This tool is based on our ICASSP 2020 work Universal Phone Recognition with a Multilingual Allophone System

Architecture

Get Started

Install

Allosaurus is available from pip

pip install allosaurus

You can also clone this repository and install

python setup.py install

Quick start

The basic usage is pretty simple, your input is an wav audio file and output is a sequence of phones.

python -m allosaurus.run  -i <audio>

For example, you can try using the attached sample file in this repository. Guess what's in this audio file :)

python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s

You can also use allosaurus directly in python

from allosaurus.app import read_recognizer

# load your model
model = read_recognizer()

# run inference -> æ l u s ɔ ɹ s
model.recognize('sample.wav')

For full features and details, please refer to the following sections.

Inference

The command line interface is as follows:

python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] [--topk <int>] -i <audio file/directory>

It will recognize the narrow phones in the audio file(s). Only the input argument is mandatory, other options can ignored. Please refer to following sections for their details.

There is also a simple python interface as follows:

from allosaurus.app import read_recognizer

# load your model by the <model name>, will use 'latest' if left empty
model = read_recognizer(model)

# run inference on <audio_file> with <lang>, lang will be 'ipa' if left empty
model.recognize(audio_file, lang)

The details of arguments in both interface are as follows:

Input

The input can be a single file or a directory containing multiple audio files.

If the input is a single file, it will output only the phone sequence; if the input is a directory, it will output both the file name and phone sequence, results will be sorted by file names.

The audio file(s) should be in the following format:

  • It should be a wav file. If the audio is not in the wav format, please convert your audio to a wav format using sox or ffmpeg in advance.

  • The sampling rate can be arbitrary, we will automatically resample them based on models' requirements.

  • We assume the audio is a mono-channel audio.

Output

The output is by default stdout (i.e. it will print all results to terminal).

If you specify a file as the output, then all output will be directed to that file.

Language

The lang option is the language id. It is to specify the phone inventory you want to use. The default option is ipa which tells the recognizer to use the the entire inventory (around 230 phones).

Generally, specifying the language inventory can improve your recognition accuracy.

You can check the full language list with the following command. The number of available languages is around 2000.

python -m allosaurus.bin.list_lang

To check language's inventory you can use following command

python -m allosaurus.bin.list_phone [--lang <language name>]

For example,

# to get English phone inventory
# ['a', 'aː', 'b', 'd', 'd̠', 'e', 'eː', 'e̞', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰ', 'l', 'm', 'n', 'o', 'oː', 'p', 'pʰ', 'r', 's', 't', 'tʰ', 't̠', 'u', 'uː', 'v', 'w', 'x', 'z', 'æ', 'ð', 'øː', 'ŋ', 'ɐ', 'ɐː', 'ɑ', 'ɑː', 'ɒ', 'ɒː', 'ɔ', 'ɔː', 'ɘ', 'ə', 'əː', 'ɛ', 'ɛː', 'ɜː', 'ɡ', 'ɪ', 'ɪ̯', 'ɯ', 'ɵː', 'ɹ', 'ɻ', 'ʃ', 'ʉ', 'ʉː', 'ʊ', 'ʌ', 'ʍ', 'ʒ', 'ʔ', 'θ']
python -m allosaurus.bin.list_phone --lang eng

# you can also skip lang option to get all inventory
#['I', 'a', 'aː', 'ã', 'ă', 'b', 'bʲ', 'bʲj', 'bʷ', 'bʼ', 'bː', 'b̞', 'b̤', 'b̥', 'c', 'd', 'dʒ', 'dʲ', 'dː', 'd̚', 'd̥', 'd̪', 'd̯', 'd͡z', 'd͡ʑ', 'd͡ʒ', 'd͡ʒː', 'd͡ʒ̤', 'e', 'eː', 'e̞', 'f', 'fʲ', 'fʷ', 'fː', 'g', 'gʲ', 'gʲj', 'gʷ', 'gː', 'h', 'hʷ', 'i', 'ij', 'iː', 'i̞', 'i̥', 'i̯', 'j', 'k', 'kx', 'kʰ', 'kʲ', 'kʲj', 'kʷ', 'kʷʼ', 'kʼ', 'kː', 'k̟ʲ', 'k̟̚', 'k͡p̚', 'l', 'lʲ', 'lː', 'l̪', 'm', 'mʲ', 'mʲj', 'mʷ', 'mː', 'n', 'nj', 'nʲ', 'nː', 'n̪', 'n̺', 'o', 'oː', 'o̞', 'o̥', 'p', 'pf', 'pʰ', 'pʲ', 'pʲj', 'pʷ', 'pʷʼ', 'pʼ', 'pː', 'p̚', 'q', 'r', 'rː', 's', 'sʲ', 'sʼ', 'sː', 's̪', 't', 'ts', 'tsʰ', 'tɕ', 'tɕʰ', 'tʂ', 'tʂʰ', 'tʃ', 'tʰ', 'tʲ', 'tʷʼ', 'tʼ', 'tː', 't̚', 't̪', 't̪ʰ', 't̪̚', 't͡s', 't͡sʼ', 't͡ɕ', 't͡ɬ', 't͡ʃ', 't͡ʃʲ', 't͡ʃʼ', 't͡ʃː', 'u', 'uə', 'uː', 'u͡w', 'v', 'vʲ', 'vʷ', 'vː', 'v̞', 'v̞ʲ', 'w', 'x', 'x̟ʲ', 'y', 'z', 'zj', 'zʲ', 'z̪', 'ä', 'æ', 'ç', 'çj', 'ð', 'ø', 'ŋ', 'ŋ̟', 'ŋ͡m', 'œ', 'œ̃', 'ɐ', 'ɐ̞', 'ɑ', 'ɑ̱', 'ɒ', 'ɓ', 'ɔ', 'ɔ̃', 'ɕ', 'ɕː', 'ɖ̤', 'ɗ', 'ə', 'ɛ', 'ɛ̃', 'ɟ', 'ɡ', 'ɡʲ', 'ɡ̤', 'ɡ̥', 'ɣ', 'ɣj', 'ɤ', 'ɤɐ̞', 'ɤ̆', 'ɥ', 'ɦ', 'ɨ', 'ɪ', 'ɫ', 'ɯ', 'ɯ̟', 'ɯ̥', 'ɰ', 'ɱ', 'ɲ', 'ɳ', 'ɴ', 'ɵ', 'ɸ', 'ɹ', 'ɹ̩', 'ɻ', 'ɻ̩', 'ɽ', 'ɾ', 'ɾj', 'ɾʲ', 'ɾ̠', 'ʀ', 'ʁ', 'ʁ̝', 'ʂ', 'ʃ', 'ʃʲː', 'ʃ͡ɣ', 'ʈ', 'ʉ̞', 'ʊ', 'ʋ', 'ʋʲ', 'ʌ', 'ʎ', 'ʏ', 'ʐ', 'ʑ', 'ʒ', 'ʒ͡ɣ', 'ʔ', 'ʝ', 'ː', 'β', 'β̞', 'θ', 'χ', 'ә', 'ḁ']
python -m allosaurus.bin.list_phone

Model

The model option is to select model for inference. The default option is latest, it is pointing to the latest model you downloaded. It will automatically download the latest model during your first inference if you do not have any local models.

We intend to train new models and continuously release them. The update might include both acoustic model binary files and phone inventory. Typically, the model's name indicates its training date, so usually a higher model id should be expected to perform better.

To download a new model, you can run following command.

python -m allosaurus.bin.download_model -m <model>

If you do not know the model name, you can just use latest as model's name and it will automatically download the latest model.

We note that updating to a new model will not delete the original models. All the models will be stored under pretrained directory where you installed allosaurus. You might want to fix your model to get consistent results during one experiment.

To see which models are available in your local environment, you can check with the following command

python -m allosaurus.bin.list_model

To delete a model, you can use the following command. This might be useful when you are fine-tuning your models mentioned later.

python -m allosaurus.bin.remove_model

Current available models are the followings

Language Independent Model (Universal Model)

The universal models predict language-independent phones and covers many languages. This is the default model allosaurus will try to download and use. If you cannot find your language on the language dependent models, please use this universal model instead.

Model Target Language Description
uni2005 Universal This is the latest model (previously named as 200529)

Language Dependent Model

We are planning to deliver language-dependent models for some widely-used languages. The models here are trained with the target language specifically. It should perform much better than the universal model for the target language. Those models will not be downloaded automatically. Please use the download_model command above to download, and use --model flag during inference.

Model Target Language Description
eng2102 English (eng) English only model

Device

device_id controls which device to run the inference.

By default, device_id will be -1, which indicates the model will only use CPUs.

However, if you have GPU, You can use them for inference by specifying device_id to a single GPU id. (note that multiple GPU inference is not supported)

Timestamp

You can retrieve an approximate timestamp for each recognized phone by using timestamp argument.

python -m allosaurus.run --timestamp=True -i sample.wav 
0.210 0.045 æ
0.390 0.045 l
0.450 0.045 u
0.540 0.045 s
0.630 0.045 ɔ
0.720 0.045 ɹ
0.870 0.045 s

The format here in each line is start_timestamp duration phone where the start_timestamp and duration are shown in seconds.

Note that the current timestamp is only an approximation. It is provided by the CTC model, which might not be accurate in some cases due to its nature.

The same interface is also available in python as follows:

model = read_recognizer()
model.recognize('./sample.wav', timestamp=True)

Top K

Sometimes generating more phones might be helpful. Specifying the top-k arg will generate k phones at each emitting frame. Default is 1.

# default topk is 1
python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s

# output top 5 probable phones at emitting frame, "|" is used to delimit frames (no delimiter when topk=1)
# probability is attached for each phone, the left most phone is the most probable phone 
# <blk> is blank which can be ignored.
python -m allosaurus.run -i sample.wav --topk=5
æ (0.577) ɛ (0.128) ɒ (0.103) a (0.045) ə (0.021) | l (0.754) l̪ (0.196) lː (0.018) ʁ (0.007) ʀ (0.006) | u (0.233) ɨ (0.218) uː (0.104) ɤ (0.070) ɪ (0.066) | s (0.301) <blk> (0.298) z (0.118) s̪ (0.084) sː (0.046) | ɔ (0.454) ɑ (0.251) <blk> (0.105) ɹ̩ (0.062) uə (0.035) | ɹ (0.867) ɾ (0.067) <blk> (0.024) l̪ (0.018) r (0.015) | s (0.740) z (0.191) s̪ (0.039) zʲ (0.009) sː (0.003)

Phone Emission

You can tell the model to emit more phones or less phones by changing the --emit or -e argument.

# default emit is 1.0
python -m allosaurus.run -i sample.wav 
æ l u s ɔ ɹ s

# emit more phones when emit > 1
python -m allosaurus.run -e 1.2 -i sample.wav 
æ l u s f h ɔ ɹ s

# emit less phones when emit < 1
python -m allosaurus.run -e 0.8 -i sample.wav 
æ l u ɹ s

Inventory Customization

The default phone inventory might not be the inventory you would like to use, so we provide several commands here for you to customize your own inventory.

We have mentioned that you can check your current (default) inventory with following command.

python -m allosaurus.bin.list_phone --lang <language name>

The current phone inventory file can be dumped into a file

# dump the phone file
python -m allosaurus.bin.write_phone --lang <language name> --output <a path to save this file>

If you take a look at the file, it is just a simple format where each line represents a single phone. For example, the following one is the English file

a
aː
b
d
...

You can customize this file to add or delete IPAs you would like. Each line should only contain one IPA phone without any space. It might be easier to debug later if IPAs are sorted, but it is not required.

Next, update your model's inventory by the following command

python -m allosaurus.bin.update_phone --lang <language name> --input <the file you customized)

Then the file has been registered in your model, run the list_phone command again and you could see that it is now using your updated inventory

python -m allosaurus.bin.list_phone --lang <language name>

Now, if you run the inference again, you could also see the results also reflect your updated inventory.

Even after your update, you can easily switch back to the original inventory. In this case, your updated file will be deleted.

python -m allosaurus.bin.restore_phone --lang <language name>

Prior Customization

You can also change the results by adjusting the prior probability for each phone. This can help you reduce the unwanted phones or increase the wanted phones.

For example, in the sample file, we get the output

æ l u s ɔ ɹ s

Suppose you think the first phone is wrong, and would like to reduce the probability of this phone, you can create a new file prior.txt as follows

æ -10.0

The file can contain multiple lines and each line has information for each phone. The first field is your target phone and the second field is the log-based score to adjust your probability. Positive score means you want to boost its prediction, negative score will suppress its prediction. In this case, we can get a new result

python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt 
ɛ l u s ɔ ɹ s

where you can see æ is suppressed and another vowel ɛ replaced it.

Another application of prior is to change the number of total output phones. You might want more phones outputs or less phones outputs. In this case, you can change the score for the <blk> which corresponds to the silence phone.

A positive <blk> score will add more silence, therefore decrease the number of outputs, similarly, a negative <blk> will increase the outputs. The following example illustrates this.


# <blk> 1.0
python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt 
æ l u ɔ ɹ s

# <blk> -1.0
$ python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt 
æ l u s f ɔ ɹ s

The first example reduces one phone and the second example adds a new phone.

Fine-Tuning

We notice that the pretrained models might not be accurate enough for some languages, so we also provide a fine-tuning tool here to allow users to further improve their model by adapting to their data. Currently, it is only limited to fine-tuned with one language.

Prepare

To fine-tune your data, you need to prepare audio files and their transcriptions. First, please create one data directory (name can be arbitrary), inside the data directory, create a train directory and a validate directory. Obviously, the train directory will contain your training set, and the validate directory will be the validation set.

Each directory should contain the following two files:

  • wave: this is a file associating utterance with its corresponding audios
  • text: this is a file associating utterance with its phones.

wave

wave is a txt file mapping each utterance to your wav files. Each line should be prepared as follows:

utt_id /path/to/your/audio.wav

Here utt_id denotes the utterance id, it can be an arbitrary string as long as it is unique in your dataset. The audio.wav is your wav file as mentioned above, it should be a mono-channel wav format, but sampling rate can be arbitrary (the tool would automatically resample if necessary) The delimiter used here is space.

To get the best fine-tuning results, each audio file should not be very long. We recommend to keep each utterance shorter than 10 seconds.

text

text is another txt file mapping each utterance to your transcription. Each line should be prepared as follows

utt_id phone1 phone2 ...

Here utt_id is again the utterance id and should match with the corresponding wav file. The phone sequences came after utterance id is your phonetic transcriptions of the wav file. The phones here should be restricted to the phone inventory of your target language. Please make sure all your phones are already registered in your target language by the list_phone command

Feature Extraction

Next, we will extract feature from both the wave file and text file. We assume that you already prepared wave file and text file in BOTH train directory and validate directory

Audio Feature

To prepare the audio features, run the following command on both your train directory and validate directory.

# command to prepare audio features
python -m allosaurus.bin.prep_feat --model=some_pretrained_model --path=/path/to/your/directory (train or validate)

The path should be pointing to the train or the validate directory, the model should be pointing to your traget pretrained model. If unspecified, it will use the latest model. It will generate three files feat.scp, feat.ark and shape.

  • The first one is an file indexing each utterance into a offset of the second file.

  • The second file is a binary file containing all audio features.

  • The third one contains the feature dimension information

If you are curious, the scp and ark formats are standard file formats used in Kaldi.

Text Feature

To prepare the text features, run the following command again on both your train directory and validate directory.

# command to prepare token
python -m allosaurus.bin.prep_token --model=<some_pretrained_model> --lang=<your_target_language_id> --path=/path/to/your/directory (train or validate)

The path and model should be the same as the previous command. The lang is the 3 character ISO language id of this dataset. Note that you should already verify the the phone inventory of this language id contains all of your phone transcriptions. Otherwise, the extraction here might fail.

After this command, it will generate a file called token which maps each utterance to the phone id sequences.

Training

Next, we can start fine-tuning our model with the dataset we just prepared. The fine-tuning command is very simple.

# command to fine_tune your data
python -m allosaurus.bin.adapt_model --pretrained_model=<pretrained_model> --new_model=<your_new_model> --path=/path/to/your/data/directory --lang=<your_target_language_id> --device_id=<device_id> --epoch=<epoch>

There are couple of other optional arguments available here, but we describe the required arguments.

  • pretrained_model should be the same model you specified before in the prep_token and prep_feat.

  • new_model can be an arbitrary model name (Actually, it might be easier to manage if you give each model the same format as the pretrained model (i.e. YYMMDD))

  • The path should be pointing to the parent directory of your train and validate directories.

  • The lang is the language id you specified in prep_token

  • The device_id is the GPU id for fine-tuning, if you do not have any GPU, use -1 as device_id. Multiple GPU is not supported.

  • epoch is the number of your training epoch

During the training, it will show some information such as loss and phone error rate for both your training set and validation set. After each epoch, the model would be evaluated with the validation set and would save this checkpoint if its validation phone error rate is better than previous ones. After the specified epoch has finished, the fine-tuning process will end and the new model should be available.

Testing

After your training process, the new model should be available in your model list. use the list_model command to check your new model is available now

# command to check all your models
python -m allosaurus.bin.list_model

If it is available, then this new model can be used in the same style as any other pretrained models. Just run the inference to use your new model.

python -m allosaurus.run --lang <language id> --model <your new model> --device_id <gpu_id> -i <audio>

Acknowledgements

This work uses part of the following codes and inventories. In particular, we heavily used AlloVera and Phoible to build this model's phone inventory.

Reference

Please cite the following paper if you use code in your work.

If you have any advice or suggestions, please feel free to send email to me (xinjianl [at] cs.cmu.edu) or submit an issue in this repo. Thanks!

@inproceedings{li2020universal,
  title={Universal phone recognition with a multilingual allophone system},
  author={Li, Xinjian and Dalmia, Siddharth and Li, Juncheng and Lee, Matthew and Littell, Patrick and Yao, Jiali and Anastasopoulos, Antonios and Mortensen, David R and Neubig, Graham and Black, Alan W and Florian, Metze},
  booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={8249--8253},
  year={2020},
  organization={IEEE}
}

allosaurus's People

Contributors

ajd12342 avatar kormoczi avatar raotnameh avatar saikrishnarallabandi avatar steveway avatar willstott101 avatar xinjli avatar zaidsheikh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

allosaurus's Issues

Suggestions for README

Maybe you could add two things to the README:

  • A link to the arXiv version of the paper at the top of the README
  • A link to the dictate.app online interface

Segmentation fault

Hi, I tried to install with pip install allosaurus and tried to run python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav where I try to transcribe a 16kHz wave file from the CMU ARCTIC dataset. I got the following results:

$ python -m allosaurus.run -i <path>/cmu_us_slt_arctic/wav/arctic_b0340.wav
ð i z k w ɪ k l ɪ tʰ ə l dʒ o j z ʌ v h ɹ̩ z w ɹ̩ s ɔ ɹ s ə z ʌ v dʒ o j t ə h ɪ m
Segmentation fault (コアダンプ)

The last Japanese word means "core dump". Has anyone encountered this issue before?
FYI, I am using python 3.7.7 and my torch version is 1.5.1+cu101.

Issue with shapes alignment

Hello! I was having an issue with fine-tuning the model. This is the error message I'm getting :
image
I'm not sure how to proceed. Any insight would be greatly appreciated, thank you!

Deterministic output

I noticed that there is some variability in the output from call to call. For example, I just ran the same 15 second sample 10 times and the output contained varying numbers of phones:

[197, 198, 200, 199, 196, 195, 203, 195, 198, 197]

Is it possible to configure/modify the code slightly to generate deterministic results? I'm not sure, but I suspect this has something to do with Torch.

device-id argument doesn't work for different GPUs

For inference, the command is
python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] -i <audio>

However, specifying any device ID other than 0 (like say 1) still runs the inference on GPU 0.

Currently, the following code works to run inference on a GPU other than 0, but I think the intention of the device_id argument was to specify GPU ID as well.
CUDA_VISIBLE_DEVICES=<gpu_id> python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id 0] -i <audio>

Runtime Error

Thanks for the sharing the codes. During running, I encountered the following runtime error:

Traceback (most recent call last): File "/opt/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/run.py", line 61, in <module> phones = recognizer.recognize(args.input, args.lang, args.topk) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/app.py", line 69, in recognize tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/Users/jiawen/Google Drive/WorkSpace/github/allosaurus/allosaurus/am/allosaurus_torch.py", line 88, in forward hidden_pack_sequence, _ = self.blstm_layer(pack_sequence) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 573, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

I used Python3.7 and torch 1.5. It seems to be package version problem, could you please list all your package versions?
Tx

Unable to use the model

Hello, Thank you for the great repo.

I am unable to run it. Could you please help me to fix the issue. I tried the following compands.

I am using Miniconda:

  1. Install allosaurus
    pip install allosaurus

  2. Run inference

(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
    if len(get_all_models()) == 0:
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
    assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically
  1. Download the model
(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.bin.download_model -m latest
downloading model  latest
from:  https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to:    /home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/bin/pretrained
please wait...
(deepspeech_v0.7.4) [email protected]@wika:~$
  1. Run inference
(deepspeech_v0.7.4) [email protected]@wika:~$ python -m allosaurus.run -i deM23-44.wav
Traceback (most recent call last):
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/run.py", line 21, in <module>
    if len(get_all_models()) == 0:
  File "/home/LTLab.lan/agarwal/miniconda3/envs/deepspeech_v0.7.4/lib/python3.6/site-packages/allosaurus/model.py", line 13, in get_all_models
    assert len(models) > 0, "No models are available, you can maually download a model with download command or just run inference to download the latest one automatically"
AssertionError: No models are available, you can maually download a model with download command or just run inference to download the latest one automatically

audio file size limit?

is there any limit to size of audio file ? i tried for some files with 6 min of data , it processed but didn't gave an output.

assert wave_path.exists()

Hi, so I'm running into what should be a simple problem, but I simply can't figure out what I'm doing wrong.

I run the following command
python -m allosaurus.bin.prep_feat --path='C:\Users\maria\Allo\train'

I wanted to test with just a few samples to make sure I have everything working before using the complete dataset, but I'm stuck on this stage.

The wave txt file for the train directory contains

utt_1 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs5.wav
utt_2 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs6.wav
utt_3 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs7.wav
utt_4 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs8.wav
utt_5 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs9.wav
utt_6 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs10.wav
utt_7 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs12.wav
utt_8 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs14.wav
utt_9 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs15.wav
utt_10 C:\Users\maria\Allo\koyi\crdo-KKT_CONVERSATION_CONVERSATIONs16.wav

but I keep getting the error

Traceback (most recent call last):
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\maria\AppData\Local\Programs\Python\Python39\lib\site-packages\allosaurus\bin\prep_feat.py", line 57, in
assert wave_path.exists(), "the path directory should contain a wave file, please check README.md for details"
AssertionError: the path directory should contain a wave file, please check README.md for details

Is my wave file just formatted incorrectly and that's why I keep getting an error about no wav files existing? Is my command line argument the reason? Thank you for any help.

Support Speaker Diarization

Hello,
As you can see here I've started integrating this project into Papagayo-NG:
morevnaproject-org/papagayo-ng#49
The first results from my tests seem to be very promising.
Especially the new timestamp feature is helping a lot with that.

Is it possible to add some speaker separation to this?
Papagayo-NG itself allows several speakers for one audio file.
If we could recognize which parts are spoken by a separate speaker then that would make this a really nice solution for even
more animators.
I've taken a look at the topic, and it seems to be quite complex.
If this could be integrated to Allosaurus then that would be awesome of course.
If not there would be ways to get this into Papagayo-NG, we could do a separate pass over the audio.
I've taken a look and pyAudioAnalysis seems to already do that.
But that would be a big dependency addition.

using 'eval' insted of 'distance' in trainer.py

Hi
Thank you for allosaurus :)
I'm trying to train a model and I've got this error message
AttributeError: module 'editdistance' has no attribute 'distance'
so I replaced 'distance' with 'eval' in trainer.py and it works well
I'm using python v3.7

Incomplete phone inventory for iso gup

Description:

The phone inventory for Kunwinjku (iso gup) is incomplete. The output of python -m allosaurus.list_phone --lang gup is:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ']

However, Phoible lists the complete inventory as:

allophone description_name
m m Gunwinggu (PH 883)
i ɪ i Gunwinggu (PH 883)
j j Gunwinggu (PH 883)
u ʊ u Gunwinggu (PH 883)
a ʌ ai au a Gunwinggu (PH 883)
w w Gunwinggu (PH 883)
n n Gunwinggu (PH 883)
l l Gunwinggu (PH 883)
b p pʰ b Gunwinggu (PH 883)
ŋ ŋ Gunwinggu (PH 883)
e ɛ æ e Gunwinggu (PH 883)
o ɔ ɒ o Gunwinggu (PH 883)
ɡ k kʰ ɡ Gunwinggu (PH 883)
r r Gunwinggu (PH 883)
ɲ ɲ Gunwinggu (PH 883)
ʔ ʔ Gunwinggu (PH 883)
d̪ t̪ t̪ʰ d̪ Gunwinggu (PH 883)
ɳ ɳ Gunwinggu (PH 883)
ɭ ɭ Gunwinggu (PH 883)
ɻ ɻ Gunwinggu (PH 883)
ɖ ɖ Gunwinggu (PH 883)
ɽ ɽ Gunwinggu (PH 883)
ʎ ʎ Gunwinggu (PH 883)
dʲ tʲ tʲʰ dʲ Gunwinggu (PH 883)

https://phoible.org/inventories/view/883

Expected behavior

I would expect the allosaurus model inventory for iso gup to be:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ', 'ɪ', 'ʊ', 'ʌ', 'ai', 'au',  'b', 'p', 'pʰ', 'ɛ','æ', 'ɔ', 'ɒ', 'ɡ', 'k', 'kʰ', 'ɲ', 'd̪', 't̪', 't̪ʰ', 'ɖ', 'ɽ', 'ʎ', 'dʲ', 'tʲ', 'tʲʰ']

Phone times?

Would it be straightforward to modify Allosaurus to return the approximate times of the recognized phones?

Also, I’m a novice in this area, but for what it’s worth, very impressive tool!

Optimizing for Latency

Have the authors considered any approaches to reduce the latency of this approach?

Would be interested to understand if any avenues have been pursued (e.g., distilling into a more performant architecture)

Thanks!

How the different phonemes sounds exactly? (Preparation for fine-tuning...)

Hi,

When I use allosaurus with the eng2102 model for an English wav file, the results looks quite good (although there is one issue, if there is no silence at the beginning of the wav file, some phonemes from the beginning of the speech will be missing - I am still testing this, maybe later I will start a separate issue on this topic).

But when I use the universal model for a Hungarian wav file, the results are not so good (of course, I know it is not a very well known language ;-)).
So I would like to fine-tune the model. But for this, I need to create the text files about the phonemes of the sentences. As it is stated in the doc, the phones here should be restricted to the phone inventory of my target language.
The phone inventory for the Hungarian language is the following:
aː b bː c d dː d̠ d̪ d̪ː d̻ eː f fː h hː i iː j jː k kː l lː l̪ l̪ː m mː n nː n̪ n̪ː o oː p pː r rː r̪ r̪ː s sː s̪ s̻ t tː t̠ t̪ t̪ː t̻ u uː v vː w y yː z zː z̪ z̻ æ ø øː ɑ ɒ ɔ ɛ ɟ ɡ ɡː ɲ ɲː ɾ ʃ ʃː ʒ ʒː ʝ ʝː
But for some phonemes I cannot recognize.
Here is the explanation for the IPA signs for the Hungarian language:
https://hu.wikipedia.org/wiki/IPA_magyar_nyelvre
(unfortunately, it is in Hungarian, but the IPA signs are easy to find...)
Can you help me to understand this, or give me a link to any document, describing these phonemes?

Thanks!

Incorrect command for downloading model

A minor error:
In the README, the command specified for downloading a model is
python -m allosaurus.download <model>
However, the following is what actually works:
python -m allosaurus.download -m <model>

Can't download models

Hi!

I am trying to download the English model by running:

python -m allosaurus.bin.download_model -m eng2102

and I get the following error:

downloading model  eng2102
from:  https://www.pyspeech.com/static/model/recognition/allosaurus/eng2102.tar.gz
to:    /home/j/miniconda3/lib/python3.8/site-packages/allosaurus/pretrained
please wait...
Error: could not download the model

Same goes for the other model.
Is there a way to get them?

Thanks,

Allosaurus function to perform phoneme recognition without having to run the library as an executable

Hi,

Currently, the only way to perform phoneme recognition with allosaurus is to run a command in a cli type interface with the following structure python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory>.

It would be great if there was a function within the library that can also do something similar for example
from allosaurus.app import read_recognizer, speech_recognizer ... phoneme_seq = speech_recognizer.recognize(model_name, speech_wav_file, other_config) ...

Issue with using dependencies numpy with numba and panphon

Hi, when I try to run the package, I get an error with panphon stating that numpy needs to be greater than 1.20.2, but if I upgrade numpy, I get an error stating that numba only works with numpy between 1.17 to 1.20

EDIT: Had to upgrade numba

Change the location for the downloaded model

Hi,

It would be great if there was support for modifying the location in which the latest model (and other model versions) would be downloaded into. For example python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] -i <audio file/directory> -mp <custom directory to save models to>

Checking the full language list raised IndexError

While I'm checking the full language list with python -m allosaurus.bin.list_lan, the executable raised IndexError. The following is the error message:

Traceback (most recent call last):
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/bin/list_lang.py", line 13, in <module>
    model_path = get_model_path(args.model)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 27, in get_model_path
    resolved_model_name = resolve_model_name(model_name)
  File "/home/yist/.pyenv/versions/3.8.5/lib/python3.8/site-packages/allosaurus/model.py", line 76, in resolve_model_name
    return models[0].name
IndexError: list index out of range

support for python 3.10

I don't know why, but it does not work in python 3.10. Is it going to have support for 3.10 in the future?

Loss for recognizing a part of audio

Hi,

I'm trying to recognize the audio file below by lang_id='jpn', emit=1, timestamp=True, but there is nothing to be generated during 7.290~13.170s which includes about two audio clips:
drive link of the audio file

Could you please have a look at this?

By the way, I found that the generated duration seems always to be 0.045s? Could you please give some tips for optimizing it, like considering the connection between two phones, or Vowels and Consonants?

Thank you

maximum size for Inventory CustomizationIs?

Hi
Is there a a maximum size for inventory customization?
It seems not working if I have a phoneme number> 230?
and keep getting this error assert max_domain_idx != -1

Thank you so much and I really appreciate your time to replay me

How was the training data processed?

Hello,

We're trying to evaluate allosaurus for a pronunciation trainer. But currently the results fluctuate a bit too much for it to be reliable. Is there any tips that you have to get more consistent results? How was the training data recorded and was it processed in some way (compressor, noise reduction, etc...)? With this information we could adjust our input data and might get better results.

Peter

Update phones in a language

I just checked and it seems the phone list for yor and pcm are incorrect. How can I update this and potentially retrain the model so it can predict the appropriate phoneme sequence?

system not deterministic

Hello. I faced this issue a few times. It seems that the system is not deterministic. After having run the model several times on the same audio file, sometime a phone or two are replaced. The replacement seems to happen between the 1st and 2nd more likely phones when the probability for the top1 is low.

urllib.error.URLError: <urlopen error [Errno 60] Operation timed out>

Hi:

It seems like your pre-trained model link is dead

from:  https://www.pyspeech.com/static/model/recognition/allosaurus/latest.tar.gz
to:    /python_path/lib/python3.7/site-packages/allosaurus-0.4.2-py3.7.egg/allosaurus/pretrained
please wait...
Error: could not download the model
Traceback (most recent call last):
  File "/python_path/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/python_path/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/python_path/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/python_path/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/python_path/lib/python3.7/http/client.py", line 1439, in connect
    super().connect()
  File "/python_path/lib/python3.7/http/client.py", line 944, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/python_path/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/python_path/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 60] Operation timed out

my python version: 3.7.9
allosaurus version: commit a11771dd4aa16b5162e9aae6238a58bbcac430e5

Phoneme Boundaries

Hi,

Thank you for putting up the code open-source.

I have a question, is it somehow possible to add phoneme boundaries for each word recognized.

For example:

Transcript for a wav file (german): schau mal hin ist das dorf noch nicht zu sehen
Phonemes Recognized: | ʃ a ʊ h m a l h ɪ n ɪ s t d a s d ɔ ə f n ɔ x n ɪ x t s u z e h ə n
Phonemes with word boundaries: * | ʃ a ʊ h* m a l * h ɪ n* * ɪ s t* * d a s* * d ɔ ə f* * n ɔ x* * n ɪ x t* * s u* * z e h ə n*

Not sure if I am missing something.

Thank you.

Any explanation on feature window re-ordering?

Hi, I'm looking at shrinking the processing window down from the entire audio file at once.

Could you shed any light on this line?

feature = np.concatenate((np.roll(feature, 1, axis=0), feature, np.roll(feature, -1, axis=0)), axis=1)

Why does it use np.roll to move the frames to the front, and the end as well as joining all 3 together to widen the sample?

I'd spread out the one-liner as below to try and figure it out.

    rollup   = np.roll(feature, 1, axis=0)  # make last feature first
    rolldown = np.roll(feature, -1, axis=0) # make first feature last

    combined = np.concatenate((rollup, feature, rolldown), axis=1) # join all feature on second axis
    windowed = combined[::3, ] # removes features with overlapping samples

    return windowed

It seems to make all the overlapping features into 1 deeper sample and then drops all the overlaps by getting every 3rd item. But why the np.roll ?

Realtime? (low-latency streaming inference)

Thanks for allosaurus, my experiments with it have been fruitful so far. Very impressive work!

I'm curious about whether the architecture of this package is suitable for operating on streaming audio at a reasonably low-latency?

I haven't dug much further than what I needed to load a file with pydub and get some output, and am happy to dig further. I thought it could be a good idea to start a conversation about this, perhaps the system and models are totally unsuitable for real-time, or perhaps it might just require a bit of engineering effort from me.

Thanks in advance

WASM support?

Hello,

This is a really amazing project. Is there some way to make it run directly on a website? Without going to the server? Via wasm or similar

Peter

Input wav file as a BytesIO object not working

Hi,
I wanted to read the speech wav from a BytesIO object, but it does not work, because of the assert in the line 65 of app.py (filename should have a ".wav" extension). I have tried to give a filename to the BytesIO object, but it did not help either (do not really understand, why). If I comment this above mentioned line, then everything works well, but I would like to have a more appropriate / robust solution, which do not need the modification of the original code.
Do you have any suggestion or advice?
Thanks!

allosaurus results for Persian language

Hi, I'm trying to use allosaurus for Persian language but the results are not accurate at all!

here is an example:
model.recognize("source.wav", "pes")
returned result is:
f l a l ə m a l n ɪ k a m a n a ŋ t b a ʃ p uː x t ɔ l t b a ɪ s t ɔ n ə
but it should be like:
s a l ə m m a n m ɪ t a v ə n a m f a r s ɪ s uː x b a t ɔ o n a m

The source.wav has been attached.

What should I do? How can I improve the results?

Progress Information possible

Hi,
So this is working quite well now in Papagayo-NG.
But I wanted to know if it is possible to get progress information while it is recognizing.
Because if the input files are larger it could take a while.
If not then I will likely test slicing the input files into smaller segments based on silence gaps if possible and running them in series.
So I can then show an approximate progress status.
But the slicing might likely change the result of the recognizer.

Phone distance metric

Thanks for all your work on allosaurus. It's a really great resource!

For comparing the similarity of two phonetic sequences right now I've been using simple jaccard distance, but it would be nice to use a distance metric that would be sensitive to the fact that phonemes are not equally similar. Can you recommend a resource that would allow for this kind of distance metric?

Thank you!

recording best practice to get best result ?

Hello
first thanks a lot to make your work so easily available

I'm trying to make a software to help my friends improve their French pronunciation by doing the following things :

  1. put a french sentence to read (for which I have the IPA and a native recording )
  2. let them read aloud this sentence
  3. transcribe their recording to IPA using allosaurus
  4. compare with the expected IPA and point out mistakes

I've started to first play with allosaurus to chekc if it can correctly transcribe me (a french native) pronouncing some simple words, but it seems to have some trouble doing so (the result is quite approximate) I've added -l fra which seems to improve slightly the accuracy but not by much

Is there some best practice regarding recording to give best results ? Is there some other way I have to improve the accuracy for french ? (I'm a software engineer with good knowledge in python but not that much in machine learning )

thanks a lot for the pointers you can give me

Support for custom phoneme symbols beyond IPA

Hi,

Not sure if this is on the roadmap, but it would be super cool to have a way to provide a custom set of symbols to represent phonemes and their mappings to phones. Probably a function/layer to support IPA to custom phoneme set mapping should be sufficient for this requirement.

Does allosaurus handle mixed speech and non-speech data?

Hi, Thanks again for a great program

I tried to run allosaurus on approximately a 15 minute TED talk and got the following error. From the same talk, I extracted a 5 second speech excerpt, and allosaurus seemed to work. Did allosaurus crash because the TED talk starts with about 12 seconds of music? Here's the error message:

python -m allosaurus.run -i ~/datasets/tedlium3-wav/NaliniNadkarni_2009.wav
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/run.py", line 59, in
phones = recognizer.recognize(args.input, args.lang)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/app.py", line 56, in recognize
tensor_batch_lprobs = self.am(tensor_batch_feat, tensor_batch_feat_len)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/allosaurus/am/allosaurus_torch.py", line 88, in forward
hidden_pack_sequence, _ = self.blstm_layer(pack_sequence)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsitaram/Py3.7venv/allosaurus/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 580, in forward
self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

Timestamps for phones?

It would be wonderful to optionally be able to retrieve the timestamps for the phonemes. Is that possible?

[edit: I see this suggestion #20 would it be possible to add this option to the code?]

Prior.txt file path

Hi, thank you for the nice work! I would like to know where should I put the prior.txt file for Prior Customization ? Thanks !

Phone duration is always 0.045

No matter what, phone duration is 0.045 that doesn't sound right. Even if I say something like "Ooooooooh yeeeeees"

4.080 0.045 iː
4.320 0.045 tʲ
4.410 0.045 iː

Not able to transcribe simple word what in English

The issue

I am currently trying to use Allosaurus to help a Speech Language Pathologist perform transcriptions, but I am having issues with getting the application to recognize the word what let along longer WAV files with more complex sentences in them. Attached is the WAV file. The output I get from Allosaurus is:

~/Downloads❯ python -m allosaurus.run -i what.wav --model eng2102 --lang eng

~/Downloads❯ 

I even installed the eng2102 model.

~/Downloads❯ python -m allosaurus.bin.list_model
Available Models
- uni2005 (default)
~/Downloads❯ python -m allosaurus.bin.download_model -m eng2102
downloading model  eng2102
from:  https://github.com/xinjli/allosaurus/releases/download/v1.0/eng2102.tar.gz
to:    /home/filbot/.local/lib/python3.9/site-packages/allosaurus/pretrained
please wait...
~/Downloads❯ python -m allosaurus.bin.list_model               
Available Models
- uni2005 (default)
- eng2102

It was recorded using a Tascam DR-40X using WAV 32bit then transferred over to a Pop!_OS Linux System.

Python Version

~/Downloads❯ python -V
Python 3.9.7

Pop!_OS Version

~/Downloads❯ neofetch 
             /////////////                filbot@pop-os 
         /////////////////////            ------------- 
      ///////*767////////////////         OS: Pop!_OS 21.10 x86_64 
    //////7676767676*//////////////       Host: Oryx Pro oryp6 
   /////76767//7676767//////////////      Kernel: 5.15.23-76051523-generic 
  /////767676///*76767///////////////     Uptime: 1 hour, 40 mins 
 ///////767676///76767.///7676*///////    Packages: 2857 (dpkg), 90 (flatpak) 
/////////767676//76767///767676////////   Shell: zsh 5.8 
//////////76767676767////76767/////////   Resolution: 1920x1080 
///////////76767676//////7676//////////   DE: GNOME 40.5 
////////////,7676,///////767///////////   WM: Mutter 
/////////////*7676///////76////////////   WM Theme: Pop 
///////////////7676////////////////////   Theme: Pop-dark [GTK2/3] 
 ///////////////7676///767////////////    Icons: Pop [GTK2/3] 
  //////////////////////'////////////     Terminal: gnome-terminal 
   //////.7676767676767676767,//////      CPU: Intel i7-10875H (16) @ 5.100GHz 
    /////767676767676767676767/////       GPU: Intel CometLake-H GT2 [UHD Graphics] 
      ///////////////////////////         Memory: 3052MiB / 31977MiB 
         /////////////////////
             /////////////               

what.wav file.
what.wav.zip

The question

I feel like I'm not doing something correctly. Do I need to train allosaurus to listen for English sounds as well? I expect to see something similar to wʌt

Build issue

Hi,

Looks like a great program.

However, I was having trouble building allosaurus. I am on Ubuntu 16.04 and when I do 'pip install allosaurus' I get

Building wheel for llvmlite (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-r4C97N/llvmlite/setup.py'"'"'; file='"'"'/tmp/pip-install-r4C97N/llvmlite/s
etup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_whee
l -d /tmp/pip-wheel-AzDXdO
cwd: /tmp/pip-install-r4C97N/llvmlite/
Complete output (7 lines):
running bdist_wheel
/usr/bin/python /tmp/pip-install-r4C97N/llvmlite/ffi/build.py
File "/tmp/pip-install-r4C97N/llvmlite/ffi/build.py", line 122
raise ValueError(msg.format(_ver_check_skip)) from e
^
SyntaxError: invalid syntax
error: command '/usr/bin/python' failed with exit status 1

ERROR: Failed building wheel for llvmlite

On the web, there were suggestions to use 'python -m pip ...' and also try install llvm. I did both, but it didn't help

Appreciate any help

pip download

The pip download for allosaurus shows that it downloads successfully in the terminal, however allosaurus does not show up as a known module when I import it in my coding environment. What is the fix for this?

Cannot open 32 bit floating audio file

Hi,

It seems like the wave package does not support 32-bit floating encoding. Here is the error message:

Traceback (most recent call last):
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/run.py", line 71, in <module>
    phones = recognizer.recognize(args.input, args.lang, args.topk, args.emit, args.timestamp)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/app.py", line 63, in recognize
    audio = read_audio(filename)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/site-packages/allosaurus/audio.py", line 17, in read_audio
    wf = wave.open(filename)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 510, in open
    return Wave_read(f)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 164, in __init__
    self.initfp(f)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 144, in initfp
    self._read_fmt_chunk(chunk)
  File "/home/jamfly/miniconda2/envs/sb/lib/python3.8/wave.py", line 269, in _read_fmt_chunk
    raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3

Could we try to use torchaudio instead of the wave to open files?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.