tarteelai / tarteel-ml Goto Github PK

View Code? Open in Web Editor NEW

176.0 176.0 50.0 1.02 MB

Pre-processing and training scripts for the Tarteel Dataset

License: MIT License

Python 28.24% Jupyter Notebook 71.76%

tarteel-ml's People

Contributors

Stargazers

Watchers

Forkers

kareemn aymenq omarsayedmostafa yshaqalle muhdamean neoboy smk508 fadlytanjung omerasif-itu jidutm wantanwonderland inventrohyder mumer92 mralioo aliamituddin murtraja sdqmd hakimarx karim-53 bameur dnur sytocode nuwaisir-1998 hadilotfy sadjous hmkhalla abass94 fariztiger madwichery imot irzaip mohamed7genius abumalick qvin zhlzied ielb brilliant-esystems-limited azlir muchlisre incupadaftab hax4usincupad niezsellami messaoudi-mounir noura-k morningstar2213 kadirkid khwrali011 abder78 mjay-ku

tarteel-ml's Issues

Create a train-test-validation split for the recordings by verse.

Split the verse in the Qur'an 60-20-20 by verse. All recordings of that verse will be in the same set.

Note: There should be two copies of this split. One of them should be by ayah, and the other should be by unique ayah (i.e. identical ayat should be lumped together in one ayah-set).

Same number of audio files with or without surah [-s] argument

Description:
Same audio data with or without surah [-s] argument.

$ python3 download.py --use-cache --log CRITICAL
Audio Files:   0%|                                                   | 14/20565 [01:06<26:52:56,  4.71s/it]

python3 download.py -s 1 --use-cache --log CRITICAL
Audio Files:   0%|                                                               | 0/20565 [00:00<?, ?it/s]

Is it normal to have same amount of data in both cases?

Please advise. Regards.

Create a Dockerfile to support repo on multiple OS like Windows

As-salāmu alaykum. I was trying to run this project on Windows, but it is giving me errors in every step. First, it was giving errors in finding the specific versions of the modules mentioned in your requirements.txt file, so I downloaded the latest versions of them manually. This creates the possibility that I will be getting errors in the next steps. Which I am. Currently, I am getting an error at the very next step:
python download.py -s 1

Is it possible for you to create a Docker app, so we can run it on Windows without any hassle? That would likely help a lot of people. Thank you. JazākAllāhu khayrā.

Error when creating conda env, sox=14.4.2 not found.

Tried to create new conda env using the environment.yml but there is an error.

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
- sox=14.4.2

how to generate language model?

I am planning to train deep speech using the Korean language. Could you provide some guidelines about how can I create a language model?

Make improvements to the args passed to `download.py` for greater specification of what to download.

This would include

filtering by surah
identifying whether the cache should be ignored in favor of a new csv
additional comments

get Invalid wave header found

I have cloned Tarteel-ML repo, started to download csv for Alfatiha as mentioned in wiki but i got error
all i do was

python3 download.py -s 1
Downloading CSV from https://d2sf46268wowyo.cloudfront.net/datasets/tarteel_v1.0.csv
Done downloading CSV.
Invalid wave header found .audio/s1/a4/1_4_2787081723.wav , removing.
Invalid wave header found .audio/s1/a3/recording_FO94M2V.wav , removing.
Invalid wave header found .audio/s1/a1/1_1_3752224010.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_4035742518.wav , removing.
Invalid wave header found .audio/s1/a2/1_2_4115400297.wav , removing.
Audio file .audio/s1/a5/1_5_4027410949.wav does not have speech according to VAD. Removing.
Audio file .audio/s1/a7/1_7_456658554.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a2/1_2_2198883921.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_3964846100.wav , removing.
Audio file .audio/s1/a4/1_4_3355668251.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a1/1_1_526190118.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_4081852003.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_1540864270.wav , removing.
Audio file .audio/s1/a1/1_1_1740606045.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a3/1_3_1486618514.wav , removing.
Invalid wave header found .audio/s1/a5/1_5_1812842962.wav , removing.
Invalid wave header found .audio/s1/a2/1_2_2615458200.wav , removing.
Invalid wave header found .audio/s1/a3/1_3_2791353353.wav , removing.
Invalid wave header found .audio/s1/a3/1_3_619438520.wav , removing.
Audio file .audio/s1/a4/1_4_3452859238_pbxN174.wav does not have speech according to VAD. Removing.
Invalid wave header found .audio/s1/a7/1_7_3771589469.wav , removing.
Invalid wave header found .audio/s1/a6/1_6_1220528519.wav , removing.
Audio file .audio/s1/a3/1_3_409467092_jRelUpO.wav does not have speech according to VAD. Removing.

python 3.7.1
linux ubuntu 16.04

Train a model for gender prediction

Based on conversations with @abdulhaim and others, we have realized that it is important to know the gender of the person reciting each recordings to protect gender-based privacy during evaluation (see https://github.com/Tarteel-io/tarteel.io/issues/179).

However, only a small fraction of the recordings have a gender associated with them, because providing demographic information is optional. We can potentially overcome this issue by training a gender-identification model to provide tentative gender labels for our recordings.

Write a script to download, convert to features, and delete audio by ayah.

Doing it by ayah (specifically restarting the python generate_features script each time) will enable the fastest processing.

Validate that mfcc feature generation script works by reconstructing the signal and listening to it.

Tensorflow provides a function for reconstructing a signal from its components. Use it to reconstruct a few of the files and listen to them.

Some demographic metadata missing from Tarteel Dataset CSV

Recordings are being uploaded prior to any demographic info submitted (ex. people submit 5 ayahs and then put their demographics).

Potential hotfix is creating a small script to link all AnnotatedRecording objects with the same session_id as a DemographicInformation (which I believe is technically being done here: https://github.com/Tarteel-io/tarteel-api/blob/504ffb933c6b328b888714b5fb20f35f138a6b41/audio/views.py#L432)

Conda requirements are OSX specific

Conda requirements are currently OSX specific -- either a simpler requirements file should be created by hand (without pinning each individual dependency unnecessarily), or a separate requirements file should be made for Linux systems.

the files for the fattiha overfitting experiment

السلام عليكم

currently i am experimenting a quraan tutor for surat elekhlass, the same idea as surat alfatihah but with another audio set which from a known tutors.
it seems that i have a problem with the preprocessing step : /

can i get the files that were used for the experiment to compare it with mine?
train_src.txt, train_tgt.txt, val_src.txt, val_tgt.txt

thank you

`dataset_csv_url` : A ghost argument?

Issue: Exception

In download.py, line 103 throws exception:

'Namespace' object has no attribute 'dataset_csv_url'

while changing it to csv_url, fixes it.

Possible Cause:
Argument mismatch?

line 26 : parser.add_argument('--csv-url', type=str, default=TARTEEL_V1_CSV_URL)

line 103 : download_csv_dataset(args.dataset_csv_url, path_to_dataset_csv)

How to Add/Edit Wiki pages?

I want to add documentation to the concept of MFCC Coefficients. Also I noticed some typos in existing Wiki pages.

How can I add or edit Wiki pages?

Update tutorials

Elsalem,

Guys I have spent one day trying to run the first ML model unsuccessfully.
Problem is, there are different tutorials across the repo:

README.md propose to run the following

download.py: Download the Tarteel dataset
create_train_test_split.py: Create train/test/validation split csv files.
generate_alphabet|vocabulary.py: Generate all unique letters/ayahs in the Quran in a text file.
generate_csv_deepspeech.py: Create a CSV file for training with DeepSpeech.

But I am stuck at generate_csv_deepspeech.py
And I don't know what is the purpose of each data generated...

The wiki refer to py files that have been deleted long ago

Navigate into the audio_preprocessing directory and run python generate_features.py

wiki and CONTRIBUTING.md both explains how to set up the repo. It is just redundant...

I suggest that

we update README.md with the minimum instruction to run the simplest ML model. and I would need your help with that please please
we keep using CONTRIBUTING.md to explain how to set up the repo and bring the related content from the wiki here coz contributors can modify README.md easily but not the wiki as explained here #51

How to pass a live stream of audio to the ML

Hello

My plan is to make a mobile application for correcting the speech of the user while reading surat Al-EKlass.
i saw your tarteel application and it is fantastic ❤️ thank you for your work.

i already did the ML speech2text with OpenNMT, but i am wondering how did you pass a live stream audio from microphone to recognize it ? My plan was after training the ML i would set up the REST server then doing a GUI. but i am stuck with the audio streaming now :/

appreciate your help..,
thank you

Train an initial seq2seq model for recitation2text using the v1 dataset.

Implement a simple Keras architecture
Overfit on a small training set of ~100 recordings (#13)
Benchmark model performance on entire training set

Convert all v1 recordings to filter bank representations.

This is a separate task for filter banks from issue #3.

Missing audio_processing directory

Where I can find the audio_processing directory?
following the wiki steps, after step 2, Navigate into the audio_preprocessing directory and run python generate_features.py -f mfcc -s 1 --local_download_dir "../.audio" --output_dir "../.outputs" to generate the MFCC coefficients I couldn't able to find that audio_preprocessing directory.

Update training utils and models with OpenNMT based methods.

We recently moved away from developing or own feature generation and are conducting a mass refactor of the utilities. This task is related to the files in the training folder.

Write helper functions for working with the one-hot encoding.

None exist yet.

Migrate `has_speech` function to use evaluation data once it is ready.

Once we have reached a critical mass of evaluations, we need to transition the has_speech function to use that data instead of webrtc vad.

Move feature generation files to an archived code location.

We are moving over to use OpenNMT based utilities for this and have no need of these files anymore.

Downloading Surah Al-Fatihah only took long time

I was trying to download and preprocess Al-Fatiha. Here my commands:

git clone https://github.com/Tarteel-io/Tarteel-ML.git
cd Tarteel-ML/
git cherry-pick 624c46b
conda env create -f environment.yml
conda activate tarteel-ml
python download.py -s 1

I applied this commit to fix invalid wave header issue and make it download. However, it took long time for one short surah! Is this normal?

Provide instructions for downloading labelled data

Is the tarteel.io dataset (or a subset of the data) available for download?

Create a one-hot output encoding for the Arabic alphabet, including harakat and madd.

This task will involve writing a script that creates a list of all the Arabic characters used in Tarteel's quran dataset and then creating a file that dictates a one-hot encoding to be used for as a neural network output.

Refactor utilities related to the Quranic dataset.

As part of a broader refactoring effort, improve the code used to manipulate Quranic data.

get error IndexError: too many indices for array

i made audio data set from 10 quran readers for each aya
so i have 10 audio files for each aya with sample rate 32000 hz in wav format also i passed all audio files on audio checking function in your download script
so now i have 61382 audio file 10 files foreach aya
so now i tried to run script Sequence-to-Sequence Model in Keras
where i prepared every thing as you build your system

Data/one-hot.pkl
-.outputs/mfcc

but in your script you used
def build_dataset(local_coefs_dir='../.outputs/mfcc', surahs=[1], n=100):
i changed it to
def build_dataset(local_coefs_dir='../.outputs/mfcc', surahs=[2], n=100):

but i got this error
"IndexError: too many indices for array"
while executing this function
"convert_list_of_arrays_to_padded_array"
in this line
"padded_array[a, :r, :c] = arr"

and this is the values stored in memory
shape (1361, 13)
max_shape [13459]
padded_array (100,13459)

Logging exception due to invalid arguments

Currently this exception is thrown when downloading files: python3 download.py

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 577, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "download.py", line 103, in <module>
    download_csv_dataset(args.csv_url, path_to_dataset_csv)
  File "download.py", line 50, in download_csv_dataset
    logging.info("Downloading CSV from ", csv_url, " to ", dataset_csv_path, ".")
Message: 'Downloading CSV from '
Arguments: ('https://tarteel-frontend-static.s3-us-west-2.amazonaws.com/datasets/tarteel_v1.0.csv', ' to ', '.cache/csv/local.csv', '.')

It is due to this line:
logging.info("Downloading CSV from ", csv_url, " to ", dataset_csv_path, ".")

Error when trying to do "Fatihah overfitting experiment" but with surat Al-Ekhlass

Hello team tarteel, I would like to thank you for your hard work.

currently i am experimenting a quraan tutor for surat elekhlass, the same idea as surat alfatihah but with another audio set which from a known tutors.

I prepared all the files for training, but i face a problem in the training phase.

I run this command ::

!python /content/OpenNMT-py/train.py -model_type audio -enc_rnn_size 512 -dec_rnn_size 512 -audio_enc_pooling 1,2 -dropout 0 -enc_layers 2 -dec_layers 1 -rnn_type LSTM -data /content/OpenNMT-py/data/speech/demo -save_model demo-model -global_attention mlp -gpu_ranks 0 -batch_size 8 -optim adam -max_grad_norm 100 -learning_rate 0.0003 -learning_rate_decay 0.8 -train_steps 2000

the error is :

_[2020-03-04 21:03:57,891 INFO]  * tgt vocab size = 15
[2020-03-04 21:03:57,892 INFO] Building model...
[2020-03-04 21:04:02,067 INFO] NMTModel(
  (encoder): AudioEncoder(
    (W): Linear(in_features=512, out_features=512, bias=False)
    (batchnorm_0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (rnn_0): LSTM(161, 512)
    (pool_0): MaxPool1d(kernel_size=1, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rnn_1): LSTM(512, 512)
    (pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (batchnorm_1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(15, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.0, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(1012, 512)
      )
    )
    (attn): GlobalAttention(
      (linear_context): Linear(in_features=512, out_features=512, bias=False)
      (linear_query): Linear(in_features=512, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
      (linear_out): Linear(in_features=1024, out_features=512, bias=True)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=512, out_features=15, bias=True)
    (1): Cast()
    (2): LogSoftmax()
  )
)
[2020-03-04 21:04:02,067 INFO] encoder: 3747840
[2020-03-04 21:04:02,067 INFO] decoder: 4190555
[2020-03-04 21:04:02,067 INFO] * number of parameters: 7938395
[2020-03-04 21:04:02,068 INFO] Starting training on GPU: [0]
[2020-03-04 21:04:02,068 INFO] Start training loop and validate every 10000 steps...
[2020-03-04 21:04:02,069 INFO] Loading dataset from /content/OpenNMT-py/data/speech/demo.train.0.pt
[2020-03-04 21:04:02,070 INFO] number of examples: 15
Traceback (most recent call last):
  File "/content/OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/content/OpenNMT-py/onmt/bin/train.py", line 204, in main
    train(opt)
  File "/content/OpenNMT-py/onmt/bin/train.py", line 88, in train
    single_main(opt, 0)
  File "/content/OpenNMT-py/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/content/OpenNMT-py/onmt/trainer.py", line 244, in train
    report_stats)
  File "/content/OpenNMT-py/onmt/trainer.py", line 365, in _gradient_accumulation
    with_align=self.with_align)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/OpenNMT-py/onmt/models/model.py", line 45, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/OpenNMT-py/onmt/encoders/audio_encoder.py", line 119, in forward
    memory_bank = pool(memory_bank)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/pooling.py", line 76, in forward
    self.return_indices)
  File "/usr/local/lib/python3.6/dist-packages/torch/_jit_internal.py", line 181, in fn
    return if_false(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 457, in _max_pool1d
    input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: Given input size: (7x1x1). Calculated output size: (7x1x0). Output size is too small_

I know that the problem is in the pooling size but i don't know how to fix it.

pip or conda ?

Hi,

I am confused:

the CONTRIBUTING.md ask me to install stuff using conda and environment.yml (file updated in 2019) containing numpy=1.15.4
on the other hand, if I follow the tutorial in the README.md file, then I will use pip to install the requirements.txt (file updated in 2020) and thus numpy==1.18.2

I would like to know, please, what I should do.
From my personal experience, it is better to use pip as it is more stable and more up-to-date.

And maybe my first contribution would be to fix that :)