harvard-edge / multilingual_kws Goto Github PK

View Code? Open in Web Editor NEW

155.0 155.0 36.0 21.11 MB

Few-shot Keyword Spotting in Any Language and Multilingual Spoken Word Corpus

Jupyter Notebook 86.46% Python 12.99% Dockerfile 0.05% HTML 0.49%

few-shot-learning keyword-search keyword-spotting kws query-by-example speech-recognition wake-word-detection

multilingual_kws's People

Contributors

Stargazers

Watchers

multilingual_kws's Issues

Bug: Some languages (basque, polish) seem to have a higher total length duration than in original common Voice

some empty directories in MSWC? or the 16KHz reencode?

uhohs = []
mswc_16khz = Path("/media/mark/hyperion/mswc/16khz_wav/en/clips")
keywords = list(sorted(os.listdir(mswc_16khz)))
print(len(keywords))
for keyword in tqdm.tqdm(keywords):
    keyword_samples = list(sorted((mswc_16khz / keyword).glob("*.wav")))
    if len(keyword_samples) == 0:
        uhohs.append(keyword)
print(len(uhohs))
>>> 24

How to train it for more than one target_keyword?

Thanks!

generate a sibling dataset with speech context

Where can I find the used keywords (total 760) and splits from the paper?

First of all, thanks a lot for this work! It is incredibly useful for training strong models for keyword spotting.

I would like to train with the same data as mentioned in the paper: where can I download this data or where can I download a list of the files used for training/validation/testing etc. (see image below for the dataset I am looking for0?

For example, the newest version of the dataset on mlcommons.org has now more than 340k keywords:

So is there such an overview somewhere? I couldnt find it on the mlcommons website or in this repo, but maybe I missed it somewhere.

Some percentage of wavs (~3%) are below 1s according to soxi, others can't be opened

re-run few shot experiments with same split of unknown/silence as the DSCNN tests

found duplicates at 2 and 3 etc

GCS transfer

words with apostrophes are not correctly being extracted

Impacts certain languages more heavily than others (French, Kinyarwanda, ...)

empty_directories.txt

add version info to each tarball

add version.txt containing version 1.0 for all .tar.gz files

Add TFDS integration/flow

Similar to common voice or speech commands

expand and validate text normalization/cleaning filters

given two transcripts 1. [hello is a common greeting] and 2. [she said, “hello”], without punctuation filtering we would otherwise treat [hello] and [“hello”] as separate words

verify 1-1 match between final audio files and splits

this verification should only apply to words that have > the minimum number of clips for defining a split
currently the tarballs for audio has the full path starting with /mnt/ - need to fix

words greater than 2^3 are probably > 1s

word counts

Really great job on kicking off the wordcount feature Tejas! Excited to see you making progress so fast. Some suggestions on next steps:

It looks like the current script produces a csv of wordcounts for an input list of keywords. I think what we're looking for is rather, a csv of wordcounts for all words present in the .tsv file (after they have been normalized with clean_and_filter). Let me know if you have questions about this
Excellent to see type annotations! Can you also add docstrings please?
use standard __main__ (link)
- I think you can omit sys.argv since you're using argparse
format with black (https://github.com/psf/black)
rename the file to snakecasing (I have a bad habit of camelcasing .ipynb files but I think python files should be lowercased; eventually we will move several of these functions into a library)

Again, great job!! Let me know if you have any questions or if these suggestions don't make sense.

Lithuanian clips not validated

Arabic word formatting

Re-creating alignments for Common Voice 7

Most of our current alignments are for Common Voice 3/4, so re-running the alignments should create a lot more data.

Low priority as of now.

TFDS api

ERROR: Cannot find key when Running docker

Hi I get ERROR: Cannot find key: --keyword
when I Run

docker run --gpus all -p 8080:8080 --rm -u $(id -u):$(id -g) -it \
   -v $(pwd):/demo_data \
   mkws \
   --keyword mask \
   --modelpath /demo_data/xfer_epochs_4_bs_64_nbs_2_val_acc_1.00_target_mask \
   --groundtruth /demo_data/mask_groundtruth_labels.txt \
   --wav /demo_data/mask_stream.wav \
   --transcript /demo_data/mask_full_transcript.json \
   --visualizer

also is there an example for the required files (groundtruth,transcript)?

Thanks!

Filter out NaNs from Common Voice tsvs, distinguish between intentional "nan" in language vocabulary

in German, 'null' (zero) is being converted to NaN by pandas when it is the only word present in the transcript (due to single-word-target-segments data)

One option is to use filter_na=False when reading Common Voice TSVs
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

however, we should also first check for truly missing values in the sentence transcription column

UMAP visualization transitive dependency on old numpy

When running the intro tutorial notebook in the docker container for tensorflow/tensorflow:latest-gpu-jupyter the umap library can't be installed because numba only works on numpy <= 1.20

Installing umap in colab currently works but this might cause issues soon. We might want to move the umap visualization to a separate notebook.

is there the inference.py？

Using multilingual_kws with microphone streaming

Hi!
Very interesting work! I would like to know if it was possible to test this using the microphone stream as input?

check for 16KHz in AudioDataset

both for mp3s and wav files

OperatorNotAllowedInGraphError during Transfer Learning

Hi!

After creating a conda environment using the provided environment.yml file, followed by additionally installing TensorFlow 2.9.0 as mentioned in the Dockerfile, I tried to run the Jupyter Notebook's cells (put together in a main.py file). When calling the transfer_learning.transfer_learn function, I observed the following error:

File "main.py", line 152, in <module>
    main()
  File "main.py", line 104, in main
    _, model, _ = transfer_learning.transfer_learn(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 302, in wrapper
    return func(*args, **kwargs)
  File "/path/to/cioflanc/few_shot_kws/multilingual_kws/multilingual_kws/embedding/transfer_learning.py", line 76, in transfer_learn
    init_train_ds = audio_dataset.init_single_target(
  File "/path/to/cioflanc/few_shot_kws/multilingual_kws/multilingual_kws/embedding/input_data.py", line 467, in init_single_target
    waveform_ds = waveform_ds.map(self.augment, num_parallel_calls=AUTOTUNE)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1697, in map
    return ParallelMapDataset(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4080, in __init__
    self._map_func = StructuredFunctionWrapper(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3371, in __init__
    self._function = wrapper_fn.get_concrete_function()
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2938, in get_concrete_function
    graph_function = self._get_concrete_function_garbage_collected(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2906, in _get_concrete_function_garbage_collected
    graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3065, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3364, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3299, in _wrapper_helper
    ret = autograph.tf_convert(func, ag_ctx)(*nested_args)
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 302, in wrapper
    return func(*args, **kwargs)
  File "/path/to/cioflanc/few_shot_kws/multilingual_kws/multilingual_kws/embedding/input_data.py", line 290, in augment
    self.random_timeshift(audio) if self.max_time_shift_samples > 0 else audio
  File "/path/to/cioflanc/few_shot_kws/multilingual_kws/multilingual_kws/embedding/input_data.py", line 261, in random_timeshift
    if time_shift_amount > 0:
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 877, in __bool__
    self._disallow_bool_casting()
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 483, in _disallow_bool_casting
    self._disallow_when_autograph_disabled(
  File "/path/to/cioflanc/miniconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 467, in _disallow_when_autograph_disabled
    raise errors.OperatorNotAllowedInGraphError(
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed: AutoGraph is disabled in this function. Try decorating it directly with @tf.function.

Have you noticed this behaviour before? Do you have any suggestions?

Reproducing paper results

I'm unable to train a working monolingual embedding model. Using the provided script (train_monolingual_embedding.py) with the top 165 English words yields the following results at the end of training:
loss: 0.7145 - accuracy: 0.7711 - val_loss: 7.6774 - val_accuracy: 0.0586

Based on the paper, I was expecting something in the range of 70's for validation accuracy. Is it dependent on choosing the "right" words?

Could you please post a tutorial or maybe some of the missing files (e.g. train_files.txt, val_files.txt, test_files.txt, commands.txt) for reproducing the embedding?

I also notice that the file references seem to be to common voice rather than MSW. I'm using the English clips download from MSW which I'm assuming are the same. I've converted these to 16KHz, 16bit wav files using pydub which I guess is ffmpeg under the hood.

harvard-edge / multilingual_kws Goto Github PK

multilingual_kws's People

Contributors

Stargazers

Watchers

Forkers

multilingual_kws's Issues

Recommend Projects

Recommend Topics

Recommend Org