secretsauceai / precise-wakeword-model-maker Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mycroftai/mycroft-precise

16.0 16.0 3.0 1.17 MB

Automated, end-to-end wakeword model maker using the Precise Wakeword Engine

License: Apache License 2.0

Python 96.51% Shell 2.95% Dockerfile 0.54%

acoustic-model hotword-detection machine-learning nlp wakeword wakeword-activation

precise-wakeword-model-maker's People

Contributors

Stargazers

Forkers

skewballfox ccleven sfu-bigdata

precise-wakeword-model-maker's Issues

Move more of the configuration parameters to JSON files

There are already a lot of parameters included in both data_prep_user_configuration.json and data_prep_system_configuration.json, however there are still a lot of hard coded parameters floating around in the data prep scripts. It would probably be a good idea to move them into the json files and document them in the readme.

Precise Lite plugin

Hey this looks great!

Just wanted to flag the official Precise Lite plugin as something you might want to add to the main readme:
https://github.com/MycroftAI/plugin-wake-word-precise-lite

It is purely a recognizer, nothing else and as you'd expect only works on the TFLite models.

project structure improvements

I plan on working on this once I have a full successful run through, though I think documenting the extent and rationale of the changes would be valuable to make sure that they are in line with both the end goal of the project and the expectations of current users.

directory changes and rational:

move py contents of project dir to src/, src/data_prep or src/model_maker
- the reason being for this one (along with some of the filename changes) is easier navigation of source code and identification of relevant components of the application when working on a specific problem. plus it cleans up the top level directory so that stuff that the only things in it is related to the total project (the readme, the setup scripts, requirements.txt etc).
create a tmp directory for all by products of the process,
- this is mainly beneficial for navigation and clearing application state: (rm -rf project/tmp/*), it also makes it easy for the program to clean up after itself or in the case of using the system tmp, for the system to wipe those files on reboot
create an out dir for end result files/dirs
- mainly to identify to the user that this is the location of the actual model has been finished and this are the desired end products of creation (don't nuke it with everything else)
create a directory test for all testing functionality
- because this follows standard practice there may be some tools for auto-generating some test with this. also by following practices like mirroring names with src/whatever/file_name_test.py, makes it easier to see what is currently being tested and what isn't.
potentially create a config directory for the json files
- if there is any configuration that is semi-permanent (for example, you want to always do this stuff, and not just per wakeword or run) then this is generally how it is handled. however compared to the above changes, this isn't too critical.

these together also makes it easier to make rules for .gitIgnore that doesn't involve listing a bunch of file names that may change at some point

file changes from easiest to hardest

change wake_word_data_prep_ide.py to src/whatever/__main__.py
- basically signals to developers that this is the entry-point of the program and makes it easier to identify where to start when trying to follow the code execution flow.
change wake_word_data_prep_classes.py to utility.py or multiple something_utils.py
- signals that this is for stuff separate from the core logic of the program and primarily for the plumbing. something like dir_util or fs_util would be great as a way to signal this is where you want to look if something messes up with the creation of directories
add docstrings and types to all functions related to domain logic
- this is useful if the person uses an IDE or a beefy editor like vs code because the user can see the purpose of a function by hovering over it, this makes it easier for people to debug non-logic-based issues (such as wrong param type) with the nlp functionality without having to understand exactly what the function does. it also makes it easier for us to identify logic issues (the user used this function correctly but the end result is incorrect) and if something is incorrectly produced if one function chains multiple function together.
move from print to logs.
- there are tools for making refactoring changes such as this somewhat automatically, but this has multiple benefits such as filtering output and identifying issues if the root cause is too high up for the terminal to scroll to. however this will still be more involved than the above changes, so we will probably save this one for later.

the majority of these are fairly low-hanging fruit that makes it easier for people to get up and running, both in using the code and contributing back to the end project.

Testing script(s)

The data prep flows do a lot of steps and it can be very complicated to find bugs. It would be great to create testing scripts to make sure all of the functionality required works. One problem with that is the data requirements.

I would like to avoid putting a bunch of wav files in a directory that will just be used for testing, but I don't see a simpler way to test each step.

Bug with `precise-train-incremental`: "AttributeError: 'str' object has no attribute 'decode'"

This is the same issue as here:
https://stackoverflow.com/questions/53740577/does-any-one-got-attributeerror-str-object-has-no-attribute-decode-whi

The solution of downgrading h5py to 2.10 works: pip install 'h5py==2.10.0' --force-reinstall. However it would be good to actually solve this in the requirements.txt or otherwise.

The big cleaning

The scripts here really grew from the needs of the models by someone who is not a professional software developer. The goal of data prep, as well as the whole wake word project in this phase, is to create prototypes of data recipes to help developers build tools to empower their users to create their own production quality wake words.

It would be great once the dust has settled a bit to go through the code and clean it up as much as possible to make it more understandable to developers.

Disable Tensorflow WARNINGS

The output when running the flows can be a lot, and that can lead to people overlooking important information because they are totally spammed with lots of WARNINGS. It would be great to find a way to disable all of these warnings to clean up the output.

add documentation

out of the project structure improvements this is the most important thing left. we need to add docstrings to the code(and types but that is another issue), both at the function level and the file level, and this mainly serves the purpose of documenting input, output and purpose of the function. A good heuristic is the more domain knowledge required to implement the function, the more it is necessary to document it, which lowers the amount that contributors have to learn in order to fix or implement something.

given the module is now split into multiple files which have functionality that is mostly self contained, we should also add some doc strings at the top of each file to document what this particular file is responsible for, and/or otherwise mention this in some form of documentation specifically for contributors and developers.

TTS generator feature

Integrate TTS Wakeword Generator as an optional step '0'.
Feature branch

As a user, I want an option to generate TTS wakeword data, so that I can train a model with either exclusively TTS generated data or TTS generated data and data from a wakeword collection.

Testing TTS Wakeword Generator
Move all TTS Wakeword Generator scripts into TTS_generator/
Move all config files into config/and update code for path
Make separate requirements file TTS_generation_requirements.txt
Refactor TTS Wakeword Generator
Create TTS Wakeword 'flow' script to walk user through TTS generation
Add menu item in cli.py

Refactor dialog

There are a lot of print statements (and f-string print statements) in the CLI. It has been suggested to pull them out and let them be handled as a dialog JSON file. And it's always fun to handle dialog with slotting (ie NLG).

Write a class to handle dialog
Move dialog from the flows into a JSON file
Test

add tmp or cache for converted files

this is a bit different from my original plan for tmp. when I was testing changes the one operation that consistently took the longest amount of time was converting the wavs I had to mp3. We could make it to where we store the results in a tmp directory, which is wiped between reboots, in order to avoid converting it each time the project is run. if someone is testing or making multiple models using the same extra training data, then successive runs will be much faster, without cluttering their filesystem with multiple copies of data that have to be manually deleted.

what do you think of avoiding model directories altogether?

I sort of referenced this idea in #15 but I figured I'd make it a separate issue since I'm not sure it fits with the intended use of the model directories for your workflow as an NLP guy.

I think it would be possible to avoid copying files altogether, and only creating a directory for the finalized results. this should speed up the process of creating and testing models against each other, because they are all using references to the original files in the data directories, avoiding read writes to the filesystem

add test for the data_prep stuff

this is more of a down the road issue, but is also a good way for someone to try to familiarize themselves with the codebase. the ideal is once the project is further along, you shouldn't need to rerun in order to verify that changes to a function or module don't break existing code and don't change the expected behavior for anything built on top of the project.

refactor to use a model class

this is something which I was planning on doing once my schedule relented, but I may as well document the planned changes and rationale, in case anyone else would like to get to this before I do.

when I originally refactored the project to use an out directory, I appended "out/" + to a lot of directory calls intentionally, as I was trying to get a feel for what part of the codebase were producing output to the filesystem, but it was only meant as a stop gap measure. The plan is to create a model class that would be used in place of dictionary currently used to represent models.

the goal of the models class is to sit as a layer of abstraction between the file handling and model data(such as accuracy) and the code handling model logic. take for example the function split_incremental_results: instead of specifying the paths and the copying the directories to the training and test directories, you would do something like:

self.models[model].add_test_data(list_of_files)

the goal of this isn't abstraction for the sake of abstraction but making it to where pathing is mostly derived rather than hard coded. since all the models have the same directory layout, you can potentially make it to where you only feed the output directory and the model name during instantiation and then the pathing is handled (with getters to get file paths), and also allows you to conveniently access model properties(you can implement comparators in python using dunder methods, meaning something like get best model, could be the result of best_model=max(self.models)

this also makes it easy to change it to where files aren't moved or copied at all, only stored as list internally inside the model.

command for training the precise model

hi, thank you for your great work.
I think your good work needs some tutorial.
after making the data what is the command for training the precise model?
I really appreciate your help.

Experiment: Improve batch sizing?

The batch size should ideally scale to the number of files, at least in the beginning of training the first models.

For example in the first flow the number of files will be very small when training the experiment models, as this is before any noise generation or incremental training has been performed. reducing the batch sizes here is better.

However once the noise files have been generated in flow 2 and before the first training occurs, the batch size can be slightly increased to account for all of the additional files.

Perhaps it is better to simple set the batch size slightly lower and accommodate for the small number of files in the beginning and it should statically be set to the same batch size and the number of epochs can simply be scaled.

Experiment
Perform several runs with scaled batch sizes and with fixed lower batch size and increased epochs to determine which method yields better models.

Implement and test new analytics feature

The basic code for this is in model_analytics.py.

Issue
The current reported analytics on the model is taken from the last epoch on training, this is however not always the chosen epoch (ie when using precise-train with -sb, the save best model parameter).

In addition other metrics should be recorded such as:

number of wake word and not wake word files for training
the two types of 'best' models (epoch with the lowest loss, epoch with the lowest val_loss)

The selection of the models should be dependent on whether the loss or val_loss is being used to select the best epoch.

refactor model_analytics.py into a proper class used in PreciseModelingOperations
add report on number of files for wake and not-wake
test on model selection for several runs