Wake Word project prototype deliverables

Description

Complete the prototype deliverables:

Wake Word Collector (recorder) script
Wake Word Data Prep script
Precise TF2 (binaries?) runner only and conversion of compressed models
Documentation

To-do

Scripts

ML

Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.

Models (first tf1.13, then tflite) must pass minimum quality control: 5 wake up in a row, 2h random input (ie 1h TV and 1h conversation) without false wake up
- tf 1.13
Models must pass production quality control: 1 week: wake up every time, ~2-3 false wake ups (future goal: 1 false wake up per week!)
- tf1.13
  - sensitivity is too high, requires parameter tweaking, but once tweaked it passes all quality controls
Further utterance based not-wake 'incremental' (ie common voice) training
- current amount of common voice: ~30k, increase to ~40k
- perform analysis for unbalanced classes, size as needed (maximize the number of not-wake samples without breaking the model's class balance)

Precise tflite

Make branch for just the runner (engine) of Precise TF2
Create slimmed down tflite binaries(?) of runner branch

How do you optimally perform a data collection for TTS?

Are there good tools for this?
How are the data requirements different for other kinds of TTS systems out there?

What are the advantages and disadvantages of using Lily bus over Hermes MQTT?

As the title implies, I would like to hear more about:

current state
advantages and disadvantages
performance

of using the Lily bus over Hermes.

This is the area of voice assistants I know the least about! 🙂

Also:

How will this change impact using projects like Rhasspy?
Will this be implemented in such projects?
Will this require major changes on many other already existing components?

NLU clean up data set

Description

We want to be able to benchmark NLU data sets, then refine the entries of the data sets. This is done to improve the quality of the data set.
Please see cleaning the NLU dataset and macro NLU data refinement.

High-level deliverables

Creating basic data cleaner (formatting, etc.)
Build NLU engine for testing
- intent classifier
- entity classifier
- testing pipeline
Create flows for refining the NLU data
- Intent refinement flow (almost done!)
- Entity refinement flow (will hopefully be similar to above!)
Refine data set

DoD

Related milestones

The related milestones are in NLU Engine Prototype Benchmarks repo.

rough roadmap of the program

NLU benchmark Padatious

Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples

TTS

How to have grammatical agreement in NLG and good slotting, too?

How can grammatical agreement be best achieved with good slotting, too? This isn't just for English, so remember some languages have gendered articles, suffix/prefix changes, cases, dual forms, etc.

An easy case for this is the plural form of an entity in an NLG response in Engish:

utterance: 'turn off the living room lights'

response: 'The living room lights are now off' (entity: living room lights, grammar: plural)

For example in Mycroft by default for the Home Assistant skill you can get a response that says:

'The living room lights is now off' (besides basic slotting, there is no real NLG engine in Mycroft). Now imagine if there are 3 or 4 forms of the definite or indefinite article depending on the entity, Latin grammar constructions like the accusative, dative, or genitive, etc. (closest thing in English: who and whom).

Potential solution 1

One thing I played around with is using either an API (such as Google) or running something like LanguageTool to perform a grammar check of the generated response before the TTS speaks it or the user reads it. Support for self hosting a grammar checker for languages outside of English wasn't really possible at the time (I couldn't find the grammar rules for LanguageTool in other languages). Perhaps it is now?

Using the above example, the response 'The living room lights is now off' would be automatically correct to 'The living room lights are now off', without having to code this.

There are also models based on such architectures as BERT (and can even use distilBERT) (ie GECToR that can handle grammar correction (I have not yet benchmarked these solution specifically). Perhaps that is also a potential solution.

In addition, if a response doesn't exist for a specific language, you could have it translated automatically using similar tools (ie google translate or deepl), further reducing the amount of work.

The advantages to this approach is clear: reduce complexity to creating responses that must have grammatical agreement, no pseudo-code NLG engines for grammar.

potential solution 2

Note: This solution has not yet been implemented

When it comes to Lily NLG, Project fluent is used to solve grammar agreement. From the linked examples, I still haven't seen how slotting of the entities is performed, but perhaps this is a good solution.

How can this solution be evaluated?
What other solutions exist?
Which one gives the highest grammatical agreement for the least amount of manual rules?
- NOTE: At every project I worked at that used NLG, templates were used that could become very complex for slotting, grammatical agreement (rules), and handling random variations. Language experts often found this too difficult to do, it generated lots of technical debt, created many mistakes, and a lot of the mistakes couldn't even be checked with unit tests as so many words and phrases can be slotted and it's hard to come up with every example possible.

Potential solution 3

Accelerated Text looks absolutely insane. I would really like to try this. It uses grammatical framework, a functional programming language for multilingual grammar applications, in its core and wraps that in some nice GUI and runs with docker, which is also pretty sweet. I really badly want to check this thing out.

Other

While not a total solution, the language plugins from Neon look interesting. Especially for automatically translating. Although none of these repos seems to have a readme or link to further documentation.

NLU multiple features for intent matching

In this notebook: https://github.com/AmateurAcademic/interview-code-examples

the classifier for intents uses strictly TFIDF, could futher feature engineering, such as stemming/lemma, or using also word2vec improve results?

NLU benchmark suggestions

https://github.com/OpenVoiceOS/jurebes (scikit-learn)
https://github.com/OpenVoiceOS/padacioso (regex)
https://github.com/OpenJarbas/nebulento (fuzzy match)

Wakeword Mini project: Tiktok 'open can' fails to fix

Video response to the ‘open can’ fail video on tiktok: This is just a fun little project.

It would be cool to build a better wakeword for the trash can and post it as a video (will probably put it up on youtube and embed it into the model maker README, and @sheosi will post it on TikTok for the LOLs).

Scrape all video sources for wakeword and not-wakeword (there is more than one video, so much wakeword and wakeword fails)
Run Precise Wakeword Model Maker for ‘open can’ to make the wakeword model
Do small video demo showing it working and not falsely waking up, will probably just use precise-listen as a visual aid (the video style will be a bit like that Khaby dude does lifehacks)

How to reduce time between wake word and recording for ASR to nearly 0 delay?

In your MVP, you write:

wakeword can be immediately followed by a root-level command ("computer, lights off", without waiting for feedback after wakeword)

There is a limit to how much you can cut down the delay between the wake word engine processing and finding the wake word. This can be improved with tflite, however there will still be a pause (maximum delay for uncompressed tflite on raspi4 aarch64: 0.1359s, however that's purely just running the engine and testing, not the whole setup, which makes it slower and noticeable)

How would this requirement be handled when there is still a delay between hearing the wake word and kicking on the recording for ASR transcription to record the utterance?
Is this as simple to solve as recording from the time stamp of when a user says the wake word and process the whole recording for ASR (one could also normalize the wake word out of the ASR transcription in case that is captured) or is this something more complex?

Separate summary of documents (program and projects)

@joshua had a great idea, to create separate summaries of the documentation, more to onboard the casual person who might stumble upon the project that links to deeper documentation (ie directly in the README)

Finish draft documents (program and projects' overviews)
Discuss what would be most interesting or otherwise enticing aspects of the program and its projects
Draft rough document
Review
Revise

How do you make skills independent of NLU engine?

This is more of a coding problem than anything else. I am out of my depth here, once wake word prototype deliverables are done, I will probably look into this. I wonder already exists in this respect?

Can skills be imported?
How should utterances, entities, intents, and actions be stored (ie YAML, json)?

To a limited degree, Rhasspy seems to offer the closest thing I have seen to this modular design:

I haven't really looked into this much at all. Being able to use intent matchers with skills from other systems would solve the ecosystem problem.

Android Wake Word

In order for the wake word to work on Android we need:

Write the inference using TF Lite.
- There's already some code for processing this in Rust Precise RS.
Perform MFCC
- The code above already implements MFCC, however the standard Rust library was TOO slow (like 10 times too slow), therefore a new implementation is required. See SpeechPy MFCC Rust
Capturing sound.
- If using Rust Cpal can be used
Write an Android service
- https://developer.android.com/training/articles/assistant.html
- https://android.googlesource.com/platform/frameworks/base/+/marshmallow-release/tests/VoiceInteraction/
- Seems like the real star of the show is the permission "android.permission.BIND_VOICE_INTERACTION".

Bus: "Universal Protocol" design

Seeing as the community is swarming towards microservices (because of several benefits: easier dependency management, easy concurrency, sandboxing ...) I think that designing a common protocol has benefits that potentially can revolutionize our community, mostly:

Compatibility of projects: Despite being summarized so easily, this is a big feature. It has three subvariants:
- Core: If we get our core components to be compatible this could mean that users could choose those that favour them best. This in turn could incentivize people to make new components that cater to really high standards, or some component that is extremely specialized (like components for running on small devices).
- Clients: Having compatible clients means that it doesn't matter whether one is using project A or project B, and even allow for architectural differences accross them (which can make sense for different use cases, e.g: a distributed core would be great for personalization but a single-process core might make sense for final users looking for something lighter and easier to manage).
- Skills: This is crucial. Having a common set of skills would make our community better as a whole. This way someone can write a skill for any project and be usable for all of them. There's the possibility of a common shared store!!! This would make having a set of skills of good quality easier.
Better protocol: Even if nothing is shared at the end of the day, a protocol is the bloodline of such projects as voice assistant, having a well-thought, researched and simple protocol will make our projects more robust.

Now that we know the "why?" the "what?" should be discussed, or in other terms, which features should be included, here are some but consider this an open list:

Clients with different capabilities: That is, clients that are capable of different things, maybe some are CLI and have no voice and some might have the ability to show pictures or small interactions/programs to the user, heck even an integration with the OS could be possible, this would make the "hey computer, close all my computer windows" feasible even if the client and the core/server are not in the same computer. Also, clients and skills should be able to define their own capabilities, this would make it really easy for clients to expose custom behaviour for skills to use. This effectively means breaking the whole protocol into smaller ones, which is pretty good for reasoning and documentation.
Versionning: Technology is never still, it's always evolving moving towards some other places, what can seem a good idea initially, might be bad in the future, if think about voice even if we are using the best encoder right now, maybe someone comes up with something incredible in a couple of years, and switching to it might be beneficial for a bonus in voice clarity and a reduciton in network usage. Also, new features could be added if needed.

Of course there might be more out there, anyone with a suggestion feel free to comment.

Finally, there's the "how?" which is decissive, as it will determine the robustness and performance of our solution. Some questions about how to build this:

Messaging Protocol: We need something to communicate all the components, but should it be based on TCP or on UDP? Which protocol?
Data interchange format: The protocol might send our message, but we still need a structure to hold them: JSON, MsgPack, FlexBuffers...?
Sandboxing: Some of the current implementations treat every node of the voice assistant as equals, even skills, this might open the door to bad-behaved skills, how this should be addressed?
Other features: how would versioning work? How to implement other features?

Document the prototype NLU engine

Description

TODO

DoD (definition of done)

TODO

Get rid of magical tarball in mycroft-precise

The official install instructions for mycroft-precise involve downloading a tarball full of mystical binaries in order to run a model. This tarball is not reproducible, uses old python, mostly redundant and plain annoying.

This to do is related to the deprecated to do about building a new binary

The tarball exists primarily because arm builds of tensorflow are not priovided officially. However, there are wheels available on:

piwheels: https://www.piwheels.org/project/tensorflow/
bitsy-ai github: https://github.com/bitsy-ai/tensorflow-arm-bin
maybe somewhere else

In this task we clean up mycroft-precise and make sure it's easy to install using standard tools (pip & apt).

strip it down to just the required runner, engine, listener and throw it in a new branch

Ideally as the result one can just run pip install ${some_url} to install precise and minimal set of dependencies to run a model.
TBD: how to provide the list of additional dependencies (if any) for training the model.

Open Questions

Description

General open questions have their own item in the program.

They are currently sorted on the AI Program Kanban

Skills

How well do NLU engines perform? CPU% vs correctness (per domain & whole data set)

How do the NLU engines perform?

We are interested in measuring resource utilization vs correctness, scalability, performance per domain, and performance per language.

This might be a good data set for performing this analysis. The utterances seem close to reality, it contains enough variations, most of the labeling is pretty good, and it has roughly 25k utterances. Perhaps translate these into other languages as a rough analysis for other languages?

For each intent matching system, we want to find out:

CPU% (base: raspi4 aarch64 4gb ram)
correct intents matched
- over each domain
- for whole dataset (scalability test)
- for other languages (this will be performed after English)

List of NLU engines

(This is a work in progress)

diet classifier (Rasa)
Snips
random forest
Padatious
?

Deliverables

testing script to run data set on several NLU engines (can the existing scripts used from the dataset be modified, should the snips one be used or should a new one be written?)
analysis comparing the performance of each intent matcher per domain and over the whole set

This will help people decide given the domains they require for intent matching, how scalable the engine should be, the device they want to run it on, and the language: which intent matcher will perform the best for them.

create draft documentation for NLU-NLG project

What are the challenges of FOSS voice assistants on Android devices?

@Tadashi-Hikari has some interesting insights into the challenges of developing a voice assistant to run natively on Android.

Most voice assistants that run on Android are just a client, they don't run natively. Building a voice assistant that can run completely natively on Android is a very interesting topic. I personally would love to know more about this in hopes the community can collectively help solve some of the challenges.

Challenges

So far, as I understand (and my understanding of Android development is very limited):

Audio processes are the hardest
Android doesn't use standard C libraries, therefore it is harder to get the stuff that can run in Linux working in Android, and many NLP solutions are made in Python, which can present further challenges to getting it to run on Android
- @sheosi is also working on a voice assistant that can run natively on Android, he works using Rust, perhaps this can help?
Resources are more limited and a developer needs to think about battery life
Getting processes (ie wake word) to run constantly in the background is poorly documented and can be a challenge
- @Tadashi-Hikari from what I understand, you have solved this problem, would it be possible to get documentation for this?

What other challenges are there? How can we solve these problems?

Opportunities

TinyML data collection
- Passive data collection: A user records a lot of data on their phone through many apps, if a user could run a native voice assistant that could access this data (without sharing the data: the user owns their own data), the voice assistant could aid the user in a much more personal and intelligent way.
Accessibility: There are many people with Android phones world wide and installing an app is much easier than installing a FOSS voice assistant on a device (such as a raspberry pi) that many people don't even own.

Resources

Sapphire Assistant Framework
Lily

Document NLU data cleaning tools

Description

TODO

DoD (definition of done)

TODO

How do you train a glow TTS model for Larynx TTS?

If I wanted to train a TTS with my own recordings and then deploy it with Larynx TTS, how exactly would I do that? I couldn’t find a good tutorial online. Maybe I missed something.

Perhaps @JarbasAI or @NeonDaniel have done something similar, or happen to have good resources for this?

Server with gitlab (Jitsi?)

Hetzner?

Should we use Jitsi or something similar?

Wakeword Phase two deliverables

Phase two of the Wakeword Project

Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.

Precise tflite

Create new branch of Precise reflecting the results on compression (include CLI for people to pick between levels of compression)

How well can ASR run locally (raspi4)?

@NeonDaniel pointed out that you can run DeepSpeech in real time on a raspi4. I actually didn't know that would run. (That's amazing btw, thanks for that!)

This opens up a good question:
How well does it perform (resources, latency, quality)?

It would be interesting to benchmark this for a list of prerecorded utterances and measure the latency, resource usage, and 'correctness' of transcription.

Side question

Another interesting side question: How well does it perform in noisy situations? A person can really train a wake word to work in pretty noisy environments, however that won't be so effective if the transcription breaks down in the same environments. It might be a good idea to add noise to the background of a sub-set of this data set.

Number of recorded utterances?
Perhaps 1000?

Benchmark

It wouldn't be a benchmark if it wasn't measured against something else.

It might also be interesting to measure a model from Silero . Should Kaldi or anything else be considered?

Add your MVP

Description

You are here because you want your own, personal AI assistant (voice assistant, chat bot, home automation, etc.). What is your MVP? Add this as a doc, tag 'Personal MVP' as the type. You can take a look at my crappy rough draft for Jarvis to get some ideas.

Brief general description: general overview
User stories or otherwise requirements clearly stated: keep it simple
Pipeline architecture (this can be very rough and full of guesses, but it is nice to have a rough overview of how you think the system should be built up)
- What already exists that doesn't need to be built from scratch? (including what you already put together)
- What needs to be built?
  - How can we help in this process (how to's, coding, data collection, etc.)

Wakeword Project Phase 2 Deliverables

Phase two of the Wakeword Project

Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.

Precise tflite

Create new branch of Precise reflecting the results on compression (include CLI for people to pick between levels of compression)

NLU benchmark Snips

Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples

How to collect and annotate text data easily?

One of the limits of NLP is good annotated data, this is also true for tinyML where a user might want to add their own examples to train and/or test an ML model. How can a user easily collect and annotate text data?

Here are two examples of open source text annotation software to check out:

Doccano
Brat
Do these meet the needs of users?
Are there better solutions?

NLU-NLG Project description

Description

The NLU-NLG project requires a rough outline containing:

more research into solutions
- What? Scope?
definition of deliverables
- What tools, documentation, models, etc. are to be used and what is to be built?
scope of deliverables
- What's the MVP? (this is supposed to be a prototype)
project plan

To-do

List research topics
List research resources
Scope the research

How can you make a light ASR system that only understands finite 'quick commands' (but can have slotting!)?

@equi brought up an interesting question about using a lighter ASR that instead of transcribing everything, is more based around specific commands.

Potential Solution 1

'Quick Commands'
Use multiple wake words. However(!) this does not work with slots (tags), ie datetimes, other entities. Therefore this isn't a satisfactory solution.

Potential Solution 2

Use something like Kaldi with grammars...

https://github.com/daanzu/kaldi-active-grammar

https://github.com/dictation-toolbox/dragonfly

https://github.com/dictation-toolbox/Caster

Document DistilBERT engine

Description

TODO

DoD (definition of done)

TODO

Wakeword

Description

The Wakeword project, is focused on all aspects of users creating and using wakewords to trigger actions, such as ASR transcription, for voice enabled solutions.

🗒️ Topics

Data collection
Modeling
Architecture

💿 Repos

Phase 1 (complete): sparse data collection and production quality model production

Phase 2 (in progress): Precise for Android with tflite

✅ DoD (Definition of Done)

Wakeword Phase 1 milestone
Wakeword Phase 2

Helpful links

Wakeword project wiki documentation

ASR

NLU intent and entity extraction DistilBERT

Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples

using DistilBERT.

@Ashit-cloud you said you wanted to try this one out, right?

https://shreelakshmigp1995.medium.com/bert-for-joint-intent-classification-and-slot-filling-1baf32e27386

Bus

NLU-NLG

Description

The NLU-NLG project, is focused on all aspects of NLU (natural language understanding: intent matching and entity extraction) and NLG (natural language generation) engines.

🗒️ Topics

Data collection
Modeling
Architecture
User analytics
Defect management

💿 Repos

Phase 1

NLU engine prototype and benchmarks for data cleaning

✅ DoD (Definition of Done)

TODO: What are the KPIs (benchmarks) for this project to be complete?

NLU-NLG phase 1 milestone
TODO: add Phase 2 milestone

Helpful links

TODO: create documentation issue

What TTS models can be run on raspi4 in real time?

We know Mimic can be run on a raspi4 in 'real time', we also know that Tacotron(2) probably will never run real time on a raspi4 (or perhaps?), so what does that leave us with?

Has anyone tried the Silero TTS models?

NLU further engine features for CRF entity extraction

In this notebook: https://github.com/AmateurAcademic/interview-code-examples

the CRFs do use several features, however it might be interesting to further improve the extraction by using lemma/stemming, or other features. Do the features always have to be on the word level?

Switch to git pages + github project management

Although Notion is cool and all, it really isn't free (also it isn't open source). For an open source project, I don't really want to pay for such a service (but it was great to have some ideas and templates).

Setup Program management repo: generic repo that will contain all pages, documentation, and project management stuff (ie SecretSauceAI repo)
Migrate Notion documentation to git pages
Migrate Notion todos to github project management

How to do silence detection in python well?

Having better silence detection would aid in chopping up audio files containing wake word information to reduce false positives.

Currently to make sure individual files capture only aspects of the wake word recordings, I chop them by n +2, where n is the number of syllables in the wake word. This works, however it misses a lot more combinations of sounds (ie Jarvis in 'hey Jarvis' would not be completely contained).

I tried some experiments with silence removal myself based on this stackoverflow question. However the threshold must be manually provided, I couldn't find a satisfactory threshold, perhaps a dynamic threshold is needed?

Here is an interesting code snippet to check if it works better.

Solution

However I think for now, the easiest solution is to add a feature into the wake word recording python script to let people add in such stuff themselves. This level of recording (such as using 'Jarvis' as a not-wake audio) was impossible on earlier models before the data generation methods were perfected.

This is the easiest and most viable solution. But it would be cool to be able to chop up audio files automatically for syllables and even more complex sounds in the future.

Example

I want my wake word 'hey Jarvis' to work, but not also for just 'Jarvis'. Therefore I add in when prompted for extra input on not-wake-words 'Jarvis' with 2 recordings (one for training one for test, which will be generated further anyway).

Why rust wake word engine?

You have mentioned interest in creating a rust wake word engine that can run precise models. I am curious as to the advantages of this.

Will this be easier to deploy onto Android devices?
Would this have any advantage for other devices (ie raspi)?
Do you really think this will increase the speed of the engine noticeably (as the biggest bottleneck is running the model itself and that is in tf which is coded in C++)?
What are the steps involved to get an engine such as Precise working on an android device?
What are the greatest difficulties in implementing an engine such as precise in rust and deploying it on an Android phone?

Blockers

TFlite in rust
Version of TFLite in rust? Compatible with TFLite version in TF2.3.1?
- Onnx? (convert to Onnx?) @gokayokyay

@sheosi is working on this.

Update README with more brief overview

Also move contents of README into Program Overview wiki page

secretsauceai / secret_sauce_ai Goto Github PK

secret_sauce_ai's People

Contributors

Stargazers

Watchers

Forkers

secret_sauce_ai's Issues

Description

To-do

Scripts

ML

Precise tflite

Description

High-level deliverables

DoD

Related milestones

Potential solution 1

potential solution 2

Potential solution 3

Other

Description

DoD (definition of done)

Description

List of NLU engines

Deliverables

Challenges

Opportunities

Resources

Description

DoD (definition of done)

Precise tflite

Side question

Benchmark

Description

Precise tflite

Description

To-do

Potential Solution 1

Potential Solution 2

Description

DoD (definition of done)

Description

🗒️ Topics

💿 Repos

Phase 1 (complete): sparse data collection and production quality model production

Phase 2 (in progress): Precise for Android with tflite

✅ DoD (Definition of Done)

Helpful links

Description

🗒️ Topics

💿 Repos

Phase 1

✅ DoD (Definition of Done)

Helpful links

Solution

Example

Recommend Projects

Recommend Topics

Recommend Org