secretsauceai / secret_sauce_ai Goto Github PK
View Code? Open in Web Editor NEWSecret Sauce AI: a coordinated community of tech minded AI enthusiasts
License: Apache License 2.0
Secret Sauce AI: a coordinated community of tech minded AI enthusiasts
License: Apache License 2.0
Complete the prototype deliverables:
Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.
As the title implies, I would like to hear more about:
of using the Lily bus over Hermes.
This is the area of voice assistants I know the least about! 🙂
Also:
We want to be able to benchmark NLU data sets, then refine the entries of the data sets. This is done to improve the quality of the data set.
Please see cleaning the NLU dataset and macro NLU data refinement.
The related milestones are in NLU Engine Prototype Benchmarks repo.
Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples
How can grammatical agreement be best achieved with good slotting, too? This isn't just for English, so remember some languages have gendered articles, suffix/prefix changes, cases, dual forms, etc.
An easy case for this is the plural form of an entity in an NLG response in Engish:
utterance: 'turn off the living room lights'
response: 'The living room lights are now off' (entity: living room lights, grammar: plural)
For example in Mycroft by default for the Home Assistant skill you can get a response that says:
'The living room lights is now off' (besides basic slotting, there is no real NLG engine in Mycroft). Now imagine if there are 3 or 4 forms of the definite or indefinite article depending on the entity, Latin grammar constructions like the accusative, dative, or genitive, etc. (closest thing in English: who and whom).
One thing I played around with is using either an API (such as Google) or running something like LanguageTool to perform a grammar check of the generated response before the TTS speaks it or the user reads it. Support for self hosting a grammar checker for languages outside of English wasn't really possible at the time (I couldn't find the grammar rules for LanguageTool in other languages). Perhaps it is now?
Using the above example, the response 'The living room lights is now off' would be automatically correct to 'The living room lights are now off', without having to code this.
There are also models based on such architectures as BERT (and can even use distilBERT) (ie GECToR that can handle grammar correction (I have not yet benchmarked these solution specifically). Perhaps that is also a potential solution.
In addition, if a response doesn't exist for a specific language, you could have it translated automatically using similar tools (ie google translate or deepl), further reducing the amount of work.
The advantages to this approach is clear: reduce complexity to creating responses that must have grammatical agreement, no pseudo-code NLG engines for grammar.
Note: This solution has not yet been implemented
When it comes to Lily NLG, Project fluent is used to solve grammar agreement. From the linked examples, I still haven't seen how slotting of the entities is performed, but perhaps this is a good solution.
Accelerated Text looks absolutely insane. I would really like to try this. It uses grammatical framework, a functional programming language for multilingual grammar applications, in its core and wraps that in some nice GUI and runs with docker, which is also pretty sweet. I really badly want to check this thing out.
While not a total solution, the language plugins from Neon look interesting. Especially for automatically translating. Although none of these repos seems to have a readme or link to further documentation.
In this notebook: https://github.com/AmateurAcademic/interview-code-examples
the classifier for intents uses strictly TFIDF, could futher feature engineering, such as stemming/lemma, or using also word2vec improve results?
Video response to the ‘open can’ fail video on tiktok: This is just a fun little project.
It would be cool to build a better wakeword for the trash can and post it as a video (will probably put it up on youtube and embed it into the model maker README, and @sheosi will post it on TikTok for the LOLs).
Scrape all video sources for wakeword and not-wakeword (there is more than one video, so much wakeword and wakeword fails)
Run Precise Wakeword Model Maker for ‘open can’ to make the wakeword model
Do small video demo showing it working and not falsely waking up, will probably just use precise-listen
as a visual aid (the video style will be a bit like that Khaby dude does lifehacks)
In your MVP, you write:
- wakeword can be immediately followed by a root-level command ("computer, lights off", without waiting for feedback after wakeword)
There is a limit to how much you can cut down the delay between the wake word engine processing and finding the wake word. This can be improved with tflite, however there will still be a pause (maximum delay for uncompressed tflite on raspi4 aarch64: 0.1359s, however that's purely just running the engine and testing, not the whole setup, which makes it slower and noticeable)
@joshua had a great idea, to create separate summaries of the documentation, more to onboard the casual person who might stumble upon the project that links to deeper documentation (ie directly in the README)
This is more of a coding problem than anything else. I am out of my depth here, once wake word prototype deliverables are done, I will probably look into this. I wonder already exists in this respect?
To a limited degree, Rhasspy seems to offer the closest thing I have seen to this modular design:
I haven't really looked into this much at all. Being able to use intent matchers with skills from other systems would solve the ecosystem problem.
In order for the wake word to work on Android we need:
Seeing as the community is swarming towards microservices (because of several benefits: easier dependency management, easy concurrency, sandboxing ...) I think that designing a common protocol has benefits that potentially can revolutionize our community, mostly:
Now that we know the "why?" the "what?" should be discussed, or in other terms, which features should be included, here are some but consider this an open list:
Of course there might be more out there, anyone with a suggestion feel free to comment.
Finally, there's the "how?" which is decissive, as it will determine the robustness and performance of our solution. Some questions about how to build this:
TODO
TODO
The official install instructions for mycroft-precise
involve downloading a tarball full of mystical binaries in order to run a model. This tarball is not reproducible, uses old python, mostly redundant and plain annoying.
This to do is related to the deprecated to do about building a new binary
The tarball exists primarily because arm builds of tensorflow are not priovided officially. However, there are wheels available on:
bitsy-ai github: https://github.com/bitsy-ai/tensorflow-arm-bin
maybe somewhere else
In this task we clean up mycroft-precise
and make sure it's easy to install using standard tools (pip & apt).
Ideally as the result one can just run pip install ${some_url}
to install precise and minimal set of dependencies to run a model.
TBD: how to provide the list of additional dependencies (if any) for training the model.
General open questions have their own item in the program.
They are currently sorted on the AI Program Kanban
How do the NLU engines perform?
We are interested in measuring resource utilization vs correctness, scalability, performance per domain, and performance per language.
This might be a good data set for performing this analysis. The utterances seem close to reality, it contains enough variations, most of the labeling is pretty good, and it has roughly 25k utterances. Perhaps translate these into other languages as a rough analysis for other languages?
For each intent matching system, we want to find out:
(This is a work in progress)
This will help people decide given the domains they require for intent matching, how scalable the engine should be, the device they want to run it on, and the language: which intent matcher will perform the best for them.
@Tadashi-Hikari has some interesting insights into the challenges of developing a voice assistant to run natively on Android.
Most voice assistants that run on Android are just a client, they don't run natively. Building a voice assistant that can run completely natively on Android is a very interesting topic. I personally would love to know more about this in hopes the community can collectively help solve some of the challenges.
So far, as I understand (and my understanding of Android development is very limited):
What other challenges are there? How can we solve these problems?
TODO
TODO
If I wanted to train a TTS with my own recordings and then deploy it with Larynx TTS, how exactly would I do that? I couldn’t find a good tutorial online. Maybe I missed something.
Perhaps @JarbasAI or @NeonDaniel have done something similar, or happen to have good resources for this?
Hetzner?
Should we use Jitsi or something similar?
Phase two of the Wakeword Project
Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.
@NeonDaniel pointed out that you can run DeepSpeech in real time on a raspi4. I actually didn't know that would run. (That's amazing btw, thanks for that!)
This opens up a good question:
How well does it perform (resources, latency, quality)?
It would be interesting to benchmark this for a list of prerecorded utterances and measure the latency, resource usage, and 'correctness' of transcription.
Another interesting side question: How well does it perform in noisy situations? A person can really train a wake word to work in pretty noisy environments, however that won't be so effective if the transcription breaks down in the same environments. It might be a good idea to add noise to the background of a sub-set of this data set.
Number of recorded utterances?
Perhaps 1000?
It wouldn't be a benchmark if it wasn't measured against something else.
It might also be interesting to measure a model from Silero . Should Kaldi or anything else be considered?
You are here because you want your own, personal AI assistant (voice assistant, chat bot, home automation, etc.). What is your MVP? Add this as a doc, tag 'Personal MVP' as the type. You can take a look at my crappy rough draft for Jarvis to get some ideas.
Phase two of the Wakeword Project
Each step usually consists of training several (at least 5) models (using the same and other collected data to remove bias) and evaluating their results to ensure that the models aren't randomly performing well.
Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples
One of the limits of NLP is good annotated data, this is also true for tinyML where a user might want to add their own examples to train and/or test an ML model. How can a user easily collect and annotate text data?
Here are two examples of open source text annotation software to check out:
The NLU-NLG project requires a rough outline containing:
@equi brought up an interesting question about using a lighter ASR that instead of transcribing everything, is more based around specific commands.
'Quick Commands'
Use multiple wake words. However(!) this does not work with slots (tags), ie datetimes, other entities. Therefore this isn't a satisfactory solution.
Use something like Kaldi with grammars...
https://github.com/daanzu/kaldi-active-grammar
TODO
TODO
The Wakeword project, is focused on all aspects of users creating and using wakewords to trigger actions, such as ASR transcription, for voice enabled solutions.
Make a similar notebook to here:
https://github.com/AmateurAcademic/interview-code-examples
using DistilBERT.
@Ashit-cloud you said you wanted to try this one out, right?
The NLU-NLG project, is focused on all aspects of NLU (natural language understanding: intent matching and entity extraction) and NLG (natural language generation) engines.
TODO: What are the KPIs (benchmarks) for this project to be complete?
We know Mimic can be run on a raspi4 in 'real time', we also know that Tacotron(2) probably will never run real time on a raspi4 (or perhaps?), so what does that leave us with?
Has anyone tried the Silero TTS models?
In this notebook: https://github.com/AmateurAcademic/interview-code-examples
the CRFs do use several features, however it might be interesting to further improve the extraction by using lemma/stemming, or other features. Do the features always have to be on the word level?
Although Notion is cool and all, it really isn't free (also it isn't open source). For an open source project, I don't really want to pay for such a service (but it was great to have some ideas and templates).
Having better silence detection would aid in chopping up audio files containing wake word information to reduce false positives.
Currently to make sure individual files capture only aspects of the wake word recordings, I chop them by n +2, where n is the number of syllables in the wake word. This works, however it misses a lot more combinations of sounds (ie Jarvis in 'hey Jarvis' would not be completely contained).
I tried some experiments with silence removal myself based on this stackoverflow question. However the threshold must be manually provided, I couldn't find a satisfactory threshold, perhaps a dynamic threshold is needed?
Here is an interesting code snippet to check if it works better.
However I think for now, the easiest solution is to add a feature into the wake word recording python script to let people add in such stuff themselves. This level of recording (such as using 'Jarvis' as a not-wake audio) was impossible on earlier models before the data generation methods were perfected.
This is the easiest and most viable solution. But it would be cool to be able to chop up audio files automatically for syllables and even more complex sounds in the future.
I want my wake word 'hey Jarvis' to work, but not also for just 'Jarvis'. Therefore I add in when prompted for extra input on not-wake-words 'Jarvis' with 2 recordings (one for training one for test, which will be generated further anyway).
You have mentioned interest in creating a rust wake word engine that can run precise models. I am curious as to the advantages of this.
Blockers
@sheosi is working on this.
Also move contents of README into Program Overview wiki page
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.