Comments (10)
We have finally been able to start work on V5 using this data, among others.
from silero-vad.
Hi!
This is definetely an interesting area to cover for v5
, we did not really think about it before explicitly!
You see, we viewed VAD as speech / noised speech separation from everything else (silence, mild noise, music).
This poses a quesion of separating speech from extremely noisy backgrounds, if I understand correctly. Or when there always is noise and only sometimes speech.
However, for noise-only signals, I've been getting a consistent 2-3x worse result from v4 w.r.t. v3
Could this be due to v4's encoder shrinking w.r.t. v3's (see number of params below from JIT)? Or should this be more of a training-data issue?
We simply did not optimize for this metric, so it is more or less random.
But our data construction prefers mild noise and more or less clean speech.
In a nutshell, we simply did not optimize for this scenario.
Did you also observe this behaviour on non-speech-only data?
We observed that for very loud noise our VAD behaves not very well.
Does these numbers make sense at all? Am I doing something wrong? If so, I'd appreciate some directions.
The numbers being measured are the sigmoid'ed output of both models' forward method (early returned from get_speech_timestamps() utility), with threshold of 0.5 and window size of 1536 samples.
Yes, this makes sense.
There are a lot of gimmicks in the get speech timestamps method to make speech detection more robust.
We will try to (i) replicate your metrics (ii) see if applying more of the above method will improve the results (iii) adopt the task long-term.
The good news also is that we got a bit of support for our project, so it will enjoy some attention in the near future with regard to customization, generalization and flexibility.
from silero-vad.
@dgoryeo I'm not sure what to tell you. I don't use python for silero v3/v4 anymore, just onnxruntime C api. If I were you I guess I would start by checking out an older repository revision before v4 update? https://github.com/snakers4/silero-vad/tree/v3.1
from silero-vad.
To be solved with a V5 release.
from silero-vad.
The new VAD version was released just now - #2 (comment)
It was designed with this issue in mind and performance on noise-only data was significantly improved - https://github.com/snakers4/silero-vad/wiki/Quality-Metrics
When designing for this task we were using your conclusions and ideas, so many thanks for this ticket
Can you please re-run your and tests and if the issue persists - please open a new issue referring to this one
Many thanks!
from silero-vad.
Hello!
Thank you for your response, @snakers4.
This poses a quesion of separating speech from extremely noisy backgrounds, if I understand correctly. Or when there always is noise and only sometimes speech.
Yes, it is not exactly "detecting speech", but "not triggering on non-speech" instead. What I had in mind is slightly related to the latter. Something like idle periods of a ASR-based dictation application, in which the VAD is always on: to my mind, v4 would trigger - say - twice as often as v3 for background noises (such as a dog barking), which in turn might leave the ASR exposed. For IoT applications, on the other hand, it also means unecessarily calling a power-hungrier system more frequently.
We simply did not optimize for this metric, so it is more or less random.
Ok, got it!
There are a lot of gimmicks in the get speech timestamps method to make speech detection more robust.
In fact, I only used the windowing and forward call from get_speech_segments()
, and posed an evaluation after the binarization step only at the model output posteriors, not at the timestamps. Perhaps I should continue the tests at the segment level (e.g., the best model should have the lowest sum of duration of wrongly-detected speech segments for noise-only data), even though I believe v4 would still present a worse behaviour, but maybe not on the same 2-3x proportion.
In any case, while waiting for - and looking forward to - v5, if you would be so nice to report the attempts to replicate such numbers in that table, I'll be happy to hear!
from silero-vad.
This sounds like something related to my experience as well. After using v4 for a while I had to come back to v3. While overall speech detection seemed a bit better in v4 and more precise near word boundaries, it however exhibits a consistent tendency for false positives - long durations of non-speech (1-2 minutes) at the beginning and end of audio files are mistakenly flagged as having speech. For my uses this isn't worth a minor accuracy increase, I can simply increase padding between speech segments.
Now I'm not ruling out a mistake in my code, and I have never tested it formally, but subjectively it seems like it might be related to this issue.
from silero-vad.
@IntendedConsequence , juts a quick novice question: how does one envokes v3 model? Thanks.
from silero-vad.
from silero-vad.
@snakers4 Can we fine-tune VAD on our own data ? We have our in house segmented data just like to ask is it possible to fine tune this model or not.
I am not able to find any finetuning code in this repo.
from silero-vad.
Related Issues (20)
- ❓ Can window_size_samples be selected as 160 (10ms)?
- Bug report - Warnings about Unused Initializers HOT 5
- ⚠️Public pre-test of Silero-VAD v5 HOT 6
- This vad algorithm does not work well on Chinese data sets HOT 4
- Bug report - Unable to convert model to CoreML or to C HOT 2
- Failed to compile C++ VAD example HOT 3
- Is there a method or parameter that can filter out noise that is not human voice? HOT 1
- Help / Load model from silero_vad.onnx failed:Protobuf parsing failed. HOT 3
- English version of the dataset README HOT 1
- Compile silero-vad-onnx.cpp with MSVC 2022 HOT 4
- How to export an ONNX with opset version = 13? HOT 2
- Feature request - DO NOT disable PyTorch gradient globally when using PyTorch JIT model HOT 3
- ai.onnxruntime.OrtException: Error code - ORT_FAIL - message: Load model from /sdcard/Download/huigu/silero_vad.onnx failed:system error number 13 HOT 2
- Is it possible to run silero-vad on a hosted live stream url? HOT 1
- ❓ Questions / Help / Support
- Block when using multiprocess HOT 1
- Bug report - cannot import name 'get_number_ts' HOT 3
- would the c++ example still work after the new silero_vad.onnx release ? HOT 5
- Bug report - [installation] Cannot import name 'get_number_ts' from 'utils_vad' HOT 2
- Properly loading v3.1 and v4 on a non-clean installation HOT 19
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from silero-vad.