Hi, I would like to know what is the technical process happening "behind the scene

In server.py , line 85, the default <code class="notra

The definition of stride is <div class="highlight

Technical Explanation of Desktop Application about honk HOT 7 CLOSED

castorini commented on August 22, 2024

Technical Explanation of Desktop Application

from honk.

Comments (7)

daemon commented on August 22, 2024

Hi,

speech_demo.py segments the speech in overlapping windows. Posteriors aren't smoothed, instead being compared to a threshold at every timestep; i.e., if the output probability is less than some min_keyword_prob, then the label is treated as negative (see line 110 in server.py). It would be interesting to see how posterior smoothing compares.

[...] the wav file was basically converted into a single image, which led eventually to a single prediction.

Yep, you're correct that train.py does that. It's originally designed for the Google Speech Commands dataset, which has audio clips of only one second in length.

from honk.

waltergenchi commented on August 22, 2024

Great, thanks!
What is the window and overlapping size?
I can't find the parameter in the speech_demo.py file.

from honk.

daemon commented on August 22, 2024

Oops, scratch that -- speech demo doesn't stride the windows at all. server.py contains code for striding. With the default parameters, windows aren't overlapped, since they are sent in chunks of one second. You'll need to increase the chunk size or the total number of chunks sent, in the demo application code. The stride can be adjusted in server.py.

from honk.

waltergenchi commented on August 22, 2024

In server.py, line 85, the default stride_size=500 (in milliseconds) and in the speech_demo.py the default chunk_size=1000.
From that I have understood that actually there is overlapping between windows (f size 1 second, i.e. 1000 milliseconds).
Am I missing something?

from honk.

daemon commented on August 22, 2024

The definition of stride is

def stride(array, stride_size, window_size):
    i = 0
    while i + window_size <= len(array):
        yield array[i:i + window_size]
        i += stride_size

Thus, no overlapping occurs if (len(array) - window_size) // stride_size is less than 2.

from honk.

riatzukiza commented on August 22, 2024

so does the listen end point accept a wav file, or a pcm buffer? I'm trying to use the http server from a new client. I saw that there is some compression going on in there?

All I know is that when I sent a one second long wav file (compressed and base 64 encoded), I got results, but they were not as good as the demo.

Would the sample rate also affect this? The microphone defaults to 44100 hz

from honk.

daemon commented on August 22, 2024

Yes, you're correct about the compression/b64 encoding. It was simply a hack to get the demo working in a short amount of time -- using WebSockets would have been a much better way to do it. The sample rate must be 16 kHz, so if you have only 44.1 kHz audio, you'll need to downsample first.

from honk.

Technical Explanation of Desktop Application about honk HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent