Support full GPU decoding,about alphacep/vosk-api

Comments (42)

basicasicmatrix commented on August 25, 2024 1

I wonder if a good fit for GPU support might be a Vosk Custom Backend for Nvidia's Triton Inference Server?

Triton is BSD3, with some good ground work already done by Nvidias team for Kaldi Inference . A Vosk twist on this could bring a lot of value! :-)

from vosk-api.

GaetanLepage commented on August 25, 2024 1

Ok thank you very much for the quick answer.
I will give it a try :)

from vosk-api.

nshmyrev commented on August 25, 2024 1

@nshmyrev Is that possible to add method to allow use GPU in C# nuget package?

Yes, sure, it is 2 lines ;)

from vosk-api.

sskorol commented on August 25, 2024 1

@vonguyen1982 you should build Vosk with GPU support on your own. Published versions don't use CUDA. But if you use Docker, you can check some prebuilt images for arm64/amd64.

from vosk-api.

nshmyrev commented on August 25, 2024

Kaldi decoders are not supposed to work with GPU this way. Decoding is CPU-bound due to search, you can't get too much advances from using GPU.

The only way you can make it fast is to use Batched GPU decoder, but that is another story.

What is the goal you are trying to achieve actually?

from vosk-api.

hairyone commented on August 25, 2024

I added a function to init the GPU

void KaldiRecognizer::InitGpu() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Which is called here ...

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    rec.InitGpu()
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

and here is the amended makefile

KALDI_ROOT ?= $(HOME)/kaldi

CXX := g++

ATLASLIBS := /usr/lib/libatlas.so.3 /usr/lib/libf77blas.so.3 /usr/lib/libcblas.so.3 /usr/lib/liblapack_atlas.so.3

KALDI_FLAGS := \
    -DKALDI_DOUBLEPRECISION=0 -DHAVE_POSIX_MEMALIGN \
    -Wno-sign-compare -Wno-unused-local-typedefs -Winit-self \
    -DHAVE_EXECINFO_H=1 -rdynamic -DHAVE_CXXABI_H -DHAVE_ATLAS \
    -I$(KALDI_ROOT)/tools/ATLAS/include \
    -I$(KALDI_ROOT)/tools/openfst/include -I$(KALDI_ROOT)/src

CUDA_FLAGS := \
    -DHAVE_CUDA=1 -I/usr/local/cuda/include

CXXFLAGS := -std=c++11 -g -Wall -DPIC -fPIC $(KALDI_FLAGS) $(CUDA_FLAGS) `pkg-config --cflags python3`

KALDI_LIBS = \
    -rdynamic -Wl,-rpath=$(KALDI_ROOT)/tools/openfst/lib \
    $(KALDI_ROOT)/src/online2/kaldi-online2.a \
    $(KALDI_ROOT)/src/decoder/kaldi-decoder.a \
    $(KALDI_ROOT)/src/ivector/kaldi-ivector.a \
    $(KALDI_ROOT)/src/gmm/kaldi-gmm.a \
    $(KALDI_ROOT)/src/nnet3/kaldi-nnet3.a \
    $(KALDI_ROOT)/src/tree/kaldi-tree.a \
    $(KALDI_ROOT)/src/feat/kaldi-feat.a \
    $(KALDI_ROOT)/src/lat/kaldi-lat.a \
    $(KALDI_ROOT)/src/hmm/kaldi-hmm.a \
    $(KALDI_ROOT)/src/transform/kaldi-transform.a \
    $(KALDI_ROOT)/src/cudamatrix/kaldi-cudamatrix.a \
    $(KALDI_ROOT)/src/matrix/kaldi-matrix.a \
    $(KALDI_ROOT)/src/fstext/kaldi-fstext.a \
    $(KALDI_ROOT)/src/util/kaldi-util.a \
    $(KALDI_ROOT)/src/base/kaldi-base.a \
    -L $(KALDI_ROOT)/tools/openfst/lib -lfst \
    $(ATLASLIBS) \
    `pkg-config --libs python3` \
    -lm -lpthread

CUDA_LIBS := \
    -Wl,-rpath=/usr/local/cuda/lib64 \
    -Wl,-rpath=/usr/lib/x86_64-linux-gnu \
    -L /usr/local/cuda/lib64 \
    -L /usr/lib/x86_64-linux-gnu \
    -lcublas -lcusparse -lcudart -lcurand -lcufft -lnvToolsExt -lcusolver

all: _kaldi_recognizer.so

_kaldi_recognizer.so: kaldi_recognizer_wrap.cc kaldi_recognizer.cc model.cc
    $(CXX) $(CXXFLAGS) -shared -o $@ kaldi_recognizer.cc model.cc kaldi_recognizer_wrap.cc $(KALDI_LIBS) $(CUDA_LIBS)

kaldi_recognizer_wrap.cc: kaldi_recognizer.i
    swig -threads -python -c++ -o kaldi_recognizer_wrap.cc kaldi_recognizer.i

clean:
    $(RM) *.so kaldi_recognizer_wrap.cc *.o *.pyc kaldi_recognizer.py
~

from vosk-api.

hairyone commented on August 25, 2024

I was told by a colleague that by compiling kaldi with GPU support the matrix operations would be done in parallel on the GPU yielding a marginal performance improvement or at least taking some of the load off the CPUs.

from vosk-api.

hairyone commented on August 25, 2024

Just for completeness here is my docker file

# FROM ubuntu:16.04
# FROM debian:9.8
# FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# FROM nvidia/cuda:9.0-devel-ubuntu16.04
# FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
  FROM nvidia/cuda:10.2-devel-ubuntu16.04

################################################################################
# get the packages we need
################################################################################

RUN apt-get update \
&&  apt-get install -y --no-install-recommends \
       g++ make automake autoconf bzip2 unzip wget libtool git subversion \
       sox python2.7 python3 python3-dev python3-websockets pkg-config \
       zlib1g-dev patch libatlas-dev libxml2 ca-certificates swig \
       libatlas3-base vim \
&&  rm -rf /var/lib/apt/lists/*

################################################################################
# install cuda
################################################################################
# ARG CUDA_VERSION=cuda_8.0.61_375.26_linux-run
# ARG CUDA_VERSION=cuda_10.0.130_410.48_linux.run
# ARG CUDA_VERSION=cuda_10.1.243_418.87.00_linux.run
# ARG CUDA_VERSION=cuda_10.2.89_440.33.01_linux.run
# ADD ${CUDA_VERSION} /opt/cuda/${CUDA_VERSION}
# RUN cd /opt/cuda \
# &&  sh ${CUDA_VERSION} --silent --toolkit --samples \
# &&  rm ${CUDA_VERSION}

# ADD NVIDIA_CUDA-8.0_Samples /root/NVIDIA_CUDA-8.0_Samples

################################################################################
# compile and install kaldi
################################################################################
ADD kaldi-master /opt/kaldi

RUN cd /opt/kaldi \
&&  cd /opt/kaldi/tools \
&&  make -j $(nproc) \
&&  cd /opt/kaldi/src \
&&  ./configure --mathlib=ATLAS --shared \
&&  make depend -j $(nproc) \
&&  make -j $(nproc) online2 \
&&  find /opt/kaldi -name "*.o" | xargs rm

################################################################################
# compile the kaldi_recogniser shared library with python binding
################################################################################
ADD kaldi-websocket-python-master /opt/kaldi-websocket

RUN cd /opt/kaldi-websocket \
&&  KALDI_ROOT=/opt/kaldi make

# &&  cd /opt/kaldi/src \
# &&  make clean

################################################################################
# install the language model
################################################################################
ADD model-en-f1 /opt/kaldi-en/model

################################################################################
# server config
################################################################################
EXPOSE 2700
WORKDIR /opt/kaldi-websocket
CMD [ "python3", "./asr_server.py", "/opt/kaldi-en/model" ]

from vosk-api.

nshmyrev commented on August 25, 2024

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till kaldi-asr/kaldi#3568 lands into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

from vosk-api.

hairyone commented on August 25, 2024

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till kaldi-asr/kaldi#3568 will land into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

Thanks for the response Nick :)

Without me realising you have probably worked with my colleague Nazim. His comment was that he had added GPU support to the kaldi-gstreamer implementation and he did see a difference.

I prefer your implementation to the kaldi-gstreamer one so i wanted to add GPU support so I could compare them side by side for a similar number of streams.

I was told that compiling kaldi with GPU support was largely transparent to the user in the sense that if it was compiled with GPU support then certain matrix operations would moved to the GPU so I am surprised at the error above.

from vosk-api.

nshmyrev commented on August 25, 2024

I see, cool! Greetings to Nazim!

Well, the issue you see is due to multithread access I suppose, there are multiple worker threads and device needs some blocking. Gstreamer uses processes so it is easier to access the memory.

I would start with CUDA_LAUNCH_BLOCKING environment variable, probably it will fix the concurrency issue.

I might look on this issue a little bit later.

from vosk-api.

hairyone commented on August 25, 2024

Thanks Nick,

I will pass on a hello to Nazim.

Regarding the thread access ... at the moment I am only sending single stream in which case I would not expect to see any concurrency issues. Nevetherless I will test with CUDA_LAUNCH_BLOCKING and report back.

Again thanks for responding so promptly. If you do decide to investigate and you need anything from me, please let me know.

from vosk-api.

nshmyrev commented on August 25, 2024

There is still a worker pool with many threads and they can work simultaneously, see the python code. I suspect it is the case. I'll let you know.

from vosk-api.

hairyone commented on August 25, 2024

When I run with CUDA_LAUNCH_BLOCKING there was more error info:

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0146899 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyRows():cu-matrix.cc:2691) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaGetLastError()'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f3e6f2492de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f3e6ee35f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrixBase<float>::CopyRows(kaldi::CuMatrixBase<float> const&, kaldi::CuArrayBase<int> const&)+0x251) [0x7f3e6f145e5b]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0xb1f) [0x7f3e6ef88ab1]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f3e6ef89582]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f3e6ef98d74]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f3e6ef99047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f3e6ee9f173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f3e6ee9f575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f3e6eea00c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f3e6ee7ee8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f3e6ee34d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f3e6ee6d9c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f3e7325e6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3e72f9441d]

WARNING (server[5.5]:ExecuteCommand():nnet-compute.cc:436) Printing some background info since error was detected
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:437) matrix m1(50, 40), m2(3, 100), m3(46, 220), m4(46, 1024), m5(15, 4096), m6(15, 1024), m7(13, 3072), m8(13, 1024), m9(11, 3072), m10(11, 1024), m11(9, 3072), m12(9, 1024), m13(7, 3072), m14(7, 1024), m15(7, 1024), m16(7, 8850), m17(21, 40), m18(1, 100), m19(21, 220), m20(21, 1024), m21(7, 4096), m22(7, 3072), m23(7, 3072), m24(7, 3072), m25(7, 3072), m26(7, 1024), m27(7, 1024), m28(7, 8850), m29(21, 40), m30(1, 100), m31(21, 220), m32(21, 1024), m33(7, 4096), m34(7, 3072), m35(7, 3072), m36(7, 3072), m37(7, 3072), m38(7, 1024), m39(7, 1024), m40(7, 8850), m41(21, 40), m42(1, 100)
# The following show how matrices correspond to network-nodes and
# cindex-ids.  Format is: matrix = <node-id>.[value|deriv][ <list-of-cindex-ids> ]
# where a cindex-id is written as (n,t[,x]) but ranges of t values are compressed
# so we write (n, tfirst:tlast).
m1 == value: input[(0,-17:32)]
m2 == value: ivector[(0,-21), (0,0), (0,21)]
m3 == value: Tdnn_0_affine_input[(0,-16:29)]
m4 == value: Tdnn_0_affine[(0,-16:29)]
m5 == value: Tdnn_1_affine_input[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m6 == value: Tdnn_1_affine[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m7 == value: Tdnn_2_affine_input[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m8 == value: Tdnn_2_affine[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m9 == value: Tdnn_3_affine_input[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m10 == value: Tdnn_3_affine[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m11 == value: Tdnn_4_affine_input[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m12 == value: Tdnn_4_affine[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m13 == value: Tdnn_5_affine_input[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m14 == value: Tdnn_5_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m15 == value: Tdnn_pre_final_chain_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m16 == value: Final_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m17 == value: input[(0,33:53)]
m18 == value: ivector[(0,42)]
m19 == value: Tdnn_0_affine_input[(0,30:50)]
m20 == value: Tdnn_0_affine[(0,30:50)]
m21 == value: Tdnn_1_affine_input[(0,30), (0,33), (0,36), (0,39), (0,42), (0,45), (0,48)]
m22 == value: Tdnn_2_affine_input[(0,27), (0,30), (0,33), (0,36), (0,39), (0,42), (0,45)]
m23 == value: Tdnn_3_affine_input[(0,24), (0,27), (0,30), (0,33), (0,36), (0,39), (0,42)]
m24 == value: Tdnn_4_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m25 == value: Tdnn_5_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m26 == value: Tdnn_5_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m27 == value: Tdnn_pre_final_chain_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m28 == value: Final_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m29 == value: input[(0,54:74)]
m30 == value: ivector[(0,63)]
m31 == value: Tdnn_0_affine_input[(0,51:71)]
m32 == value: Tdnn_0_affine[(0,51:71)]
m33 == value: Tdnn_1_affine_input[(0,51), (0,54), (0,57), (0,60), (0,63), (0,66), (0,69)]
m34 == value: Tdnn_2_affine_input[(0,48), (0,51), (0,54), (0,57), (0,60), (0,63), (0,66)]
m35 == value: Tdnn_3_affine_input[(0,45), (0,48), (0,51), (0,54), (0,57), (0,60), (0,63)]
m36 == value: Tdnn_4_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m37 == value: Tdnn_5_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m38 == value: Tdnn_5_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m39 == value: Tdnn_pre_final_chain_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m40 == value: Final_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m41 == value: input[(0,75:95)]
m42 == value: ivector[(0,84)]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c0: m1 = user input [for node: 'input']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c1: m2 = user input [for node: 'ivector']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c2: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c3: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c4: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c5: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c6: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c7: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c8: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c9: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c10: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c11: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c12: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c13: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c14: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c15: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c16: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c17: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c18: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c19: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c20: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c21: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c22: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c23: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c24: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c25: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c26: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c27: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c28: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c29: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c30: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c31: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c32: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c33: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c34: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c35: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c36: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c37: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c38: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c39: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c40: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c41: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c42: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c43: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c44: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c45: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c46: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c47: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c48: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c49: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c50: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c51: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c52: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c53: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c54: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c55: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c56: [no-op-permanent]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c57: m3 = undefined(46,220)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c58: m3(0:45, 0:39) = m1(0:45, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c59: m3(0:45, 40:79) = m1(1:46, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c60: m3(0:45, 80:119) = m1(2:47, 0:39)
ERROR (server[5.5]:ExecuteCommand():nnet-compute.cc:443) Error running command c61: m3(0:45, 120:219).CopyRows(1, m2[0x16, 1x21, 2x9])

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f3e6f2492de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f3e6ee35f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f3e6ef89363]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f3e6ef89582]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f3e6ef98d74]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f3e6ef99047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f3e6ee9f173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f3e6ee9f575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f3e6eea00c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f3e6ee7ee8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f3e6ee34d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f3e6ee6d9c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f3e7325e6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3e72f9441d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'
  what():  kaldi::KaldiFatalError

from vosk-api.

hairyone commented on August 25, 2024

It is not going to work this way because search takes > 60% of time on CPU and GPU will be just waiting for CPU to finalize.

You need to wait till kaldi-asr/kaldi#3568 lands into kaldi, it is a work in progress currently.

If you need faster processing it is more straightforward to tune beams, compile with MKL and use smaller model.

Here's what Nazim said :)

Yes GPU will wait but what is wrong with that I don't understand.
It is just doing some of the computation to help.
And the idea is to send a lot of requests to GPU from different streams.
If matrix multiplication is done on CPU lets say in 1 seconds but on GPU 0.1 seconds, you will do do it in 0.9 seconds less time.

from vosk-api.

nshmyrev commented on August 25, 2024

This thing is probably missing https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

from vosk-api.

YunzhaoLu commented on August 25, 2024

I added a function to init the GPU

void KaldiRecognizer::InitGpu() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Which is called here ...

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    rec.InitGpu()
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

and here is the amended makefile

KALDI_ROOT ?= $(HOME)/kaldi

CXX := g++

ATLASLIBS := /usr/lib/libatlas.so.3 /usr/lib/libf77blas.so.3 /usr/lib/libcblas.so.3 /usr/lib/liblapack_atlas.so.3

KALDI_FLAGS := \
    -DKALDI_DOUBLEPRECISION=0 -DHAVE_POSIX_MEMALIGN \
    -Wno-sign-compare -Wno-unused-local-typedefs -Winit-self \
    -DHAVE_EXECINFO_H=1 -rdynamic -DHAVE_CXXABI_H -DHAVE_ATLAS \
    -I$(KALDI_ROOT)/tools/ATLAS/include \
    -I$(KALDI_ROOT)/tools/openfst/include -I$(KALDI_ROOT)/src

CUDA_FLAGS := \
    -DHAVE_CUDA=1 -I/usr/local/cuda/include

CXXFLAGS := -std=c++11 -g -Wall -DPIC -fPIC $(KALDI_FLAGS) $(CUDA_FLAGS) `pkg-config --cflags python3`

KALDI_LIBS = \
    -rdynamic -Wl,-rpath=$(KALDI_ROOT)/tools/openfst/lib \
    $(KALDI_ROOT)/src/online2/kaldi-online2.a \
    $(KALDI_ROOT)/src/decoder/kaldi-decoder.a \
    $(KALDI_ROOT)/src/ivector/kaldi-ivector.a \
    $(KALDI_ROOT)/src/gmm/kaldi-gmm.a \
    $(KALDI_ROOT)/src/nnet3/kaldi-nnet3.a \
    $(KALDI_ROOT)/src/tree/kaldi-tree.a \
    $(KALDI_ROOT)/src/feat/kaldi-feat.a \
    $(KALDI_ROOT)/src/lat/kaldi-lat.a \
    $(KALDI_ROOT)/src/hmm/kaldi-hmm.a \
    $(KALDI_ROOT)/src/transform/kaldi-transform.a \
    $(KALDI_ROOT)/src/cudamatrix/kaldi-cudamatrix.a \
    $(KALDI_ROOT)/src/matrix/kaldi-matrix.a \
    $(KALDI_ROOT)/src/fstext/kaldi-fstext.a \
    $(KALDI_ROOT)/src/util/kaldi-util.a \
    $(KALDI_ROOT)/src/base/kaldi-base.a \
    -L $(KALDI_ROOT)/tools/openfst/lib -lfst \
    $(ATLASLIBS) \
    `pkg-config --libs python3` \
    -lm -lpthread

CUDA_LIBS := \
    -Wl,-rpath=/usr/local/cuda/lib64 \
    -Wl,-rpath=/usr/lib/x86_64-linux-gnu \
    -L /usr/local/cuda/lib64 \
    -L /usr/lib/x86_64-linux-gnu \
    -lcublas -lcusparse -lcudart -lcurand -lcufft -lnvToolsExt -lcusolver

all: _kaldi_recognizer.so

_kaldi_recognizer.so: kaldi_recognizer_wrap.cc kaldi_recognizer.cc model.cc
    $(CXX) $(CXXFLAGS) -shared -o $@ kaldi_recognizer.cc model.cc kaldi_recognizer_wrap.cc $(KALDI_LIBS) $(CUDA_LIBS)

kaldi_recognizer_wrap.cc: kaldi_recognizer.i
    swig -threads -python -c++ -o kaldi_recognizer_wrap.cc kaldi_recognizer.i

clean:
    $(RM) *.so kaldi_recognizer_wrap.cc *.o *.pyc kaldi_recognizer.py
~

I can work this way:

void KaldiRecognizer::InitGpu() {
kaldi::CuDevice::Instantiate().SelectGpuId("yes");
kaldi::CuDevice::Instantiate().AllowMultithreading();
}

I think the problem is here:

async def recognize(websocket, path):
rec = KaldiRecognizer(model);
rec.InitGpu()
while True:
message = await websocket.recv()
response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
await websocket.send(response)
if stop: break

I had problem before because "initGPU" might be called multiple times. I just make sure that "initGPU" is called only once!

from vosk-api.

hairyone commented on August 25, 2024

As suggested I amended to the code so that the GPU initialisation is called just once before the server is started.

async def recognize(websocket, path):
    rec = KaldiRecognizer(model);
    while True:
        message = await websocket.recv()
        response, stop = await loop.run_in_executor(pool, process_chunk, rec, message)
        await websocket.send(response)
        if stop: break

gpu = Gpu()
gpu.Init()

start_server = websockets.serve(
    recognize, '0.0.0.0', 2700)

loop.run_until_complete(start_server)
loop.run_forever()
~

#include "gpu.h"

Gpu::Gpu() { }

void Gpu::Init() {
    kaldi::CuDevice::Instantiate().SelectGpuId("yes");
    kaldi::CuDevice::Instantiate().AllowMultithreading();
}

Gpu::~Gpu() { }

However I still get the error below:

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0140841 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyRows():cu-matrix.cc:2691) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaGetLastError()'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f91b4683282]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f91b426f1cc]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrixBase<float>::CopyRows(kaldi::CuMatrixBase<float> const&, kaldi::CuArrayBase<int> const&)+0x251) [0x7f91b457fdff]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0xb1f) [0x7f91b43c2a55]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f91b43c3526]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f91b43d2d18]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f91b43d2feb]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f91b42d9117]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f91b42d9519]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f91b42da06c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f91b42b8e33]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f91b426e01a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c353f) [0x7f91b42a753f]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f91b86996ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f91b83cf41d]

WARNING (server[5.5]:ExecuteCommand():nnet-compute.cc:436) Printing some background info since error was detected
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:437) matrix m1(50, 40), m2(3, 100), m3(46, 220), m4(46, 1024), m5(15, 4096), m6(15, 1024), m7(13, 3072), m8(13, 1024), m9(11, 3072), m10(11, 1024), m11(9, 3072), m12(9, 1024), m13(7, 3072), m14(7, 1024), m15(7, 1024), m16(7, 8850), m17(21, 40), m18(1, 100), m19(21, 220), m20(21, 1024), m21(7, 4096), m22(7, 3072), m23(7, 3072), m24(7, 3072), m25(7, 3072), m26(7, 1024), m27(7, 1024), m28(7, 8850), m29(21, 40), m30(1, 100), m31(21, 220), m32(21, 1024), m33(7, 4096), m34(7, 3072), m35(7, 3072), m36(7, 3072), m37(7, 3072), m38(7, 1024), m39(7, 1024), m40(7, 8850), m41(21, 40), m42(1, 100)
# The following show how matrices correspond to network-nodes and
# cindex-ids.  Format is: matrix = <node-id>.[value|deriv][ <list-of-cindex-ids> ]
# where a cindex-id is written as (n,t[,x]) but ranges of t values are compressed
# so we write (n, tfirst:tlast).
m1 == value: input[(0,-17:32)]
m2 == value: ivector[(0,-21), (0,0), (0,21)]
m3 == value: Tdnn_0_affine_input[(0,-16:29)]
m4 == value: Tdnn_0_affine[(0,-16:29)]
m5 == value: Tdnn_1_affine_input[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m6 == value: Tdnn_1_affine[(0,-15), (0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24), (0,27)]
m7 == value: Tdnn_2_affine_input[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m8 == value: Tdnn_2_affine[(0,-12), (0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21), (0,24)]
m9 == value: Tdnn_3_affine_input[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m10 == value: Tdnn_3_affine[(0,-9), (0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18), (0,21)]
m11 == value: Tdnn_4_affine_input[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m12 == value: Tdnn_4_affine[(0,-6), (0,-3), (0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m13 == value: Tdnn_5_affine_input[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m14 == value: Tdnn_5_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m15 == value: Tdnn_pre_final_chain_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m16 == value: Final_affine[(0,0), (0,3), (0,6), (0,9), (0,12), (0,15), (0,18)]
m17 == value: input[(0,33:53)]
m18 == value: ivector[(0,42)]
m19 == value: Tdnn_0_affine_input[(0,30:50)]
m20 == value: Tdnn_0_affine[(0,30:50)]
m21 == value: Tdnn_1_affine_input[(0,30), (0,33), (0,36), (0,39), (0,42), (0,45), (0,48)]
m22 == value: Tdnn_2_affine_input[(0,27), (0,30), (0,33), (0,36), (0,39), (0,42), (0,45)]
m23 == value: Tdnn_3_affine_input[(0,24), (0,27), (0,30), (0,33), (0,36), (0,39), (0,42)]
m24 == value: Tdnn_4_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m25 == value: Tdnn_5_affine_input[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m26 == value: Tdnn_5_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m27 == value: Tdnn_pre_final_chain_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m28 == value: Final_affine[(0,21), (0,24), (0,27), (0,30), (0,33), (0,36), (0,39)]
m29 == value: input[(0,54:74)]
m30 == value: ivector[(0,63)]
m31 == value: Tdnn_0_affine_input[(0,51:71)]
m32 == value: Tdnn_0_affine[(0,51:71)]
m33 == value: Tdnn_1_affine_input[(0,51), (0,54), (0,57), (0,60), (0,63), (0,66), (0,69)]
m34 == value: Tdnn_2_affine_input[(0,48), (0,51), (0,54), (0,57), (0,60), (0,63), (0,66)]
m35 == value: Tdnn_3_affine_input[(0,45), (0,48), (0,51), (0,54), (0,57), (0,60), (0,63)]
m36 == value: Tdnn_4_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m37 == value: Tdnn_5_affine_input[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m38 == value: Tdnn_5_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m39 == value: Tdnn_pre_final_chain_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m40 == value: Final_affine[(0,42), (0,45), (0,48), (0,51), (0,54), (0,57), (0,60)]
m41 == value: input[(0,75:95)]
m42 == value: ivector[(0,84)]

LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c0: m1 = user input [for node: 'input']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c1: m2 = user input [for node: 'ivector']
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c2: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c3: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c4: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c5: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c6: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c7: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c8: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c9: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c10: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c11: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c12: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c13: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c14: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c15: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c16: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c17: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c18: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c19: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c20: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c21: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c22: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c23: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c24: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c25: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c26: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c27: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c28: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c29: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c30: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c31: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c32: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c33: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c34: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c35: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c36: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c37: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c38: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c39: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c40: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c41: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c42: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c43: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c44: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c45: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c46: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c47: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c48: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c49: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c50: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c51: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c52: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c53: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c54: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c55: [no-op]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c56: [no-op-permanent]
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c57: m3 = undefined(46,220)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c58: m3(0:45, 0:39) = m1(0:45, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c59: m3(0:45, 40:79) = m1(1:46, 0:39)
LOG (server[5.5]:ExecuteCommand():nnet-compute.cc:439) c60: m3(0:45, 80:119) = m1(2:47, 0:39)
ERROR (server[5.5]:ExecuteCommand():nnet-compute.cc:443) Error running command c61: m3(0:45, 120:219).CopyRows(1, m2[0x16, 1x21, 2x9])

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f91b4683282]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f91b426f1cc]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f91b43c3307]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f91b43c3526]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x4a8) [0x7f91b43d2d18]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f91b43d2feb]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f91b42d9117]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f91b42d9519]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f91b42da06c]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f91b42b8e33]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f91b426e01a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c353f) [0x7f91b42a753f]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f91b86996ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f91b83cf41d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'
  what():  kaldi::KaldiFatalError

from vosk-api.

hairyone commented on August 25, 2024

Could this problem be related to the pointer to the model in memory not being accessible to the GPU?

from vosk-api.

nshmyrev commented on August 25, 2024

@hairyone I've just pushed https://github.com/alphacep/kaldi-websocket-python/tree/gpu which should make it work, please test. Requires python 3.7 between.

from vosk-api.

hairyone commented on August 25, 2024

@nshmyrev Thanks!!

I will try it out and let you know how I get on.

from vosk-api.

nshmyrev commented on August 25, 2024

One day we need to support full GPU decoding, not now.

from vosk-api.

adamreed90 commented on August 25, 2024

Is this still something that is planning on being brought into master ?

from vosk-api.

nshmyrev commented on August 25, 2024

Is this still something that is planning on being brought into master ?

We have it in plans of course, but no immediate plans, we will be busy with other things.

from vosk-api.

dgxlsir commented on August 25, 2024

When I compile you app inside a docker container without GPU support everything works fine.

I have made a few changes so that Kaldi is compiled with GPU support and I am running the application inside a docker container with NVIDIA GPU support.

But when I run the GPU version I get the error below, do you have any idea what the problem might be?

server --min-active=200 --max-active=6000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=1.0 --frame-subsampling-factor=3 --endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10 --endpoint.rule2.min-trailing-silence=0.5 --endpoint.rule3.min-trailing-silence=1.0 --endpoint.rule4.min-trailing-silence=2.0
LOG (server[5.5]:Model():model.cc:47) Sample rate is 8000
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (server[5.5]:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (server[5.5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (server[5.5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (server[5.5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
LOG (server[5.5]:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0133111 seconds in looped compilation.
WARNING (server[5.5]:SelectGpuId():cu-device.cc:228) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:408) Selecting from 1 GPUs
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:423) cudaSetDevice(0): GeForce GTX 1070 free:8022M, used:97M, total:8119M, free/total:0.987992
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:471) Device: 0, mem_ratio: 0.987992
LOG (server[5.5]:SelectGpuId():cu-device.cc:352) Trying to select device: 0
LOG (server[5.5]:SelectGpuIdAuto():cu-device.cc:481) Success selecting device 0 free mem ratio: 0.987992
LOG (server[5.5]:FinalizeActiveGpu():cu-device.cc:308) The active GPU is [0]: GeForce GTX 1070  free:7834M, used:285M, total:8119M, free/total:0.964838 version 6.1
ERROR (server[5.5]:CopyToMat():cu-matrix.cc:464) cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaMemcpy2DAsync(dst->Data(), dst_pitch, this->data_, src_pitch, width, this->num_rows_, cudaMemcpyDeviceToHost, cudaStreamPerThread)'

[ Stack-Trace: ]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f852b86c2de]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x2e) [0x7f852b458f3c]
/opt/kaldi-websocket/_kaldi_recognizer.so(void kaldi::CuMatrixBase<float>::CopyToMat<float>(kaldi::MatrixBase<float>*, kaldi::MatrixTransposeType) const+0x1ea) [0x7f852b78bb52]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::CuMatrix<float>::Swap(kaldi::Matrix<float>*)+0x12f) [0x7f852b78cba5]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::Matrix<float>::Swap(kaldi::CuMatrix<float>*)+0x12) [0x7f852b78cc24]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk()+0x5a3) [0x7f852b5bbe6f]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood(int, int)+0x51) [0x7f852b5bc047]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::ProcessEmitting(kaldi::DecodableInterface*)+0x22b) [0x7f852b4c2173]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl<float> >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl<float> > > > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x97) [0x7f852b4c2575]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding(kaldi::DecodableInterface*, int)+0x74) [0x7f852b4c30c8]
/opt/kaldi-websocket/_kaldi_recognizer.so(kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding()+0x19) [0x7f852b4a1e8f]
/opt/kaldi-websocket/_kaldi_recognizer.so(KaldiRecognizer::AcceptWaveform(char const*, int)+0x10a) [0x7f852b457d8a]
/opt/kaldi-websocket/_kaldi_recognizer.so(+0x2c29c4) [0x7f852b4909c4]
python3(PyCFunction_Call+0x4f) [0x4e12df]
python3(PyEval_EvalFrameEx+0x614) [0x530b94]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3537]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_EvalFrameEx+0x24a2) [0x532a22]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
python3(PyEval_EvalCodeEx+0x13b) [0x53a81b]
python3() [0x4e3423]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3() [0x4f08be]
python3(PyObject_Call+0x47) [0x5c3bd7]
python3(PyEval_CallObjectWithKeywords+0x30) [0x525d00]
python3() [0x626bb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f852f8816ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f852f5b741d]

terminate called after throwing an instance of 'kaldi::KaldiFatalError'

hello,i am doing the same thing as you , that using the gpu to decode my model, (nnet3-chain).and have you done it?
i think if we use the gpu to deel the viterbi decode, must be faster than cpu, and can you give me some advice, thank u very much!

from vosk-api.

nshmyrev commented on August 25, 2024

@dgxlsir looks the same as https://groups.google.com/forum/embed/#!topic/kaldi-help/zSNKoD5OHeU, might be an issue with cuda version (too old/too new).

from vosk-api.

nshmyrev commented on August 25, 2024

@basicasicmatrix thanks, very useful link

from vosk-api.

sskorol commented on August 25, 2024

FYI, I've built 2 docker images with GPU support for Jetson Xavier and Nano based on vosk 0.3.17. Works good so far.

from vosk-api.

GaetanLepage commented on August 25, 2024

Hello !
I am currently using vosk for a research project and was wondering whether GPU support would be available anytime soon.

Thank you guys anyway for your nice work :)

from vosk-api.

sskorol commented on August 25, 2024

@GaetanLepage hi, GPU itself is supported. You just need to build Kaldi/Vosk with a special flag. You can check this merged PR for details: https://github.com/alphacep/vosk-api/pull/436/files

from vosk-api.

vonguyen1982 commented on August 25, 2024

How should I test GPU with nuget for C# ?

from vosk-api.

sskorol commented on August 25, 2024

@vonguyen1982 I don't see corresponding methods that activate GPU in NuGet. But you can create a PR with appropriate updates. Basically, you need to use this Python code as a reference and create the same API here and here. Then rebuild everything with HAVE_CUDA flag and call GpuInit in the main thread of your app, when it's just started. For a multithreaded code, you also have to use GpuThreadInit. You can check the difference between these 2 here.

from vosk-api.

GaetanLepage commented on August 25, 2024

@sskorol Thanks to the dockerfile, the PR and the vosk instructions I was able to make it work on my GPU !

I have two remaining questions/issues:

If I use it on several python processes (using the multiprocessing library) I get some ASSERTION_FAILED (VoskAPI:IsComputeExclusive():cu-device.cc:362) Assertion failed: (cudaSuccess == cudaDeviceSynchronize()).
I called GpuInit() in my main application (before creating the processes) and GpuThreadInit() in each thread.
Is it possible for it to work on several GPUs ? If so, what should I do ?

Thanks once again for your help !

from vosk-api.

nshmyrev commented on August 25, 2024

I called GpuInit() in my main application (before creating the processes) and GpuThreadInit() in each thread.

You should be using threads then, not processes. Of if you still want to use processes you can call GpuInit inside every process after the forks.

from vosk-api.

GaetanLepage commented on August 25, 2024

@nshmyrev : I indeed tested both:

Keeping multiprocessing and using GpuInit() in each process
Using multithreading and calling GpuInit() (from the main app) and GpuThreadInit() in each thread

Performance is similar with both.
Is there a way I could use several GPUs at the same time ?

from vosk-api.

nshmyrev commented on August 25, 2024

Performance is similar with both.

Ok, and what is the problem then?

Is there a way I could use several GPUs at the same time ?

If you really want to use GPU at full speed you need to use kaldi cuda decoders, not vosk. I wrote that above.

from vosk-api.

GaetanLepage commented on August 25, 2024

Ok, and what is the problem then?

Oh, nothing, I just wanted to remark that, on my system, running 16 parallel jobs is more or less equivalent than running the system in GPU mode. It was only an observation :)

If you really want to use GPU at full speed you need to use kaldi cuda decoders, not vosk. I wrote that above.

All right ! I will keep this in mind if I happen to need more speed up.

For now, the convenience and usability of vosk are really helping me! Thank you very much for developing this great tool !

from vosk-api.

vonguyen1982 commented on August 25, 2024

@nshmyrev Is that possible to add method to allow use GPU in C# nuget package?

from vosk-api.

sskorol commented on August 25, 2024

@vonguyen1982 a fix is already in master. You can check this PR for details: #514

from vosk-api.

vonguyen1982 commented on August 25, 2024

I got. Thanks.
I am using latest nuget package with .net core 5.0.
I have Vosk.Vosk.GpuInit(); and Vosk.Vosk.GpuThreadInit(); but I did not see any usage from GPU. Do I need specific model or which version of cuda I should install ? Thanks

from vosk-api.

vonguyen1982 commented on August 25, 2024

@nshmyrev At the moment the way I use nuget packet with c# is very simple because I don't need to manage vosk server. I wonder if we can make it simple like that when works with GPU rather than I need to build Vosk by my own or using docker image etc. Is that possible ?

from vosk-api.

nshmyrev commented on August 25, 2024

Now we have it working. We might need to consider two dockers - one for simple streaming, another for batch.

from vosk-api.

Support full GPU decoding about vosk-api HOT 42 CLOSED

Comments (42)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent