Code Monkey home page Code Monkey logo

larynx's Introduction

Larynx has been succeeded by Piper!

This repository is no longer actively developed.


Larynx

🎥 DEMO VIDEO

Offline end-to-end text to speech system using gruut and onnx (architecture). There are 50 voices available across 9 languages.

curl https://raw.githubusercontent.com/rhasspy/larynx/master/docker/larynx-server \
    > ~/bin/larynx-server && chmod +755 ~/bin/larynx-server
larynx-server

Visit http://localhost:5002 for the test page. See http://localhost:5002/openapi/ for HTTP endpoint documentation.

Larynx screenshot

Supports a subset of SSML that can use multiple voices and languages!

<speak>
  The 1st thing to remember is that 9 languages are supported in Larynx TTS as of 10/19/2021 at 10:39am.

  <voice name="harvard">
    <s>
      The current voice can be changed!
    </s>
  </voice>

  <voice name="northern_english_male">
    <s>Breaks are possible</s>
    <break time="0.5s" />
    <s>between sentences.</s>
  </voice>

  <s lang="en">
    One language is never enough
  </s>
  <s lang="de">
   Eine Sprache ist niemals genug
  </s>
  <s lang="sw">
    Lugha moja haitoshi
  </s>
</speak>

Larynx's goals are:

  • "Good enough" synthesis to avoid using a cloud service
  • Faster than realtime performance on a Raspberry Pi 4 (with low quality vocoder)
  • Broad language support (9 languages)
  • Voices trained purely from public datasets

You can use Larynx to:

Samples

Listen to voice samples from all of the pre-trained voices.


Docker Installation

Pre-built Docker images are available for the following platforms:

  • linux/amd64 - desktop/laptop/server
  • linux/arm64 - Raspberry Pi 64-bit
  • linux/arm/v7 - Raspberry Pi 32-bit

These images include a single English voice, but many more can be downloaded from within the web interface.

The larynx and larynx-server shell scripts wrap the Docker images, allowing you to use Larynx as a command-line tool.

To manually run the Larynx web server in Docker:

docker run \
    -it \
    -p 5002:5002 \
    -e "HOME=${HOME}" \
    -v "$HOME:${HOME}" \
    -v /usr/share/ca-certificates:/usr/share/ca-certificates \
    -v /etc/ssl/certs:/etc/ssl/certs \
    -w "${PWD}" \
    --user "$(id -u):$(id -g)" \
    rhasspy/larynx

Downloaded voices will be stored in ${HOME}/.local/share/larynx.

Visit http://localhost:5002 for the test page. See http://localhost:5002/openapi/ for HTTP endpoint documentation.

Debian Installation

Pre-built Debian packages for bullseye are available for download with the name larynx-tts_<VERSION>_<ARCH>.deb where ARCH is one of amd64 (most desktops, laptops), armhf (32-bit Raspberry Pi), and arm64 (64-bit Raspberry Pi)

Example installation on a typical desktop:

sudo apt install ./larynx-tts_<VERSION>_amd64.deb

From there, you may run the larynx command or larynx-server to start the web server (http://localhost:5002).

Python Installation

You may need to install the following dependencies (besides Python 3.7+):

sudo apt-get install libopenblas-base libgomp1 libatomic1

On 32-bit ARM systems (Raspberry Pi), you will also need:

sudo apt-get install libatlas3-base libgfortran5

Next, create a Python virtual environment:

python3 -m venv larynx_venv
source larynx_venv/bin/activate

pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools

Next, install larynx:

pip3 install -f 'https://synesthesiam.github.io/prebuilt-apps/' -f 'https://download.pytorch.org/whl/cpu/torch_stable.html' larynx

Then run larynx or larynx.server for the web server. You may also execute the Python module directly with python3 -m larynx and python3 -m larynx.server.

Voice/Vocoder Download

Voices and vocoders are automatically downloaded when used on the command-line or in the web server. You can also manually download each voice. Extract them to ${HOME}/.local/share/larynx/voices so that the directory structure follows the pattern ${HOME}/.local/share/larynx/voices/<language>,<voice>.


Command-Line Interface

Larynx has a flexible command-line interface, available with:

  • The larynx script for Docker
  • The larynx command from the Debian package
  • larynx or python3 -m larynx for Python installations

Basic Synthesis

larynx -v <VOICE> "<TEXT>" > output.wav

where <VOICE> is a language name (en, de, etc) or a voice name (ljspeech, thorsten, etc). <TEXT> may contain multiple sentences, which will be combined in the final output WAV file. These can also be split into separate WAV files.

To adjust the quality of the output, use -q <QUALITY> where <QUALITY> is "high" (slowest), "medium", or "low" (fastest).

SSML Synthesis

larynx --ssml -v <VOICE> "<SSML>" > output.wav

where <SSML> is valid SSML. Not all features are supported; for example:

  • Breaks (pauses) can only occur between sentences and can only be specified in seconds or milliseconds
  • Voices can only be referenced by name
  • Custom lexicons are not yet supported (you can use <phoneme ph="...">, however)

If your SSML contains <mark> tags, add --mark-file <FILE> to the command-line. As the marks are encountered (between sentences), their names will be written on separate lines to the file.

CUDA Accelerated Synthesis

The --cuda flag will make use of a GPU if its available to PyTorch:

larynx --cuda 'This is spoken on the GPU.' > output.wav

Adding the --half flag will enable half-precision inference, which is often faster:

larynx --cuda --half 'This is spoken on the GPU even faster.' > output.wav

For CUDA acceleration to work, your voice must contain a PyTorch checkpoint file (generator.pth). Older Larynx voices did not have these, so you may need to re-download your voices.

Long Texts

If your text is very long, and you would like to listen to it as its being synthesized, use the --raw-stream option:

larynx -v en --raw-stream < long.txt | aplay -r 22050 -c 1 -f S16_LE

Each input line will be synthesized and written the standard out as raw 16-bit 22050Hz mono PCM. By default, 5 sentences will be kept in an output queue, only blocking synthesis when the queue is full. You can adjust this value with --raw-stream-queue-size. Additionally, you can adjust --max-thread-workers to change how many threads are available for synthesis.

If your long text is fixed-width with blank lines separating paragraphs like those from Project Gutenberg, use the --process-on-blank-line option so that sentences will not be broken at line boundaries. For example, you can listen to "Alice in Wonderland" like this:

curl --output - 'https://www.gutenberg.org/files/11/11-0.txt' | \
    larynx -v ek --raw-stream --process-on-blank-line | aplay -r 22050 -c 1 -f S16_LE

Multiple WAV Output

With --output-dir set to a directory, Larynx will output a separate WAV file for each sentence:

larynx -v en 'Test 1. Test 2.' --output-dir /path/to/wavs

By default, each WAV file will be named using the (slightly modified) text of the sentence. You can have WAV files named using a timestamp instead with --output-naming time. For full control of the output naming, the --csv command-line flag indicates that each sentence is of the form id|text where id will be the name of the WAV file.

cat << EOF |
s01|The birch canoe slid on the smooth planks.
s02|Glue the sheet to the dark blue background.
s03|It's easy to tell the depth of a well.
s04|These days a chicken leg is a rare dish.
s05|Rice is often served in round bowls.
s06|The juice of lemons makes fine punch.
s07|The box was thrown beside the parked truck.
s08|The hogs were fed chopped corn and garbage.
s09|Four hours of steady work faced us.
s10|Large size in stockings is hard to sell.
EOF
  larynx --csv --voice en --output-dir /path/to/wavs

Interactive Mode

With no text input and no output directory, Larynx will switch into interactive mode. After entering a sentence, it will be played with --play-command (default is play from SoX).

larynx -v en
Reading text from stdin...
Hello world!<ENTER>

Use CTRL+D or CTRL+C to exit.

GlowTTS Settings

The GlowTTS voices support two additional parameters:

  • --noise-scale - determines the speaker volatility during synthesis (0-1, default is 0.667)
  • --length-scale - makes the voice speaker slower (> 1) or faster (< 1)

Vocoder Settings

  • --denoiser-strength - runs the denoiser if > 0; a small value like 0.005 is a good place to start.

List Voices and Vocoders

larynx --list

MaryTTS Compatible API

To use Larynx as a drop-in replacement for a MaryTTS server (e.g., for use with Home Assistant), run:

docker run \
    -it \
    -p 59125:5002 \
    -e "HOME=${HOME}" \
    -v "$HOME:${HOME}" \
    -v /usr/share/ca-certificates:/usr/share/ca-certificates \
    -v /etc/ssl/certs:/etc/ssl/certs \
    -w "${PWD}" \
    --user "$(id -u):$(id -g)" \
    rhasspy/larynx

The /process HTTP endpoint should now work for voices formatted as <LANG> or <VOICE>, e.g. en or harvard.

You can specify the vocoder quality by adding ;<QUALITY> to the MaryTTS voice where QUALITY is "high", "medium", or "low".

For example: en;low will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.


SSML

A subset of SSML is supported (use --ssml):

  • <speak> - wrap around SSML text
    • lang - set language for document
  • <s> - sentence (disables automatic sentence breaking)
    • lang - set language for sentence
  • <w> / <token> - word (disables automatic tokenization)
  • <voice name="..."> - set voice of inner text
    • voice - name or language of voice
  • <say-as interpret-as=""> - force interpretation of inner text
    • interpret-as one of "spell-out", "date", "number", "time", or "currency"
    • format - way to format text depending on interpret-as
      • number - one of "cardinal", "ordinal", "digits", "year"
      • date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
  • <break time=""> - Pause for given amount of time
    • time - seconds ("123s") or milliseconds ("123ms")
  • <mark name=""> - User-defined mark (written to --mark-file or part of TextToSpeechResult)
    • name - name of mark
  • <sub alias=""> - substitute alias for inner text
  • <phoneme ph="..."> - supply phonemes for inner text
    • ph - phonemes for each word of inner text, separated by whitespace
  • <lexicon id="..."> - inline pronunciation lexicon
    • id - unique id of lexicon (used in <lookup ref="...">)
    • One or more <lexeme> child elements with:
      • <grapheme role="...">WORD</grapheme> - word text (optional [role][#word-roles])
      • <phoneme>P H O N E M E S</phoneme> - word pronunciation (phonemes separated by whitespace)
  • <lookup ref="..."> - use inline pronunciation lexicon for child elements
    • ref - id from a <lexicon id="...">

Word Roles

During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as gruut:<TAG>. For initialisms and spell-out, the role gruut:letter is used to indicate that e.g., "a" should be spoken as /eɪ/ instead of /ə/.

For en-us, the following additional roles are available from the part-of-speech tagger:

  • gruut:CD - number
  • gruut:DT - determiner
  • gruut:IN - preposition or subordinating conjunction
  • gruut:JJ - adjective
  • gruut:NN - noun
  • gruut:PRP - personal pronoun
  • gruut:RB - adverb
  • gruut:VB - verb
  • gruut:VB - verb (past tense)

Inline Lexicons

Inline pronunciation lexicons are supported via the <lexicon> and <lookup> tags. gruut diverges slightly from the SSML standard here by only allowing lexicons to be defined within the SSML document itself. Additionally, the id attribute of the <lexicon> element can be left off to indicate a "default" inline lexicon that does not require a corresponding <lookup> tag.

For example, the following document will yield three different pronunciations for the word "tomato":

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- Individual phonemes are separated by whitespace -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- Made up pronunciation for fake word role -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>

The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/). Within the <lookup> tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a role attached (selecting a made up pronunciation in this case).

Even further from the SSML standard, gruut allows you to leave off the <lexicon> id entirely. With no id, a <lookup> tag is no longer needed, allowing you to override the pronunciation of any word in the document:

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <!-- No id means change all words without a lookup -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
</speak>

This will yield a pronunciation of /t ə m ˈɑ t oʊ/ for all instances of "tomato" in the document (unless they have a <lookup>).


Text to Speech Models

Vocoders

  • Hi-Fi GAN
    • Universal large (slowest)
    • VCTK "small"
    • VCTK "medium" (fastest)

Benchmarks

The following benchmarks were run on:

  • Core i7-8750H (amd64)
  • Raspberry Pi 4 (aarch64)
  • Raspberry Pi 3 (armv7l)

Multiple runs were done at each quality level, with the first run being discarded so that cache for the model files was hot.

The RTF (real-time factor) is computed as the time taken to synthesize audio divided by the duration of the synthesized audio. An RTF less than 1 indicates that audio was able to be synthesized faster than real-time.

Platform Quality RTF
amd64 high 0.25
amd64 medium 0.06
amd64 low 0.05
-------- ------- ---
aarch64 high 4.28
aarch64 medium 1.82
aarch64 low 0.56
-------- ------- ---
armv7l high 16.83
armv7l medium 7.16
armv7l low 2.22

See the benchmarking scripts in scripts/ for more details.


Architecture

Larynx breaks text to speech into 4 distinct steps:

  1. Text to IPA phonemes (gruut)
  2. Phonemes to ids (phonemes.txt file from voice)
  3. Phoneme ids to mel spectrograms (glow-tts)
  4. Mel spectrograms to waveforms (hifi-gan)

Larynx architecture

Voices are trained on phoneme ids and mel spectrograms. For each language, the voice with the most data available was used as a base model and fine-tuned.

larynx's People

Contributors

fquirin avatar synesthesiam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

larynx's Issues

Package `larynx-tts_0.5.0_amd64.deb` installs but fails to run older systems

Problem

The package larynx-tts_0.5.0_amd64.deb installs on Elementary OS 5.1 (which is based on Ubuntu 18.04 LTS which is based on Debian ~buster/sid*) but the supplied python3 binary/larynx script fails to run due to an issue related to libc versioning.

$ larynx --help
python3: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by python3)

Workaround

I'd recently encountered this issue with another project so was able to work around the issue in the interim by extracting a package with a later version of libc and helping things find what they were looking for. waves hands here

Cause

Anyway, as far as I'm aware, this issue occurs because the Larynx package is built on a machine with a more recent libc version than the one installed locally.

Which I think is confirmed by this line in the docker config:

FROM debian:buster-slim as python37

Options for resolving issue

In terms of "resolving" the issue:

  • Ideally the package could be built on an older base system docker image so older machines could still run it successfully. (As I understand it, I think the only libc version changes are related to some optimisations but I don't know if they impact Larynx's performance.)
  • Alternatively the package could be configured with version information that would prevent installation on older, incompatible systems, unless manually overridden.

I'll admit I didn't really expect the Larynx package to ship its own Python binary instead of depending on system packages but I assume that's to ensure compatibility with compiled extensions?

Appreciation

Despite this issue I was able to get up and running with Larynx after applying the workaround and overall am very happy with the initial resulting output.

Thanks for all the work you've put into the project, I'm really excited about the potential that high quality, free & open source offline text to speech technology brings with it!

Thanks!

Real-time factor: calculation

I know the metric real time factor (RTF) from STT (or ASR) systems. A RTF of 0.5 would mean than 1 sec is recognized in 0.5 sec.

I would expect a similar logic for TTS systems. But the numbers reported in larynx' debug output as Real-time factor seem to be 1/RTF. This is confusing, isn't it?

SSL error when downloading new tts

Steps to reproduce:

  1. Run larynx-server on NixOS with Docker
  2. Attempt to download a tts

Full error output:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/.venv/lib/python3.7/site-packages/quart/app.py", line 1827, in full_dispatch_request
    result = await self.dispatch_request(request_context)
  File "/app/.venv/lib/python3.7/site-packages/quart/app.py", line 1875, in dispatch_request
    return await handler(**request_.view_args)
  File "/app/larynx/server.py", line 667, in api_download
    tts_model_dir = download_voice(voice_name, voices_dirs[0], url)
  File "/app/larynx/utils.py", line 78, in download_voice
    response = urllib.request.urlopen(link)
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 563, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1367, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.7/urllib/request.py", line 1326, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

Liason in French

In French sometimes two words sound like one
DEBUG:larynx:Words for 'oui, c'est un': ['oui', ',', "c'est", 'un']
DEBUG:larynx:Phonemes for 'c'est un': ['#', 's', 'e', 't', '#', 'œ̃', '#', '‖', '‖']
't' was lost in the output wav and phonemes should be something like this
DEBUG:larynx:Phonemes for 'c'est un': ['#', 's', 'e', 't', 'œ̃', '#', '‖', '‖']

DEBUG:larynx:Words for 'ce n'est pas un': ['ce', "n'est", 'pas', 'un']
DEBUG:larynx:Phonemes for 'ce n'est pas un': ['#', 's', 'e', 'ə', '#', 'n', 'ɛ', '#', 'p', 'a', '#', 'œ̃', '#', '‖', '‖']
the output wav was OK ('z' was added) but I think phonemes should be something like this
DEBUG:larynx:Phonemes for 'ce n'est pas un': ['#', 's', 'e', 'ə', '#', 'n', 'ɛ', '#', 'p', 'a', 'z', 'œ̃', '#', '‖', '‖']

Sound was lost in french word rez-de-chaussée

Try to get audio for french word: rez-de-chaussée
Here's command line:
cat << EOF |
fr|rez-de-chaussée.
EOF
/usr/local/bin/larynx
--debug
--csv
--glow-tts /path/fr-fr/siwis-glow_tts
--hifi-gan /path/hifi_gan/universal_large
--output-dir /mnt/d/99/voices/
--language fr-fr
--denoiser-strength 0.001

Debug data:
DEBUG:larynx:Words for 'rez-de-chaussée': ['rez-de-chaussée']
DEBUG:larynx:Phonemes for 'rez-de-chaussée': ['#', 'ʁ', 'e', 'd', 'ʃ', 'o', 's', 'e', '#', '‖', '‖']
Phonemes is OK for this word but there is not sound 'd' in an output audio.

Required versions for python and pip

I have a working setup on a recent linux box (with python 3.8). But now I have to use an older computer (python 3.5, pip 8.1.1) and I run into trouble:

 Using cached https://files.pythonhosted.org/packages/f8/4d/a2.../larynx-0.3.1.tar.gz
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-build-sbulg1mn/larynx/setup.py", line 13
    long_description: str = ""
                    ^
 SyntaxError: invalid syntax

What are the minimum versions required by larynx at the moment?

Using larynx as a module in python code

I find this project cool and useful, but I have a question in mind. Is it possible to use it like pyttsx3 as a TTS engine in code?
If yes, how is it possible.

Mac onnxruntime cannot import name 'get_all_providers

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py:14: UserWarning: Cannot load onnxruntime.capi. Error: 'dlopen(/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so, 2): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
Referenced from: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so
Reason: image not found'.
warnings.warn("Cannot load onnxruntime.capi. Error: '{0}'.".format(str(e)))
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/larynx-0.3.0-py3.7.egg/larynx/init.py", line 9, in
import onnxruntime
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/onnxruntime/init.py", line 13, in
from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed,
ImportError: cannot import name 'get_all_providers' from 'onnxruntime.capi._pybind_state' (/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/onnxruntime/capi/_pybind_state.py)

With onnx == 1.7.0 its okay.

Adding support for Windows Sapi5 implimentation

Hey there developers! I found this repo by exploring, and I'd like to make some requests.
Firstly: Releasing a windows sapi5 version of the tts engine, compatible with all the voices that are available, with integrated necessary encoders which ensure a fast and responsive synthesis: Details below.
I am a blind person who uses a screen reader to use the computer. Blind people like me require a responsive speech synthesizer so they can recieve the requested information without any unnecessary delays, and a quite poppular part of them require very fast speech output without resulting in weird voice artifacts such as those produced by natural sounding tts voices. If I were stupid and ignorant to the point where I don't realize the hard work for it, I would ask you to make an Nvda addon containing the synthesizer along with a possibility to download the voices, but a more mainstream windows integrated option like sapi5 would maybe a little easier perhaps?
Anyway, I know that this project is for rasberry py/commandline usage, but the currently available voices attracted someone like me who uses a more beneficial option for say, dayly usage or something. I look forward to your responce, This is just a request from me, if it can't be done it can't be done. So thanks, and have a good time

Version/tag mismatch when downloading voices for 1.0.0 release

Looks like the Github version tag is 1.0 but the code is looking for 1.0.0. The assets exist on Github with 1.0 in the path, but I'm getting this error when trying to download voices from the web interface:

larynx.utils.VoiceDownloadError: Failed to download voice en-us_kathleen-glow_tts from http://github.com/rhasspy/larynx/releases/download/v1.0.0/en-us_kathleen-glow_tts.tar.gz: HTTP Error 404: Not Found

missing liaison in phonemes but wrong one heard

DEBUG:larynx:Words for 'avec ton amour': ['avec', 'ton', 'amour']
DEBUG:larynx:Phonemes for 'avec ton amour': ['#', 'a', 'v', 'ɛ', 'k', '#', 't', 'ɔ̃', '#', 'a', 'm', 'u', 'ʁ', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)

I can hear "ton zamour" :)

MaryTTS emulation and Home Assistant

I'm having trouble setting up the MaryTTS component in Home Assistant to work with Larynx. In particular, there are several parameters that can be defined in yaml. The docs give this example:

tts:
  - platform: marytts
    host: "localhost"
    port: 59125
    codec: "WAVE_FILE"
    voice: "cmu-slt-hsmm"
    language: "en_US"
    effect:
      Volume: "amount:2.0;"

Larynx is up and running and I can generate speech via localhost:59125. I'd like to use a specific voice and quality setting with Home Assistant's TTS. I tried setting the following:

...
    voice: "harvard-glow_tts"
    language: "en_us"
...

But Home Assistant's log shows an error saying that "en_us" is not a valid language ("en_US" is, though).

What are the correct parameters necessary to use a specific voice? And would it be possible to use an effect key to set the voice quality (high, medium, low)?

Letter t and p

In Russian words "Установите" and "зарядку", letter "т" and "р" pronunciation not good.

Release v0.4.0 contains only a single German larynx-tts-voice (Thorsten)

All others seem to miss the the onnx model:

larynx-tts-voice-de-de-eva-k-glow-tts_0.4.0_all.deb
383 KB
larynx-tts-voice-de-de-karlsson-glow-tts_0.4.0_all.deb
387 KB
larynx-tts-voice-de-de-pavoque-glow-tts_0.4.0_all.deb
393 KB
larynx-tts-voice-de-de-rebecca-braunert-plunkett-glow-tts_0.4.0_all.deb
367 KB
larynx-tts-voice-de-de-thorsten-glow-tts_0.4.0_all.deb
102 MB

Server with HTTPS ?

Hi,

This is a very useful project !
Very easy to install with the provided .deb file

Could you please explain in the Readme how to run the server with HTTPS in debian ?

For example with a certificate generated with Letsencrypt.

Thanks

Siwis avec + sa wrong phonemes

example
DEBUG:gruut.phonemize:Loading lexicon from /usr/lib/larynx-tts/gruut/fr-fr/lexicon.db
DEBUG:larynx:Words for 'avec sa mauvaise vue': ['avec', 'sa', 'mauvaise', 'vue']
DEBUG:larynx:Phonemes for 'avec sa mauvaise vue': ['#', 'a', 'v', 'ɛ', 'k', '#', 'ɛ', 's', 'a', '#', 'm', 'ɔ', 'v', 'ɛ', 'z', '#', 'v', 'y', '#', '‖', '‖']

there should not be a 'ɛ', before 's', 'a'

Longer pause

I'd like to be able to manually add a 2-second pause between different paragraphs of text. Is there a way to do this?

About the use of this software

@synesthesiam

Hi,I have contacted this software for the first time. I have cloned it locally now. How do I use it?
I didn't understand the README file, it only wrote the usage method, but didn't write very detailed information on how to deploy and use it.
I want to use it in Python, how do I deploy it? What are the specific steps?

siwis : sound not so accurate

DEBUG:larynx:Words for 'ce fait est avéré': ['ce', 'fait', 'est', 'avéré']
DEBUG:larynx:Phonemes for 'ce fait est avéré': ['#', 's', 'e', 'ə', '#', 'f', 'ɛ', '#', 'ɛ', '#', 'a', 'v', 'e', 'ʁ', 'e', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)

I can hear a sound 'd" after 'f', 'ɛ'.
I suppose that's because in such a context, we can have a liaison "t" or not.

CUDA Does not appear to be working in docker container

Running the latest docker container with the nvidia container runtime nvidia-smi returns and shows the graphics card as available and ready.

image

You can run larynx from the command line inside of the container without error.
image

But as soon as you pass the cuda flag in

^C(.venv) root@larynx-dd4858485-t9dj2:/home/larynx/app/larynx# python -m larynx --cuda
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/larynx/app/larynx/__main__.py", line 750, in <module>
    main()
  File "/home/larynx/app/larynx/__main__.py", line 66, in main
    import torch
ModuleNotFoundError: No module named 'torch'

Similar errors occur if you attempt to start the container with the cuda flag as an additional argument.

By executing into the container and using the venv that exists I was able to install torch and then run the command.

image

I believe the build container has an issue here https://github.com/rhasspy/larynx/blob/master/Dockerfile#L42 as my knowledge of python is limited it appears that the intent is to use a precompiled version of torch that you are providing, but it does not appear to actually be making it into the container.

Siwis : wrong phonemes for "de"

DEBUG:larynx:Words for 'de fait': ['de', 'fait']
DEBUG:larynx:Phonemes for 'de fait': ['#', 'd', 'a', 'm', '#', 'f', 'ɛ', '#', '‖', '‖']

About the improvement of README

@synesthesiam

First of all, thank you for the detailed introduction to the README. Is there an online version of this software? If there is a link to the online version at the beginning of the README, it can increase the user experience.

Exclude tests in setuptools.find_packages

I'm currently packaging larynx + deps for Arch Linux and I encountered an issue with larynx: setuptools includes the tests in the package, which does not play well with Arch Linux packaging - see rhasspy/phonemes2ids#1 for details.

So I propose to also exclude the tests for larynx.

"Python installation" method fails on musl-based Linux

The method "Python Installation" installs the current and all old versions of larynx (7: 1.0.3 down to 0.3.0) and gruut in this step:

pip3 install -f 'https://synesthesiam.github.io/prebuilt-apps/' -f 'https://download.pytorch.org/whl/cpu/torch_stable.html' larynx

Same for the simpler command:

pip3 install larynx

Of course, this ends in massive version conflicts. The problem first occurred after version 1.0.0.
My python3 is version 3.9.7.

Dot (.) stops synthesis

I am new to Larynx, so maybe my question can be answered easily and quickly, but I couldn't find anything to fix it.

Whenever a dot character is encountered, synthesis ends. I don't even need multiple sentences, but if it encounters something like X (feat. Y) it just says X feat. I am using Larynx over opentts in Home Assistant, but this can easily replicated in the GUI as well. So how exactly can I fix this? And maybe for later, how exactly can I synthesize multiple sentences? Thank you very much in advance, the voices are superb!

Dutch extra "t" sounds

ik ga naar huis is pronounced as: ik ga naar huist
ik ga naar de bakker is pronounced as: ik ga naar de bakkert
jij moet opstaan is pronounced as: jij moet opstaant

Siwis good training on bad prompts

in Siwis, the talent rarely respects the pronunciation of verbs in conditional mode
for example, she would say "il tirait" instead of "il tirerait " .. so

despite the correct phonemes

`DEBUG:larynx:Words for 'il tirerait le premier.': ['il', 'tirerait', 'le', 'premier', '.']
DEBUG:larynx:Phonemes for 'il tirerait le premier.': ['#', 'i', 'l', '#', 't', 'i', 'ʁ', 'ə', 'ʁ', 'ɛ', '#', 'l', 'ə', '#', 'p', 'ʁ', 'ə', 'm', 'j`

I can hear "il tirait le premier".

Cannot redirect audio output to file with --raw-stream

When I try to redirect larynx output to a .wav file from the shell, the file produced is corrupted, when I try to play the output with the same command adding | aplay syntax it plays flawlessly.
larynx -v cmu_jmk -q high --raw-stream < /mnt/hgfs/HostSharedFolder/text/text.txt > test.wav
Am I missing something?
Following the informations given in the wiki the command larynx -v cmu_jmk -q high "Test text." > test.wav works as expected, so it seems there's an issue with the --raw-stream specifier and output redirection, could you please help?

SSML file not processing under --ssml flag

Testing both Larynx and Larynx.server install via pip3 in a venv. All dependencies are satisfied. Fedora 34 all up to date.

Using the example SSML in a file TTS-SSML_test.txt:
larynx.server --> input contents of file into input box and run. SSML checkbox unchecked or checked = voice recognizing ssml cmds and not reading them

Using larynx from cmd line:
$ python3 -m larynx -v southern_english_female-glow_tts < TTS-SSML_test.txt
reads whole file including all the SSML statements

$ python3 -m larynx --ssml -v southern_english_female-glow_tts < TTS-SSML_test.txt
errors:
Traceback (most recent call last):
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 479, in process
root_element = etree.fromstring(text)
File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 7

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 720, in
main()
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/main.py", line 294, in main
for result_idx, result in enumerate(tts_results):
File "/TextToSpeech/venv/lib64/python3.9/site-packages/larynx/init.py", line 71, in text_to_speech
for sentence in gruut.sentences(
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/init.py", line 79, in sentences
graph, root = text_processor(text, lang=lang, ssml=ssml, **process_args)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 432, in call
return self.process(*args, **kwargs)
File "/TextToSpeech/venv/lib64/python3.9/site-packages/gruut/text_processor.py", line 483, in process
root_element = etree.fromstring(f"{text}")
File "/usr/lib64/python3.9/xml/etree/ElementTree.py", line 1348, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 22

Also tried piping the file in via cat:
cat TTS-SSML_test.txt | python3 -m larynx --ssml -v southern_english_female-glow_tts
Same error
Produces audio file without the --ssml flag, but as above includes all the SSML statements

Been through the documentation page and tried the examples to narrow this down. There is nothing specific to using a SSML specific file to produce the audio. Non-SSML examples all work on my workstation

Would like to get this working for a small project that produces training audio files of Shorin-Ryu Karate Yakusokus for my Black belt test practice

Thanks,

MaryTTS API interface is not 100% compatible

Hi Michael,

congratulations for your Larynx v1.0 release 🥳 . Great work, as usual 🙂.

I've been trying to use Larynx with the new SEPIA v0.24.0 client since it has an option now to use MaryTTS compatible TTS systems directly, but encountered some issues:

  • The /voices endpoint is not delivering information in the same format. The MaryTTS API response is: [voice] [language] [gender] [tech=hmm] but Larynx is giving [laguage]/[voice]. Since I'm automatically parsing the string it currently fails to get the right language.
  • The /voices endpoint will show all voices including the ones that haven't been downloaded yet.
  • The Larynx quality parameter is not accessible.

The last point is not really a MaryTTS compatibility issue, but it would be great to get each voice as 'low, medium, high' variation from the 'voices' endpoint, so the user could actually choose them from the list.

I believe the Larynx MaryTTS endpoints are mostly for Home-Assistant support and I'm not sure how HA is parsing the voices list (maybe it doesn't parse it at all or just uses the whole string), but it would be great to get the original format from the /voices endpoint. Would you be willing to make these changes? 😇 😁

OpenAPI page broken

I am getting HTTP 500 returned when I go to http://localhost:5002/openapi/ - The browser page says "Fetch error undefined /openapi/swagger.json"

On the command line, I tried find /usr/local/python3/ -name '*swagger*' and only got results for the swagger_ui package in site-packages.

length-scale is working incorrectly

Parameter --length-scale is working incorrectly comparing with description. Speaker speaks slower when parameter > 1 and faster when < 1, but according description should be
--length-scale - makes the voice speaker slower (< 1) or faster (> 1)

Keyboard Shortcut

Hey! Just wondering if it is possible to implement a keyboard shortcut functionality?

'denoiser_strength' referenced before assignment [waveglow]

cat << EOF |
leçon|leçon
garçon|garçon
EOF
/usr/local/bin/larynx --csv --glow-tts /mnt/d/99/voices/fr-fr/siwis-glow_tts --waveglow /mnt/d/99/voices/waveglow/wn_256 --output-dir /mnt/d/99/fr_sw/ --language fr-fr --denoiser-strength 0.001
Traceback (most recent call last):
File "/usr/local/bin/larynx", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/larynx/main.py", line 185, in main
for text_idx, (text, audio) in enumerate(text_and_audios):
File "/usr/local/lib/python3.7/dist-packages/larynx/init.py", line 146, in text_to_speech
audio = future.result()
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.7/dist-packages/larynx/init.py", line 185, in _sentence_task
audio = vocoder_model.mels_to_audio(mels, settings=vocoder_settings)
File "/usr/local/lib/python3.7/dist-packages/larynx/waveglow.py", line 59, in mels_to_audio
if denoiser_strength > 0:
UnboundLocalError: local variable 'denoiser_strength' referenced before assignment

New languages need a link

Thanks @synesthesiam for this excellent tool.

I followed the 'Python installation' method (on Ubuntu 20.10) and added the language de-de via python3 -m gruut de-de download . Before I could use the new language, I had to add a link in ~/.local/lib/python3.8/site-packages/gruut/data/ to ~/.config/gruut/de-de; otherwise the new language was not found.

Integration for accessibility on linux

      On linux blind and visualy impared people use there computer via the screen-reader orca. It reads the contents on the screen out loud ot to the customer. It works with speech-dispatcher. Speech-dispatcher has generic module-files where we can add the integration of larynx relatively easily. To be able to use this natural sounding voices with orca we need to write such a module file. We also need to achief a verry small delay from sending the text to the engine and play the wav, because otherwise the system will not be fluent. But if we achief this we will bring accessibility on linux to the next level.

Ideas for lipsync and visemes?

First, love the project !

I have a robotic and virtual agent project that I'm trying to get as close to real-time response as possible.
I use the following to generate speech:
python3 fastVoice.py | larynx -v ek --interactive --ssml --raw-stream --cuda --half --max-thread-workers 8 --stdin-format lines --process-on-blank-line| aplay -r 22050 -c 1 -f S16_LE
Where fastVoice.py just dumps the SSML from a socket onto stdin (remember to flush properly ...)
fastVoice.txt

All works very well. Audio generally starts <1s from receiving the message. The question is how to get a phoneme-viseme sequence synced with the audio output.
I can manage to get level 0-ish lipsync by looking at the amplitude of the audio output, but that gives enough info for just the jaw, not the viseme's of the lips.

Do you have any ideas/pointers on how to maintain the responsiveness of "--raw-stream" while getting real-time matching info to generate the matching visemes?

Soften the stop of voice at start of break?

How would I "soften" the end of the sentence at break? Jarring abrupt stop to voice at each break start.

I was thinking of moving to a Japanese voice for the Japanese words, then back to English for the movement directions. Abrupt change ups could make it painful to listen to.

Maybe when you add more SSML set there would be enhanced control to tackle this

Available benchmarks?

Not an issue

I am looking at using Larynx in my rhasspy implementation and was wondering about Benchmarks before I go ahead and run some tests myself. I am interested in using a select one or two voices at medium quality, and wanted to pick the one with the quickest synthesizing. Just by randomly testing a couple of voices, I see noticeable differences between the voice for the same options and piece of text, so there are differences. But, has anyone put together some benchmarks to compare the voices?

Also, on a related note, are there any benchmarks in installation methods? My current method of installation is the Docker container then calling a GET request, converting the binary response to a .wav file, and playing the wave file (all in Python 3 on a raspberry pi 4, 64 bit). But, has anyone noticed differences in speed between the Docker vs. Debian vs. Python 3 installation?

Any other language besides ljspeech en doesn't work

Hi everyone, awesome job with your TTS module!
I have a few problems getting it to work with foreign languages.
I tried with siwis and the 'it' voice, no error comes out, it just doesn't play anything, meanwhile ljspeech works correctly.

Tried with the latest larynx version on Ubuntu 18.04 (amd64), I'm always using it on CLI.

Problems pronouncing times and dates

It looks like the English and German voices fail to pronounce dates and times like this (only ones I've tested):

English: 4/23/2021, 5:02:54 PM
German: 23.4.2021, 17:02:51

I know this is a widely discussed problem in the TTS field and not so easy to solve, but maybe there is some smart python library that does the work ;-). A small script using regular expressions could be a start, but to make this work for every language there has to be some ML based procedure I guess.

Maybe you are already working on something? ^^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.