grammatek / simaromur Goto Github PK

Icelandic TTS (text-to-speech) service for Android

License: Apache License 2.0

Java 96.05% CMake 0.23% C++ 3.54% Shell 0.18%

thrax g2p openfst android tts

simaromur's Introduction

Símarómur

This project provides an Icelandic TTS application for the Android TTS service. The current state of the project is production-ready.

The app is available on the Google Play Store.

Voices

Símarómur provides access to neural network on-device voices that are bundled via assets.

Currently, there is one male voice available, named Steinn. This voice is not only highly intelligible but also possesses a pleasant and engaging tone, making it a versatile, general-purpose option that sets the standard for Icelandic on-device text-to-speech (TTS) technology. It is well-suited for reading both short and lengthy texts, providing a consistent listening experience.

We are currently developing a multi-speaker model that will include a female voice, slated for future release.

User Normalization Dictionary

Users can add normalization entries to accommodate alternative pronunciations of words or tokens. These alternative pronunciations take precedence over the built-in normalization rules, applying the specified replacements for any such terms found in the text being read.

To simplify usage, replacements can be made at the grapheme level without the need to understand or use regular expression syntax. Users can immediately hear how the entered term and its replacement sound with the current voice by using play buttons.

By default, the user normalization dictionary starts empty. At present, importing or exporting the dictionary is not supported.

Text Normalization & G2P

Icelandic text normalization is performed before the text enters G2P. Local voice G2P is rule-based and is implemented using the C++ frameworks Thrax & OpenFST, which are accessed via JNI.

New since version 2.x

Deprecated FLite voices and the former neural network voices. Nowadays, Flite voices are obsolete and we are using purely neural network voices instead. The FLite project is barely maintained, and the runtime performance of the neural network voices is closing in on the FLite voices rapidly. We can achieve 25x realtime speed with the neural network model on a Pixel 6 phone.

The neural network model is based on VITS and trained via Piper TTS.

Build Prerequisites

This project uses our versions of OpenFST & Thrax with the appropriate fixes to build for Android inside the branch android. Please build & install these first, before compiling Símarómur.

Using prebuilt libraries from github releases

For our CI jobs, we have already prebuilt all dependent libraries and published as Github release assets at their corresponding project site. You can take advantage of these and install them locally inside your project directory via the following procedure:

Set environment variables for the used release versions, e.g. :

export OPENFST_TAG=1.8.1-android
export THRAX_TAG=1.3.6-android

Then run this script:

.github/scripts/dl_3rdparty.sh

This should download and extract all necessary binaries to the sub-directory 3rdparty/ndk.

Configuration & Build

Fetch the voice assets subdirectory via

git submodule update --init

Then create the file local.properties if it doesn't already exist and add variables 3rdparty.dir for the installed OpenFST/Thrax libraries, e.g.

3rdparty.dir=/Users/fred/install-android

or in case you have downloaded our releases via dl_3rdparty.sh, point these variables into your project directory simaromur/3rdparty/ndk, e.g.:

3rdparty.dir=/Users/fred/projects/simaromur/3rdparty/ndk

It might also be necessary, to adapt/uncomment the variable ndkVersion inside app/build.gradle depending on your installed NDK version. Then build the project inside Android Studio.

Contributing

You can contribute to this project by forking it, creating a branch and opening a new pull request.

License

Acknowledgements

We use the 3rdparty libraries Sonic for audio speed and pitch manipulation. Sonic is Copyright 2010, 2011 by Bill Cox and is licensed under the Apache License. Símarómur uses adapted versions of Thrax and OpenFST for G2P. These are also licensed under the Apache License. Furthermore, we use OpenNLP for tokenization and sentence splitting. OpenNLP is licensed under the Apache License.

simaromur's People

Contributors

Stargazers

Watchers

Forkers

ets-android5 kingfener

simaromur's Issues

E-mail feedback

Implement an activity to send feedback e-mails to Grammatek. Add a button on the main app screen to access this activity.

App crashes indeterministically when switching different views

The following crash can be detected, if randomly switching views inside the app:

2021-04-30 15:09:20.988 6887-6913/com.grammatek.simaromur V/Flite_Native_JNI_Service: Java_com_grammatek_simaromur_NativeFliteTTS_nativeDestroy
2021-04-30 15:09:20.988 6887-6913/com.grammatek.simaromur I/Flite_Native_Engine: TtsEngine::shutdown
2021-04-30 15:09:20.988 6887-6913/com.grammatek.simaromur I/Flite_Native_Engine: Voices::~Voices Deleting voice list
2021-04-30 15:09:20.988 6887-6913/com.grammatek.simaromur I/Flite_Native_Engine: Voices::~Voices voice list deleted
2021-04-30 15:09:20.988 6887-6913/com.grammatek.simaromur I/Flite_Native_Engine: Unloading generic clustergen voice.
2021-04-30 15:09:20.989 6887-6913/com.grammatek.simaromur V/Flite_Native_JNI_Service: Java_com_grammatek_simaromur_NativeFliteTTS_nativeDestroy
2021-04-30 15:09:20.989 6887-6913/com.grammatek.simaromur I/scudo: Scudo ERROR: invalid chunk state when deallocating address 0xc304e5f0
2021-04-30 15:09:20.990 6887-6913/com.grammatek.simaromur A/libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 6913 (FinalizerDaemon), pid 6887 (matek.simaromur)

When using the debugger on the Android emulator, the following code line is made responsible:

With that backtrace:

Get list of beta testers

Implement CMU g2p for Flite voice

For the Flite voice, we need the CMU Phonemeset to be implemented

Use streaming response from Network-API

Instead of waiting for the complete audio to be returned, we should read the audio in chunks and feed it to the Audio queue that is being used for the on-device voices. The benefit would be less delay for longer text passages

"No text found" response plays with higher pitch

I have a debug version at #694bb62 built and installed on a Mi 2 with Android 10. When I have "Select to speak" active and select an area containing no text I get a synthesized response saying "Enginn texti fannst á þessum stað" (if using Icelandic). Using Símarómur this response always plays at a slightly higher pitch than everything else.

Log if I select an area that contains some text:

06-24 17:41:08.943  3784  3863 I Simaromur_Java_TTSService: onSynthesizeText
06-24 17:41:08.943  3784  3863 V Simaromur_Java_TTSService: onSynthesizeText: (isl/ISL/Dóra), voice: Dóra
06-24 17:41:08.943  3784  3863 E Simaromur_Java_TTSService: onSynthesizeText: Loaded voice (Alfur) and given voice (Dóra) differ ?!
06-24 17:41:08.943  3784  3863 V Simaromur_AppRepository: getCachedVoices
06-24 17:41:09.044  3784  3863 V Simaromur_Java_TTSService: TalkBackonSynthesizeText:  => normalized =>TalkBakk .
06-24 17:41:09.044  3784  3863 V Simaromur_TiroSpeakController: streamAudio: request: SpeakRequest{Engine='standard', LanguageCode='is-IS', OutputFormat='pcm', SampleRate='16000', Text='TalkBakk .', TextType='text', VoiceId='Alfur'}
06-24 17:41:09.522  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: Tiro API returned: 29720 bytes
06-24 17:41:09.522  3784  3784 I Simaromur_AppRepository: Applying pitch 1.6, speed 1.0
06-24 17:41:09.568  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 0
06-24 17:41:09.569  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 8192
06-24 17:41:09.569  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 16384
06-24 17:41:09.593  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 24576
06-24 17:41:09.601  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: consumed 29774 bytes
06-24 17:41:10.654  3784  3863 I Simaromur_Java_TTSService: onSynthesizeText
06-24 17:41:10.654  3784  3863 V Simaromur_Java_TTSService: onSynthesizeText: (isl/ISL/Dóra), voice: Dóra
06-24 17:41:10.654  3784  3863 E Simaromur_Java_TTSService: onSynthesizeText: Loaded voice (Alfur) and given voice (Dóra) differ ?!
06-24 17:41:10.655  3784  3863 V Simaromur_AppRepository: getCachedVoices
06-24 17:41:10.754  3784  3863 V Simaromur_Java_TTSService: Off / Lesa upp atriði á skjánumonSynthesizeText:  => normalized =>Off skástrik Lesa upp atriði á skjánum .
06-24 17:41:10.754  3784  3863 V Simaromur_TiroSpeakController: streamAudio: request: SpeakRequest{Engine='standard', LanguageCode='is-IS', OutputFormat='pcm', SampleRate='16000', Text='Off skástrik Lesa upp atriði á skjánum .', TextType='text', VoiceId='Alfur'}
06-24 17:41:11.415  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: Tiro API returned: 75788 bytes
06-24 17:41:11.416  3784  3784 I Simaromur_AppRepository: Applying pitch 1.6, speed 1.0
06-24 17:41:11.484  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 0
06-24 17:41:11.484  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 8192
06-24 17:41:11.485  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 16384
06-24 17:41:11.502  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 24576
06-24 17:41:11.505  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 32768
06-24 17:41:11.785  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 40960
06-24 17:41:12.065  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 49152
06-24 17:41:12.345  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 57344
06-24 17:41:12.625  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 65536
06-24 17:41:12.905  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 73728
06-24 17:41:13.045  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: consumed 75852 bytes
06-24 17:41:13.935  3784  3863 I Simaromur_Java_TTSService: onSynthesizeText
06-24 17:41:13.935  3784  3863 V Simaromur_Java_TTSService: onSynthesizeText: (isl/ISL/Dóra), voice: Dóra
06-24 17:41:13.935  3784  3863 E Simaromur_Java_TTSService: onSynthesizeText: Loaded voice (Alfur) and given voice (Dóra) differ ?!
06-24 17:41:13.935  3784  3863 V Simaromur_AppRepository: getCachedVoices
06-24 17:41:13.935  3784  3863 V Simaromur_Java_TTSService: onSynthesizeText:  => normalized =>
06-24 17:41:13.935  3784  3863 I Simaromur_Java_TTSService: onSynthesizeText: finished

Log if I select an area containing no text:

06-24 17:43:13.664  3784  3863 I Simaromur_Java_TTSService: onSynthesizeText
06-24 17:43:13.664  3784  3863 V Simaromur_Java_TTSService: onSynthesizeText: (isl/ISL/Dóra), voice: Dóra
06-24 17:43:13.664  3784  3863 E Simaromur_Java_TTSService: onSynthesizeText: Loaded voice (Alfur) and given voice (Dóra) differ ?!
06-24 17:43:13.664  3784  3863 V Simaromur_AppRepository: getCachedVoices
06-24 17:43:13.760  3784  3863 V Simaromur_Java_TTSService: Enginn texti fannst á þessum stað.onSynthesizeText:  => normalized =>Enginn texti fannst á þessum stað .
06-24 17:43:13.761  3784  3863 V Simaromur_TiroSpeakController: streamAudio: request: SpeakRequest{Engine='standard', LanguageCode='is-IS', OutputFormat='pcm', SampleRate='16000', Text='Enginn texti fannst á þessum stað .', TextType='text', VoiceId='Alfur'}
06-24 17:43:14.377  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: Tiro API returned: 60556 bytes
06-24 17:43:14.377  3784  3784 I Simaromur_AppRepository: Applying pitch 1.92, speed 1.0
06-24 17:43:14.426  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 0
06-24 17:43:14.426  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 8192
06-24 17:43:14.426  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 16384
06-24 17:43:14.439  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 24576
06-24 17:43:14.442  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 32768
06-24 17:43:14.739  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 40960
06-24 17:43:15.019  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 49152
06-24 17:43:15.299  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: offset = 57344
06-24 17:43:15.579  3784  3784 V Simaromur_AppRepository: TiroTtsObserver: consumed 60710 bytes

The first one has pitch 1.6 and the second has pitch 1.92. If I set the pitch to 1.0 the second will be 1.2, so it seems that it gets multiplied by 1.2.

New app logos

We should replace all current app logos with our own. This is also a prerequisite for publishing it to the Play store.

Use double buffering for local voices

We should use the CPU on the device to already compute the next utterance while the current utterance is being spoken. For detecting the end of the utterances, we can use the same heuristics as being currently used. There should be a queue for the speak task, that decouples the current utterance from the processing. This queue should be of size 1 and the consumer (the one doing the processing and feeding the audio samples to the TTS service) should blockingly wait on the queue. We should examine, if we could use this approach also for network voices, although here we have a much smaller delay.

The challenge is to signal to TTS that we have consumed the current utterance, although we haven't. We use the synthCb.audioAvailable() call to feed the TTS service with audio. We could play e.g. the first utterance with a dummy silence and directly call synthCb.done() afterwards so that we get the next utterance / keep the sequence of utterances coming. Then we'd feed the TTS audio via synthCb.audioAvailable() with the really computed utterance n-1 while the next utterance is being prepared. We do all of this until we get an end-of-utterance signal from TTS service and don't call synthCb.done() for the last-1 utterance, wait for the last utterance to be computed and instead call synthCb.audioAvailable() with the last computed utterances and only afterwards execute synthCb.done() at the end to keep the number of calls to synthCb.done() the same as number of executed callbacks.

This has to be thoroughly tested, with all possible combinations and error situations in between.

Voice not playing in info screen

The voice cannot be played in the info screen. The API returns the audio, but MediaPlayer is immediately stopped

Prepare frontend for newer neural network voices

The newer network voices (espnet based) that are currently prepared for mobile inferencing use slightly different symbols than the current on-device voices. The frontend has to be adapted (G2P) to match the training data.

Add speechmark handling

Network API supports speechmarks; Android service supports speechmarks; we should also support speechmarks

First start handling

Think about what should be presented to the user, when the App is started for the first time.

E.g.:

Toast message
- some explanation about how the data is used
- a link to the Google Play entry of the Accessibility Toolbox, so that selecting text in any app for TTS is possible
- general workflow info about how to activate TTS

~~Also we should add a shortcut for the TTS settings dialog on the TTSManager activity screen~~ (done)

Google Accessibility Client stops in the middle of reading

It seems, even in the middle of a reading session, there a empty texts sent. In that case, currently nothing goes forward and the TTS client waits endlessly for some TTS actions

Make sample rate of device voice a parameter of the voice itself instead of hard-coding

As came up in a discussion here, it's better to not rely on the voice models being generated with a fixed 22050 Hz sample rate, but to add the used sample rate as metadata in the file voice-info.json of the simaromur-voices submodule.

Add running unit tests on CI

As the title says: don't only build the application as apk, but run the unit tests already.

Don't play voice in case we already get a new request

Users reported to us, when using talkback that they like to navigate as fast as possible to certain GUI elements. Currently, doing it like this queues up texts that are always read fully, resulting in a big response time.

We should immediately stop playing of a voice, as soon as a new speak requst is received.

Try Flite phoneme voice from RU

We already have a demo voice deliverable. We should try it out on the device and make the runtime at least work together with this voice.

Show splash screen at startup while initializing

Show some sort of "busy" activity, while the app is starting up. Currently, only a white screen is shown and then after roughly 3,5 secs the TTSManager activity is shown.

Add user synonym entries dictionary

We got an idea from one of our users that uses word synonyms in the Ivona app to improve the prununciation for utterances the app doesn't yet know or where the user wants to deviate from what the app decides.

Add prununciation dictionary with alternative pronunciation

Pronunciation of English words on screen

Do another round of adding missing English words and app names to the dictionary. For example "Talkback" is missing. Go systematically through the UI and transcribe missing words.

Normalization issues

This is a collection of normalization issues we have found with the application.

Replacement error
- ("Það er rúmlega 93 þús km"), normalized ("Það er rúmlega níu þrír þúsund níu þrír einns km .")
km: unit Is not replaced by correct Icelandic word
- ("Það er rúmlega 93 þús km"), normalized ("Það er rúmlega níu þrír þúsund níu þrír einns km .")
( ): parenthesis are not taken into account when spoken. There is no pause done which sounds strange
- Nær allur austurhluti landsins er á ungversku sléttunni (pússtunni). "), normalized ("Nær allur austurhluti landsins er á ungversku sléttunni ( pússtunni )")
Issues with the TTS accessibility app from Google, but maybe we should document this somewhere:
- ²: superscripts are not spoken e.g. in km²: instead km is fed and then a non-superscript 2 given, both as 2 consecutive TTS requests
- all external Links (e.g. link of a word to another Wikipedia article inside a Wikipedia article) are requested via an own TTS request instead of spoken fluently

Get the Beta Test workflow defined

2022

Do we already have 2022 ? Update all Copyright infos

Firebase crash reports

Firebase reports these issues:

 Fatal Exception: java.lang.NullPointerException
Attempt to invoke interface method 'void com.grammatek.simaromur.network.tiro.VoiceController$VoiceObserver.error(java.lang.String)' on a null object reference
com.grammatek.simaromur.network.tiro.VoiceController.onFailure (VoiceController.java:139)
retrofit2.ExecutorCallAdapterFactory$ExecutorCallbackCall$1$2.run (ExecutorCallAdapterFactory.java:79)
android.os.Handler.handleCallback (Handler.java:938)

 Fatal Exception: java.lang.NullPointerException
Attempt to invoke interface method 'void c.c.a.g0.b.b(java.lang.String)' on a null object reference
com.grammatek.simaromur.network.tiro.SpeakController.onFailure (SpeakController.java:145)

These have a similar reason: it seems in SpeakController and VoiceController the method onStop() is called, and later the callback onFailure() as well. In the latter we refer to the given callback, which is deinitialized in onStop().

Prepare JUnit based testing

Things that can be tested individually w/o e.g. GUI, should be tested via JUnit tests. Find out how to mock context objects that are needed inside our classes

Upload App to Google Play for testing

Selected voice sometimes changes

This has been noticed in v1.2.1, but probably is also a problem in earlier versions. The voice Álfur hratt was selected, but after a few utterances, it changed suddenly to another voice.

Fix 2 crashes

Google console reports 2 new crashes.

Crash in AppRepository::getCachedVoices() that results from returning uninitialized mAllCachedVoices member.
Crash in callback provided via ConnectionCheck::registerNetworkCallback() which doesn't check return value of connectivityManager.getNetworkCapabilities(network)

Double entries when API voice name changes

SInce the Tiro API changed parameters of some voices, there are now double entries shown in the VoiceInfo activity of these updated voices and the Db accordingly, contains double entries.

examine the index uniqueness constraint for the used fields
delete API entries, that don't exist anymore, if this is the current default voice, make another voice the default voice

After reinstallation: race condition between selected previous voice and Simarómur "default" voice

After a reinstall of the app, Dóra is often used instead of Álfur if the latter has been selected previously. Only a reselection of Álfur fixes the problem. There seem to be a race condition between the input of network voice query, the update of the DB and the queries coming from the TTS client engine.

Examine branch 'abn-syllab-stress'

This branch is currently in limbo. Prepare a PR / delete ?

Add voice sub-activity in VoiceManager activity

When pressing a voice name, a new activity should be started

show voice type (network, local)
if network voice, show if available (API/network)
voice audio demo via play button - this should not use the Android TTS service, but a direct "short-cut"

Rearrange voice selection handling

Currently, we return many voices for the Icelandic locale, "abducting" the locale property variant for the voice name. The user has to explictely choose the voice by selecting the locale, instead of choosing it via the voice settings.

We should rather make the voice choosable inside the app and only return the selected voice for the "favorite" voice of is_IS. If the user presses the voice settings inside the TTS system settings, we should refer him to these settings. We should only add more voices for the locale, if we really support different locales.

Use setContentDescription() for all views for better accessibility

Via View#setContentDescription(CharSequence contentDescription) screenreaders can read the title for views. We should add for all views the content description.

Add privacy notice

Add privacy note in Info screen, what we (don't) do with the text read via TTS service.

DateConversion fails on some devices

Firebase shows the following backtrace:

sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate (BaseCalendar.java:453)
java.util.GregorianCalendar.computeFields (GregorianCalendar.java:2411)
java.util.GregorianCalendar.computeTime (GregorianCalendar.java:2813)
java.util.Calendar.updateTime (Calendar.java:3402)
java.util.Calendar.getTimeInMillis (Calendar.java:1761)
java.util.Calendar.getTime (Calendar.java:1734)
java.text.SimpleDateFormat.parseInternal (SimpleDateFormat.java:1832)
java.text.SimpleDateFormat.parse (SimpleDateFormat.java:1726)
java.text.DateFormat.parse (DateFormat.java:360)
com.grammatek.simaromur.db.TimestampConverter.fromTimestampString (TimestampConverter.java:26)
com.grammatek.simaromur.db.AppDataDao_Impl$4.call (AppDataDao_Impl.java:272)
com.grammatek.simaromur.db.AppDataDao_Impl$4.call (AppDataDao_Impl.java:227)
androidx.room.RoomTrackingLiveData$1.run (RoomTrackingLiveData.java:90)
java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1167)
java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:641)
java.lang.Thread.run (Thread.java:919)

This happened on one device. Solution is to catch the exception and return a current date instead

Normalization/G2P issue/todo collection

g2p

Álfur P => au l v Y r p_h a h c I
Mörk Ómars á EM: => m 9 r_0 k ou: m a r_0 s au: E: m : (=> ":" at end)
til áramóta 2021/2022 => t_h I: l au: r a m ou t a §sp
Improve Símarómur => I m_0 p r O v E s i: m a r ou m Y r §sp
start and end every utterance with a pause (§sp / pau)
Do custom g2p handling of some problematic letters when read as a single utterance (important for keyboard reading)

Other

Should correctly produce English/Icelandic phonemes for all Símarómur strings that could be read by VoiceOver if set to English locale

New

1.800.000 kr via network voice => 1 komma 8 0 0 0 0 0 krónur
original ("Anníe Mist Þórisdóttir er svolítið stríðin og Björgvin Karl Guðmundsson fékk að kynnast því á dögunum.INSTAGRAM/@anniethorisdottir")
- normalized ("anníe mist þórisdóttir er svolítið stríðin og björgvin karl guðmundsson fékk að kynnast því á dögunum.instagram@anniethorisdottir .")
- phonemes ("a n i E m I s t T ou: r I s t ou h t I r E r s v O: l i t I D s t r i: D I n O: G p j 9 r k v I n k_h a r t l_0 k v Y D m Y n t s O n f j E h k a: D c_h I n a s t T v i: au: Rewrite failed")
original ("Bühl er liðsfélagi Glódísar Perlu Viggósdóttur, Karólínu Leu Vilhjálmsdóttur og Cecilíu Rán Rúnarsdóttur hjá Bayern München.")
- normalized ("buhl er liðsfélagi glódísar perlu viggósdóttur , karólínu leu vilhjálmsdóttur og kekilíu rán rúnarsdóttur hjá bayern munkhen .")
- phonemes ("p Y l_0 E r l I D s f j E l ai j I k l ou: t i s a r p_h E r t l Y v I k ou s t ou h t Y r §sp k_h a: r ou l i n Y l E: Y v I lC au l m s t ou h t Y r O: G c_h E: c I l i j Y r au: n r u: n a r_0 s t ou h t Y r C au: p ai j E r n m u n_0 k h E n ") - missing space between "l C"
original ("4. ágúst 2022 06:42")
- normalized ("fjórir. ágúst tvö þúsund tuttugu og tvö núll sex fjörutíu og tvö .")
original ("Link in bio 💅🏼")
- normalized ("link in bio �� .")
- phonemes ("l i N_0 k I n p I: O Rewrite failed ")
original ("Aðeins er eftir um einn metri áður en hraunið fer að renna út úr dölunum.ELDFJALLAFRÆÐI OG NÁTTÚRUVÁRHÓPUR HÍ")
- normalized ("aðeins er eftir um einn metri áður en hraunið fer að renna út úr d ö l u n u m punktur eldfjallafræði og n á t t ú r u v á r h ó p u r h í .")
- phonemes ("a: D ei n s E r E f t I r Y m ei t n_0 m E: t r I au: D Y r E n r_0 9i: n I D f E: r a: D r E n a u: t u: r t j E: 9: ai: Y: E n Y: E m p_h u n_0 t Y r E l t f j a t l a v r ai D I O: G E n au: t_h j E: t_h j E: u: E r Y: v a f au: E r h au: ou: p_h j E: Y: E r h au: i: ")
original ("www.visir")
- normalized ("w w w . v i s i r ")
- phonemes ("t_h v 9: f a l_0 t v a f t_h v 9: f a l_0 t v a f t_h v 9: f a l_0 t v a f §sp v a f I: E s I: E r")

Handle normalization version changes in UtteranceCache handling

Is there anywhere the possibility to findItemByTextAndFrontendVersion? If the current frontend version == cacheItem frontend version, we don't have to perform text pre-processing, isn't that correct?

Originally posted by @bnika in #102 (comment)

Analyze & fix crash for empty returned voice

Firebase shows a crash for beta3.3 for empty returned voice.

same happened if some GSON & Retrofit2 symbols were dexed. Make sure, we are not running into this problem again
add a defensive null check at the pointed out place

Startup is slow

When initially starting the App, there is a lot of delay until the TTSManager screen is shown. It needs > 3sec. It should be analyzed, if we could throttle the voice query network requests, if we already have current voice info for the service inside the DB.

define voice query interval time (e.g. 30 minutes)
don't query voice list from server, if interval time hasn't expired
use DB for satisfying the voice availibility request

Update external dependencies

External libraries/framworks like torch-mobile, room, coordinatorlayout, etc. are available in a newer version. Update those of them that make sense.

Use Icelandic localisation

The UI is currently in English

Honor the currently set locale in case it's Icelandic

Handle network and/or TTS service unavailability

When the network and/or TTS service are not available, the app behaves indeterministic.

if network is not available and TTSService is triggered
- play a distinguished message that user should connect to internet (in case no voice but network voice is available, which is currently always the case)
- show a toast message with the same message contents and user has to press 'ok'
if network service is not available, play a distinguished message, that the service is "currently unavailable, please try again later"
define, if we should play this for every supported voice, or maybe only via Dóra ?
prerecord the messages and integrate into app
also show the message inside the Símarómur app, if it's activated
add the network status inside the info screen ?

Add Tíro network voice API

Add classes that access the Tíro web service.

detect web service availibility
send text, receive audio
play received audio

Debundle network voices

With the current 2 neural network voices Álfur + Díljá inside the app assets, we are almost at the upper edge of what Google Play accepts as max. app size. As we have a few more on-device voices in the pipeline, we need to make these available as downloads.

We should bundle these together with meta data and sample .wav files and make them available as voice releases on simaromur_voices. Any new version of a voice should be detected by Símarómur and marked as updatable, so that the user can update an already downloaded voice.

In the end, the app should just contain the bare minimum assets. Voices should be downloaded to Android phone external storage and used from there.

Remove superfluous POS-Tagger model

Currently, we have 2 POS tagger models: is-pos-reduced-maxent.bin used by NormalizationManager and is-pos-maxent.bin used by FrontendManager.

Use only one of these and remove the other
Make the POS tagger available by a getter method to the other instance where it's needed

Implement audio caching

To get down the latency of the most used utterances, e.g. for navigation items and titles, we should cache the generated audio up to a certain amount of items.

To be somewhat efficient, we should use a hash with the voice name, the voice version, the text itself as the filename basename and save it with the audio type as suffix
For PCM data, we should rather convert this immediately to a wav, because otherwise we will not know later, what exactly was the format used
Instead of passing around data buffers, we should pass around the filename of the just generated/downloaded audio
- use a temp-file for downloaded audio and after all data is downloaded, rename it to the final path
limit the amount of cached files to a sensible threshold (e.g. 1000 entries)
- delete the oldest items ?
- delete most unused items ?

Download Voice Data activity is triggered

Firebase shows a crash for one device in the following area:

It seems, the DownloadVoice activity has been triggered. This must be a consequence of a return value from Símarómur, indicating that the Voice Data is not yet returned. The crash itself is obvious, because this code hasn't been tested in combination with network-only voices. It's planned to touch the code, when we have the on-device voices ready.

The crash happened just once for the user, it's presumably a first-initialization bug.

Download of Álfur hratt does not work

Attempts to download Álfur hratt run into "DownloadVoiceManager: Server Voice description not available" Error.
Not a permanent error, repeated attempts work (after reloading the app from AndroidStudio, first in debug mode and then delete all data on device and load again via "run").

Examine asset file copying

Currently, some assets are copied over to external storage at startup of the app. This is unnecessary and increases startup time. Examine which of these can be replaced by direct usage of the asset files.