readalongs / studio Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 19.0 68.86 MB

Audiobook alignment for Indigenous languages

Home Page: https://readalongs.github.io/Studio/

License: Other

Python 98.00% Dockerfile 0.34% CSS 0.42% HTML 1.20% Procfile 0.04%

studio's People

Stargazers

Watchers

Forkers

finguist joanise eddieantonio tobyatgithub dmschrein marctessier algonquian-dictionaries-project normankong rijkstarhn dowobeha dhdaines carpentersaw

studio's Issues

Set root directory on coveralls

Currently coveralls makes you go way deep down into the directory structure just to see the relevant files:

There's a way to set the root directory so that it does away with all those files and also let's you see each file's coverage. Right now it just says:

The file "/home/travis/build/dhdaines/ReadAlong-Studio/readalongs/align.py" isn't available on github. Either it's been removed, or the repo root directory needs to be updated.

I think the solution is here and that if you're logged in as the owner (@dhdaines) you should see a somewhere to change the root.

TokenizerLibrary initializing all mappings

The TokenizerLibrary class (readalongs/tetx/tokenize_xml.py) is currently initializing with every single available mapping. See its constructor:

 def __init__(self):
        self.tokenizers = {None: DefaultTokenizer()}
        for x in MAPPINGS_AVAILABLE:
            mapping = Mapping(in_lang=x["in_lang"], out_lang=x["out_lang"])
            tokenizer_key = self.get_tokenizer_key(x["in_lang"], x["out_lang"])
            self.tokenizers[tokenizer_key] = Tokenizer(mapping)

We should instead just load the mappings necessary in the path from the input to eng-arpabet

DNA audio and silence splitting interaction

Marc found a situation where silence splitting made the first word span half the DNA range at the beginning of a file. Notice in the data below that [0,4576] is DNA audio, but the first word in the SMIL file starts exactly at 4576/2=2.288s.

Config:

{
        "do-not-align": {
                "method": "remove",
                "segments": [{
                                "begin": 0,
                                "end": 4576
                        },
                        {
                                "begin": 11726,
                                "end": 26267
                        }
                ]
        }
}

Command:
readalongs align --config ./s0387_intro2.json --debug --save-temps --force-overwrite --language iku s0387_intro2.xml s0387_intro2.mp3 s0387_intro2.config 2> s0387_intro2.out.config

Data files: Eric received the data files to reproduce this by e-mail from Marc on 2021-04-19.

Output:

<smil xmlns="http://www.w3.org/ns/SMIL" version="3.0">
    <body>
        <par id="par-t0b0d0p0s0w0">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w0"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="2.288" clipEnd="5.016"/>
        </par>
        <par id="par-t0b0d0p0s0w1">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w1"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="5.016" clipEnd="5.526"/>
        </par>
        <par id="par-t0b0d0p0s0w2">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w2"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="5.526" clipEnd="6.806"/>
        </par>
        <par id="par-t0b0d0p0s0w3">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w3"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="6.806" clipEnd="8.201"/>
        </par>
        <par id="par-t0b0d0p0s0w4">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w4"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="8.201" clipEnd="9.301"/>
        </par>
        <par id="par-t0b0d0p0s0w5">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w5"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="9.301" clipEnd="10.126"/>
        </par>
        <par id="par-t0b0d0p0s0w6">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w6"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="10.126" clipEnd="10.606"/>
        </par>
        <par id="par-t0b0d0p0s0w7">
            <text src="s0387_intro2.config.xml#t0b0d0p0s0w7"/>
            <audio src="s0387_intro2.config.mp3" clipBegin="10.606" clipEnd="11.716"/>
        </par>
    </body>
</smil>

Adopt consistent code formatting strategy

Some issues are coming up in code reviews around stylistic choices (ex. single quotes vs double quotes). We should use a consistent formatter to reduce style noise in commits.

Add citation

we should add a citation for the readalongs paper

Refactor the representation of a word inside Studio

Currently, in readalongs/align.py we use dicts to store words, typically with a start, end and then id or text. This is confusing and difficult to use and document correctly. See discussion starting with "Orthogonal to this PR" in #119

Implementation to consider: reimplement a Word class using @dataclass and have the .text attribute calculated (and cached) on use, so that we just have one Word object type with the different query attributes we need, thus making it at lot more intuitive to both document and use.

Consider splicing the model out of the repo and PyPI release

The last release was a 50mb bundle because it now includes the cmusphinx model as part of the release.

This is OK, but we cannot let the bundle get much bigger, especially since PyPI has a size limit of 60mb by default.

See https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html for a potential solution, like having the model as a downloadable file on the GitHub release instead of inside the PyPI bundle.

Studio doesn't work properly with NFC g2p mappings

See #36

If a mapping doesn't use NFD normalization, then alignment fails for some languages..

Allow exported readalongs to be standalone (offline-safe)

We want to generate readalongs that can be used entirely standalone, without requiring a stable internet connection.

We want to both get the latest greatest features, but also have stable component, so we have to figure out how to allow creators of readalongs the choice to use a stable version vs. choosing to have the latest version.

Why stable, and offline-friendly?

bad/expensive mobile internet connections
bad or non-existent WiFi when demoing
the third-party servers hosting fonts/JS suddently disappear 😱
have a standalone component you can give to a community that will work for as long as possible

Allow for different Unicode normalization standards

Currently, ReadAlong-Studio requires NFD. This creates problems with the g2p module because mappings in that module can use NFC, NFD, NKFC or NKFD. ReadAlong-Studio should just look at the mapping of any given language it's working with and use its declared standard.

Update the English g2p

This was brought up in PR https://github.com/dhdaines/ReadAlong-Studio/pull/8 - we might want to replace the English g2p system we use. g2p_en was proposed, others?

Use LexiconG2P for English input

This is an interim solution, but we should have some way of dealing with English input.

ignore superflouous blank lines in readalongs prepare

When a plain text file start with one or more blank lines, blank pages are inserted before the first page of the RA.
When there are 3+ consecutive blank lines mid-text, blank pages are inserted in the middle.
When there are 2+ blank lines at the end of the plain text, blank pages are inserted after the RA.

While we can ask the user to remove such extraneous blank lines, it would be better UX with fewer gotchas to ignore them with these rules:

ignore all blank lines before the first non-blank line
ignore all blank lines after the last non-blank line
consider any sequence of 3+ blank lines the same as a sequence of 2 blank lines, i.e., a single page break.

In short, never create an empty <div type="page"></div> element in the output.

readalongs align -f doesn't work now that the output is a directory

With the change of output for readalongs align from a bunch of files to a directory containing them, the -f option no longer works:

$ readalongs align -l fra -s -f -i in.txt in.mp3 out
INFO - Server initialized for eventlet.
Usage: readalongs align [OPTIONS] INPUTFILE WAVFILE OUTPUT_BASE

Error: Output folder 'out' already exists

Expected behaviour: with -f, the output folder is allowed to exist already, and files therein should get overwritten.

Alignment from txt test not passing

This issue is 2-fold. One, the testAlignText is not passing. Two, Travis, is giving a false positive.

NamedTemporaryFiles must be closed before being read on Windows

This issues is specific to Windows, the problem does not occur on *nix systems, because of how files are opened. On Windows, a write file handle has an exclusive lock, preventing a read file handle from being opened on the same file before the write handle is closed. As a consequence, the way we use NamedTemporaryFile in readalongs/align.py methods create_input_xml() and create_input_tei() is broken on Windows.

For cross-platform compatibility, this is the required workflow, which I know is a bit of a pain:

tempfile=NamedTemporaryFile(..., delete=False)
write to tempfile
close tempfile (which would cause the file to be deleted if we had used delete=True)
open tempfile.name() for reading
track tempfile.name() whereever we still need it to delete it once we're actually done with it.

Yuck, I know.

Fixing this should solve issue #20, which I believe is another symptom of the same problem.

To reproduce, on Windows only:

cd test
python test_force_align.py

Observed behaviour:

cannot open temp file that was just created
warning message complaining about unclosed handle
test suite fails

g2p should be able to be outsourced to an api

We should be able to specify that g2p for certain languages will come from particular RESTful endpoints. The endpoints will be documented in an OAS 3.0 spec.

Maybe something like readalongs align sample.xml sample.wav output -g2p eng=https://www.sample.com/api/v1/g2p

Add the ability to convert from single file HTML to multi-file

given a single file HTML file, we need to be able to extract the XML, SMIL, audio and images files.

Use case: you receive a readalong by e-mail, and you want to inspect it, and maybe apply manual corrections to the alignment, or add pictures, or do any other changes to it.

silence and dna text error

I created a file with silences and some DNA text, and I get a KeyError with a stack trace trying to align it.

To reproduce: readalongs align data/ej-fra-dna-silence.xml data/ej-fra.m4a sil-dna

data/ej-fra-dna-silence.xml:

<?xml version='1.0' encoding='utf-8'?>
<TEI>
    <!-- To exclude any element from alignment, add the do-not-align="true" attribute to
         it, e.g., <p do-not-align="true">...</p>, or
         <s>Some text <foo do-not-align="true">do not align this</foo> more text</s> -->
    <text xml:lang="fra">
        <body>
            <div type="page">
                <p>
                    <s><silence dur="1"/>Bonjour.</s>
                    <s>Je m'appelle Éric Joanis.</s>
                    <s>Je suis <silence dur="1.382s"></silence> programmeur au sein <silence dur="500ms"></silence> de l'équipe des technologies pour les langues autochtones au CNRC.</s>
                </p>
            </div>
            <div type="page">
		<anchor time="28.6s"/>
                <p do-not-align="true">
                    <s>J'ai fait une bonne partie de ma carrière en traduction automatique statistique, mais maintenant cette approche est déclassée par l'apprentissage profond.</s>
                    <s>En ce moment je travaille à l'alignement du hansard du Nunavut pour produire un corpus bilingue anglais-inuktitut.</s>
                    <s>Ce corpus permettra d'entraîner la TA, neuronale ou statistique, ainsi que d'autres applications de traitement du langage naturel.</s>
                </p>
		<anchor time="50.2s"/>
                <p>
                    <s>En parallèle, j'aide à écrire des tests pour rendre le ReadAlong-Studio plus robuste.</s>
                </p>
            </div>
        </body>
    </text>
</TEI>

Traceback:

Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\readalongs-script.py", line 11, in <module>
    load_entry_point('readalongs', 'console_scripts', 'readalongs')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 596, in main
    return super().main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 440, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\joanise\ras\studio\readalongs\cli.py", line 267, in align
    verbose_g2p_warnings=kwargs["g2p_verbose"],
  File "c:\users\joanise\ras\studio\readalongs\align.py", line 452, in align_audio
    words_dict[el.attrib["id"]]["end"] * 1000
KeyError: 't0b0d1p0s0w0'

[With Solution] Outdated Docker File Regarding Werkzeug Package

BUG reproduce:
If you follow the Docker instruction under the Usage guidance:

git clone ...
cd ...
docker build . --tag=readalong-studio

You will see an error as:

ImportError: cannot import name 'ContextVar' from 'werkzeug.local' (/usr/local/lib/python3.7/dist-packages/werkzeug/local.py)

Potential Cause:
This is due to the Werkzeug==0.16.0 requirement in the Dockerfile. And that version of Werkzeug will not work with the latest Flask framework.

Potential Solution:
One workaround is to comment out the related part in the Dockerfile

#RUN python3 -m pip uninstall -y Werkzeug
#RUN python3 -m pip install Werkzeug==0.16.0

Issues:

I haven't tested whether the code with the new Werkzeug version (2.0.1 currently) will work with other functionalities.
However, I wonder whether the Werkzeug package is even necessary as I don't find it in the requirments.txt and the pip readalongs version is working fine.

Question on usage

Hi,

I am currently looking for a way/solution to be able to read a-long my ebooks.
I have an audiobook for each of my ebooks.

My question now is, (since I am not a programmer myself and I am not sure if I understand this correctly) can this software be used to basically create a "whisper sync" like feature, to be able to read a-long in the same way?

best,
gelsas

Word Tokenizer not picking the right mapping

The TokenizerLibrary class in readalongs/text/tokenize_xml.py overwrites tokenizers with the same input language. So, for example if there is a mapping from git to git-ipa and then a mapping from git to git-apa, the git->git-ipa one is overwritten by the git-apa one. I put a bandaid on this here: a3c414d but this won't work if we want to tokenize/align through anything other than an ipa mapping. It also requires out_lang to end in -ipa. This should be more robust.

EPub not working

After aligning and then running readalongs epub output.smil output.epub, I try to read the epub file using calibre (v3.39.1) and get the following error:

calibre, version 3.39.1
ERROR: Could not open e-book: Failed to read book, /Users/pinea/Calibre Library/Unknown/output (5)/output - Unknown.epub click "Show Details" for more information

Traceback (most recent call last):
  File "site-packages/calibre/utils/ipc/simple_worker.py", line 289, in main
  File "site-packages/calibre/ebooks/oeb/iterator/book.py", line 65, in extract_book
  File "site-packages/calibre/customize/conversion.py", line 244, in __call__
  File "site-packages/calibre/ebooks/conversion/plugins/epub_input.py", line 344, in convert
ValueError: No valid entries in the spine of this EPUB

readalongs -i without -l gives stack trace; ditto readalongs with noise-only output

Reproduce:

$ readalongs align -i data/ej-fra.txt data/ej-fra.m4a delme -f
...
  File "c:\users\joanise\ras\studio\readalongs\align.py", line 387, in align_audio
    final_end = end
UnboundLocalError: local variable 'end' referenced before assignment

Should output CLI error: "missing -l option".

Seems related:

$ readalongs align data/ej-fra.xml data/noise_at_1500.mp3 delme -f
...
   File "c:\users\joanise\ras\studio\readalongs\align.py", line 387, in align_audio
    final_end = end
UnboundLocalError: local variable 'end' referenced before assignment

Should output: "no non-noise segments founds" or something like it.

Make output-base create a directory

Dealing with filename conflicts is tougher if the files are just exported to the current directory. output-base should be a valid path to a place where a directory with that name could be created. So instead of files being created like output-base.wav and output-base.smil, we should create output-base/output-base.wav and output-base/output-base.smil.

Studio will require g2p >= v0.5.20210514, when dev.g2p-cascade is merged in.

In requirements.txt, we just say g2p>=0.5.*, which has been convenient and useful, but now we have a breaking change that renders this inadequate. In branch dev.g2p-cascade, I now depend on is_arpabet() and other code introduced in g2p v0.5.20210514. How can we declare this requirement without breaking our way of doing locally installed sandboxes?

Can I just say g2p>=0.5.20210514, or do we need to bump g2p to 0.6?

Thanks to Marc Tessier for noticing this issue.

Step 3 in UI doesn't run unless logs are shown

From Del: "not sure .. if I am misreading the code but the web interface of read-aslong-studio does not actually call the align method in step 3 after https://github.com/ReadAlongs/Studio/blob/master/readalongs/views.py#L175 I do not see the args being used .."

`readalongs -h` confusing output: flask, angular (?) and readalongs CLI conflated

When running readalongs -h, the output combines and conflates CLI elements added by readalongs itself with those added by flask and, I believe, angular.

The --version option appears twice - the first comes from flask, the second from readalongs:

  --version   Show the flask version
  --version   Show the version and exit.

As for the five commands shown,

align and epub are defined in readalongs/cli.py.
routes appears to come from angular - is it relevant? Is the output readalongs routes produces meaningful?
run: I'm not sure where this comes from, since I cannot find the strong 'Runs a development server.' anywhere in the code base, but we use it, so it's clearly relevant. But where does it come from?
shell: again, not sure where this is defined. I've never used it, what is it for exactly?

Studio fails to align when a word is "eaten up" by g2p.

Currently, readalongs align fails when a word is converted to an empty string by the g2p module.

Error message:

ERROR - Alignment produced a different number of segments and tokens, please examine dictionary and input audio and text.

To reproduce this error, checkout 76faf18 in g2p or any commit before the problem with "s" disappearing is fixed in French g2p, go to OpenSamples, and run:

readalongs align -i -s -f -l fra UDHR-Librivox/human_rights_un_frn-preamble.txt UDHR-Librivox/human_rights_un_frn_ezwa_64kb-preamble.mp3 output/UDHR-fra-preamble

The error in this specific example is due to word <w>s</w> (the 330th token in UDHR-fra-preamble.tokenized.xml, on line 37) turning into an empty string because of my g2p rule erasing word-final "s" including in this case where the whole word is "s". As a consequence, file UDHR-fra-preamble.dict skips from token t0b0d0p10s0w42 to t0b0d0p10s0w44, bypassing empty token t0b0d0p10s0w43, causing a mismatch between the number of tokens and dictionary entries.

Eventually, I'll fix the French g2p to not swallow "s", but Studio needs to handle this case gracefully. Options:

Consider it an error and output a meaningful message telling the user to edit the g2p. This is not a great option for general users who might not know how to edit the g2p, though.
Fix the aligner code to align the whole text anyway, cleanly skipping over (or otherwise handling) the word with an empty phonetic representation.

Pratt and VTT output seem to use wrong sentence segmentation

When I run this command in test/:

 readalongs align -l fra -s -f -i data/ej-fra.txt data/ej-fra.m4a delme -t  -C

The .xml file is fine, but the .eaf, .TextGrid, and _sentences.* files generated consider each sentence on the first page as sentences, but each word on the second page as an individual sentence.

The correct output should logically consider the same units as sentences in the .xml file and the TextGrid and cc/subtitle files.

image file(s) should be copied into the OUTPUT_BASE folder.

When creating a ReadAlongs and using a config.json file with images. The image(s) specified should be copied automatically into the OUTPUT_BASE folder ( or inside the assets folder???)

example config.json:

{ "images": { "0": "0.jpg" } }

Capitalization

Very minor issue to follow the Canadian Translation Bureau guidelines for capitalizing the word 'Indigenous' in the repo description. Currently: "Audiobook alignment for North American indigenous languages"

Capitalize the singular and plural forms of the nouns Status Indian, Registered Indian, Non-Status Indian and Treaty Indian, as well as the adjectives Indigenous and Aboriginal, when they refer to Indigenous people in Canada.

Indigenous person (one individual)
- Example: Any Indigenous person in Alberta is eligible under this program.
Indigenous persons, Indigenous people (more than one person)
- Example: The conference could not have succeeded without the help of almost a
  thousand Indigenous people from all over Saskatchewan.
Indigenous peoples (two or more Indigenous groups)
- Example: Representatives from three Indigenous peoples were present.

Include public data for e2e tests

We should record public data from @littell @roedoejet @joanise @finguist.

This is because currently our e2e tests require data that is private and belongs to communities.

After we should add the e2e test suite to be run by travis.

Define our `.ras` file format with a DTD

Studio should output and read .ras files rather than just .xml.
The file format needs an actual DTD defining it.

Refactor docstrings to Google standard format

This was brought up in PR https://github.com/dhdaines/ReadAlong-Studio/pull/8 - I think we should adopt some kind of standard for docstrings, for the sake of documentation and integration with Sphinx as well as just general consistency. The numpy standard was proposed.

Issue when aligning files

I got this error when trying to align https://creeliteracy.org/wp-content/uploads/2020/07/Cover.m4a (from https://creeliteracy.org/2020/07/31/covid-safety-reminder-solomon-ratt-y-dialect/) in the studio.

The studio UI needs some work... it said this completed successfully and gave me a blank readlong widget :/

Here's the temp data:

readalongs-temp.zip

CompletedProcess(args=['readalongs', 'align', '--force-overwrite', '--save-temps', '--text-grid', '--text-input', '--language', 'crk', '/var/folders/s1/y4p2fc9d1c9bv3nfjhgpvwch0000gq/T/tmp1u_lhndi/text.txt', '/var/folders/s1/y4p2fc9d1c9bv3nfjhgpvwch0000gq/T/tmp1u_lhndi/sol.wav', '/var/folders/s1/y4p2fc9d1c9bv3nfjhgpvwch0000gq/T/tmp1u_lhndi/aligned1596638807'], returncode=1, stdout=b'', stderr=b'INFO - Server initialized for eventlet.\nINFO - Words (<w>) not present; tokenizing\nTraceback (most recent call last):\n File "/Users/santoseadmin/Work/Studio/venv/bin/readalongs", line 11, in <module>\n load_entry_point(\'readalongs\', \'console_scripts\', \'readalongs\')()\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__\n return self.main(*args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/flask/cli.py", line 557, in main\n return super(FlaskGroup, self).main(*args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 782, in main\n rv = self.invoke(ctx)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke\n return callback(*args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func\n return f(get_current_context(), *args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/flask/cli.py", line 412, in decorator\n return __ctx.invoke(f, *args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke\n return callback(*args, **kwargs)\n File "/Users/santoseadmin/Work/Studio/readalongs/cli.py", line 217, in align\n results = align_audio(\n File "/Users/santoseadmin/Work/Studio/readalongs/align.py", line 123, in align_audio\n xml = convert_xml(xml)\n File "/Users/santoseadmin/Work/Studio/readalongs/text/convert_xml.py", line 208, in convert_xml\n convert_words(xml_copy, word_unit, output_orthography)\n File "/Users/santoseadmin/Work/Studio/readalongs/text/convert_xml.py", line 157, in convert_words\n all_indices = compose_tiers(indices)\n File "/Users/santoseadmin/Work/Studio/readalongs/text/util.py", line 290, in compose_tiers\n reduced_indices = compose_indices(tiers[0], tiers[1])\n File "/Users/santoseadmin/Work/Studio/readalongs/text/util.py", line 278, in compose_indices\n if i2_idx in i2_dict and i2_dict[i2_idx] > highest_i2_found:\nTypeError: \'>\' not supported between instances of \'NoneType\' and \'int\'\n')

Here's that traceback from readalongs align

INFO - Server initialized for eventlet.
INFO - Words (<w>) not present; tokenizing
Traceback (most recent call last):
 File "/Users/santoseadmin/Work/Studio/venv/bin/readalongs", line 11, in <module>
 load_entry_point('readalongs', 'console_scripts', 'readalongs')()
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
 return self.main(*args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/flask/cli.py", line 557, in main
 return super(FlaskGroup, self).main(*args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
 rv = self.invoke(ctx)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
 return _process_result(sub_ctx.command.invoke(sub_ctx))
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
 return ctx.invoke(self.callback, **ctx.params)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
 return callback(*args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
 return f(get_current_context(), *args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/flask/cli.py", line 412, in decorator
 return __ctx.invoke(f, *args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
 return callback(*args, **kwargs)
 File "/Users/santoseadmin/Work/Studio/readalongs/cli.py", line 217, in align
 results = align_audio(
 File "/Users/santoseadmin/Work/Studio/readalongs/align.py", line 123, in align_audio
 xml = convert_xml(xml)
 File "/Users/santoseadmin/Work/Studio/readalongs/text/convert_xml.py", line 208, in convert_xml
 convert_words(xml_copy, word_unit, output_orthography)
 File "/Users/santoseadmin/Work/Studio/readalongs/text/convert_xml.py", line 157, in convert_words
 all_indices = compose_tiers(indices)
 File "/Users/santoseadmin/Work/Studio/readalongs/text/util.py", line 290, in compose_tiers
 reduced_indices = compose_indices(tiers[0], tiers[1])
 File "/Users/santoseadmin/Work/Studio/readalongs/text/util.py", line 278, in compose_indices
 if i2_idx in i2_dict and i2_dict[i2_idx] > highest_i2_found:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

ReadTheDocs will not build

ReadTheDocs cannot install our package which is causing it to fail because autodocumentation tools depend on it. The root cause is because PocketSphinx fails to pip install.

We could use conda, but we have a number of dependencies that are also not on conda, and we're not currently packaging for conda, so there's some significant work to be done here.
We could try and fix the pip installation of readalongs, or create a minimal installation without all the audio stuff that crashes it.
We could spin off our own instance of rtd or generate html locally and push somewhere...but...ugh
We could abandon autodocumentation...also ugh

Basically there are no great options as far as I'm concerned, but we should try and figure out a solution one way or another.

fails on Windows when trying to create a temporary file

Command:

readalongs align -l alq -i Adjidamo-no-intro.txt Adjidamo-no-intro.mp3 delme4

Trace:

Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\readalongs-script.py", line 11, in <module>
    load_entry_point('readalongs', 'console_scripts', 'readalongs')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\readalong-studio\readalongs\cli.py", line 108, in align
    if kwargs['save_temps'] else None))
  File "c:\users\joanise\ras\readalong-studio\readalongs\align.py", line 98, in align_audio
    xml = etree.parse(xml_path).getroot()
  File "src\lxml\etree.pyx", line 3424, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file 'C:\Users\joanise\AppData\Local\Temp\readalongs_xml_ypb25lr6.xml': failed to load external entity "C:\Users\joanise\AppData\Local\Temp\readalongs_xml_ypb25lr6.xml"

CI publish recipe has deprecated actions and code

These issues noticed in g2p will affect Studio in the same way:

Our "Determine tag" code gets this warning:
Warning: The set-output command is deprecated and will be disabled
soon. Please upgrade to using Environment Files. For more information
see:
https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
mathieudutour/[email protected] gets the set-output warning and
the node12 warning. Update to @v6?
https://github.com/mathieudutour/github-tag-action also says to use
ncipollo/release-action@v1 instead of actions/create-release
actions/create-release is obsolete and not longer maintained:
https://github.com/actions/create-release01 use
ncipollo/release-action instead?

Generate single-file HTML from multi-file RA

Given the default multi-file output from readalongs align, possibly with manual corrections to the alignments in the SMIL or other manual changes, we want to be able to create the single-file HTML bundle.

Use case: -o html does not allow the user to do any corrections to their readalong. If they need to fix something but still want the results as a single-file HTML, we need to provide a tools to create the bundle post-hoc.

ePub compliance

Currently the ePubs need a bit of work to function properly on iOS, which probably is because we are not quite producing valid EPUB 2.0 or 3.0 yet (I think 3.0 is supported so we should target that)

I'll try to get this going in the next few days (making an issue here to track the work, etc)

images when using URL not working

Trying to add an image to to a RA using a url , configured in a json config file..

example json file used:

{
    "images":
        { "0": "https://www.btb.termiumplus.gc.ca/images/termium-wet.png"}
}

I ran it like this:
readalongs align --config 1.json --text-input 1.Welcome.txt 1.Welcome.mp3 1

NOTE: I did get this good warning message when producing the RA at the end.

WARNING - Please make sure https://www.btb.termiumplus.gc.ca/images/termium-wet.png is accessible to clients using your read-along.

When we look at the my local http server access logs, we can see the the RA tried to hit the server with a GET and got a 404. Technically , we should not be seeing that hit where my browser should have hit the site "www.btb.termiumplus.gc.ca " instead. Also notice how it tried to use the "/assets" folder as well. The warning message was correct in saying "Please make sure XXXX is accessible to clients using your read-along".

10.0.2.2 - - [19/Aug/2021 10:46:29] code 404, message File not found
10.0.2.2 - - [19/Aug/2021 10:46:29] "GET /assets/https://www.btb.termiumplus.gc.ca/images/termium-wet.png HTTP/1.1" 404 -

( Also technically this should be a ReadAlong-Web-Component bug I think)

Mapping inventory should be lower-case if mapping is case insensitive

Currently there is at least one method (maybe more, this needs checking) which doesn't apply lower-casing to the mapping inventory.

Take the following at the beginning of is_word_character in readalongs/text/tokenize_xml.py:

  def is_word_character(self, c):
        if not self.case_sensitive:
            c = c.lower()
        if c in self.inventory:
            return True

The inventory hasn't been lower-cased yet and this method doesn't do what's expected (ie return True if the inventory has upper-case characters).

CI/CD failed

It looks like the release is being created but the version bump is not updating to the master branch. See https://github.com/ReadAlongs/Studio/runs/3885903392

multiple removed DNA segments are not calculated correctly

When there are multiple DNA segments in the config.json file, and the method is removed, the correction after the first DNA segment is incorrect.

to reproduce:
dna-config.json:

{
    "do-not-align":
        {
        "method": "remove",
        "segments":
            [
                {   "begin": 1000,     "end": 1100    },
                {   "begin": 3700,    "end": 3900   }
            ]
        }
}

The command

readalongs align -c dna-config.json data/ej-fra.xml data/ej-fra.m4a delme

outputs an alignment in the .smil file at (3.040 : 3.770) and one at (3.770 : 4.000). 3.770 s is inside [3700ms, 3900ms) where it should not have been allowed to be.

The problem is that calculate_adjustment() should shift the timestamp it's adjusting for every previous dna segment it has tallied.

Improve alignment and error messaging/debugging

This are a few ideas following a meeting between myself, @joanise and @littell .

The problem is that, currently, when alignment fails, it's not clear exactly why the alignment failed. It's often due to errors in transcription, or false starts from the speaker. This issue is to improve ReadAlong-Studio's ability to handle these occurrences, but also to provide helpful insight as to where in the document the alignment is least certain.

Some possible things to add

Watch log perplexity of certain transitions and record any sudden increases
Add low-probability transitions for silence or 'garbage'. this is seemingly happening
Dynamic adjustment of beam - strict at first, but relax after certain threshold
Anomaly detection with outliers from the duration model

We should also be thinking about how to interact with the user when these issues come up. There should be a view in the ReadAlong-Studio web app for debugging that highlights areas where the beam was adjusted or where sudden changes of log perplexity occurred.

Mappings from g2p with 'none' normalization do not work

For example Danish whose norm_form value is 'none', throws a ValueError:

 File "/Users/pinea/ReadAlong-Studio/readalongs/text/convert_xml.py", line 176, in convert_xml
    convert_words(xml_copy, word_unit, output_orthography)
  File "/Users/pinea/ReadAlong-Studio/readalongs/text/convert_xml.py", line 132, in convert_words
    word.text = ud.normalize(norm_form, word.text)
ValueError: invalid normalization form

UnboundLocalError exception aligning UDHR fra

On branch Studio: dev.g2p g2p: master OpenSamples: master, all up to date as of now

cd OpenSamples
readalongs align -i -s -f -l fra UDHR-Librivox/human_rights_un_frn-preamble.txt UDHR-Librivox/human_rights_un_frn_ezwa_64kb-preamble.mp3 output/UDHR-fra-preamble

outputs:

?[32mINFO?[0m - Server initialized for eventlet.
INFO - Words (<w>) not present; tokenizing
Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\readalongs-script.py", line 11, in <module>
    load_entry_point('readalongs', 'console_scripts', 'readalongs')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\studio\readalongs\cli.py", line 115, in align
    if kwargs['save_temps'] else None))
  File "c:\users\joanise\ras\studio\readalongs\align.py", line 109, in align_audio
    xml = convert_xml(xml)
  File "c:\users\joanise\ras\studio\readalongs\text\convert_xml.py", line 194, in convert_xml
    convert_words(xml_copy, word_unit, output_orthography)
  File "c:\users\joanise\ras\studio\readalongs\text\convert_xml.py", line 143, in convert_words
    all_indices = compose_tiers(indices)
  File "c:\users\joanise\ras\studio\readalongs\text\util.py", line 271, in compose_tiers
    reduced_indices = compose_indices(tiers[0], tiers[1])
  File "c:\users\joanise\ras\studio\readalongs\text\util.py", line 256, in compose_indices
    results.append((i1_in, highest_i2_found))
UnboundLocalError: local variable 'highest_i2_found' referenced before assignment

The tokenizer does not correctly handle tce

When a language is mapped via a hop, like tce -> tce-equiv -> tce-ipa, the tokenizer fails to find the right mapping and uses the DefaultTokenizer instead.

echo "ts'e ch’e ghw'nj sih" > tce1.txt
readalongs prepare -l tce tce1.txt tce1.xml
readalongs tokenize tce1.xml tce1.tok.xml
grep '<w>' tce1.tok.xml
    <s><w>ts</w>'<w>e</w> <w>ch</w>’<w>e</w> <w>ghw</w>'<w>nj</w> <w>sih</w></s>

The correct output should have been:

    <s><w>ts'e</w> <w>ch’e</w> <w>ghw'nj</w> <w>sih</w></s>

Contrast with win, which has just a win -> win-ipa mapping and also uses the apostrophe as a letter:

readalongs prepare -l win tce1.txt win1.xml
readalongs tokenize win1.xml win1.tok.xml
grep '<w>' win1.tok.xml
    <s><w>ts'e</w> <w>ch</w>’<w>e</w> <w>ghw'nj</w> <w>sih</w></s>

which handles ' as a letter, as it should, though not ’ since win does not map it as an equiv.

As I am moving the tokenizer into g2p, this problem remains whole. I will not solve it right away. I will put unit test cases to validate it, but comment them out since they will fail for now.

Error message for incorrect argument ordering

If you accidentally provide an audio file as the first argument to readalongs align and a text file after, it tries to read the mp3 file as a text file and gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 45: invalid start byte

We should handle this differently.

readalongs / studio Goto Github PK

studio's People

Stargazers

Watchers

Forkers

studio's Issues

Recommend Projects

Recommend Topics

Recommend Org