finos / greenkey-asrtoolkit Goto Github PK

A collection of useful tools for handling speech recognition data

License: Apache License 2.0

Dockerfile 0.54% Python 98.85% Shell 0.61%

asr speech-recognition speech-to-text data-science file-conversion

greenkey-asrtoolkit's Introduction

NOTE! This project is archived due to lack of activity; you can still consume this software, although not advised, as it is not actively maintained. If you're interested to restore activity on this repository, please email [email protected]

The GreenKey ASRToolkit provides tools for file conversion and ASR corpora organization. These are intended to simplify the workflow for building, customizing, and analyzing ASR models, useful for scientists, engineers, and other technologists in speech recognition.

File formats supported

File formats have format-specific handlers in asrtoolkit/data_handlers. The scripts convert_transcript and wer support stm, srt, vtt, txt, and GreenKey json formatted transcripts. A custom html format is also available, though this should not be considered a stable format for long term storage as it is subject to change without notice.

convert_transcript

usage: convert_transcript [-h] input_file output_file

convert a single transcript from one text file format to another

positional arguments:
  input_file   input file
  output_file  output file

optional arguments:
  -h, --help   show this help message and exit

This tool allows for easy conversion among file formats listed above.

Note: Attributes of a segment object not present in a parsed file retain their default values

For example, a segment object is created for each line of an STM line
each is initialized with the following default values which are not encoded in STM files: formatted_text=''; confidence=1.0

wer

usage: wer [-h] [--char-level] [--ignore-nsns]
           reference_file transcript_file

Compares a reference and transcript file and calculates word error rate (WER)
between these two files

positional arguments:
  reference_file   reference "truth" file
  transcript_file  transcript possibly containing errors

optional arguments:
  -h, --help       show this help message and exit
  --char-level     calculate character error rate instead of word error rate
  --ignore-nsns    ignore non silence noises like um, uh, etc.

This tool allows for easy comparison of reference and hypothesis transcripts in any format listed above.

clean_formatting

usage: clean_formatting.py [-h] files [files ...]

cleans input *.txt files and outputs *_cleaned.txt

positional arguments:
  files       list of input files

optional arguments:
  -h, --help  show this help message and exit

This script standardizes how abbreviations, numbers, and other formatted text is expressed so that ASR engines can easily use these files as training or testing data. Standardizing the formatting of output is essential for reproducible measurements of ASR accuracy.

split_audio_file

usage: split_audio_file [-h] [--target-dir TARGET_DIR] audio_file transcript

Split an audio file using valid segments from a transcript file. For this
utility, transcript files must contain start/stop times.

positional arguments:
  audio_file            input audio file
  transcript            transcript

optional arguments:
  -h, --help            show this help message and exit
  --target-dir TARGET_DIR
                        Path to target directory

prepare_audio_corpora

usage: prepare_audio_corpora [-h] [--target-dir TARGET_DIR]
                             corpora [corpora ...]

Copy and organize specified corpora into a target directory. Training,
testing, and development sets will be created automatically if not already
defined.

positional arguments:
  corpora               Name of one or more directories in directory this
                        script is run

optional arguments:
  -h, --help            show this help message and exit
  --target-dir TARGET_DIR
                        Path to target directory

This script scrapes a list of directories for paired STM and SPH files. If train, test, and dev folders are present, these labels are used for the output folder. By default, a target directory of 'input-data' will be created. Note that filenames with hyphens will be sanitized to underscores and that audio files will be forced to single channel, 16 kHz, signed PCM format. If two channels are present, only the first will be used.

degrade_audio_file

usage: degrade_audio_file input_file1.wav input_file2.wav

Degrade audio files to 8 kHz format similar to G711 codec

This script reduces audio quality of input audio files so that acoustic models can learn features from telephony with the G711 codec.

extract_excel_spreadsheets

Note that the use of this function requires the separate installation of pandas. This can be done via pip install pandas.

usage: extract_excel_spreadsheets.py [-h] [--input-folder INPUT_FOLDER]
                                     [--output-corpus OUTPUT_CORPUS]

convert a folder of excel spreadsheets to a corpus of text files

optional arguments:
  -h, --help            show this help message and exit
  --input-folder INPUT_FOLDER
                        input folder of excel spreadsheets ending in .xls or
                        .xlsx
  --output-corpus OUTPUT_CORPUS
                        output folder for storing text corpus

align_json

This aligns a gk hypothesis json file with a reference text file for creating forced alignment STM files for training new ASR models. Note that this function requires the installation a few extra packages

python3 -m pip install spacy textacy https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm

usage: align_json.py [-h] input_json ref output_filename

align a gk json file against a reference text file

positional arguments:
  input_json       input gk json file
  ref              reference text file
  output_filename  output_filename

optional arguments:
  -h, --help       show this help message and exit

Requirements

Python >= 3.6.1 with pip

Contributing

Code of Conduct

Please make sure you read and observe our Code of Conduct.

Pull Request process

Fork it
Create your feature branch (git checkout -b feature/fooBar)
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

NOTE: Commits and pull requests to FINOS repositories will only be accepted from those contributors with an active, executed Individual Contributor License Agreement (ICLA) with FINOS OR who are covered under an existing and active Corporate Contribution License Agreement (CCLA) executed with FINOS. Commits from individuals not covered under an ICLA or CCLA will be flagged and blocked by the FINOS Clabot tool. Please note that some CCLAs require individuals/employees to be explicitly named on the CCLA.

Need an ICLA? Unsure if you are covered under an existing CCLA? Email [email protected]

Authors

License

The code in this repository is distributed under the Apache License, Version 2.0.

greenkey-asrtoolkit's People

Contributors

Stargazers

Watchers

Forkers

tshastry mgoldey nikoganev fagan2888 kaleko rimim tamarajqawasmeh wijijo stjordanis ejonghan truleo greenkeytech kran46 finos-admin

greenkey-asrtoolkit's Issues

Support for SRT format needed

Subtitle files in SRT format should be supported by this tool

Additional documentation for wer

What kind of inputs does the wer tool support? Could you include some examples in the documentation?

Add forced alignment tool

Move confluence roadmap here

As agreed to in our last PMC meeting, please move the project roadmap from confluence to this project, wherever it fits best (I suggest the Wiki, with a link from the README).

https://finosfoundation.atlassian.net/wiki/spaces/VOICE/pages/906133835/GreenKey+ASR+Toolkit+Roadmap

Sanitize_hyphens could use a silent mode

Many use cases directly call sanitize hyphens. However, there's no choice in these cases to disable the warning message.

wer rate is 400%

Hi I use Python 3.5.3 :: Anaconda custom (x86_64) and when I ran the command wer answer.txt stt.txt the result was wer 400%

My another coworker tried this with same files, same comand, the python version was 3.6.8 but he got the result seemed normal like wer 49.485 % . His computer is Ubuntu and I use Mac OS

Other coworkers (One Ubuntu, one Windows) also tried and they got the wer 400% result like me so can you help me to figure out the reason that the calculation result is different from me and the other person even though using same files and same command?

Thank you!!!

clean up should gracefully handle phone numbers

Is your feature request related to a problem? Please describe.
Phone numbers in transcripts are mapped to long series of numerals

Describe the solution you'd like
they should be mapped to spelled out single digits

Examples:

1-317-222-222 should map to the series of numbers 'one', 'three', 'one', 'seven', etc.

Request: styled HTML outputs

Is your feature request related to a problem? Please describe.
We need to provide clean HTML output that can be served in front end applications

Describe the solution you'd like
We should be able to return html output as a string or as a file

some silence tags are not ignored by our WER util

Describe the bug
Occasionally, <#s> can be output by some decoders to indicate silence

Expected behavior
We should amend the regex matches for noise tags to incorporate this specific tag.

Additional context
This may be somewhat unclear if we group more tagged non speech events in the same variable, so we may want to rename the variable.

dollars with quantity shorthand are not cleaned appropriately

Describe the solution you'd like
The quantity "$6.5B" should map to "six point five billion" in the un-formatted text

Support for VTT format needed

Need to support VTT format for file conversion

commas without trailing spaces are not handled cleanly

Describe the bug
Given a list such as "a,b" - clean_formatting currently just removes the comma. We should replace this with a space in the event bad data has entered a data processing pipeline.

To Reproduce
Steps to reproduce the behavior:

echo a,b > foo.txt
clean_formatting foo.txt
cat foo.txt

ab

Expected behavior

a b

use `csv` to make the excel spreadsheet reader in `asrtoolkit` functional

The excel spreadsheet reader in asrtoolkit (as written) requires pandas, which is dependent upon numpy. The latter module was removed on 1/28/19 from requirements.txt to address test failures on circleci (cause undetermined). This script is rarely used internally, and does not affect the rest of the codebase but it is used by clients.

To Do:

fix the excel spreadsheet reader in asrtoolkit using a library other than pandas. One option would be to use csv with the excel format option

combine_audio_files needs testing

Testing coverage gap here

clean_formatting should remove noise tags

Is your feature request related to a problem? Please describe.
Noise tags [noise] or <noise> are not removed by the text clean up routines.

Describe the solution you'd like
These should be remove using a regex that matches A-Za-z words in brackets or angle brackets

JSON support of GK output needed

Basic json formatting should be supported

wer script does not error with garbage files

Describe the bug
The wer script accepts pretty much anything as input and will spit out a random WER. This conflates a problem with the wer script and a simple filename error. I would suggest adding

To Reproduce
Steps to reproduce the behavior:

Run the wer script with any invalid input
Observe a lack of error and an output WER

> wer asdf asd
WER: 0.000%
> wer ^ ^
WER: 0.000%
> wer aaaaaaaaaaaaaaa eeeeeeeeeeeeeeeeeee
WER: 100.000%
> wer --ignore-nsns I-like to-eat-garbage
WER: 150.000%
> wer seg_data/gkt_corpora_earnings_test_AAPL2017q1TranscriptMturkTest_seg_0001. seg_data/gkt_corpora_earnings_test_AAPL2017q1TranscriptMturkTest_seg_0001.st
WER: 11.111%

Expected behavior
I would expect the wer script to error when given inputs that aren't valid ( ie. the expect file format, imaginary files, etc. )

Desktop (please complete the following information):

OS: Ubuntu 16.04.5
Python version 3.5.2+

add slack group

Request to archive repository due to lack of activity

Given the lack of activity on this repo, I propose archiving it.

If and when activity can be restored, the repository can be easily unarchived, by either:

reopening this issue
emailing [email protected]

We will wait until Friday the 25th of November to collect feedback from the community, then we will proceed with the archival.

Archiving the repo will consist of:

Adding a note at the top of the repo home - https://github.com/finos/greenkey-asrtoolkit
make the repository read-only, though it will still be public and downloadable

validate_stm should be exposed at the command line

Describe the solution you'd like
validate_stm should be a command line tool

Describe alternatives you've considered
Users can invoke this by writing python code

Additional context
Requested by users of asrtoolkit

Audio corpora prep should allow for WAV files and mix multichannels

The current audio corpora prep seems to only work on SPH files. In addition, the current description says this:

 Note that filenames with hyphens will be sanitized to underscores and that audio files will be forced to single channel, 16 kHz, signed PCM format. If two channels are present, only the first will be used.

Many corpora come in WAV files instead of SPH files, and many also have two unmixed channels that need to be mixed to properly account for all audio.

clean_formatting doesn't handle line breaks correctly

With this file as test.txt

okay
yeah
okay

I get this as an output

okayyeahokay

JSON output starttime and endtime secs need to be numeric, not strings

Is your feature request related to a problem? Please describe.
GK JSON schema has been better defined since asrtoolkit first output to it.

Describe the solution you'd like
They should be floats

Describe alternatives you've considered
We could maintain both or switch to strings

Additional context
GK schema last edited by @burrows

Existing WER tool from requirements is not performant

Frequently takes a long time and much memory to compute the word error rate

Potential typo in align_json.py

Describe the bug
Line 58 of align_json.py: https://github.com/finos/greenkey-asrtoolkit/blob/master/asrtoolkit/align_json.py

Should this be "align_json(", not "align_gk_json("? I don't see align_gk_json anywhere else in the repo.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Execute '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]
Python version
Using docker or pip package?

Additional context
Add any other context about the problem here.

update logo

convert_transcript should allow output of formatted / punctuated transcript

In some cases, it's desirable to simply extract a readable transcript out of JSON. For example, a text representation or well-formatted subtitles. In these cases, it would be useful to have convert_transcript output the formatted / punctuated transcript instead of the unformatted one.

finos / greenkey-asrtoolkit Goto Github PK

greenkey-asrtoolkit's Introduction

The GreenKey ASRToolkit provides tools for file conversion and ASR corpora organization. These are intended to simplify the workflow for building, customizing, and analyzing ASR models, useful for scientists, engineers, and other technologists in speech recognition.

File formats supported

convert_transcript

wer

clean_formatting

split_audio_file

prepare_audio_corpora

degrade_audio_file

extract_excel_spreadsheets

align_json

Requirements

Contributing

Code of Conduct

Pull Request process

Authors

License

greenkey-asrtoolkit's People

Contributors

Stargazers

Watchers

Forkers

greenkey-asrtoolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org