aws-transcription-parser's Introduction

Automated parsing of Amazon-Transcribe-annotated episode transcripts

This is an automatic parser of Amazon Transcribe jobs - of podcast episodes - which outputs to HTML.

Compatible with Python 2.7 and Python 3+.

Just populate an input file, called input.txt, where each line is semi-colon-separated and contains the name of the Amazon Transcribe output JSON file, and a comma-separated, ordered list of speakers.

For example, the following input.txt file will result in the iterative processing of files episode_1.json, episode_2.json and episode_3.json. speaker_1, speaker_2 and speaker_3 will replace the automatically generated placeholders spk_0, spk_1 and spk_2. The output HTML files will be named after the jobName from each input JSON file.

episode_1.json;speaker_1,speaker_2
episode_2.json;speaker_2,speaker_3
episode_3.json;speaker_1,speaker_2,speaker_3

Once you've created your input.txt file and moved it in the same directory as the process_aws_output.py file, you simply need to run the script with Python:

$ python process_aws_output.py
SUCCESS!

A SUCCESS! message is expected, signifying that all HTML outputs have been stored in the same directory.

Please, don't hesitate to ask questions or request changes or improvements via the Issues section.

If you're feeling generous, donations are welcome:

BTC: 1QFNgTV3GQby8uv3mXwLKBHAgKUEenSREd

ETH: 0xa7350d9fb3c6193759b587bb984f0dfe3568c8ed

LTC: LW3SNJ61CXUfRQTpehpDfV7vv1iVdLh9En

ADA: DdzFFzCqrhtBbS7o5LQ3u1ZxFVz3Q6b2bQ86FEYanf6UsRgK6D3So4grpZEHPXcitQWEuRfnAA7jzi3xmj9Md6kng2UiVn4QLxEsAefK

BCH: 1QFNgTV3GQby8uv3mXwLKBHAgKUEenSREd

aws-transcription-parser's People

Contributors

Stargazers

Watchers

aws-transcription-parser's Issues

list index out of range

I have json file from Amazon with two speakers but I get the below error
Traceback (most recent call last): File "process_aws_output.py", line 114, in <module> run('input.txt') File "process_aws_output.py", line 109, in run parse_raw_transcription(fname, speakers) File "process_aws_output.py", line 96, in parse_raw_transcription html = build_html(lines, end_times, job_name, speaker_names) File "process_aws_output.py", line 81, in build_html html, current_speaker = _update_speaker(html, range_speaker) File "process_aws_output.py", line 31, in update_speaker speaker = speaker_map[int(speaker_name[-1])] **IndexError: list index out of range**

Speaker labels

Hi,

This script assumes there's going to be at least two speakers and there's labels for the speakers. I have a .json file where there's only one person speaking. I don't see any of the speaker labels so I get an error when I run the program. Any suggestions?

Recommend Projects

crypto-jeronimo / aws-transcription-parser Goto Github PK

aws-transcription-parser's Introduction

Automated parsing of Amazon-Transcribe-annotated episode transcripts

aws-transcription-parser's People

Contributors

Stargazers

Watchers

Forkers

aws-transcription-parser's Issues

list index out of range

Speaker labels

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent