Code Monkey home page Code Monkey logo

speech_to_text_data_pipeline's Introduction

Speech_to_text_data_pipeline

image

Table of content

Overview

The purpose of this weekโ€™s challenge is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. There are a number of large text corpora we will use We will design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file. By the end of this project, we will produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.

Install

git clone https://github.com/Reiten-10Academy/Speech_to_text_data_pipeline
cd Speech_to_text_data_pipeline
pip install -r requirements.txt

Data

Data can be found here

Pipeline

flow of data is shown with the arrows, and the order of execution is shown with the numbers attached to the bottom of the arrows.

image

  • 1: Load original dataset as a csv to from unprocessed folder cleaning and selecting script to be processed by spark
  • 2: Load a csv file containing id and text column to interim folder from cleaning script
  • 3: Load cleaned data set from interim folder in s3 bucket to producer script that sends one row of data every X seconds to kafka topic
  • 4: Send one row of data (sentence and Id) to kafka every X seconds.
  • 5: Request for a sentence is sent out to a react frontend
  • 6: The GET request is transfered from the react frontend to flask api
  • 7: A kafka consumer requests to load latest sentence added to kafka topic
  • 8: A kafka Topic responds back by sending a sentence and its id to the consumer
  • 9: A flask api responds to the GET request and sends the sentence and id to the react frontend
  • 10: The react frontend sends the loaded sentence to the user screen
  • 11: The User screen records an audio and sends the audio along with the sentence to the react frontend
  • 12: The react frontend sends a POST request to the flask api by putting the sentence, id, and audio as a body of the message
  • 13: The flask api will rename the audio with the sentence id and sends it to an s3 bucket and put it in "unprocessed" folder and creates a new column of data that holds the url of this audio file besides the sentence and id column. Finally, it sends this metadata to a kafka topic through a kafka producer.
  • 14: A kafak topic sends rows containing id, sentence, and URL information to a text loader script that holds a kafka consumer.
  • 15: A text loader script will consume all rows of data in kafka and put them in the interim folder
  • 16: A audio cleaner script that runs a pyspark code will load audios and meta data from the unprocessed folder and the interim folder respectively and performs some final cleaning and preparation
  • 17: The Audio cleaner script will move the cleaned audio to a folder called "audio" inside the "processed" folder and put the metadata containing information about the audio file in the "processed" folder

description

 Amharic news text classification dataset with baseline performance dataset: 

folders

  • backend: a flask server and a bunch of python scripts that process data in pipeline
  • frontend: a react application.
  • extra: contains, notebooks, docs, and other development and testing files.

Authors

  • ๐Ÿ‘ค Biniyam Belayneh
  • ๐Ÿ‘ค Meron Abate
  • ๐Ÿ‘ค Tewodros Kaderaleh
  • ๐Ÿ‘ค Gezahegne Wondachew
  • ๐Ÿ‘ค Hewan Mulu
  • ๐Ÿ‘ค Titus Wachira
  • ๐Ÿ‘ค Amal Abdallah

Show your support

Give a โญ if you like this project!

speech_to_text_data_pipeline's People

Contributors

benbel376 avatar programmingoperative avatar gezish avatar hewanm avatar meriab21 avatar teddyk251 avatar amulah-98 avatar

speech_to_text_data_pipeline's Issues

front-end

  • create text displayer
  • create audio recorder
  • create audio submitter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.