Code Monkey home page Code Monkey logo

swc_parser's Introduction

This repository has a simple script that turns the Spoken Wikipedia Corpus into dataframes and jsons. I created this repo because currently the data when downloaded is an XML format that is not easily human readable. I might turn this into a package since there are a couple of dependencies.

About the script

Dependencies

  • click
  • pandas
  • xmltodict

parser.py works from the top-level directory (i.e. PATH/TO/spoken_wikipedia_corpus/.), traverses all the directories, and creates jsons (if you want to munge the data in a different way from how I did it here) as well as dataframes containing the columns "term", "start_time", "end_time", and "phonemes"].

These jsons and dataframes are saved in the same folders where the aligned.swc files are for each Wikipedia article, and are saved as aligned.swc.json and aligned.swc.df.

Use

To use the script, simply call python parser.py with an additional parameter -fd to specify the directory where your files are, and it will automatically traverse all directories.

If you have downloaded the repo from github, run pip install -e . from within the folder. Then, from Python, you can do this:

from swc_parser import parser
parser.parse_files('PATH/TO/spoken_wikipedia_corpus')

and voilà!

About the dataframe schema

start_time and end_time are taken directly from the XML schema, as well as the term that is being pronounced, which is stored in the schema as t or maybe n. It's called term here for transparency.

The phonemes column contains a json blob of phone durations for each word if those existed in the annotation file. Many words have durations and not phone durations, and vice versa. Additionally, many words don't have durations or timestamps at all, presumably because of the forced aligner.

Because it is easy to lose track of the documents themselves (and indeed, even the sentences), I have also uploaded my personal version of the data (swc_word_durations.csv), which is a tab-delimited text file with two additional columns beyond the dataframes in each of the folders: id, which is a index of the article that the data come from in alphabetic order, and duration, which subtracts start_time from end_time. Additionally, the index (which will be loaded as a column X in R or as a regular index in pandas) keeps track of the location of each word or character in the document that is being read aloud.

UPDATE 11/30/17

Arne Köhn of SWC wrote up a solid justification of why they used XML and not JSON, and how to parse it using other packages, namely XPath. Give it a read :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.