galv / lingvo-copy Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 29.98 MB

License: Apache License 2.0

Starlark 1.77% Jupyter Notebook 8.17% Python 81.95% Dockerfile 0.18% C++ 6.77% Shell 0.72% TeX 0.38% C 0.05%

lingvo-copy's People

Contributors

Stargazers

Watchers

lingvo-copy's Issues

Escape data IDs with wildcards

spark cannot load google cloud bucket files that contain wildcards (e.g., gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/1961DoctorBloodsCoffinWKieronMoore/[1961]Doctor Blood's Coffin w Kieron Moore.mp3). Characters like "[" and "]" trigger the problem, but others may as well.

Reproducer:

spark.read.format("binaryFile").read("gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/1961DoctorBloodsCoffinWKieronMoore/[1961]Doctor Blood's Coffin w Kieron Moore.mp3")

It should give an error.

Related issue (although it doesn't talk about spark itself):
https://stackoverflow.com/questions/42087510/gsutil-ls-returns-error-contains-wildcard/42146769

Provide a way to request takedown of particular data

It is possible that some of our data is mislabeled as CC-BY or CC-0. In addition, it is possible that a creator may not have intended for their work to belong in a machine learning dataset.

We should provide away to do to a "take down" request.

Let's do a manual process for now.

A simple [email protected] email would suffice. We ought to specify in the instructions that the person needs to specify the "primary key" for the data.

On the backend, we typically have two stores of truth: The audio data and transcript stored in a BLOB store, and the metadata, stored in a much more structured format.

Spark SQL should be able to easily delete the metadata for a given record. gsutil rm works for deleting the blob data.

Build a flow to get the creators of CC-BY content

Look at the "license" or "licenseurl" of all of our original items. If it is one of "CC-BY {1,2,3,4}", then we are obligated to provide credit to the original creators of the work in order to comply with the CC-BY license. Almost all of sources have an "uploader" or "creator" ID in their schema.

We need to output a SQL table that looks like the following:

CREATE TABLE credits(
author_or_uploader TEXT,
name_of_work TEXT,
source INT, -- foreign key into a "sources" table. That would be "librivox", "archive.org", "vimeo", etc.
original_license INT -- this could be a foreign key into a "licenses" table, or we could simply denormalize and list the TEXT of the license in this case. It doesn't matter. Note that CC-BY 1.0, CC-BY 2.0, CC-BY 3.0, and CC-BY 4.0 are all legally distinct licenses
);

We could then simply dump this table as a csv file in our data distribution to downloaders of the data.

Build a sampling frequency classifier.

Our data comes from a variety of sources, all at different frequencies.

Right now our ETL pipeline converts it all to 16kHz 16-bit signed waveforms.

archive.org has a "bitrate" field, but it is an arbitrary StringType, with no schema. For example, a 44.1kHz audio may have a bitrate field set to null, "", "44.1KHz", "44.1 KHz", "44100Hz", or "44100 Hz".

So clearly we cannot rely on that for archive.org. In general, data comes from sources other than archive.org, so it doesn't make sense to try to make a fuzzy matcher anyway.

The "soxi" utility allows us to inspect sampling rate of any particular input file. This should be fairly reliable.

However, it's possible that the source of the audio may have upsampled the data. For example, an audio file may be 44.1kHz, but have been recorded with a microphone that supports only 22.5kHz sampling rate.

It may be worthwhile therefore to try to detect the the "original" sampling rate of an audio. A straightforward way to do this may be to take the FFT of each audio file, get the noise power at particular frequency bands, and make a hand-coded decision rules to classify into the categories (8kHz, 16kHz, 22.5kHz, 44.1kHz, 48kHz. I don't think there are any other meaningful sampling rates in audio land). This may fail, however, if some files are unusually quiet. This is a general case of https://en.wikipedia.org/wiki/Ordinal_regression, though I'm not quite sure machine learning is the right approach here.

Gender and Accent Recognition Brainstorm

After a performant forced alignment pipeline is done, my next thought goes to how to add gender and accent recognition.

First of all, I will assume that each segment output by the forced aligner contains the voice of only one speaker. The forced alignment system depends upon voice activity detection to find silent regions of the audio, so this is a decent assumption, except in the case of interjections (where one speaker speaks over another without waiting for the first speaker to stop).

It is straightforward to consider these supervised learning tasks, but I have other ideas.

Librivox and Common Voice both contain gender labels. As I understand it, Common Voice also contains either region or accent metadata for speakers. Finally, it is possible that we could detect, e.g., Spanish-accented English by using Spanish spoken by native Speakers, even if the region or accent metadata is poor. Finally, CommonVoice and Librivox both contain "clean" audio. It is probably worthwhile to use SpecAugment with whatever training process we use in order to help generalization when applied to archive.org data (I've seen data where it is raining, etc.).

We can use locality sensitive hashing on the "ivector" of each training data audio segment (we ignore the text transcript). An ivector is a fixed length vector summarizing an arbitrary length piece of audio. It is essentially the mean of a gaussian distribution. We would use locality sensitive hashing to put the gender- and accent-labeled data's ivectors into hash buckets.

You can use Euclidean distance or Mahalanobis distance for the distance metric. It's important that all ivectors (which are the mean of a multivariate Gaussian) have the same covariance for these metrics. I'm not 100% certain, but it seems like you could simply normalize the mean of the spectrogram to 0 and the covariance to the identity matrix to 1 for a particular spectrogram for this. Worth double checking. Otherwise, https://en.wikipedia.org/wiki/Bhattacharyya_distance may work.

For unlabeled data, we would also hash the ivectors into the same hash buckets. We would classify an unlabeled ivector by assigning it to the majority accent or gender in that bucket.

What is distinct about this from a discriminative or supervised method is that it has an "open world" assumption. For example, we could have a large number of buckets. If there are no or few labeled ivectors in a particular bucket, that suggests this may be an unusual accent or way of speaking compared to the rest of the dataset. For example, it could be someone with a speech impediment or children's speech, which are probably worth capturing in their own right.

Finally, if you get ivectors, you could conceivably visualize them using t-SNE or whatever is newer or cooler than it nowadays.

Experiment with Speaker Diarization

From manual inspection, we have interviews, university lectures, plays, spoken histories, city hall proceedings, and other forms of labeled audio where there are multiple speakers.

This is a interesting because it is well-known that ASR systems struggle when there are multiple speakers.

It would be good to get a handle on how many speakers are in each of our audio tracks, as well as when they are speaking. I'm not quite sure how well the state-of-the-art is for this subfield (known as speaker diarization) as a start. People did this via k-means clustering in the past, where each speaker is a cluster, but k (the number of speakers) is a hyperparameter. I'm not sure if there are better mechanisms for this today for automatically proposing the number of speakers.

This is very open-ended, and I'm out of my depth here.

galv / lingvo-copy Goto Github PK

lingvo-copy's People

Contributors

Stargazers

Watchers

lingvo-copy's Issues

Escape data IDs with wildcards

Provide a way to request takedown of particular data

Build a flow to get the creators of CC-BY content

Build a sampling frequency classifier.

Gender and Accent Recognition Brainstorm

Experiment with Speaker Diarization

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent