common-voice / cv-dataset Goto Github PK

View Code? Open in Web Editor NEW

137.0 137.0 15.0 504 KB

Metadata and versioning details for the Common Voice dataset

Home Page: https://commonvoice.mozilla.org/datasets

License: Mozilla Public License 2.0

JavaScript 100.00%

asr dataset open-data open-datasets speech-recognition voice

cv-dataset's Introduction

Common Voice

This is the web app for Mozilla Common Voice, a platform for collecting speech donations in order to create public domain datasets for training voice recognition-related tools.

Upcoming releases

Type	Release Cadence	More info
Platform code & sentences	Monthly, or as needed	Release notes
Dataset	Quarterly	Dataset metadata

Quick links

How to contribute

🎉 First off, thanks for taking the time to contribute! This project would not be possible without people like you. 🎉

There are many ways to get involved with Common Voice - you don't have to know how to code to contribute!

To add or correct the translation of the web interface, please use the Mozilla localization platform Pontoon. Please note, we do not accept any direct pull requests for changing localization content.
For information on how to add or edit sentences to Common Voice, see SENTENCES.md
For instructions on setting up a local development environment, see DEVELOPMENT.md
For information on how to add a new language to Common Voice, see LANGUAGE.md
For information on how to get in contact with existing language communities, see COMMUNITIES.md

For more general guidance on building your own language community using Mozilla voice tools, please refer to the Mozilla Voice Community Playbook.

Discussion

For general discussion (feedback, ideas, random musings), head to our Discourse Category.

For bug reports or specific feature, please use the GitHub issue tracker.

For live chat, join us on Matrix.

Licensing and content source

This repository is released under MPL (Mozilla Public License) 2.0.

The majority of our sentence text in /server/data comes directly from user submissions in our Sentence Collector or they are scraped from Wikipedia using our extractor tool, and are released under a CC0 public domain Creative Commons license.

Any files that follow the pattern europarl-VERSION-LANG.txt (such as europarl-v7-de.txt) were extracted with our thanks from the Europarl Corpus, which features transcripts from proceedings in the European parliament.

Citation

If you use the data in a published academic work we would appreciate if you cite the following article:

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211—4215

The BiBTex is:

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

Cross Browser Testing

This project is tested with Browserstack

cv-dataset's People

Contributors

Stargazers

Watchers

Forkers

irvin devbox10 cyberflamego maxmood96 huangweiboy2 kmmao ajders hmgov hunar4321 cboonnag harikalarkutusu saiful9379 seanpm2001 saltfish233 heisnotanimposter

cv-dataset's Issues

Wrong duration value in ar v10.0

Here:

cv-dataset/datasets/cv-corpus-10.0-2022-07-04.json

Line 581 in 392d53b

"duration": 71804112154,

I measured around 523,476,000 (not exact value as calculated back from rounded hours).

有没有中文native-speaker能帮忙解释下各个*.tsv的意思

validated contains a list of all clips that have received two or more validations where up_votes > down_votes

invalidated contains a list of all clips that have received two or more validations where down_votes > up_votes, or clips that have received three or more validations where down_votes = up_votes

other contains a list of all clips that have not received sufficient validations to determine their status

Do not understand

Fix stereo files to mono

As reference common-voice/CorporaCreator#110

Feature Request: More digits for percentage values.

On age and gender percentage values, there are 2 significant digits, like 0.12... As these also include rounding, when we sum them the result will not be 100% most of the time.

It would be very nice if we can have two more significant digits in these fields, like 0.1234 showing 12.34%, so that one can sum them and do the rounding.

German 12.0 Segment missing train dev test TSV files

The newest german segment "cv-corpus-12.0-delta-2022-12-07-de.tar" does not include the train.tsv dev.tsv and test.tsv.

Bug: accents splits are not shown in dataset JSON summary after release 7

Description of bug:

Since the inception of the Accent functionality, dataset release summary JSON files have contained a summary of the accent splits, see for example:

https://github.com/common-voice/cv-dataset/blob/main/datasets/cv-corpus-7.0-2021-07-21.json

"date": "2021-07-21",
  "locales": {
    "en": {
      "buckets": {
        "dev": 16284,
        "invalidated": 220015,
        "other": 220176,
        "reported": 2732,
        "test": 16284,
        "train": 759975,
        "validated": 1425784
      },
      "reportedSentences": 2679,
      "duration": 9493711987,
      "clips": 1865975,
      "splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

By contrast, v8 and v9 of the release do not contain accent splits:

 "locales": {
        "en": {
            "duration": 10390463635,
            "buckets": {
                "dev": 16326,
                "invalidated": 239065,
                "other": 251332,
                "reported": 3558,
                "test": 16326,
                "train": 864448,
                "validated": 1530385
            },
            "reportedSentences": 3500,
            "clips": 2020782,
            "splits": {
                "accent": {
                    "": 1
                },
                "age": {
                    "": 0.37,
                    "twenties": 0.24,
                    "sixties": 0.04,
                    "thirties": 0.13,
                    "teens": 0.06,
                    "seventies": 0.01,
                    "fourties": 0.1,
                    "fifties": 0.04,
                    "eighties": 0,
                    "nineties": 0
                },
                "gender": {
                    "": 0.37,
                    "male": 0.46,
                    "female": 0.16,
                    "other": 0.02
                }
            },
            "users": 79398,
            "size": 75356163484,
            "checksum": "8b82525e6adb8382e28eabfed1beeedd3f315c1d3cdf7445a3ff33743f42025d",
            "avgDurationSecs": 5.142,
            "validDurationSecs": 7868938.703,
            "totalHrs": 2886.23,
            "validHrs": 2185.81
        },

This may be due to the self-identified accent functionality that was implemented some months ago, meaning there are many self-entered accents and no agreed method of grouping them, however accent split data will be very useful for equity, diversity and inclusion measures.

Kind regards,
Kathy

Add CV 8.0 metadata

Feature request: Summary data of each language including rows with metadata, gender, age, accent distribution

Firstly a huge thanks to the team for all the efforts that go into Common Voice, hugely appreciated.

If possible, I would like summary data for each language for which a dataset is released, showing:

Unique rows
Unique contributors
Rows with metadata (#)
Rows with metadata (%)
Approximate hours with metadata
Row count by genders
Row count by age ranges
Row count by accents

This allows a researcher to understand easily how much of a language dataset has metadata, and what the metadata distribution looks like. Some of this is already in the JSON files in this directory, but this is a different "slice and dice" of the summary data.

I have a Python script which calculates these based on the validated.tsv file of the language's dataset, happy to share.

Kind regards,
Kathy

Wrong checksums for Common Voice Corpus 13.0

Hello, I usually verify checksum after download and till Common Voice Corpus 12.0 it worked with no problem, but now (Common Voice Corpus 13.0) I suspect they are wrong, because I have no issues with download, but the checksums don't match, I don't have resources nor time to check more datasets, but I can provide a few (I suppose all checksums from this version are calculated wrong):

CV13-German-> wrong checksum (can't provide the values)

DATASET -> PROVIDED_CHECKSUM -> REAL_CHECKSUM
CV13-Icelandic -> 48db6e809f5b6eb0c00b077e6b736aeeee5d544ee3f2fdd059244da88926c040 -> 33e4c68fe2b4501f358a4762487f9c2b9d8c509a304a3288d31fd24ba6e3c451
CV13-Danish -> 6c85261bcf8dffe5c06ad29c82760cda5cd1fdc7d9c1c99b6285a425f11d105e -> 5b39bb325b76043a57b8735621dd6c8b68b615d49903c18bfa9cb4b783df01af
CV-13-Occitan -> e241c12159ac7b3d880f41d5e91d804775da188a3ac413c775341eef3406001b -> 59480c122de507e4f8ce94120a726ea042040c0750ceaf16d9d708c485a6288d

Feature request: CSV

A CSV just with the most important data locale, totalHrs, validHrs and nr_of_voices would be great.

Error: Version 15 summary data does not contain nested objects for splits (age, gender) and buckets (validation)

Description of error

In the newly-released v15 JSON file, the splits object and buckets object is missing for each locale object. This contains data on age, gender and validation splits for each locale. This data is required for me to produce the data visualisation of the metadata coverage for Common Voice. The v15 metadata coverage is done, but it shows errors where the splits objects are missing.

Outcome sought

Could the JSON file please be re-generated with the splits object and buckets object for each locale?

Example of error

Example of locale data for en language in v15 JSON:

"en": {
      "duration": 244618020,
      "reportedSentences": 140,
      "clips": 40553,
      "users": 750,
      "size": 1371012179,
      "checksum": "794f4ea6c6bab3731d54cf7ce3d67996cf1ba7c0d92cbd338c36636a9716047a",
      "avgDurationSecs": 5.2,
      "validDurationSecs": 171002.31,
      "totalHrs": 67.95,
      "validHrs": 47.5
    }

The same locale data for en in v14 JSON:

 "en": {
      "buckets": {
        "dev": 16380,
        "invalidated": 272017,
        "other": 279585,
        "reported": 6445,
        "test": 16380,
        "train": 1046685,
        "validated": 1724421
      },
      "duration": 11802687079,
      "reportedSentences": 6368,
      "clips": 2276023,
      "splits": {
        "accent": {
          "": 1
        },
        "age": {
          "": 0.37,
          "twenties": 0.24,
          "sixties": 0.04,
          "thirties": 0.14,
          "teens": 0.06,
          "seventies": 0.01,
          "fourties": 0.09,
          "fifties": 0.05,
          "eighties": 0,
          "nineties": 0
        },
        "gender": {
          "": 0.37,
          "male": 0.45,
          "female": 0.16,
          "other": 0.02
        }
      },
      "users": 88154,
      "size": 83555475656,
      "checksum": "6e88c7460090c5a6ca7f02a8525bd669e7bc509a47b2f1974977d5065a054507",
      "avgDurationSecs": 5.186,
      "validDurationSecs": 8942265.283,
      "totalHrs": 3278.52,
      "validHrs": 2483.96
    },

Feature request: Datasets with only validated recordings

I've posted this already in the main repo, but seeing #26 here makes me think this might be the more adequate place to request this.

When downloading datasets, one must download the whole set (or a delta) including all sentences and recordings, whether validated or not, even if the user only needs the validated data. This consumes a lot of bandwidth, time and disk space, and it is not environmentally friendly either.

Offering the option to just download the part of the dataset with validated recordings would save a lot of time and make the data more accessible to more people. Being able to download only the tsv files would also be a good addition, but this is already addressed in #26.

I don't know how complex it would be to implement this, but I feel this would be a very useful quality of life feature, so I hope it is taken into consideration.

Thanks for your work in this amazing project in any case!

Script to download all the datasets

Hi,

Love the dataset.

I was wondering if there was a script to download all the datasets for all languages?

Cheers

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately

User story

As a researcher, I frequently create data visualisations based on the validated.tsv file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.

I want to be able to get just the .tsv files related to a release, without downloading the clips, so that I can do faster data visualisations.

Acceptance criteria

The files
- clip_durations.tsv
- invalidated.tsv
- other.tsv
- reported.tsv
- validated.tsv

are available

for each language in the CV corpus (about 103 at time of writing)
for each version
including delta releases

from the CV datasets download page, in the same way as we currently download the .tar.gz formatted datasets.

Feature request: list sampling rates in dataset, download dataset given sampling rate

currently, the download page only provides information about the compression algorithm, i.e. MP3, but provides no information about the sampling rate and bit depth of the audio file.
it would be great to

1. know what are the sampling rates in a dataset
1. download subsets of a dataset conditioned on the sampling rate.

Bug: Discrepancy for locale "eo" in v10.0 dataset

Valid hours are larger than total hours...

        "eo": {
            "duration": 6740710,
            "buckets": {
                "dev": 14907,
                "invalidated": 127293,
                "other": 135058,
                "reported": 2127,
                "test": 14907,
                "train": 143988,
                "validated": 848511
            },
            "reportedSentences": 2126,
            "clips": 1110862,
            "splits": {
                "accent": {
                    "": 1
                },
                "age": {
                    "twenties": 0.56,
                    "thirties": 0.12,
                    "": 0.2,
                    "fourties": 0.04,
                    "fifties": 0.02,
                    "seventies": 0,
                    "teens": 0.05,
                    "sixties": 0,
                    "eighties": 0
                },
                "gender": {
                    "male": 0.69,
                    "": 0.2,
                    "female": 0.11,
                    "other": 0
                }
            },
            "users": 1541,
            "size": 40260737095,
            "checksum": "2179bad54bb2b69cd12964bc2f6533b9538b7a3f943f9e65f8f9a463796fd901",
            "avgDurationSecs": 6.068,
            "validDurationSecs": 5148764,
            "totalHrs": 1430.21,
            "validHrs": 1872.42
        },

Possibility to release tar file with just the additional data

Hi,

Thank you for providing these datasets! It's really helpful.

With every version, would it be possible to release a variant where you only have the additional data that gets added with every version?

It's becoming quite expensive to work on servers as untarring such large files are really expensive.

Thanks

Some mp3 files in cv corpus 4 are empty

I notice some files in corpus 4(e.g., common_voice_en_146651.mp3, common_voice_en_130054.mp3) have zero byte. Has anyone else had this problem?

Minor Bug in Text Corpus calculations

This happened in v17.0 data and only for cnh locale. Somewhere a minus 1 is added (looks like to drop a header line), but it gives negative value if there is no data (so no header line). Laiholh (Hakha) (Hakha Chin) locale has no unvalidated sentences, and the unvalidated_sentences.tsv file is completely empty.

    "cnh": {
      ...
      "validatedSentences": 5218,
      "unvalidatedSentences": -1,
      ...

FEATURE REQUEST: Please add `duration` as a metadata item that is included in the `*.tsv` files with a release

User story

As a researcher, I want to be able to run analysis on the duration of clips in Common Voice for various languages and purposes, for example to use Common Voice as an evaluation corpus for speech recognition engines.

Currently, the clip duration is available as an aggregate statistic in the JSON file provided by this (cv-dataset) repository. However, it is not available as a metadata data item in, e.g. the validated.tsv file.

As a researcher, this means I must use additional processing to calculate the duration of each .mp3 clip, which is frustrating if this information is known in advance.

Acceptance criteria

duration in seconds is included for each data item in each *.tsv of each dataset

need label about sample clean or noisy

As the title, I did not find a tag about whether the speech is clean. Some of the records are noisy, It maybe usefull for ASR, SV, but not for TTS

How many peoples in all dataset？

A small request for column & field naming

clip_durations.tsv contains two columns: clip and duration[ms]
"non-binary": 0,

duration[ms] and non-binary can be a nuisance when manipulating data. It is better to use standardized naming, such as duration_ms and non_binary, which will be compatible with any system (e.g. databases, variables in any language etc).

I extensively use/manipulate all data and for such cases I sometimes have to write special code.

.tsv files not found

Hi, I downloaded the dataset from the Mozilla common voice website. I downloaded the common voice corpus 1 for the English language. The zip file only contained a single file called 'en' with no extensions that Windows could recognize.

Is there a way to extract all the .tsv files?

Download format is .tar instead of .tar.gz

When I downloaded the dataset, I got a .tar file. Had to look around before finding the git and renaming it to .tar.gz to make it work.

Please correct this or mention this on site.

Thnx

common-voice / cv-dataset Goto Github PK

cv-dataset's Introduction

Common Voice

Upcoming releases

Quick links

How to contribute

Discussion

Licensing and content source

Citation

Cross Browser Testing

cv-dataset's People

Contributors

Stargazers

Watchers

Forkers

cv-dataset's Issues

Description of error

Outcome sought

Example of error

User story

Acceptance criteria

User story

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org