Code Monkey home page Code Monkey logo

cv-dataset's Introduction

Common Voice

This is the web app for Mozilla Common Voice, a platform for collecting speech donations in order to create public domain datasets for training voice recognition-related tools.

Upcoming releases

Type Release Cadence More info
Platform code & sentences Monthly, or as needed Release notes
Dataset Quarterly Dataset metadata

Quick links

How to contribute

🎉 First off, thanks for taking the time to contribute! This project would not be possible without people like you. 🎉

There are many ways to get involved with Common Voice - you don't have to know how to code to contribute!

  • To add or correct the translation of the web interface, please use the Mozilla localization platform Pontoon. Please note, we do not accept any direct pull requests for changing localization content.
  • For information on how to add or edit sentences to Common Voice, see SENTENCES.md
  • For instructions on setting up a local development environment, see DEVELOPMENT.md
  • For information on how to add a new language to Common Voice, see LANGUAGE.md
  • For information on how to get in contact with existing language communities, see COMMUNITIES.md

For more general guidance on building your own language community using Mozilla voice tools, please refer to the Mozilla Voice Community Playbook.

Discussion

For general discussion (feedback, ideas, random musings), head to our Discourse Category.

For bug reports or specific feature, please use the GitHub issue tracker.

For live chat, join us on Matrix.

Licensing and content source

This repository is released under MPL (Mozilla Public License) 2.0.

The majority of our sentence text in /server/data comes directly from user submissions in our Sentence Collector or they are scraped from Wikipedia using our extractor tool, and are released under a CC0 public domain Creative Commons license.

Any files that follow the pattern europarl-VERSION-LANG.txt (such as europarl-v7-de.txt) were extracted with our thanks from the Europarl Corpus, which features transcripts from proceedings in the European parliament.

Citation

If you use the data in a published academic work we would appreciate if you cite the following article:

  • Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211—4215

The BiBTex is:

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

Cross Browser Testing

This project is tested with Browserstack

cv-dataset's People

Contributors

moz-dfeller avatar mozgzh avatar phirework avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cv-dataset's Issues

有没有中文native-speaker能帮忙解释下各个*.tsv的意思

validated contains a list of all clips that have received two or more validations where up_votes > down_votes

invalidated contains a list of all clips that have received two or more validations where down_votes > up_votes, or clips that have received three or more validations where down_votes = up_votes

other contains a list of all clips that have not received sufficient validations to determine their status

Do not understand

Feature Request: More digits for percentage values.

On age and gender percentage values, there are 2 significant digits, like 0.12... As these also include rounding, when we sum them the result will not be 100% most of the time.

It would be very nice if we can have two more significant digits in these fields, like 0.1234 showing 12.34%, so that one can sum them and do the rounding.

Bug: accents splits are not shown in dataset JSON summary after release 7

Description of bug:

Since the inception of the Accent functionality, dataset release summary JSON files have contained a summary of the accent splits, see for example:

https://github.com/common-voice/cv-dataset/blob/main/datasets/cv-corpus-7.0-2021-07-21.json

"date": "2021-07-21",
  "locales": {
    "en": {
      "buckets": {
        "dev": 16284,
        "invalidated": 220015,
        "other": 220176,
        "reported": 2732,
        "test": 16284,
        "train": 759975,
        "validated": 1425784
      },
      "reportedSentences": 2679,
      "duration": 9493711987,
      "clips": 1865975,
      "splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

By contrast, v8 and v9 of the release do not contain accent splits:

 "locales": {
        "en": {
            "duration": 10390463635,
            "buckets": {
                "dev": 16326,
                "invalidated": 239065,
                "other": 251332,
                "reported": 3558,
                "test": 16326,
                "train": 864448,
                "validated": 1530385
            },
            "reportedSentences": 3500,
            "clips": 2020782,
            "splits": {
                "accent": {
                    "": 1
                },
                "age": {
                    "": 0.37,
                    "twenties": 0.24,
                    "sixties": 0.04,
                    "thirties": 0.13,
                    "teens": 0.06,
                    "seventies": 0.01,
                    "fourties": 0.1,
                    "fifties": 0.04,
                    "eighties": 0,
                    "nineties": 0
                },
                "gender": {
                    "": 0.37,
                    "male": 0.46,
                    "female": 0.16,
                    "other": 0.02
                }
            },
            "users": 79398,
            "size": 75356163484,
            "checksum": "8b82525e6adb8382e28eabfed1beeedd3f315c1d3cdf7445a3ff33743f42025d",
            "avgDurationSecs": 5.142,
            "validDurationSecs": 7868938.703,
            "totalHrs": 2886.23,
            "validHrs": 2185.81
        },

This may be due to the self-identified accent functionality that was implemented some months ago, meaning there are many self-entered accents and no agreed method of grouping them, however accent split data will be very useful for equity, diversity and inclusion measures.

Kind regards,
Kathy

Feature request: Summary data of each language including rows with metadata, gender, age, accent distribution

Firstly a huge thanks to the team for all the efforts that go into Common Voice, hugely appreciated.

If possible, I would like summary data for each language for which a dataset is released, showing:

  • Unique rows
  • Unique contributors
  • Rows with metadata (#)
  • Rows with metadata (%)
  • Approximate hours with metadata
  • Row count by genders
  • Row count by age ranges
  • Row count by accents

This allows a researcher to understand easily how much of a language dataset has metadata, and what the metadata distribution looks like. Some of this is already in the JSON files in this directory, but this is a different "slice and dice" of the summary data.

I have a Python script which calculates these based on the validated.tsv file of the language's dataset, happy to share.

Kind regards,
Kathy

Wrong checksums for Common Voice Corpus 13.0

Hello, I usually verify checksum after download and till Common Voice Corpus 12.0 it worked with no problem, but now (Common Voice Corpus 13.0) I suspect they are wrong, because I have no issues with download, but the checksums don't match, I don't have resources nor time to check more datasets, but I can provide a few (I suppose all checksums from this version are calculated wrong):

  • CV13-German-> wrong checksum (can't provide the values)
DATASET -> PROVIDED_CHECKSUM -> REAL_CHECKSUM
CV13-Icelandic -> 48db6e809f5b6eb0c00b077e6b736aeeee5d544ee3f2fdd059244da88926c040 -> 33e4c68fe2b4501f358a4762487f9c2b9d8c509a304a3288d31fd24ba6e3c451
CV13-Danish -> 6c85261bcf8dffe5c06ad29c82760cda5cd1fdc7d9c1c99b6285a425f11d105e -> 5b39bb325b76043a57b8735621dd6c8b68b615d49903c18bfa9cb4b783df01af
CV-13-Occitan -> e241c12159ac7b3d880f41d5e91d804775da188a3ac413c775341eef3406001b -> 59480c122de507e4f8ce94120a726ea042040c0750ceaf16d9d708c485a6288d

Feature request: CSV

A CSV just with the most important data locale, totalHrs, validHrs and nr_of_voices would be great.

Error: Version 15 summary data does not contain nested objects for splits (age, gender) and buckets (validation)

Description of error

In the newly-released v15 JSON file, the splits object and buckets object is missing for each locale object. This contains data on age, gender and validation splits for each locale. This data is required for me to produce the data visualisation of the metadata coverage for Common Voice. The v15 metadata coverage is done, but it shows errors where the splits objects are missing.

Outcome sought

Could the JSON file please be re-generated with the splits object and buckets object for each locale?

Example of error

Example of locale data for en language in v15 JSON:

"en": {
      "duration": 244618020,
      "reportedSentences": 140,
      "clips": 40553,
      "users": 750,
      "size": 1371012179,
      "checksum": "794f4ea6c6bab3731d54cf7ce3d67996cf1ba7c0d92cbd338c36636a9716047a",
      "avgDurationSecs": 5.2,
      "validDurationSecs": 171002.31,
      "totalHrs": 67.95,
      "validHrs": 47.5
    }

The same locale data for en in v14 JSON:

 "en": {
      "buckets": {
        "dev": 16380,
        "invalidated": 272017,
        "other": 279585,
        "reported": 6445,
        "test": 16380,
        "train": 1046685,
        "validated": 1724421
      },
      "duration": 11802687079,
      "reportedSentences": 6368,
      "clips": 2276023,
      "splits": {
        "accent": {
          "": 1
        },
        "age": {
          "": 0.37,
          "twenties": 0.24,
          "sixties": 0.04,
          "thirties": 0.14,
          "teens": 0.06,
          "seventies": 0.01,
          "fourties": 0.09,
          "fifties": 0.05,
          "eighties": 0,
          "nineties": 0
        },
        "gender": {
          "": 0.37,
          "male": 0.45,
          "female": 0.16,
          "other": 0.02
        }
      },
      "users": 88154,
      "size": 83555475656,
      "checksum": "6e88c7460090c5a6ca7f02a8525bd669e7bc509a47b2f1974977d5065a054507",
      "avgDurationSecs": 5.186,
      "validDurationSecs": 8942265.283,
      "totalHrs": 3278.52,
      "validHrs": 2483.96
    },

Feature request: Datasets with only validated recordings

I've posted this already in the main repo, but seeing #26 here makes me think this might be the more adequate place to request this.

When downloading datasets, one must download the whole set (or a delta) including all sentences and recordings, whether validated or not, even if the user only needs the validated data. This consumes a lot of bandwidth, time and disk space, and it is not environmentally friendly either.

Offering the option to just download the part of the dataset with validated recordings would save a lot of time and make the data more accessible to more people. Being able to download only the tsv files would also be a good addition, but this is already addressed in #26.

I don't know how complex it would be to implement this, but I feel this would be a very useful quality of life feature, so I hope it is taken into consideration.

Thanks for your work in this amazing project in any case!

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately

User story

  • As a researcher, I frequently create data visualisations based on the validated.tsv file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.

I want to be able to get just the .tsv files related to a release, without downloading the clips, so that I can do faster data visualisations.

Acceptance criteria

  • The files

    • clip_durations.tsv
    • invalidated.tsv
    • other.tsv
    • reported.tsv
    • validated.tsv

are available

  • for each language in the CV corpus (about 103 at time of writing)
  • for each version
  • including delta releases

from the CV datasets download page, in the same way as we currently download the .tar.gz formatted datasets.

Bug: Discrepancy for locale "eo" in v10.0 dataset

Valid hours are larger than total hours...

        "eo": {
            "duration": 6740710,
            "buckets": {
                "dev": 14907,
                "invalidated": 127293,
                "other": 135058,
                "reported": 2127,
                "test": 14907,
                "train": 143988,
                "validated": 848511
            },
            "reportedSentences": 2126,
            "clips": 1110862,
            "splits": {
                "accent": {
                    "": 1
                },
                "age": {
                    "twenties": 0.56,
                    "thirties": 0.12,
                    "": 0.2,
                    "fourties": 0.04,
                    "fifties": 0.02,
                    "seventies": 0,
                    "teens": 0.05,
                    "sixties": 0,
                    "eighties": 0
                },
                "gender": {
                    "male": 0.69,
                    "": 0.2,
                    "female": 0.11,
                    "other": 0
                }
            },
            "users": 1541,
            "size": 40260737095,
            "checksum": "2179bad54bb2b69cd12964bc2f6533b9538b7a3f943f9e65f8f9a463796fd901",
            "avgDurationSecs": 6.068,
            "validDurationSecs": 5148764,
            "totalHrs": 1430.21,
            "validHrs": 1872.42
        },

Possibility to release tar file with just the additional data

Hi,

Thank you for providing these datasets! It's really helpful.

With every version, would it be possible to release a variant where you only have the additional data that gets added with every version?

It's becoming quite expensive to work on servers as untarring such large files are really expensive.

Thanks

Minor Bug in Text Corpus calculations

This happened in v17.0 data and only for cnh locale. Somewhere a minus 1 is added (looks like to drop a header line), but it gives negative value if there is no data (so no header line). Laiholh (Hakha) (Hakha Chin) locale has no unvalidated sentences, and the unvalidated_sentences.tsv file is completely empty.

    "cnh": {
      ...
      "validatedSentences": 5218,
      "unvalidatedSentences": -1,
      ...

FEATURE REQUEST: Please add `duration` as a metadata item that is included in the `*.tsv` files with a release

User story

As a researcher, I want to be able to run analysis on the duration of clips in Common Voice for various languages and purposes, for example to use Common Voice as an evaluation corpus for speech recognition engines.

Currently, the clip duration is available as an aggregate statistic in the JSON file provided by this (cv-dataset) repository. However, it is not available as a metadata data item in, e.g. the validated.tsv file.

As a researcher, this means I must use additional processing to calculate the duration of each .mp3 clip, which is frustrating if this information is known in advance.

Acceptance criteria

  • duration in seconds is included for each data item in each *.tsv of each dataset

A small request for column & field naming

  • clip_durations.tsv contains two columns: clip and duration[ms]
  • "non-binary": 0,

duration[ms] and non-binary can be a nuisance when manipulating data. It is better to use standardized naming, such as duration_ms and non_binary, which will be compatible with any system (e.g. databases, variables in any language etc).

I extensively use/manipulate all data and for such cases I sometimes have to write special code.

.tsv files not found

Hi, I downloaded the dataset from the Mozilla common voice website. I downloaded the common voice corpus 1 for the English language. The zip file only contained a single file called 'en' with no extensions that Windows could recognize.

Is there a way to extract all the .tsv files?

Download format is .tar instead of .tar.gz

Hi

When I downloaded the dataset, I got a .tar file. Had to look around before finding the git and renaming it to .tar.gz to make it work.

Please correct this or mention this on site.

Thnx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.