Comments (10)
MLS from facebook has Italiano:
279.43 hours
from deepspeech-italian-model.
@nshmyrev WOW, thank you for this christmas present!!
from deepspeech-italian-model.
I performed script for import MLS files in MITADS-Speech Datasets
https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz (14.3G zip)
Converting the .flac audio files to Wav 16KHz and doing some checks.
With sample tests, Audio are of good quality and transcripts are clean.
All clips are between 10 and 20 seconds long (specified in paper)
if my script work fine All clips <= 15 seconds (and successfully resampled)
are in total: 159.23h
Textual corpus on which speech dataset is based, includes ancient works like this:
works of Giovanni Francesco Straparola (1400),
Divina Commedia (and others) by Dante Alighieri (1300)
works of Luigi Pirandello (1900)
In some cases we find in sentences obsolete forms and terms that person using speech technologies today is unlikely to pronounce.
If we need to filter clips by Author/Work, this information is present in flac audio file
from deepspeech-italian-model.
I think that we should avoid this kind of ancient work (except pirandello) as we did in Mitads itself as example.
from deepspeech-italian-model.
https://arxiv.org/pdf/2101.00390.pdf
VoxPopuli: largest open unlabelled speech dataset, totaling 100K hours in 23 languages from European Parliament.
Also contains 1.8K hours of transcribed speeches in 16 languages.
They will release the corpus at https://github.com/facebookresearch/voxpopuli under a open license.
from deepspeech-italian-model.
Europarl-ST , multilingual corpus for speech translation of parliamentary debates.
Total 64.18h of Italian clips audio.
From approximate calculations about 30% of these (20h?) have a transcript in Italian
Most clips range duration from 1 minute to 2 minutes
https://arxiv.org/pdf/1911.03167.pdf
https://www.mllp.upv.es/europarl-st/v1.1.tar.gz (20G)
from deepspeech-italian-model.
New Speech Dataset: Multilingual TEDx by OpenSLR
https://www.openslr.org/100/
The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages
Italian: about 123 Hours
Licence: Creative Commons Attribution-NonCommercial-NoDerivs 4.0
https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy
UPDATE:
Audio clips have an average length, ranging from 4-5 minutes to 25 minutes
This does not make it directly usable for training with deep speech
from deepspeech-italian-model.
For Multilingual TEDx dataset:
in segments.txt file there are audio segments of each clip, they are text alignment with audio timestamps.
Through this file, audio-clips could be segmented easily, but I think we cannot redistribute dataset due to the license.
Unfortunately Deep Speech doesn't have a powerfull toolkit to split audio by aligner or vad, like Kaldi or others.
Could we try this : https://espnet.github.io/espnet/apis/espnet_bin.html#asr-align-py ?
from deepspeech-italian-model.
You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcua
Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)
from deepspeech-italian-model.
You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcuaSplitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)
Dear DanBmh, thank you so much! I will update the table above
from deepspeech-italian-model.
Related Issues (20)
- MITADS - Transcript roman numbers HOT 4
- Readme improvements
- Not clear how to do a simple speech recognition HOT 9
- deepspeech - lm.binary and trie: how to? HOT 4
- Create the "contributing" file HOT 1
- Experiment on creating a new dataset audio+text HOT 3
- Voxforge bad samples, help for cleaning up HOT 3
- MITADS - convert numbers to their literal expression HOT 2
- Really bad results on Raspberry Pi 4 HOT 1
- Other italian models for transfer learning HOT 4
- MITADS - new corpora to import HOT 3
- MLS and MAILABS: considerations and issues ( Have you seen my apostrophe?) HOT 9
- Building a custom external scorer (extending the Italian text corpus) HOT 4
- ERROR: Model provided has model identifier 'K�+�', should be 'TFL3' HOT 5
- Project license HOT 3
- Migrate to Coqui
- Docker build fail HOT 2
- Documentation about how to run the various bash script alone
- DOCKERFILE Merge flag TRANSFER_LEARNING and DROP_SOURCE_LAYER HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeech-italian-model.