LIST OF ALL ITALIAN DATASETS FOUND From issue <a class="issue-link

MLS from facebook has Italiano: <a href="http://openslr.org/94/" rel

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I performed for import MLS files in MITADS-Speech Datasets <

New Speech Dataset: Multilingual TEDx by OpenSLR <a href="https://www.openslr.org

You can import the mTEDx dataset with corcua : <a href="https://gitlab.com

You can import the mTEDx dataset with corcua : <a href="https

LIST OF AUDIO+TEXT DATASETS about deepspeech-italian-model HOT 10 OPEN

nefastosaturo commented on June 2, 2024

LIST OF AUDIO+TEXT DATASETS

from deepspeech-italian-model.

Comments (10)

nshmyrev commented on June 2, 2024 1

MLS from facebook has Italiano:

http://openslr.org/94/

279.43 hours

from deepspeech-italian-model.

nefastosaturo commented on June 2, 2024

@nshmyrev WOW, thank you for this christmas present!!

from deepspeech-italian-model.

eziolotta commented on June 2, 2024

I performed script for import MLS files in MITADS-Speech Datasets

https://dl.fbaipublicfiles.com/mls/mls_italian.tar.gz (14.3G zip)

Converting the .flac audio files to Wav 16KHz and doing some checks.

With sample tests, Audio are of good quality and transcripts are clean.
All clips are between 10 and 20 seconds long (specified in paper)

if my script work fine All clips <= 15 seconds (and successfully resampled)
are in total: 159.23h

Textual corpus on which speech dataset is based, includes ancient works like this:
works of Giovanni Francesco Straparola (1400),
Divina Commedia (and others) by Dante Alighieri (1300)
works of Luigi Pirandello (1900)

In some cases we find in sentences obsolete forms and terms that person using speech technologies today is unlikely to pronounce.
If we need to filter clips by Author/Work, this information is present in flac audio file

from deepspeech-italian-model.

Mte90 commented on June 2, 2024

I think that we should avoid this kind of ancient work (except pirandello) as we did in Mitads itself as example.

from deepspeech-italian-model.

eziolotta commented on June 2, 2024

https://arxiv.org/pdf/2101.00390.pdf

VoxPopuli: largest open unlabelled speech dataset, totaling 100K hours in 23 languages from European Parliament.
Also contains 1.8K hours of transcribed speeches in 16 languages.

They will release the corpus at https://github.com/facebookresearch/voxpopuli under a open license.

from deepspeech-italian-model.

eziolotta commented on June 2, 2024

Europarl-ST , multilingual corpus for speech translation of parliamentary debates.

Total 64.18h of Italian clips audio.
From approximate calculations about 30% of these (20h?) have a transcript in Italian
Most clips range duration from 1 minute to 2 minutes

https://arxiv.org/pdf/1911.03167.pdf

https://www.mllp.upv.es/europarl-st/v1.1.tar.gz (20G)

from deepspeech-italian-model.

eziolotta commented on June 2, 2024

New Speech Dataset: Multilingual TEDx by OpenSLR
https://www.openslr.org/100/

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages
Italian: about 123 Hours

Licence: Creative Commons Attribution-NonCommercial-NoDerivs 4.0
https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy

UPDATE:
Audio clips have an average length, ranging from 4-5 minutes to 25 minutes
This does not make it directly usable for training with deep speech

from deepspeech-italian-model.

eziolotta commented on June 2, 2024

For Multilingual TEDx dataset:
in segments.txt file there are audio segments of each clip, they are text alignment with audio timestamps.
Through this file, audio-clips could be segmented easily, but I think we cannot redistribute dataset due to the license.

Unfortunately Deep Speech doesn't have a powerfull toolkit to split audio by aligner or vad, like Kaldi or others.
Could we try this : https://espnet.github.io/espnet/apis/espnet_bin.html#asr-align-py ?

from deepspeech-italian-model.

DanBmh commented on June 2, 2024

You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcua

Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)

from deepspeech-italian-model.

nefastosaturo commented on June 2, 2024

You can import the mTEDx dataset with corcua:
https://gitlab.com/Jaco-Assistant/corcua

Splitting the files will take some time (~2 days), but if you extend the script with a parallelization, it should work much faster:)

Dear DanBmh, thank you so much! I will update the table above

from deepspeech-italian-model.

LIST OF AUDIO+TEXT DATASETS about deepspeech-italian-model HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent