I originally thought that this issue is only specific to the zh-hk locale, but later r

Support bulk-ban or bulk-remove sentences about common-voice HOT 4 OPEN

laubonghaudoi commented on June 26, 2024

Support bulk-ban or bulk-remove sentences

from common-voice.

Comments (4)

jessicarose commented on June 26, 2024 2

I think this is a really useful discussion and mirrors both concerns we've heard from other language community members and internal discussions. I'll be bringing this into planning meetings next week and will be able to come back to you with more information as team discussions expand on this and we do some research into technical explorations. Thank you so much for flagging this, it's an incredibly useful issue at a really useful time for the team and I appreciate you both raising it.

from common-voice.

HarikalarKutusu commented on June 26, 2024 1

I think those sentences with "???" inside are introduced by an encoding bug when the old sentence collector database is incorporated into CV. It is unfortunately irreversible, and in some cases (western languages where their alphabet is mostly Latin/ASCII with some Unicode additions) they are recorded - because the human brain can deduct them. But for eastern languages it is quite a problem.

See:
#4048
#4138

There is one attempt to remove them from the released corpus, but it is not merged yet:
common-voice/CorporaCreator#127

It might also be caused by wrong encoding in other inclusion methods of course.

from common-voice.

irvin commented on June 26, 2024 1

(Add some bg info) The wiki dump of zh-cn came from really early days when we need a working-in-progress sst model besides English and we need build text corpus fast for contracting recording firms form china to record.

At that time one sentences only recording once, so fetching Wikipedia seems to be the only way we can have hundred thousands of sentences in really short time.

We had try hard to adjust the parameter to raise the quality, and this is the best we can have at than.

from common-voice.

irvin commented on June 26, 2024

For bulk-remove,

As a core contributor from both nan-TW and zh-tw corpus, this is very necessary tools for us if we want to ensure the quality of text corpus and cv database.

Before the collector was published on the official sites, we proof-reading all sentences before it went online, but nowadays, it's totally out of our control - everyone can add sentences, and and we don't have ways to evaluate them before hand.

We had more or less given up on ensuring the quality of things now, so It would be much appreciated if we can have this to do QC in some way.

from common-voice.

Recommend Projects

Support bulk-ban or bulk-remove sentences about common-voice HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent