Describe the bug These sentences are far too repetitive: <a h

Is there any way to get it? <a class="use

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[BUG] Retire toponym corpus from Catalan Common Voice about common-voice HOT 7 OPEN

c-armentano commented on July 18, 2024

[BUG] Retire toponym corpus from Catalan Common Voice

from common-voice.

Comments (7)

jessicarose commented on July 18, 2024

Thanks so much for getting in touch with this issue.

When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?

from common-voice.

c-armentano commented on July 18, 2024

Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.

from common-voice.

jessicarose commented on July 18, 2024

Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.

The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?

from common-voice.

c-armentano commented on July 18, 2024

Thank you for your response.

What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?

Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.

Kind regards

from common-voice.

HarikalarKutusu commented on July 18, 2024

Is there any way to get it?

@c-armentano, AFAIK there is one way. The is_used field in sentences table controls it. If it is set to 0 (false), it will not be shown for new recordings. On the other hand, you need to collect sentence_id's of all these sentences and make a PR changing the database.

If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field.

from common-voice.

c-armentano commented on July 18, 2024

Hi @HarikalarKutusu, thank you for your answer.

I understand that you are referring to the is_used field of the validated_sentences.tsv.

What would be the primary use of this field, in what other situations is it set to 0?

I think I could identify the sentences that follow this pattern, modify the tsv and do the PR. Would it be accepted by the Common Voice? Are we sure it wouldn't create other problems?

from common-voice.

HarikalarKutusu commented on July 18, 2024

AFAIK, Common Voice does not want to delete old recordings unless it is absolutely necessary. This field is in the database and controls if the sentence will be shown to users for new recordings, by default it is 1 (True), meaning people can record it.

I don't know about the extent of its usage, but it can be used to disable sentences that somewhat passed the 2+1 voting algorithm, be it typos, be it questionable license, or generated sentences. But, I think, it will need collaborative effort from members of the community to request such a change. Please read the following two links about French text-corpus with similar issues:

#3785
#3786

On the other hand, at that time, the process was based on .txt files, which is an old process, where we PR'd those .txt files. Last year, the process has been changed to importing them directly into the DB.

The .tsv files are just a dump of a database view, modifying them will not affect the database.
I don't know any example for bulk-disabling sentences (using sentence_id's), so perhaps a special purpose script should be written for this.

I think we need to ping @jessicarose, @moz-dfeller , @moz-rotimib, and/and @ftyers for further discussion.

from common-voice.

[BUG] Retire toponym corpus from Catalan Common Voice about common-voice HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent