Code Monkey home page Code Monkey logo

Comments (7)

jessicarose avatar jessicarose commented on July 18, 2024

Thanks so much for getting in touch with this issue.

When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?

from common-voice.

c-armentano avatar c-armentano commented on July 18, 2024

Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.

from common-voice.

jessicarose avatar jessicarose commented on July 18, 2024

Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.

The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?

from common-voice.

c-armentano avatar c-armentano commented on July 18, 2024

Thank you for your response.

What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?

Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.

Kind regards

from common-voice.

HarikalarKutusu avatar HarikalarKutusu commented on July 18, 2024

Is there any way to get it?

@c-armentano, AFAIK there is one way. The is_used field in sentences table controls it. If it is set to 0 (false), it will not be shown for new recordings. On the other hand, you need to collect sentence_id's of all these sentences and make a PR changing the database.

If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field.

from common-voice.

c-armentano avatar c-armentano commented on July 18, 2024

Hi @HarikalarKutusu, thank you for your answer.

I understand that you are referring to the is_used field of the validated_sentences.tsv.

What would be the primary use of this field, in what other situations is it set to 0?

I think I could identify the sentences that follow this pattern, modify the tsv and do the PR. Would it be accepted by the Common Voice? Are we sure it wouldn't create other problems?

from common-voice.

HarikalarKutusu avatar HarikalarKutusu commented on July 18, 2024

AFAIK, Common Voice does not want to delete old recordings unless it is absolutely necessary. This field is in the database and controls if the sentence will be shown to users for new recordings, by default it is 1 (True), meaning people can record it.

I don't know about the extent of its usage, but it can be used to disable sentences that somewhat passed the 2+1 voting algorithm, be it typos, be it questionable license, or generated sentences. But, I think, it will need collaborative effort from members of the community to request such a change. Please read the following two links about French text-corpus with similar issues:

#3785
#3786

On the other hand, at that time, the process was based on .txt files, which is an old process, where we PR'd those .txt files. Last year, the process has been changed to importing them directly into the DB.

The .tsv files are just a dump of a database view, modifying them will not affect the database.
I don't know any example for bulk-disabling sentences (using sentence_id's), so perhaps a special purpose script should be written for this.

I think we need to ping @jessicarose, @moz-dfeller , @moz-rotimib, and/and @ftyers for further discussion.

from common-voice.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.