Comments (7)
Thanks so much for getting in touch with this issue.
When you say that the sentences in this file have been "recorded more than once", do you mean that identical sentences are being shown to Catalan voice contributors, or that the sentences as listed are repetitive for contributors because they repeat the same sentence structures with only the place names changing from sentence to sentence?
from common-voice.
Thank you for your question.
Some of these sentences have been shown to Catalan voice contributors (and recorded) more than once (up to 4 times in v.16). I see that some others (about 1100) have never been recorded, but since they are too similar (same sentence structures with only the place names changing) we don't see the interest in recording them.
from common-voice.
Apologies for the delay in responding. Getting sentences back out of the validated text corpus is exceptionally challenging from a technical perspective and would have to wait behind feature work and bug fixes for our team.
The fastest fix for re-balancing the Catalan dataset would be to dilute these sentences with fresh uploads of bulk Catalan sentences that would provide speakers and the dataset with a more varied pool of sentences to draw from. We've seen language communities have great success with CC0 books and texts, copywrite free government or cultural writings and with community driven writing challenges. Could this be a faster fix for helping rebalance the text corpus and keep this interesting for contributors and create more useful data for dataset consumers?
from common-voice.
Thank you for your response.
What we are asking for is not to remove them from the validated dataset, only to prevent them from being proposed to speakers to be read. Is there any way to get it?
Regarding to add more sentences in the corpus, we are working on it. We hope to get more soon, since we are committed in achieving a varied and reliable corpus.
Kind regards
from common-voice.
Is there any way to get it?
@c-armentano, AFAIK there is one way. The is_used
field in sentences table controls it. If it is set to 0 (false), it will not be shown for new recordings. On the other hand, you need to collect sentence_id's of all these sentences and make a PR changing the database.
If they are synthetically generated, like "We are going to [place]", they can also be manipulated using the sentence field.
from common-voice.
Hi @HarikalarKutusu, thank you for your answer.
I understand that you are referring to the is_used field of the validated_sentences.tsv.
What would be the primary use of this field, in what other situations is it set to 0?
I think I could identify the sentences that follow this pattern, modify the tsv and do the PR. Would it be accepted by the Common Voice? Are we sure it wouldn't create other problems?
from common-voice.
AFAIK, Common Voice does not want to delete old recordings unless it is absolutely necessary. This field is in the database and controls if the sentence will be shown to users for new recordings, by default it is 1 (True), meaning people can record it.
I don't know about the extent of its usage, but it can be used to disable sentences that somewhat passed the 2+1 voting algorithm, be it typos, be it questionable license, or generated sentences. But, I think, it will need collaborative effort from members of the community to request such a change. Please read the following two links about French text-corpus with similar issues:
On the other hand, at that time, the process was based on .txt files, which is an old process, where we PR'd those .txt files. Last year, the process has been changed to importing them directly into the DB.
The .tsv files are just a dump of a database view, modifying them will not affect the database.
I don't know any example for bulk-disabling sentences (using sentence_id's), so perhaps a special purpose script should be written for this.
I think we need to ping @jessicarose, @moz-dfeller , @moz-rotimib, and/and @ftyers for further discussion.
from common-voice.
Related Issues (20)
- [FR] Detail unvalidated text corpus status
- [BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields HOT 2
- Rare letters in toki pona [BUG] HOT 4
- Create issues template for documentation updates or new docs needed HOT 2
- [BUG] Unable to modify e-mail address. HOT 2
- [FR] (suggestion) Make delta releases easily usable
- [DOCS] Removing discontinued platforms.
- [DOCS] Create information architecture draft for docs HOT 3
- [FR] Add missing major "sentence_domain"s
- Change language name of 'gom' from "Goan Konkani" to "Konkani (Romi)" HOT 3
- Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets HOT 13
- [BUG] Delta for v10.0 & v11.0 are buggy and should be removed
- LOCALISATION REQUEST: nqo_Nkoo HOT 2
- [BUG] Should purge voted sentences in "review" from local storage
- [BUG] On changing the language on review page, sentences from previous language appear even after refresh HOT 3
- LOCALISATION REQUEST for Shan (ISO-639-3: shn) language HOT 12
- Support bulk-ban or bulk-remove sentences HOT 4
- LOCALISATION REQUEST: Adding Tarifit (Tamazight language) to Pontoon for Localisation HOT 2
- LOCALISATION REQUEST: Dargwa HOT 1
- docs/sentences/correcting existing data: more information needed + migrations docs needed[DOCS] HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common-voice.