Comments (4)
I think this is a really useful discussion and mirrors both concerns we've heard from other language community members and internal discussions. I'll be bringing this into planning meetings next week and will be able to come back to you with more information as team discussions expand on this and we do some research into technical explorations. Thank you so much for flagging this, it's an incredibly useful issue at a really useful time for the team and I appreciate you both raising it.
from common-voice.
I think those sentences with "???" inside are introduced by an encoding bug when the old sentence collector database is incorporated into CV. It is unfortunately irreversible, and in some cases (western languages where their alphabet is mostly Latin/ASCII with some Unicode additions) they are recorded - because the human brain can deduct them. But for eastern languages it is quite a problem.
There is one attempt to remove them from the released corpus, but it is not merged yet:
common-voice/CorporaCreator#127
It might also be caused by wrong encoding in other inclusion methods of course.
from common-voice.
(Add some bg info) The wiki dump of zh-cn came from really early days when we need a working-in-progress sst model besides English and we need build text corpus fast for contracting recording firms form china to record.
At that time one sentences only recording once, so fetching Wikipedia seems to be the only way we can have hundred thousands of sentences in really short time.
We had try hard to adjust the parameter to raise the quality, and this is the best we can have at than.
from common-voice.
For bulk-remove,
As a core contributor from both nan-TW and zh-tw corpus, this is very necessary tools for us if we want to ensure the quality of text corpus and cv database.
Before the collector was published on the official sites, we proof-reading all sentences before it went online, but nowadays, it's totally out of our control - everyone can add sentences, and and we don't have ways to evaluate them before hand.
We had more or less given up on ensuring the quality of things now, so It would be much appreciated if we can have this to do QC in some way.
from common-voice.
Related Issues (20)
- Create issues template for documentation updates or new docs needed HOT 2
- [BUG] Unable to modify e-mail address. HOT 2
- [FR] (suggestion) Make delta releases easily usable
- [DOCS] Removing discontinued platforms.
- [DOCS] Create information architecture draft for docs HOT 3
- [FR] Add missing major "sentence_domain"s
- Change language name of 'gom' to "Konkani (Romi)" HOT 2
- Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets HOT 13
- [BUG] Delta for v10.0 & v11.0 are buggy and should be removed
- LOCALISATION REQUEST: nqo_Nkoo HOT 2
- [BUG] Should purge voted sentences in "review" from local storage
- [BUG] On changing the language on review page, sentences from previous language appear even after refresh HOT 3
- LOCALISATION REQUEST for Shan (ISO-639-3: shn) language HOT 4
- LOCALISATION REQUEST: Adding Tarifit (Tamazight language) to Pontoon for Localisation HOT 2
- LOCALISATION REQUEST: Dargwa HOT 1
- docs/sentences/correcting existing data: more information needed + migrations docs needed[DOCS] HOT 2
- Konkani (Devanagari) language code must be changed to "kok"
- [BUG] "https://commonvoice.mozilla.org/de/review" not working HOT 5
- [BUG] Please add Lower Sorbian (dsb) to the language list of the user profile
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common-voice.