Code Monkey home page Code Monkey logo

Comments (4)

jessicarose avatar jessicarose commented on June 26, 2024 2

I think this is a really useful discussion and mirrors both concerns we've heard from other language community members and internal discussions. I'll be bringing this into planning meetings next week and will be able to come back to you with more information as team discussions expand on this and we do some research into technical explorations. Thank you so much for flagging this, it's an incredibly useful issue at a really useful time for the team and I appreciate you both raising it.

from common-voice.

HarikalarKutusu avatar HarikalarKutusu commented on June 26, 2024 1

I think those sentences with "???" inside are introduced by an encoding bug when the old sentence collector database is incorporated into CV. It is unfortunately irreversible, and in some cases (western languages where their alphabet is mostly Latin/ASCII with some Unicode additions) they are recorded - because the human brain can deduct them. But for eastern languages it is quite a problem.

See:
#4048
#4138

There is one attempt to remove them from the released corpus, but it is not merged yet:
common-voice/CorporaCreator#127

It might also be caused by wrong encoding in other inclusion methods of course.

from common-voice.

irvin avatar irvin commented on June 26, 2024 1

(Add some bg info) The wiki dump of zh-cn came from really early days when we need a working-in-progress sst model besides English and we need build text corpus fast for contracting recording firms form china to record.

At that time one sentences only recording once, so fetching Wikipedia seems to be the only way we can have hundred thousands of sentences in really short time.

We had try hard to adjust the parameter to raise the quality, and this is the best we can have at than.

from common-voice.

irvin avatar irvin commented on June 26, 2024

For bulk-remove,

As a core contributor from both nan-TW and zh-tw corpus, this is very necessary tools for us if we want to ensure the quality of text corpus and cv database.

Before the collector was published on the official sites, we proof-reading all sentences before it went online, but nowadays, it's totally out of our control - everyone can add sentences, and and we don't have ways to evaluate them before hand.

We had more or less given up on ensuring the quality of things now, so It would be much appreciated if we can have this to do QC in some way.

from common-voice.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.