Code Monkey home page Code Monkey logo

Comments (7)

dan-zeman avatar dan-zeman commented on August 18, 2024

There is already a Sindhi dataset in the UD Github by @mazharaliabro. It has never been released primarily because it does not have dependencies. But it is 675 sentences / 6863 tokens with UPOS tags and some features. I suppose someone could use it to train a tagger and apply it to the new data. It should be checked whether the tokenization is compatible.

Regarding dependencies, I imagine that a parser based on XLM Roberta (it seems to contain Sindhi) and a mixture of existing UD treebanks (in the spirit of Udify) could produce something that the annotators could use.

With unexperienced annotators it may be even more advisable to implement a language-specific validator that will check patterns that the universal validator cannot check.

from docs.

dan-zeman avatar dan-zeman commented on August 18, 2024

starting a new corpus? Is there a guide for doing so?

Yes, there is this. But every language is special and there are huge differences in what resources already exist and can be potentially used.

from docs.

meesumalam avatar meesumalam commented on August 18, 2024

@AngledLuffa I am working on UD for Saraiki language which is closely related to Sindhi.

I am a PhD student in computational linguistics at Indiana University, and would be happy to share my thoughts in this project. thanks

from docs.

AngledLuffa avatar AngledLuffa commented on August 18, 2024

@dan-zeman Thank you for the link and the suggested starting point. I would worry about how much Sindhi data is really in XLM - looking over other multilingual transformers which include Sindhi, they generally have very little raw text. The idea of knowledge transfer from an existing language is an interesting one.

We had noticed the unfinished Sindhi dataset. I'm not sure what the current expectation is in terms of how finished we think the upos tagging & featurization is. Depending on how much we want to use it, there may already be enough to start a tagger. Not having dependencies will be a bit of a limitation at first, I would expect.

@meesumalam Thank you for the suggestion. Would it make sense to connect you directly with @muteeurahman ? I am curious what you've found in terms of raw text for annotating or building language models, especially if you've come across such data in Sindhi. There is a limited amount of data in the common crawl or Wikipedia for Sindhi, and I would expect even less for Saraiki (I don't see it listed in the Oscar version of CC, for example)

from docs.

meesumalam avatar meesumalam commented on August 18, 2024

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at [email protected] for further discussion on the topic.

Thank

from docs.

muteeurahman avatar muteeurahman commented on August 18, 2024

from docs.

meesumalam avatar meesumalam commented on August 18, 2024

from docs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.