Comments (7)
There is already a Sindhi dataset in the UD Github by @mazharaliabro. It has never been released primarily because it does not have dependencies. But it is 675 sentences / 6863 tokens with UPOS tags and some features. I suppose someone could use it to train a tagger and apply it to the new data. It should be checked whether the tokenization is compatible.
Regarding dependencies, I imagine that a parser based on XLM Roberta (it seems to contain Sindhi) and a mixture of existing UD treebanks (in the spirit of Udify) could produce something that the annotators could use.
With unexperienced annotators it may be even more advisable to implement a language-specific validator that will check patterns that the universal validator cannot check.
from docs.
starting a new corpus? Is there a guide for doing so?
Yes, there is this. But every language is special and there are huge differences in what resources already exist and can be potentially used.
from docs.
@AngledLuffa I am working on UD for Saraiki language which is closely related to Sindhi.
I am a PhD student in computational linguistics at Indiana University, and would be happy to share my thoughts in this project. thanks
from docs.
@dan-zeman Thank you for the link and the suggested starting point. I would worry about how much Sindhi data is really in XLM - looking over other multilingual transformers which include Sindhi, they generally have very little raw text. The idea of knowledge transfer from an existing language is an interesting one.
We had noticed the unfinished Sindhi dataset. I'm not sure what the current expectation is in terms of how finished we think the upos tagging & featurization is. Depending on how much we want to use it, there may already be enough to start a tagger. Not having dependencies will be a bit of a limitation at first, I would expect.
@meesumalam Thank you for the suggestion. Would it make sense to connect you directly with @muteeurahman ? I am curious what you've found in terms of raw text for annotating or building language models, especially if you've come across such data in Sindhi. There is a limited amount of data in the common crawl or Wikipedia for Sindhi, and I would expect even less for Saraiki (I don't see it listed in the Oscar version of CC, for example)
from docs.
Right, Saraiki doesn't have much data as compared to Sindhi.
You reach me out at [email protected] for further discussion on the topic.
Thank
from docs.
from docs.
from docs.
Related Issues (20)
- Dative Subjects Saraiki HOT 5
- Misidentified Lemmas in Spanish HOT 1
- clausal appos HOT 36
- Flat:foreign with Typo=Yes HOT 3
- acl vs xcomp vs advcl HOT 1
- Deprel of list item enumerators HOT 11
- English nominal subtypes: merge :npmod and :tmod as :unmarked HOT 18
- Create treebank without syntactic dependencies HOT 2
- How to document script used for the data in treebank? HOT 7
- Some Broken or missing treebank links HOT 4
- NPs in head-marking languages HOT 19
- Standardizing ExtPos (at least for fixed expressions) HOT 36
- New enhanced dependencies - Propagation of nsubj for ccomp and advcl in pro-drop languages HOT 3
- Annotation of Classifiers in the Egyptian-UJaen Treebank HOT 33
- English mischievous nominals involving names and numbers HOT 7
- Repository for new treebank HOT 1
- Transitive vs intransitive verb features? HOT 1
- `as X as` expressions as `fixed` with ExtPos - what qualifies? HOT 8
- Ellipsis in UD HOT 2
- docs site has stopped building HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docs.