I am wondering, what advice is there for starting a new corpus? Is there a guide for

There is already a <a href="https://github.com/UniversalDependencies/UD_Sindhi-MazharD

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Suggestions for annotating a new corpus,about universaldependencies/docs

dan-zeman commented on August 18, 2024

There is already a Sindhi dataset in the UD Github by @mazharaliabro. It has never been released primarily because it does not have dependencies. But it is 675 sentences / 6863 tokens with UPOS tags and some features. I suppose someone could use it to train a tagger and apply it to the new data. It should be checked whether the tokenization is compatible.

Regarding dependencies, I imagine that a parser based on XLM Roberta (it seems to contain Sindhi) and a mixture of existing UD treebanks (in the spirit of Udify) could produce something that the annotators could use.

With unexperienced annotators it may be even more advisable to implement a language-specific validator that will check patterns that the universal validator cannot check.

from docs.

dan-zeman commented on August 18, 2024

starting a new corpus? Is there a guide for doing so?

Yes, there is this. But every language is special and there are huge differences in what resources already exist and can be potentially used.

from docs.

meesumalam commented on August 18, 2024

@AngledLuffa I am working on UD for Saraiki language which is closely related to Sindhi.

I am a PhD student in computational linguistics at Indiana University, and would be happy to share my thoughts in this project. thanks

from docs.

AngledLuffa commented on August 18, 2024

@dan-zeman Thank you for the link and the suggested starting point. I would worry about how much Sindhi data is really in XLM - looking over other multilingual transformers which include Sindhi, they generally have very little raw text. The idea of knowledge transfer from an existing language is an interesting one.

We had noticed the unfinished Sindhi dataset. I'm not sure what the current expectation is in terms of how finished we think the upos tagging & featurization is. Depending on how much we want to use it, there may already be enough to start a tagger. Not having dependencies will be a bit of a limitation at first, I would expect.

@meesumalam Thank you for the suggestion. Would it make sense to connect you directly with @muteeurahman ? I am curious what you've found in terms of raw text for annotating or building language models, especially if you've come across such data in Sindhi. There is a limited amount of data in the common crawl or Wikipedia for Sindhi, and I would expect even less for Saraiki (I don't see it listed in the Oscar version of CC, for example)

from docs.

meesumalam commented on August 18, 2024

Right, Saraiki doesn't have much data as compared to Sindhi.

You reach me out at [email protected] for further discussion on the topic.

Thank

from docs.

muteeurahman commented on August 18, 2024

@meesumalam Thanks for your suggestions, as Saraiki is closely related to Sindhi we can have similar issues like crossing or nonprojective dependencies, feature complexities due to pronominal suffixes etc. Let us come to the point of these problems we will have interesting discussions there. Dr. Tafseer told me about someone working on Saraiki Dependencies most probably he was talking about you.

…

On Thu, 23 Nov 2023 at 18:36, meesumalam ***@***.***> wrote: Right, Saraiki doesn't have much data as compared to Sindhi. You reach me out at ***@***.*** for further discussion on the topic. Thank — Reply to this email directly, view it on GitHub <#993 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A27MQKVR32TIADVWDNZR4STYF5GM7AVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGQ2TCMZSGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from docs.

meesumalam commented on August 18, 2024

Yes, I had a conversation with Dr Tafseer and told him about my UD Saraiki work. I think, it would be great if we can have a meeting via zoom to discuss and decide on complex structures of Sindhi. Thanks, Meesum Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: muteeurahman ***@***.***> Sent: Thursday, November 23, 2023 10:47:31 AM To: UniversalDependencies/docs ***@***.***> Cc: Alam, Meesum ***@***.***>; Mention ***@***.***> Subject: [External] Re: [UniversalDependencies/docs] Suggestions for annotating a new corpus (Issue #993) This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources. @meesumalam Thanks for your suggestions, as Saraiki is closely related to Sindhi we can have similar issues like crossing or nonprojective dependencies, feature complexities due to pronominal suffixes etc. Let us come to the point of these problems we will have interesting discussions there. Dr. Tafseer told me about someone working on Saraiki Dependencies most probably he was talking about you.

On Thu, 23 Nov 2023 at 18:36, meesumalam ***@***.***> wrote: Right, Saraiki doesn't have much data as compared to Sindhi. You reach me out at ***@***.*** for further discussion on the topic. Thank — Reply to this email directly, view it on GitHub <#993 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A27MQKVR32TIADVWDNZR4STYF5GM7AVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGQ2TCMZSGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub<#993 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A2XNAY7HUOMD5TNL3ZZJGSTYF5VZHAVCNFSM6AAAAAA7QUGNDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRUGY2DENBZGQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from docs.

Suggestions for annotating a new corpus about docs HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent