Code Monkey home page Code Monkey logo

Comments (36)

sebastianruder avatar sebastianruder commented on May 19, 2024 2

Just to be clear: awesome-nlp should stay awesome, so we shouldn't remove anything from here from now and awesome-nlp should still be the place where libraries and tools, etc. are collected.
As @NirantK mentions, anything with reported results and standard evaluation setups can be added to nlpprogress.

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024 1

I will be working on this issue for a brief duration.

I'd love to assist you @the-ethan-hunt if you are up to take the lead on this.

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024 1

Welcome @arpitabatra to the thread. She did her thesis on Hindi Text Processing. Some of the cool stuff she mentioned:

Thanks for the search @the-ethan-hunt, they look good. Let's go a little wide in the beginning and then we can trim down. Sounds good?

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024 1

Hey @arpitabatra @the-ethan-hunt, please go ahead and raise one PR for Hindi datasets (excluding the work on POS, Stemming etc) as soon as you have sometime?

The work isn't quite sufficient to get started in terms of tools, but I think we should share the datasets atleast as we've done for Spanish.

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , I would be happy to be assisted by you! But the pond is too large and the fish too small. 😅

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

You are right!

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , a simple GitHub search is leading nowhere. Any leads to start this? 😅

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , I went old school and discovered two good papers worth to be mentioned in this list:

  • A POS tagger and chunker system for Hindi language using Maximum Entropy Markov Model. Here is the paper
  • A lightweight stemmer for Hindi link
    IMHO, there has been negligible research conducted for NLP in Tamil, Telugu, Marathi and other Indian languages

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , sure!
Thanks for the stuff Arpita! And welcome to awesome-nlp! 😄

from awesome-nlp.

arpitabatra avatar arpitabatra commented on May 19, 2024

POS tagging related papers.

  • Morphological Richness Offsets Resource Demand- Experiences in Constructing a POS Tagger for Hindi
    link
  • Building Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi
    link
  • Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge
    link

from awesome-nlp.

arpitabatra avatar arpitabatra commented on May 19, 2024

@NirantK and @the-ethan-hunt : shall we explore some papers which are not only statistics based but also uses some linguistic cues? Since the datasets of large size are unavailable for Hindi, it will become difficult to train the models.

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

Sure @arpitabatra, that's a good insight. We should definitely look into those. Please do help us around that.

Sidenote: If there are any glaring holes in Hindi Text Processing, please mention them as well, they can become research avenues for people after us. We can note and detail those challenges in a separate repository/markdown file as well.

@the-ethan-hunt, I think we both can focus on Gujarati/other Indic languages as @arpitabatra has been kind enough to share her expertise with us on Hindi. What do you think?

Edit: I've added Gujarati in the task list above keeping in mind the comment by @the-ethan-hunt stating that prima facie, no good work was found for Tamil, Telegu and Bengali.

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , yes I agree with you. And while tinkering around, I found this. There are huge treebanks of several languages(both Indian and foreign).
Should I make a PR for this?

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

Hey, @the-ethan-hunt that's a good find.

Let's link to Hindi specific work for now.

Maybe we need to look into more tooling, datasets and academic work beyond treebanks, POS taggers and actually compile the best from what is out there?

If I was starting looking into Hindi NLP, the above list of work is not even 20-30% of what I'd need to get started.

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , there is also this treebank prepared by several American universities. The thing here is, that both the mentioned treebanks are annotated ones; this would largely help linguistics experts and NLP scientists to stop using their time annotating their corpora.

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

That is already in the list from @arpitabatra :)

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , any points we can start working? Like the Universal Dependencies thing? 😄

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

@the-ethan-hunt
Why we need data to work in NLP?

  • Dictionaries and WordNets are useful for syntactic tasks
  • Large text corpus is useful for lot of tasks such as text classification, text embeddings, and so on

Then, in terms of data, we need the following:

  • Dictionaries e.g. Gujarati to English and vice versa
  • Large News corpus similar to CNN or DailyMail ones

I hope that this helps us streamline our efforts. I will look into large news corpus, if unavailable at least list them down a few major websites which we can use to generate that dataset.

from awesome-nlp.

the-ethan-hunt avatar the-ethan-hunt commented on May 19, 2024

@NirantK , regarding the shift of the language section to NLP-Progress as discussed in this thread, should I raise PRs for new resources here or at NLP-Progress?
Does it sound alright, @sebastianruder ? 😅

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

@the-ethan-hunt

If there are performance numbers available, or high user trust in that lib - raise it directly at NLP-Progress.
If not, raise them there here for now.

There is a lot of work which does not have results. E.g. datasets, Python libs in Arabic/Hindi etc.
They are good enough for programmers quite often. We can discuss and sort those edge cases out.

from awesome-nlp.

Shashi456 avatar Shashi456 commented on May 19, 2024

@NirantK i think this library can be added as tool for indic languages
http://anoopkunchukuttan.github.io/indic_nlp_library/

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

from awesome-nlp.

Shashi456 avatar Shashi456 commented on May 19, 2024

also @NirantK I think ACL 2018 highlights by Sebastian ruder should be added to the research trends and summaries

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

@Shashi456 please raise a MR for Ruder's highlights with a 1 line explanation and we'll review the same?

from awesome-nlp.

Shashi456 avatar Shashi456 commented on May 19, 2024

@NirantK do you know of any Indic libraries other than that, i've been scourging the internet for some but have found none satisfiable

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

from awesome-nlp.

guillaume-chevalier avatar guillaume-chevalier commented on May 19, 2024

Hello @NirantK, I made this project for clustering/topic extraction: https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

It also contains a tutorial explaining the architecture: https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/Stemming-words-from-multiple-languages.ipynb

It also has unit tests.

All those languages are supported:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Porter
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

I was hesitating whether or not to add a new section, such as "Many languages". My question is: what would you do? Where would you add this?

Thank you!

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

@guillaume-chevalier that should go under Libraries -> Python. Please raise a PR. Great to see a multi-lingual clustering toolkit!

from awesome-nlp.

NeroCube avatar NeroCube commented on May 19, 2024

Hi @NirantK, I can support Traditional Chinese translation and I am currently working on it.
This is really awesome repo, hope more people can see.

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

Wow, this is fantastic @NeroCube! Can you please raise a PR with your translation?

from awesome-nlp.

goru001 avatar goru001 commented on May 19, 2024

@NirantK Do you think adding links to repos NLP for Hindi, NLP for Punjabi , NLP for Sanskrit, NLP for Gujarati, NLP for Kannada, NLP for Malayalam, NLP for Nepali, NLP for Odia, NLP for Marathi, NLP for Bengali, NLP for Tamil, NLP for Urdu under the Indic Languages section would be helpful? All these repos contain Language Models, Classifiers and Tokenizers, along with the dataset used to train models, for their respective languages and are being used in iNLTK

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

We already have iNLTK, which in turn links to all of the above.

Maybe not add all of them? This might get spammy.

from awesome-nlp.

goru001 avatar goru001 commented on May 19, 2024

Yes okay! That seems right! Thanks!

from awesome-nlp.

NirantK avatar NirantK commented on May 19, 2024

Thank you everyone who has contributed to the multiple languages work, here on Awesome-NLP. While we continue to welcome the contributions along similar lines, we have some sort of coverage now.

I'm closing this issue for now. We will open new issues to encourage specific languages.

from awesome-nlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.