Code Monkey home page Code Monkey logo

Comments (16)

avinashvarna avatar avinashvarna commented on June 5, 2024 1

@vinayakumarr This is a different thread/repo, but the trained and exported model for https://github.com/avinashvarna/sanskrit_nmt is already in the git repo, as mentioned in the README, along with detailed steps on using the model - https://github.com/avinashvarna/sanskrit_nmt/tree/master/sandhi_split/transformer_small_vocab

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

Namaste Sebastian

The raw data is available on the DCS website and has been scraped as part of this project - https://github.com/sanskrit-coders/dcs-scraper/. If you are using Python, I have created a python wrapper here - https://github.com/avinashvarna/dcs_wrapper which you can install using pip install dcs_wrapper (https://pypi.python.org/pypi/dcs-wrapper)

How is the training going? Have you been able to train the model from the sandhi splitting project ?

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

Would you mind posting what the expected splits of the two sentences are, so that I can compare them with what the sanskrit_parser project would currently output?

If you can share the trained models, then sure, I would love to play with them. As you point out, the accuracy will probably improve with more data and more training. As far as I am aware, GRETIL does not provide segmentation/analysis. Is that correct?

Training these complex models with several stacked layers of recurrent networks is indeed compute and memory intensive. You could try using some cloud based services to see if they work better.

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

Namaste Sebastian,

Thanks for sharing the model. I will play around with it, but just looking at the file-size, you are possibly right that the model has quite a lot of parameters and would require a lot of computation and memory for training. I wonder whether it can be suitably trained using such a small dataset (the number of parameters may be much larger than the data available), but I have not looked at it in detail yet.

Glad to hear that you have made good progress using the openNMT toolkit. Unfortunately, the public DCS database does not directly give us sentence -> sandhi-split data. I had thought about this previously and the only way I could think of to get this using the available data is to use the grammatical annotations (e.g. rAma [a. p. m.]) to infer the form in the sentence (rAmAn). This would require using some library that can be used to lookup the form given the root and annotation. I never got around to implementing that, but it might be possible using some existing APIs/databases (e.g. INRIA/UoHyd). I will give it some more thought and see if there are any other options, but for now, I am not aware of this data being available publicly. You could always contact Prof. Hellwig and see if he has this data and whether he would be willing to share it.

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

Namaste Sebastian,

Please do let me know if you get the split data from Prof. Hellwig. It would be a very useful resource for training these kinds of models. I spent some time over the past week investigating what it would take to reverse-engineer the sandhi split, and did some experiments. Due to the way the data is annotated in the DCS, there are cases where it may not be possible to directly figure out what the right form is without a lot of extra effort/manual review. It would be good to avoid this if not necessary.

I think training a network directly to output the stems and morphological information is a valid approach, since that may be the ultimate goal. After all, this is what POS/tagging networks would do. If you have any results on the training, please do share them. Did you create a custom model, or are you using one of the built-in openNMT models such as nmt_small/nmt_medium?

Regarding using sandhi-joins on random data for training, my opinion would be that it would have limited use. Even if the network learns everything perfectly, it would just learn the rules of splitting sandhis between arbitrary words, but we already have different libraries to do that. The difficulty is that sandhi splitting can have multiple valid results, and the goal is to choose the most probable split based on the language model. If the network is trained on "real" data, it can learn how to use contextual information to generate the most likely split, which the basic sandhi splitting libraries currently don't do.

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

Namaste Avinash,

Sorry for my slow reply, I was on a meditation retreat in the past days.
I contacted Dr. Hellwig (just to be precise, as far as my knowledge goes he does not hold the position of a professor yet) and his reply does not show indication that he is willing to release this data in an open source way (yet). However he said he is himself working on a similar segmenter and plans to publish this in a paper in the near future. There might be also some opportunity of cooperation between him and people working at the University of Hamburg, but I guess that in this case the data will remain closed within the academic circle. I can understand it because a lot of hard work has gone into it.
So for the time being on the regarding the open source projects what we can do is limited to reverse engineering I think. And maybe thinking about ways to generate data. I agree that random generation of joined sandhi data is of limited use because the network will just learn the rules in that case. But I can imagine that training the network on the limited set of data that we have, see where it fails -> try to understand the fails and then generate some data to cover these cases would be a not too unrealistic way to tackle this problem. One big issue in my eyes with the data used by Vikas Reddy etc. is the small vocab size. Only 7mb of largely classical (mainstream) Sanskrit is prone to problems if applied on less classical texts. They claim 90% on their test data (by the way their paper is out now: https://arxiv.org/abs/1802.06185) and I believe that, but when I run it on my own test data (largely scholastic buddhist sanskrit from the 6th century) I have the subjective impression that precision is lower.
Curiously the problem is not even wrong split of sandhi, that works fine, but sometimes whole words get exchanged with wrong ones (which I think is due to the small vocab size). However I do not understand neuronal networks well enough yet to explain for these problems.
Regarding my own atempts: I am going to build a new stemmer based on Dr. Hellwigs data and this time using keras as the framework. As soon as that works I will upload code & results here. I think it is promising, some 60mb or so that we have as sentence pairs is quite ok.
I can continue my work only after next week because I am currently travelling and don't have access to a machine with a GPU. So right now I am planning the code and then see next week how it performs.

from dcs_experiments.

vinayakumarr avatar vinayakumarr commented on June 5, 2024

@sebastian-nehrdich
@avinashvarna

Namaste Avinash and Namaste sebastian

I am a beginner, detailed profile available http://vinayakumarr.github.io. I like to learn new things and i just started working on the same project. I have tried to run the https://github.com/cvikasreddy/skt
It is taking too much time even for running one epoch. It would be helpful if you can share the trained model for https://github.com/cvikasreddy/skt and https://github.com/avinashvarna/sanskrit_nmt

from dcs_experiments.

vinayakumarr avatar vinayakumarr commented on June 5, 2024

Yes you are correct Avinash

Sebastian has mentioned regarding the trained model of skt. If you have please share it.

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

I was surprised that a bidirectional RNN would require less computational resources than the transformer, since the transformer was designed to improve on the latency bottlenecks of the bidirectional RNN. However, I saw from the paper that the transformer was run with default parameters and no attempt was made to optimize the training efficiency. I suspect that the default model is too big for the task at hand. The model here - https://github.com/avinashvarna/sanskrit_nmt/tree/master/sandhi_split/transformer_small_vocab is also a smaller version of the transformer, since the default parameters were an overkill.

Since I have the setup ready, I was planning to use the data from https://github.com/OliverHellwig/sanskrit/blob/master/papers/2018emnlp/ to retrain and compare, but I see that the accuracy evaluation is built into the model training and there is no easy script to compare the results from different models. It would have been nice if there was a separate script, similar to (sacrebleu)[https://pypi.org/project/sacrebleu/] that would use a reference to compute the accuracy of the output of a model. @sebastian-nehrdich , do you happen to have such a script handy?

from dcs_experiments.

avinashvarna avatar avinashvarna commented on June 5, 2024

On a side-note, getting good quality GPUs in the cloud is not too difficult nowadays. GCE offers free credit for new users, and there are GPU backends available in colabs. E.g. see here

P.S. I am not affiliated with Google :)

from dcs_experiments.

sebastian-nehrdich avatar sebastian-nehrdich commented on June 5, 2024

from dcs_experiments.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.