Namaste, I wonder if the raw data of the dcs used for this experimen

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Namaste Sebastian The raw data is available on the <a href="http://k

Hey Avinash,Here is the model:<a href="https://wetransfer.com/download

raw data of the dcs?,about avinashvarna/dcs_experiments

Comments (16)

avinashvarna commented on June 5, 2024 1

@vinayakumarr This is a different thread/repo, but the trained and exported model for https://github.com/avinashvarna/sanskrit_nmt is already in the git repo, as mentioned in the README, along with detailed steps on using the model - https://github.com/avinashvarna/sanskrit_nmt/tree/master/sandhi_split/transformer_small_vocab

from dcs_experiments.

avinashvarna commented on June 5, 2024

Namaste Sebastian

The raw data is available on the DCS website and has been scraped as part of this project - https://github.com/sanskrit-coders/dcs-scraper/. If you are using Python, I have created a python wrapper here - https://github.com/avinashvarna/dcs_wrapper which you can install using pip install dcs_wrapper (https://pypi.python.org/pypi/dcs-wrapper)

How is the training going? Have you been able to train the model from the sandhi splitting project ?

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Namaste Avinash, Thank you very much for the link and the description to the data! I received the models from vikas because I was not able to compute them myself in reasonable time. If you are interested I can share them with you, I also have the results of the tests. I think performance is already very very good, one problem though is that if the segmenter is running on new unseen data it mostly get things right (I guess ~70% of the cases), but if it goes wrong it sometimes really badly wrong, for example: --- dharmanairātmyajñānād api jñeyāvaraṇapratipakṣatvāt jñeyāvaraṇaṃ prahīyate dharma nairātma yajñānāt api jñeya āvaraṇa pratipakṣatvāt jñeya āvaraṇīyate atha vā dharmapudgalābhiniviṣṬāś cittamātraṃ yathābhūtaṃ na jānantīty ato atha vā dharma apuda laniviṣṬā citta mātram yathā bhūtam na jānan iti ataḥ --- (this are just extreme cases, from my impression the majority of the sentences are split quite nicely.) I guess this problem is because sentencepiece was set up only with an 8k dictionary, increase to 32k will propably help. And also sentencepiece must be run with a huge collection of sanskrit etexts (like gretil) and not just on DCS data in order to make the model split better on unseen data. Finally I think that Vikas as far as I can see only used some 7mb of training data from the DCS (7mb each unsplit/split sentences) and I think we should be able to use more, at least DCS provides more than this. I just bought a computer because my laptop is not fast enough to do serious deep learning. The problem is that my GPU seems to not have enough RAM to work on this. With Vikas model I am always running out of memory. I already tried to reduce the batch size, but thats not yet helping at the moment. I wonder how to solve this problem, but then i am also very new to deep learning and it takes time to understand how this is working. Avinash Varna writes:

…

Namaste Sebastian The raw data is available on the [DCS website](http://kjc-sv013.kjc.uni-heidelberg.de/dcs/index.php) and has been scraped as part of this project - https://github.com/sanskrit-coders/dcs-scraper/. If you are using Python, I have created a python wrapper here - https://github.com/avinashvarna/dcs_wrapper which you can install using `pip install dcs_wrapper` (https://pypi.python.org/pypi/dcs-wrapper) How is the training going? Have you been able to train the model from the sandhi splitting project ?

-- Sebastian Nehrdich Hamburg, Germany

from dcs_experiments.

avinashvarna commented on June 5, 2024

Would you mind posting what the expected splits of the two sentences are, so that I can compare them with what the sanskrit_parser project would currently output?

If you can share the trained models, then sure, I would love to play with them. As you point out, the accuracy will probably improve with more data and more training. As far as I am aware, GRETIL does not provide segmentation/analysis. Is that correct?

Training these complex models with several stacked layers of recurrent networks is indeed compute and memory intensive. You could try using some cloud based services to see if they work better.

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Hey Avinash, Here is the model: https://wetransfer.com/downloads/2cc239a71f4bd1a12133843d11287aa920180217073307/4fc960915133706470af0efd7f7d484220180217073307/398393 However I fail to train this model my PC, unless one has a supercomputer it does not seem to be possible to train such a large network at home. I am currently looking into this: https://github.com/OpenNMT/OpenNMT-tf I think that it should be possible to train such a sequence to sequence mapping by using not two different languages but the sentences and their respective word roots (same setup you did within the word2vec trials you uploaded on github). Together with preprocessing the data with sentencepiece I have the feeling that this can turn out to be quite powerful. Regarding the code of Vikas, I have to admit I do not understand much of it but it just seems to be way too heavy at the moment. When I train it on CPU one epoch takes roughly 5 hours... With best wishes and let's stay in contact! Sebastian

…

On Fri, Feb 16, 2018 at 8:21 PM, Avinash Varna ***@***.***> wrote: Would you mind posting what the expected splits of the two sentences are, so that I can compare them with what the sanskrit_parser project would currently output? If you can share the trained models, then sure, I would love to play with them. As you point out, the accuracy will probably improve with more data and more training. As far as I am aware, GRETIL does not provide segmentation/analysis. Is that correct? Training these complex models with several stacked layers of recurrent networks is indeed compute and memory intensive. You could try using some cloud based services to see if they work better. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AR4vmO3HdF2IIJ_qZLjaKYj9ted4vkbtks5tVimtgaJpZM4SGtgt> .

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Hey Avinash, I hope this second message is not bothering too much. I have been rather successful with OpenNMT and within a few hours training on the data that Vikas used I got some usable result. Now I wonder if we can get more unsplit<>split data pairs than just the 7mb that Vikas used. I looked into the dcs_wrapper (thanks for the great and useful script!). While it gives us some 60mb of sentence<>root-pairs (which is also great and is certainly worth to train on), I do not see any way to extract unsplit<>split data. Or am I missing something? I imagine training on some more MB could increase the precision by a magnitude! With best wishes and thank you so far, Sebastian On Sat, Feb 17, 2018 at 5:14 AM, Sebastian Nehrdich <[email protected]

…

wrote: Hey Avinash, Here is the model: https://wetransfer.com/downloads/2cc239a71f4bd1a12133843d11287a a920180217073307/4fc960915133706470af0efd7f7d484220180217073307/398393 However I fail to train this model my PC, unless one has a supercomputer it does not seem to be possible to train such a large network at home. I am currently looking into this: https://github.com/OpenNMT/OpenNMT-tf I think that it should be possible to train such a sequence to sequence mapping by using not two different languages but the sentences and their respective word roots (same setup you did within the word2vec trials you uploaded on github). Together with preprocessing the data with sentencepiece I have the feeling that this can turn out to be quite powerful. Regarding the code of Vikas, I have to admit I do not understand much of it but it just seems to be way too heavy at the moment. When I train it on CPU one epoch takes roughly 5 hours... With best wishes and let's stay in contact! Sebastian On Fri, Feb 16, 2018 at 8:21 PM, Avinash Varna ***@***.***> wrote: > Would you mind posting what the expected splits of the two sentences are, > so that I can compare them with what the sanskrit_parser project would > currently output? > > If you can share the trained models, then sure, I would love to play with > them. As you point out, the accuracy will probably improve with more data > and more training. As far as I am aware, GRETIL does not provide > segmentation/analysis. Is that correct? > > Training these complex models with several stacked layers of recurrent > networks is indeed compute and memory intensive. You could try using some > cloud based services to see if they work better. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#1 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AR4vmO3HdF2IIJ_qZLjaKYj9ted4vkbtks5tVimtgaJpZM4SGtgt> > . >

from dcs_experiments.

avinashvarna commented on June 5, 2024

Namaste Sebastian,

Thanks for sharing the model. I will play around with it, but just looking at the file-size, you are possibly right that the model has quite a lot of parameters and would require a lot of computation and memory for training. I wonder whether it can be suitably trained using such a small dataset (the number of parameters may be much larger than the data available), but I have not looked at it in detail yet.

Glad to hear that you have made good progress using the openNMT toolkit. Unfortunately, the public DCS database does not directly give us sentence -> sandhi-split data. I had thought about this previously and the only way I could think of to get this using the available data is to use the grammatical annotations (e.g. rAma [a. p. m.]) to infer the form in the sentence (rAmAn). This would require using some library that can be used to lookup the form given the root and annotation. I never got around to implementing that, but it might be possible using some existing APIs/databases (e.g. INRIA/UoHyd). I will give it some more thought and see if there are any other options, but for now, I am not aware of this data being available publicly. You could always contact Prof. Hellwig and see if he has this data and whether he would be willing to share it.

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Namaste Avinash, Thank you for the reply. Yesterday I trained the openNMT-based code on the same data for about 8h and got the loss down somewhere to 0.03, I think if one continues to train that model the result will be quite similar to that of Vikas. I gonna contact Oliver Hellwig, in case he has the data and can share it that would be the best. Yes reverse-engineering the sandhi based on the DCS data seems to be not impossible but I imagine this to be extremely tedious work. Currently out of curiosity I am training a network directly with the stems as labels, intuitively I guess this is not how it should be done but since we have no better data at the moment I thought to at least give it a try. I am even thinking of generating some fake-sandhi-joins out of random data (since we have all the data from inria and code to process sandhi joining, it would not be a big deal to just randomly generate such data). But then I am not sure if that would be very helpful to our problem, what do you think?

…

On Sat, Feb 17, 2018 at 9:30 PM, Avinash Varna ***@***.***> wrote: Namaste Sebastian, Thanks for sharing the model. I will play around with it, but just looking at the file-size, you are possibly right that the model has quite a lot of parameters and would require a lot of computation and memory for training. I wonder whether it can be suitably trained using such a small dataset (the number of parameters may be much larger than the data available), but I have not looked at it in detail yet. Glad to hear that you have made good progress using the openNMT toolkit. Unfortunately, the public DCS database does not directly give us sentence -> sandhi-split data. I had thought about this previously and the only way I could think of to get this using the available data is to use the grammatical annotations (e.g. rAma [a. p. m.]) to infer the form in the sentence (rAmAn). This would require using some library that can be used to lookup the form given the root and annotation. I never got around to implementing that, but it might be possible using some existing APIs/databases (e.g. INRIA/UoHyd). I will give it some more thought and see if there are any other options, but for now, I am not aware of this data being available publicly. You could always contact Prof. Hellwig and see if he has this data and whether he would be willing to share it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AR4vmG-gs0Qac5Z_vAz74k-fj0KxkitKks5tV4tWgaJpZM4SGtgt> .

from dcs_experiments.

avinashvarna commented on June 5, 2024

Namaste Sebastian,

Please do let me know if you get the split data from Prof. Hellwig. It would be a very useful resource for training these kinds of models. I spent some time over the past week investigating what it would take to reverse-engineer the sandhi split, and did some experiments. Due to the way the data is annotated in the DCS, there are cases where it may not be possible to directly figure out what the right form is without a lot of extra effort/manual review. It would be good to avoid this if not necessary.

I think training a network directly to output the stems and morphological information is a valid approach, since that may be the ultimate goal. After all, this is what POS/tagging networks would do. If you have any results on the training, please do share them. Did you create a custom model, or are you using one of the built-in openNMT models such as nmt_small/nmt_medium?

Regarding using sandhi-joins on random data for training, my opinion would be that it would have limited use. Even if the network learns everything perfectly, it would just learn the rules of splitting sandhis between arbitrary words, but we already have different libraries to do that. The difficulty is that sandhi splitting can have multiple valid results, and the goal is to choose the most probable split based on the language model. If the network is trained on "real" data, it can learn how to use contextual information to generate the most likely split, which the basic sandhi splitting libraries currently don't do.

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Namaste Avinash,

Sorry for my slow reply, I was on a meditation retreat in the past days.
I contacted Dr. Hellwig (just to be precise, as far as my knowledge goes he does not hold the position of a professor yet) and his reply does not show indication that he is willing to release this data in an open source way (yet). However he said he is himself working on a similar segmenter and plans to publish this in a paper in the near future. There might be also some opportunity of cooperation between him and people working at the University of Hamburg, but I guess that in this case the data will remain closed within the academic circle. I can understand it because a lot of hard work has gone into it.
So for the time being on the regarding the open source projects what we can do is limited to reverse engineering I think. And maybe thinking about ways to generate data. I agree that random generation of joined sandhi data is of limited use because the network will just learn the rules in that case. But I can imagine that training the network on the limited set of data that we have, see where it fails -> try to understand the fails and then generate some data to cover these cases would be a not too unrealistic way to tackle this problem. One big issue in my eyes with the data used by Vikas Reddy etc. is the small vocab size. Only 7mb of largely classical (mainstream) Sanskrit is prone to problems if applied on less classical texts. They claim 90% on their test data (by the way their paper is out now: https://arxiv.org/abs/1802.06185) and I believe that, but when I run it on my own test data (largely scholastic buddhist sanskrit from the 6th century) I have the subjective impression that precision is lower.
Curiously the problem is not even wrong split of sandhi, that works fine, but sometimes whole words get exchanged with wrong ones (which I think is due to the small vocab size). However I do not understand neuronal networks well enough yet to explain for these problems.
Regarding my own atempts: I am going to build a new stemmer based on Dr. Hellwigs data and this time using keras as the framework. As soon as that works I will upload code & results here. I think it is promising, some 60mb or so that we have as sentence pairs is quite ok.
I can continue my work only after next week because I am currently travelling and don't have access to a machine with a GPU. So right now I am planning the code and then see next week how it performs.

from dcs_experiments.

vinayakumarr commented on June 5, 2024

@sebastian-nehrdich
@avinashvarna

Namaste Avinash and Namaste sebastian

I am a beginner, detailed profile available http://vinayakumarr.github.io. I like to learn new things and i just started working on the same project. I have tried to run the https://github.com/cvikasreddy/skt
It is taking too much time even for running one epoch. It would be helpful if you can share the trained model for https://github.com/cvikasreddy/skt and https://github.com/avinashvarna/sanskrit_nmt

from dcs_experiments.

vinayakumarr commented on June 5, 2024

Yes you are correct Avinash

Sebastian has mentioned regarding the trained model of skt. If you have please share it.

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

@vinayakumarr I do have the trained model for https://github.com/cvikasreddy/skt, but I would like to mention that both perfomance-wise as well as quality-wise the approach of the paper by Oliver Hellwig and me is to be preferred (I think you already discovered it judging the mails you wrote Oliver and me): https://github.com/OliverHellwig/sanskrit/blob/master/papers/2018emnlp/sandhi-rnn-hellwig-nehrdich.pdf For both the models cvikasreddy as well as the transformer one will need to have something like a Maxwell Titan X in order to work seriously with it. So even if the trained models are availabe, one will face the problem that inference (especially in the case of the transformer) will require huge amounts of resources as well. However the code of Oliver is able to give comparable and in many cases better results will still being able to run on a small GPU or maybe even CPU (i haven't tried the later option yet). Just to give a figure, memory consumption might be 10 times less than the transformer, not to talk about the time-consumption on inference. I think Oliver's code is about 30-60 times faster on a Maxwell Titan X than the transformer. Regarding quality, I think it is the best we can get at this time (due to the large training set of 500,000 sentence-pairs, which is also publicly available on github). 2018年11月21日(水) 4:01 Vinayakumar R <[email protected]>:

…

Yes you are correct Avinash Sebastian has mentioned regarding the trained model of skt. If you have please share it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AR4vmCf8BaKUNi527oFBIb_dSBPodhZGks5uxFGTgaJpZM4SGtgt> .

from dcs_experiments.

avinashvarna commented on June 5, 2024

I was surprised that a bidirectional RNN would require less computational resources than the transformer, since the transformer was designed to improve on the latency bottlenecks of the bidirectional RNN. However, I saw from the paper that the transformer was run with default parameters and no attempt was made to optimize the training efficiency. I suspect that the default model is too big for the task at hand. The model here - https://github.com/avinashvarna/sanskrit_nmt/tree/master/sandhi_split/transformer_small_vocab is also a smaller version of the transformer, since the default parameters were an overkill.

Since I have the setup ready, I was planning to use the data from https://github.com/OliverHellwig/sanskrit/blob/master/papers/2018emnlp/ to retrain and compare, but I see that the accuracy evaluation is built into the model training and there is no easy script to compare the results from different models. It would have been nice if there was a separate script, similar to (sacrebleu)[https://pypi.org/project/sacrebleu/] that would use a reference to compute the accuracy of the output of a model. @sebastian-nehrdich , do you happen to have such a script handy?

from dcs_experiments.

avinashvarna commented on June 5, 2024

On a side-note, getting good quality GPUs in the cloud is not too difficult nowadays. GCE offers free credit for new users, and there are GPU backends available in colabs. E.g. see here

P.S. I am not affiliated with Google :)

from dcs_experiments.

sebastian-nehrdich commented on June 5, 2024

Dear Avinash Varna, Such a script exists, unfortunately I cannot boot my workstation at the moment and thus only have an older backup of the script from my external harddrive. have a look at this: https://pastebin.com/39RVssnG you can compare two files with it, first should be the output of the model and the second the gold-file if I remember it correctly. Also I think this script has a bug somewhere around the part where Precision/Recall are calculated (we fixed it later, but I don't remember clearly what went wrong atm), but the overall accuracy should come out right. I did try a lot to tune the transformer, but there are some tricky issues. It is for example not easy to reduce the vocab-size to something smaller than 5k because there is a bug in the sentencepiece of the transformer. I am sure it is possible to bypass that by using googles sentencepiece, but at the time of us working on it there was not enough time to try this out. Running it on char-level did not produce usable results. I am sure that a reasonable small vocab-size, for example 1k or 500, could bring very good results. 2018年12月16日(日) 6:19 Avinash Varna <[email protected]>:

…

On a side-note, getting good quality GPUs in the cloud is not too difficult nowadays. GCE offers free credit for new users, and there are GPU backends available in colabs. E.g. see here <https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d> P.S. I am not affiliated with Google :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AR4vmIr_IUdb3H3gOSojC3VV4tmdhQTCks5u5Wd8gaJpZM4SGtgt> .

from dcs_experiments.

raw data of the dcs? about dcs_experiments HOT 16 OPEN

Comments (16)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent