xssstory / stas Goto Github PK

License: MIT License

Python 72.62% Shell 0.27% C++ 0.44% Cuda 0.87% Lua 0.14% Perl 7.14% HTML 4.18% Dockerfile 0.17% Cython 0.30% Berry 13.86%

stas's People

Contributors

Stargazers

Watchers

Forkers

fortuneseeker hyunbool

stas's Issues

Many Problems Encountered During Execution

Hello author, I recently came across your paper and have been attempting to reproduce your process. However, I have encountered numerous unexpected problems and hope that you can assist me.

I do not have much personal experience in this area.

After downloading the model, I entered the Evaluation step, but encountered issues when trying to run "bash extract_nyt.sh".

The main problem I am facing now is that during execution, there are many errors related to the 'data' folder. Could you please tell me specifically what is needed in the folder, especially the requested txt files and dataset (I only found one txt file here)?

Furthermore, some of the code has been deprecated by other developers, which requires downgrading the version of the library to run properly. (I apologize for forgetting which library and the error code; I believe it was an error in ensemble.py.)

I understand that my running environment is different from yours, as I am running your code on win10+cygwin64, which may lead to some problems.

However, I have reviewed many resources and still do not understand what is in the 'data' folder.

Could you please guide me? Thank you.

Dataset Preparation

It is mentioned in the data preprocess step that every article/summary is splitted by the <S_SEP> token.
You meant like this:

<S_SEP>sentence 1<S_SEP>sentence 2<S_SEP>

It would have been better if you have provided with the example data.

Questions about Training Program and Input Data.

Hello, I have entered the preparation stage for training. The program should be running normally now. Thank you very much for your assistance.

Currently, I am preparing to train other languages and have found a dictionary on my own.

However, I have not been able to successfully find the input for the dictionary of your software, especially I am not sure where the tokens are input from.
The file ‘roberta-base-vocab.json’ does not specify the input for the vectors, so I think it cannot be modified independently.

Also, I will have multiple articles to train, and I would like to ask if the program can read texts according to the folder. It seems to be possible in the code, but I want to confirm if I have misunderstood.

I am planning to train my data based on the pre-trained model you provided, but my texts currently do not have summaries. Can I still train them?

(Thank you again for your help.)

Order of files in ensemble results and discrepancy in len of lines.

Hi, thanks for the great work.
I'm wondering about the ensemble results, there are
Pacsum results And your results (Stas), Now each F or T represent a sentence in some file, why does Pacsum number of lines are larger then Stas? For example , for the first file Pacsum results are:
Predicted Labels: T T F F T F F F F F F F F F F F F F F F F F F F F F F F F F
Stas results are:
Predicted Labels: T T F T F F F F F F F F F F F F F.

One more question , do we have a way to catch which file each line represents? is there any order you used?

Thanks a lot!

Guidance on using STAS for inference

Hello, is it possible to provide some guidance (either here or in the README) on how to use the model for summarization using either the provided checkpoints or own checkpoints generated during training?

I assume one would have to load the checkpoints using torch.load() and initialize a model using the respective encoder class right?

There is an issue with setup.py.

When I tried to run setup.py directly using cmd, I encountered the error message "UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 969: illegal multibyte sequence".

To fix this, I should change "with open('README.md') as f:" to "with open('README.md', encoding='utf-8') as f:" in line 16.

Loading Pre-trained Models

As mentioned in our previous communication, I am preparing for training.
I would like to express my gratitude for your guidance.

My training objective is cross-lingual training.

Currently, I have made modifications to the get-data-bpe.sh script, adding '--target-lang $tgtdict' and '--joined-dictionary' to set the training dictionary. An example of the dictionary is as follows: 2 0.302642 -0.312781 ... ...
I would like to confirm with you if it is correct that I placed my prepared data in train.article, valid.article, and test.article files, where each line represents an article.
Is this the correct setup?

Next, I modified train_nyt.sh to attempt loading the pre-trained model checkpoint65.pt provided by you. Although the train.py help command mentions the ability to load a model ('--restore-file'), when I issued the command, it indicated that the command does not exist.
It seems that train_nyt.sh does not offer this type of training and instead starts from the initial pre-training of Roberta.
Therefore, I would like to ask how you continue training using commands. Could you provide an example?

Thank you very much for your guidance and assistance! Your support is crucial to my training progress and problem-solving. I truly appreciate your patience and expertise.
Thank you!

the training data

the Dataset is not available.
could you please give a link to download the data

Dataset Preparation for STAS

It is mentioned in the readme.txt file that we should get only 6 files i.e train.article, train.summary, valid.article, valid.summary, test.article and test.summary. But, we have 92k files for CCN dataset. I have separated the summaries for all these files.

How should we split the data into the mentioned 6 files i.e train.article, train.summary, valid.article, valid.summary, test.article and test.summary? Are we suppose to merge them? If yes, how? What will be the format or the criteria?

Thanks in advance.

xssstory / stas Goto Github PK

stas's People

Contributors

Stargazers

Watchers

Forkers

stas's Issues

Many Problems Encountered During Execution

Dataset Preparation

Questions about Training Program and Input Data.

Order of files in ensemble results and discrepancy in len of lines.

Guidance on using STAS for inference

There is an issue with setup.py.

Loading Pre-trained Models

the training data

Dataset Preparation for STAS

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent