Code Monkey home page Code Monkey logo

stas's People

Contributors

xingxingzhang avatar xssstory avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

stas's Issues

Many Problems Encountered During Execution

Hello author, I recently came across your paper and have been attempting to reproduce your process. However, I have encountered numerous unexpected problems and hope that you can assist me.

I do not have much personal experience in this area.

After downloading the model, I entered the Evaluation step, but encountered issues when trying to run "bash extract_nyt.sh".

The main problem I am facing now is that during execution, there are many errors related to the 'data' folder. Could you please tell me specifically what is needed in the folder, especially the requested txt files and dataset (I only found one txt file here)?

Furthermore, some of the code has been deprecated by other developers, which requires downgrading the version of the library to run properly. (I apologize for forgetting which library and the error code; I believe it was an error in ensemble.py.)

I understand that my running environment is different from yours, as I am running your code on win10+cygwin64, which may lead to some problems.

However, I have reviewed many resources and still do not understand what is in the 'data' folder.

Could you please guide me? Thank you.

Dataset Preparation

It is mentioned in the data preprocess step that every article/summary is splitted by the <S_SEP> token.
You meant like this:

<S_SEP>sentence 1<S_SEP>sentence 2<S_SEP>

It would have been better if you have provided with the example data.

Questions about Training Program and Input Data.

Hello, I have entered the preparation stage for training. The program should be running normally now. Thank you very much for your assistance.

Currently, I am preparing to train other languages and have found a dictionary on my own.

However, I have not been able to successfully find the input for the dictionary of your software, especially I am not sure where the tokens are input from.
The file ‘roberta-base-vocab.json’ does not specify the input for the vectors, so I think it cannot be modified independently.

Also, I will have multiple articles to train, and I would like to ask if the program can read texts according to the folder. It seems to be possible in the code, but I want to confirm if I have misunderstood.

I am planning to train my data based on the pre-trained model you provided, but my texts currently do not have summaries. Can I still train them?

(Thank you again for your help.)

Order of files in ensemble results and discrepancy in len of lines.

Hi, thanks for the great work.
I'm wondering about the ensemble results, there are
Pacsum results And your results (Stas), Now each F or T represent a sentence in some file, why does Pacsum number of lines are larger then Stas? For example , for the first file Pacsum results are:
Predicted Labels: T T F F T F F F F F F F F F F F F F F F F F F F F F F F F F
Stas results are:
Predicted Labels: T T F T F F F F F F F F F F F F F.

One more question , do we have a way to catch which file each line represents? is there any order you used?

Thanks a lot!

Guidance on using STAS for inference

Hello, is it possible to provide some guidance (either here or in the README) on how to use the model for summarization using either the provided checkpoints or own checkpoints generated during training?

I assume one would have to load the checkpoints using torch.load() and initialize a model using the respective encoder class right?

There is an issue with setup.py.

When I tried to run setup.py directly using cmd, I encountered the error message "UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 969: illegal multibyte sequence".

To fix this, I should change "with open('README.md') as f:" to "with open('README.md', encoding='utf-8') as f:" in line 16.

Loading Pre-trained Models

As mentioned in our previous communication, I am preparing for training.
I would like to express my gratitude for your guidance.

My training objective is cross-lingual training.

Currently, I have made modifications to the get-data-bpe.sh script, adding '--target-lang $tgtdict' and '--joined-dictionary' to set the training dictionary. An example of the dictionary is as follows: 2 0.302642 -0.312781 ... ...
I would like to confirm with you if it is correct that I placed my prepared data in train.article, valid.article, and test.article files, where each line represents an article.
Is this the correct setup?

Next, I modified train_nyt.sh to attempt loading the pre-trained model checkpoint65.pt provided by you. Although the train.py help command mentions the ability to load a model ('--restore-file'), when I issued the command, it indicated that the command does not exist.
It seems that train_nyt.sh does not offer this type of training and instead starts from the initial pre-training of Roberta.
Therefore, I would like to ask how you continue training using commands. Could you provide an example?

Thank you very much for your guidance and assistance! Your support is crucial to my training progress and problem-solving. I truly appreciate your patience and expertise.
Thank you!

the training data

the Dataset is not available.
could you please give a link to download the data

Dataset Preparation for STAS

It is mentioned in the readme.txt file that we should get only 6 files i.e train.article, train.summary, valid.article, valid.summary, test.article and test.summary. But, we have 92k files for CCN dataset. I have separated the summaries for all these files.

How should we split the data into the mentioned 6 files i.e train.article, train.summary, valid.article, valid.summary, test.article and test.summary? Are we suppose to merge them? If yes, how? What will be the format or the criteria?

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.