Code Monkey home page Code Monkey logo

argument-mining's Introduction

argument-mining's People

Contributors

fededagos avatar namiyousef avatar olicm0601 avatar valerief412 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

argument-mining's Issues

TUDarmstadt training difficulty

Darmstadt dataset was trained for the following parameters:

JOB PARAMETERS:

MODEL_NAME: google/bigbird-roberta-base
MAX_LENGTH: 1024

DATASET: TUDarmstadt
STRATEGY: standard_io

EPOCHS: 60
BATCH_SIZE: 8
VERBOSE: 2
SAVE_FREQ: 20

INFERENCE RESULTS:

macro_f1: 0.0
macro_f1 with nan: 0.0

DETAILED RESULTS:

              f1

label
O 0.0
I-MajorClaim 0.0
I-Claim 0.0
I-Premise 0.0

The loss starts at 0.486 and across all epochs only goes down to 0.441! This is not a huge improvement, and it is likely why the F1 scores are not that great. We need to look into this a bit more...

Find references / justifications for project

  • Find references that justify / give a background on the history of labelling strategies for argument mining (or even span detection / entity recognition). E.g. Why are we interested in looking at the different labelling strategies?

  • Find references for how people in literature typically go back from tokens to words. Do they always use the first subtoken prediction? Are there pros and cons to using average / max?

F1 score evaluation minor bug

This is an easy fix, just adding so I don't forget. When batching to calculate F1 you cannot just averahe the F1s. You need to total the TP, FN and FP.and then find F1 on the aggregates.

Model Training on Colab

Train Models on Colab

This issue is to document how you can get started with training, saving, loading and also running inference for any transformer based model and available datasets and/or processors.

Set up authentication to your GitHub

Since we are working with a private repository at the moment, it is not possible to easily install this package. There are 3 ways of running our code on Colab:

  • copy paste all of the relevant code into colab: this is not recommended because it makes the notebooks really long and unmaintainable. It makes versioning almost impossible, and every change will require a full refactoring again.
  • zip the package and unzip on colab: I've tried this before, and though it works most of the time it can sometimes be a bit confusing with zip files getting misplaced, misnamed, etc.. making it difficult to know why something went wrong
  • install the package as a private repository: this is very similar to running pip install for any other package, except that we need to authenticate before running it. This basically means that within colab, we'd be running a private pip install to install the package directly from the develop branch of argument-mining. Thus, any new changes that we make an push can automatically loaded by running the private install again.

Since our code should be ready for testing, we are opting with the third method. If you want to develop the code on colab, then please reach out to me in private and I can help with setting that up.

Now, in order to authenticate, you will need a GitHub access token. Follow these instructions to create an access token. Save this access token in a .json file called github_config.json that has the following format:

{
   "username": "namiyousef",
   "access_token": "YOURACCESSTOKEN"
}

Make sure that the username is MY username and NOT yours. This is because we will be installing a repository that is in my name.

Add relevant data to Google Drive

Now, in order to access data we need it accessible from within Colab. We can do this by storing things in Google Drive. Make sure that you store any data that you need within your personal Google Drive. In particular, store the github_config.json there and ONLY there. This is because you do NOT want it accessible to other people, since someone accessing it will be able to access your account.

Now, in terms of data (e.g. data for the project) I mentioned above that you can store it in your own drive. This is OK, but since we already have a shared drive (https://drive.google.com/drive/folders/1XaMWpeoSq04BkVGt16aS9Gk7PBjMtirS) you can also store data there. Just make sure that you don't overwrite anything and that each folder has a readme.txt file explaining what is there in the file, so that we don't get lost.

In order to be able to access this shared folder programmatically, you will need to add it as a shortcut to your Google Drive. You can do this by right clicking the shared folder, and then clicking the 'add shortcut to drive'.
Screenshot 2022-04-05 at 11 27 31

You will now be able to access the shared folder programmatically from within Colab.

Open the repository and run models!

Now, open Colab on your browser. You will be faced with a default screen for selecting notebooks. Navigate to the GitHub tab and check the 'include private repos' checkbox. This will prompt you to login to you GitHub and authenticate. Then from the repositories dropdown, find namiyousef/argument-mining and then select the develop branch. Once you have done this, find the notebook in the path experiments/yousef/End-to-end_GPU.ipynb and select it.

Screenshot 2022-04-05 at 11 36 53

In this notebook, configure the paths as appropriate (you will need to modify some other path variables along the way). Once you have done this, you will be able to run the notebook successfully.

Note

When you are done using the notebook, you can save a copy in your personal drive. You can also push it to GitHub, but please use a different path than experiments/yousef/ because that will change the notebook and I currently have it setup to work with my directories. I would recommend that you push the notebook to GitHub under experiments/{your_name}/{file_name} so that you can have it configured to how you want to use it, and also so you can have versioning on it.

Alternatively, you can save a copy in your personal drive (if you do this, the authentication might fail the next time you try to run it, so try to stick to GitHub wherever possible).

Report: datasetas

This issue is to monitor the datasets for the purpose of report writing.

  • Make sure that we discuss the limitations of the data that we are using to test, e.g. very small, no other similar datasets, etc.

  • The following documents are helpful for the darmstadt dataset: paper link, and annotation guidelines

Create working baseline

Not sure what to use, ideally would go for a bi-lstm. Any thoughts?

Working bi-lstm model created but needs to be integrated with rest of module and make more efficient

Determine performance parameters to use to compare experiments

So far we've based our evaluation schemes on how Kaggle expected us to evaluate. What else can we measure? Can we find references to support our decisions?

  • Think about using different thresholds when calculating the macro F1 score
  • Think about the utility of evaluating performance on the tensors?

Refactor code and add unittests

Checklist for version 0.2.0:

Unittests

  • Unittests for base processor logic
  • Unittests for specific processors using Mocks
  • Unittests for util functions
  • Integration tests for API
  • Integration tests for end-to-end training

Refactoring

  • Add docstrings to all files
  • Refactor config.py and improve hardcoded label maps
  • Refactor email functions into separate logging module (or project entirely)
  • Consider API as separate application?
  • Refactor data.py and remove archive datasets, add deprecation warnings
  • Refactor cluster run.py into main module

Create labelling schemes

This issue is to monitor the creation of methods to create the relevant data labelling schemes.

Success Criteria:

  • Functions that take the input text and give the corresponding outputs

  • Methods in the dataset class that allow us to apply the labels to subtokens

Curate Darmstadt dataset for our project

Previously we worked with the Kaggle PERSUADE dataset. Until we have permission to use this, we should refrain at all costs.

Alternate datasets that look similar to that are hard to come by.

Here is a dataset we can use, the Darmstadt dataset: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422
NOTE: the prepared data has been uploaded to this link here: https://drive.google.com/drive/u/0/folders/1kV_DXsvNDgtyV6suPyS6FfwhoRT-wxeY

This ticket is to document datasets that we can use and any associated processing that might be needed.

Things to pay attention to:

  • Is the dataset valid for what we are trying to do? How was it trained?

  • Can we enhance the dataset and increase it in size? Will the small size affect our research?

  • Do we have other datasets that might be slightly similar but we could use?

Select models to run experiments on

Success Criteria:

  • Examples of models we can use to train, with a selected set of ideal hyperparameters to try

  • Should include references to papers as well

  • This should consider models with different tokenisers as well, related to #18

  • difference between roberta-base, bigbird-roberta-base, bigbird-roberta-large

Labels for CLS, SEP and PAD as well as X

  • Do CLS and SEP need to have separate labels than your training labels?
    Yes, these should be labelled as -100.

  • Does PAD need to have a separate label?
    No, this should be labelled as -100.

  • Does the attention mask basically ignore the effect of those things with attention mask 0, or is it still important?
    Yes it does. You should have attention mask set to zero for PAD, but not for CLS and SEP because they can contain important information about the training items. The idea is:
    for PAD: don't attend and don't compute loss
    for CLS/SEP: attend but don't compute loss
    There is a thread on huggingface forums on this.

It is a requirement to use -100 for the CLS and SEP tokens because of the crossentropy function in PyTorch. You might apply this to the subtokens as well if you wanted to ignore them.

predictionString starts at 0 not 1

First word should always be 0, NOT 1.
This bug exists in multiple places in the code, so you need to look into this.

As an improvement, set the start index as a global var in the config.py file

Places to look for:
predStr, predictionString, range, etc..
Look into Datasets, DataProcessors and Functions in data.py and evaluation.py

Train models on the TUDarmstadt dataset

As a start, we will try the following experiments:

  • Training TUDarmstadt on bio, bieo and io. This will likely be done on Google Colab. The objective from here is to debug the training script to make sure that training and inference are working correctly. This is tangentially related to #39 to see model performance during training
  • Training PersuadeProcessor on bio, bieo and io. This will be done on a combination of using the cluster to make sure that is will be ready for full training runs and on Google Colab, again for debugging purposes to see if we get good results or not
  • For the above two tests, we will try using the RoBERTa and BigBird models. We will likely fix the max_length=1024 across our experiments
  • Once the above are complete and we are confident that they are working, we will agree on epochs, batch_size etc and then run experiments for all the configurations that we need. This would be in the order of models x labelling schemes x agg strategies

Dataset enhancement

Look into using Trainer class from HuggingFace to streamline train/validation and test process

Labelling schemes

As of right now, the BigBird model (loaded using AutoModelForTokenClassification) takes in encoded inputs using (AutoTokenizer). When the model is trained, e.g. model(**inputs, labels=labels) the size of labels must be the same size as the tensors in inputs. Does this always have to be the case?

Example

If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"].

However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]'] and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O'].

We need to answer the following:

  • Does the target variable size always have to match the embedding size? If so, why?
  • Which is the correct way of representing the target variables corresponding to tokenised entities, e.g. does ['Yousef', 'B-PERSON'] become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']] or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
  • Do the [CLS] and [SEP] variables turn into 'O'? What effect does this have on the classification?

Create a working dataloader

Create a working dataloader that takes input text and prepares it for the model.

Criteria:

  • Needs to be parametrised so you can easily modify preprocessing settings
  • Needs to be fast for loading and processing data

Extend PersuadeProcessor to allow adversarial examples

  • Looked into creating preprocess stage using prediction string... this does not work at all and there is a 26% discrepancy with the real data. Most of it is due to missed punctuation but some of it is more severe
  • Next step should look more closely into using discourse_start and discourse_end and see the percentage error rate with that, and if the error examples can be fixed by hardcoding

TUDarmstadt get_tts

Requires implicit path to train-test-split.csv which means that you cannot use this method if you are doing .from_json. A htofix for this should be easy, but moving forward should look into #41 to make the design more robust

Hotfix is just to copy paste the train-test-split.csv to the relevant location. Changing from bug to enhancement since code isn't broken.

Re-visit DataProcessor design

Current data processor created quickly, but as a result a number of shortcuts were made. This makes it a bit annoying to use from a user experience POV.

Can you re-think how this should work?

API Endpoint Documentation

Getting started

Make sure you are in a venv with argminer installed. Run argminer-api from the command line. This should return two links. Go to your browser to the address of the first one, and then add /ui at the end of it. You should now see a page containing the API endpoints:
Screenshot 2022-04-19 at 14 14 47

You are now ready to begin!

GET

health_check

Just checks if the API is alive

model_info

Returns metadata on a set of preselected models that we trained.

POST

evaluate

This endpoint allows you to test any huggingface model for the task of argument mining.

  • model_name: name of any publicly available model on HuggingFace
  • strategy: labelling strategy, see #13 for more information
  • agg_strategy: how to aggregate from tokens back to words, see #13 for more information
  • stategy_level: how to apply labelling strategy, see #13 for more information
  • max_length: maximum length of tensors after tokenization. This is the 2 + num_tokens + (padding) = max_length. 2 refers to CLS and SEP inherent to transformer models, these will always be there. Padding is only added if 2+num_tokens is less than max_length
!Note max_length might conflict with model of choice
  • batch_size: this is the batch size used when running inference. Depending on your memory you may have to adjust this
  • label_map: this is a list of all of the labels in the sample dataset that you are providing. You must always add 'Other' for argminer 0.1.0. This may change in the future.
+ example: [Other, Claim, MajorClaim, Premise] as in AAE dataset
  • body: this is the sample data that you are providing to the model. It must be in the form arr[arr[string]]. The first array dimension refers to 'document', the second refers to the actual instance of text as well as it's associated label, e.g. "label::sentence" with "::" being the delimiter between these. So for example if providing a 2 document input you would have the following:
[
# document 1
["Claim:: This is the first sentence of document 1 classified as claim", "MajorClaim:: THIS is the second sentence of document 1 classified as MajorClaim", "Premise:: this is the third sentence of document 1" ],
# document 2
["Claim:: this is the first sentence of document 2", "Other:: this is the second sentence of document 2. Document 2 does not have a third sentence like document 1."],
]

predict

Predict labels on a given text using one of the models we provide.

Bug during training

When the model is training, except for the first input, the outputs of the model are always the same class.
This needs to be looked into.

  • Try testing with Kaggle notebooks

This was a problem due to learning rate

Create Unified Dataloading Scheme

Requirements for input data

After reading from files and processing, the data should be in the following format, ready to be ingested as part of a PyTorch dataset:

  • 'text' column: this should include the entire passage text
  • 'labels' column: this should include the labels in string format as a list, e.g. ['O', 'B-Claim, ...]

Next features to add:

  • Add standardised intermediate input
  • Complete train test split module
  • Add saving / loading from intermediate stages

Introduction and Minutes

Introduction

This is an issue for documenting our meetings and making a note of the decisions we make.

Project workflow

  • We'll be using GitHub for managing code and versioning of different files

Meetings

  • We'll be meeting weekly at 14:30.

Improve inference speed

Inference is a bit slow, can you find out why and how it can be improved?

  • mostly a bottleneck on the prediction side. When using GPU then bottleneck becomes when using pandas but this was not a huge issue. Closing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.