namiyousef / argument-mining Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 0.0 28.33 MB

Repository for NLP project. Name to be changed when we decide on a project

License: MIT License

Jupyter Notebook 97.68% Python 2.30% Shell 0.02%

argument-mining's Introduction

I'm Yousef, a Software Engineer with background in Machine Learning.

Current Projects:

Social:

✏️ Medium
🔗 LinkedIn

argument-mining's People

Contributors

Stargazers

Watchers

argument-mining's Issues

TUDarmstadt training difficulty

Darmstadt dataset was trained for the following parameters:

JOB PARAMETERS:

MODEL_NAME: google/bigbird-roberta-base
MAX_LENGTH: 1024

DATASET: TUDarmstadt
STRATEGY: standard_io

EPOCHS: 60
BATCH_SIZE: 8
VERBOSE: 2
SAVE_FREQ: 20

INFERENCE RESULTS:

macro_f1: 0.0
macro_f1 with nan: 0.0

DETAILED RESULTS:

f1

label
O 0.0
I-MajorClaim 0.0
I-Claim 0.0
I-Premise 0.0

The loss starts at 0.486 and across all epochs only goes down to 0.441! This is not a huge improvement, and it is likely why the F1 scores are not that great. We need to look into this a bit more...

Find references / justifications for project

Find references that justify / give a background on the history of labelling strategies for argument mining (or even span detection / entity recognition). E.g. Why are we interested in looking at the different labelling strategies?
Find references for how people in literature typically go back from tokens to words. Do they always use the first subtoken prediction? Are there pros and cons to using average / max?

Create adversarial attacks module

Success Criteria:

Module that contains all of the adversarial attacks that we're going to run

Add metrics support

Add support to print metrics while the models are training

F1 score evaluation minor bug

This is an easy fix, just adding so I don't forget. When batching to calculate F1 you cannot just averahe the F1s. You need to total the TP, FN and FP.and then find F1 on the aggregates.

Mac Set-up

Train Models on Colab

This issue is to document how you can get started with training, saving, loading and also running inference for any transformer based model and available datasets and/or processors.

Set up authentication to your GitHub

Since we are working with a private repository at the moment, it is not possible to easily install this package. There are 3 ways of running our code on Colab:

copy paste all of the relevant code into colab: this is not recommended because it makes the notebooks really long and unmaintainable. It makes versioning almost impossible, and every change will require a full refactoring again.
zip the package and unzip on colab: I've tried this before, and though it works most of the time it can sometimes be a bit confusing with zip files getting misplaced, misnamed, etc.. making it difficult to know why something went wrong
install the package as a private repository: this is very similar to running pip install for any other package, except that we need to authenticate before running it. This basically means that within colab, we'd be running a private pip install to install the package directly from the develop branch of argument-mining. Thus, any new changes that we make an push can automatically loaded by running the private install again.

Since our code should be ready for testing, we are opting with the third method. If you want to develop the code on colab, then please reach out to me in private and I can help with setting that up.

Now, in order to authenticate, you will need a GitHub access token. Follow these instructions to create an access token. Save this access token in a .json file called github_config.json that has the following format:

{
   "username": "namiyousef",
   "access_token": "YOURACCESSTOKEN"
}

Make sure that the username is MY username and NOT yours. This is because we will be installing a repository that is in my name.

Add relevant data to Google Drive

Now, in order to access data we need it accessible from within Colab. We can do this by storing things in Google Drive. Make sure that you store any data that you need within your personal Google Drive. In particular, store the github_config.json there and ONLY there. This is because you do NOT want it accessible to other people, since someone accessing it will be able to access your account.

Now, in terms of data (e.g. data for the project) I mentioned above that you can store it in your own drive. This is OK, but since we already have a shared drive (https://drive.google.com/drive/folders/1XaMWpeoSq04BkVGt16aS9Gk7PBjMtirS) you can also store data there. Just make sure that you don't overwrite anything and that each folder has a readme.txt file explaining what is there in the file, so that we don't get lost.

In order to be able to access this shared folder programmatically, you will need to add it as a shortcut to your Google Drive. You can do this by right clicking the shared folder, and then clicking the 'add shortcut to drive'.

You will now be able to access the shared folder programmatically from within Colab.

Open the repository and run models!

Now, open Colab on your browser. You will be faced with a default screen for selecting notebooks. Navigate to the GitHub tab and check the 'include private repos' checkbox. This will prompt you to login to you GitHub and authenticate. Then from the repositories dropdown, find namiyousef/argument-mining and then select the develop branch. Once you have done this, find the notebook in the path experiments/yousef/End-to-end_GPU.ipynb and select it.

In this notebook, configure the paths as appropriate (you will need to modify some other path variables along the way). Once you have done this, you will be able to run the notebook successfully.

Note

When you are done using the notebook, you can save a copy in your personal drive. You can also push it to GitHub, but please use a different path than experiments/yousef/ because that will change the notebook and I currently have it setup to work with my directories. I would recommend that you push the notebook to GitHub under experiments/{your_name}/{file_name} so that you can have it configured to how you want to use it, and also so you can have versioning on it.

Alternatively, you can save a copy in your personal drive (if you do this, the authentication might fail the next time you try to run it, so try to stick to GitHub wherever possible).

Cleaning pipeline

Working with #5 and #6 ensure that data cleaning is done in a parametrised way

Report: datasetas

This issue is to monitor the datasets for the purpose of report writing.

Make sure that we discuss the limitations of the data that we are using to test, e.g. very small, no other similar datasets, etc.
The following documents are helpful for the darmstadt dataset: paper link, and annotation guidelines

Create working baseline

Not sure what to use, ideally would go for a bi-lstm. Any thoughts?

Working bi-lstm model created but needs to be integrated with rest of module and make more efficient

Summary of project ideas

Project Title:

Brief summary, including need
Brief comment on datasets

Datasets Covered

Datasets

Argument Annotated Essays:
Persuade:

Pending Datasets

ARG2020 dataset https://github.com/EducationalTestingService/argument-component-essays/

Determine performance parameters to use to compare experiments

So far we've based our evaluation schemes on how Kaggle expected us to evaluate. What else can we measure? Can we find references to support our decisions?

Think about using different thresholds when calculating the macro F1 score
Think about the utility of evaluating performance on the tensors?

Refactor code and add unittests

Checklist for version 0.2.0:

Unittests

Unittests for base processor logic
Unittests for specific processors using Mocks
Unittests for util functions
Integration tests for API
Integration tests for end-to-end training

Refactoring

Add docstrings to all files
Refactor config.py and improve hardcoded label maps
Refactor email functions into separate logging module (or project entirely)
Consider API as separate application?
Refactor data.py and remove archive datasets, add deprecation warnings
Refactor cluster run.py into main module

Create helper functions for model performance

Create helper functions for training and testing model

Review project proposal and send to Yihong for feedback

Create labelling schemes

This issue is to monitor the creation of methods to create the relevant data labelling schemes.

Success Criteria:

Functions that take the input text and give the corresponding outputs
Methods in the dataset class that allow us to apply the labels to subtokens

Curate Darmstadt dataset for our project

Previously we worked with the Kaggle PERSUADE dataset. Until we have permission to use this, we should refrain at all costs.

Alternate datasets that look similar to that are hard to come by.

Here is a dataset we can use, the Darmstadt dataset: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422
NOTE: the prepared data has been uploaded to this link here: https://drive.google.com/drive/u/0/folders/1kV_DXsvNDgtyV6suPyS6FfwhoRT-wxeY

This ticket is to document datasets that we can use and any associated processing that might be needed.

Things to pay attention to:

Is the dataset valid for what we are trying to do? How was it trained?
Can we enhance the dataset and increase it in size? Will the small size affect our research?
Do we have other datasets that might be slightly similar but we could use?

Confirm with Sebastian if we're allowed to use the Kaggle Dataset

Select models to run experiments on

Success Criteria:

Examples of models we can use to train, with a selected set of ideal hyperparameters to try
Should include references to papers as well
This should consider models with different tokenisers as well, related to #18
difference between roberta-base, bigbird-roberta-base, bigbird-roberta-large

Labels for CLS, SEP and PAD as well as X

Do CLS and SEP need to have separate labels than your training labels?
Yes, these should be labelled as -100.
Does PAD need to have a separate label?
No, this should be labelled as -100.
Does the attention mask basically ignore the effect of those things with attention mask 0, or is it still important?
Yes it does. You should have attention mask set to zero for PAD, but not for CLS and SEP because they can contain important information about the training items. The idea is:
for PAD: don't attend and don't compute loss
for CLS/SEP: attend but don't compute loss
There is a thread on huggingface forums on this.

It is a requirement to use -100 for the CLS and SEP tokens because of the crossentropy function in PyTorch. You might apply this to the subtokens as well if you wanted to ignore them.

predictionString starts at 0 not 1

First word should always be 0, NOT 1.
This bug exists in multiple places in the code, so you need to look into this.

As an improvement, set the start index as a global var in the config.py file

Places to look for:
predStr, predictionString, range, etc..
Look into Datasets, DataProcessors and Functions in data.py and evaluation.py

Use Multiple GPUs

Extend training logic to work with multiple GPUs

Useful Links

useful links

Experimental design: how to measure effect of tokenizers?

Yihong suggested that we look into the difference that different tokenziers have on performance, however it is not clear what experiment we need to run for this.

This issue is for us to monitor that

Train models on the TUDarmstadt dataset

As a start, we will try the following experiments:

Training TUDarmstadt on bio, bieo and io. This will likely be done on Google Colab. The objective from here is to debug the training script to make sure that training and inference are working correctly. This is tangentially related to #39 to see model performance during training
Training PersuadeProcessor on bio, bieo and io. This will be done on a combination of using the cluster to make sure that is will be ready for full training runs and on Google Colab, again for debugging purposes to see if we get good results or not
For the above two tests, we will try using the RoBERTa and BigBird models. We will likely fix the max_length=1024 across our experiments
Once the above are complete and we are confident that they are working, we will agree on epochs, batch_size etc and then run experiments for all the configurations that we need. This would be in the order of models x labelling schemes x agg strategies

End-to-end data loading, training., inference and evaluation*

on evaluation because we first need to decide what metrics we're going to measure, see #23

Dataset enhancement

Look into using Trainer class from HuggingFace to streamline train/validation and test process

HPC Docs

Interactive jobs:

https://www.rc.ucl.ac.uk/docs/Interactive_Jobs/

TUDarmstadtProcessor read bug on windows computers

Some reading bug due to encoded characters on Windows.

Low priority but should be looked into.

Helper functions for data augmentation

Create functions for enhancing the dataset,
example: spell check, or removing misspelt words

Create Processor for Kaggle dataset

Labelling schemes

As of right now, the BigBird model (loaded using AutoModelForTokenClassification) takes in encoded inputs using (AutoTokenizer). When the model is trained, e.g. model(**inputs, labels=labels) the size of labels must be the same size as the tensors in inputs. Does this always have to be the case?

Example

If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"].

However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]'] and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O'].

We need to answer the following:

Does the target variable size always have to match the embedding size? If so, why?
Which is the correct way of representing the target variables corresponding to tokenised entities, e.g. does ['Yousef', 'B-PERSON'] become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']] or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
Do the [CLS] and [SEP] variables turn into 'O'? What effect does this have on the classification?

Experiment summary

Summary

Create a working dataloader

Create a working dataloader that takes input text and prepares it for the model.

Criteria:

Needs to be parametrised so you can easily modify preprocessing settings
Needs to be fast for loading and processing data

Extend PersuadeProcessor to allow adversarial examples

Looked into creating preprocess stage using prediction string... this does not work at all and there is a 26% discrepancy with the real data. Most of it is due to missed punctuation but some of it is more severe
Next step should look more closely into using discourse_start and discourse_end and see the percentage error rate with that, and if the error examples can be fixed by hardcoding

Create metrics for measuring model performance

Create metrics other than Multi F1 score to measure model performance

Proof of concept for Model 1

Train first model (with dataloader from #5 )
Test first model (with utils from #9 #10 #11)
submit model to Kaggle

TUDarmstadt get_tts

Requires implicit path to train-test-split.csv which means that you cannot use this method if you are doing .from_json. A htofix for this should be easy, but moving forward should look into #41 to make the design more robust

Hotfix is just to copy paste the train-test-split.csv to the relevant location. Changing from bug to enhancement since code isn't broken.

Examine if there is a problem with training if F1 score still 0

see #29 on inference review

see also #45. This is closed, but text is still relevant.

Re-visit DataProcessor design

Current data processor created quickly, but as a result a number of shortcuts were made. This makes it a bit annoying to use from a user experience POV.

Can you re-think how this should work?

Set up CS cluster

Create inference functions for different aggregation schemes / labels

API Endpoint Documentation

Getting started

Make sure you are in a venv with argminer installed. Run argminer-api from the command line. This should return two links. Go to your browser to the address of the first one, and then add /ui at the end of it. You should now see a page containing the API endpoints:

You are now ready to begin!

GET

health_check

Just checks if the API is alive

model_info

Returns metadata on a set of preselected models that we trained.

POST

evaluate

This endpoint allows you to test any huggingface model for the task of argument mining.

model_name: name of any publicly available model on HuggingFace
strategy: labelling strategy, see #13 for more information
agg_strategy: how to aggregate from tokens back to words, see #13 for more information
stategy_level: how to apply labelling strategy, see #13 for more information
max_length: maximum length of tensors after tokenization. This is the 2 + num_tokens + (padding) = max_length. 2 refers to CLS and SEP inherent to transformer models, these will always be there. Padding is only added if 2+num_tokens is less than max_length

!Note max_length might conflict with model of choice

batch_size: this is the batch size used when running inference. Depending on your memory you may have to adjust this
label_map: this is a list of all of the labels in the sample dataset that you are providing. You must always add 'Other' for argminer 0.1.0. This may change in the future.

+ example: [Other, Claim, MajorClaim, Premise] as in AAE dataset

body: this is the sample data that you are providing to the model. It must be in the form arr[arr[string]]. The first array dimension refers to 'document', the second refers to the actual instance of text as well as it's associated label, e.g. "label::sentence" with "::" being the delimiter between these. So for example if providing a 2 document input you would have the following:

[
# document 1
["Claim:: This is the first sentence of document 1 classified as claim", "MajorClaim:: THIS is the second sentence of document 1 classified as MajorClaim", "Premise:: this is the third sentence of document 1" ],
# document 2
["Claim:: this is the first sentence of document 2", "Other:: this is the second sentence of document 2. Document 2 does not have a third sentence like document 1."],
]

predict

Predict labels on a given text using one of the models we provide.

Bug during training

When the model is training, except for the first input, the outputs of the model are always the same class.
This needs to be looked into.

Try testing with Kaggle notebooks

This was a problem due to learning rate

Create Unified Dataloading Scheme

Requirements for input data

After reading from files and processing, the data should be in the following format, ready to be ingested as part of a PyTorch dataset:

'text' column: this should include the entire passage text
'labels' column: this should include the labels in string format as a list, e.g. ['O', 'B-Claim, ...]

Next features to add:

Add standardised intermediate input
Complete train test split module
Add saving / loading from intermediate stages

Create cross validation scheme

Create cross-val scheme that can account for stratified folds

Introduction and Minutes

Introduction

This is an issue for documenting our meetings and making a note of the decisions we make.

Project workflow

We'll be using GitHub for managing code and versioning of different files

Meetings

We'll be meeting weekly at 14:30.

Improve inference speed

Inference is a bit slow, can you find out why and how it can be improved?

mostly a bottleneck on the prediction side. When using GPU then bottleneck becomes when using pandas but this was not a huge issue. Closing.