Current Projects:
- ⚡ In-N-Out: FastAPI app to universalise read-write
- ⚡ In-N-Out-Clients: Library containing common read-write clients
- ⚡ Mix-N-Match: Polars Library for efficient and usable data processing
Social:
Repository for NLP project. Name to be changed when we decide on a project
License: MIT License
Darmstadt dataset was trained for the following parameters:
MODEL_NAME: google/bigbird-roberta-base
MAX_LENGTH: 1024
DATASET: TUDarmstadt
STRATEGY: standard_io
EPOCHS: 60
BATCH_SIZE: 8
VERBOSE: 2
SAVE_FREQ: 20
macro_f1: 0.0
macro_f1 with nan: 0.0
f1
label
O 0.0
I-MajorClaim 0.0
I-Claim 0.0
I-Premise 0.0
The loss starts at 0.486 and across all epochs only goes down to 0.441! This is not a huge improvement, and it is likely why the F1 scores are not that great. We need to look into this a bit more...
Find references that justify / give a background on the history of labelling strategies for argument mining (or even span detection / entity recognition). E.g. Why are we interested in looking at the different labelling strategies?
Find references for how people in literature typically go back from tokens to words. Do they always use the first subtoken prediction? Are there pros and cons to using average / max?
Success Criteria:
Add support to print metrics while the models are training
This is an easy fix, just adding so I don't forget. When batching to calculate F1 you cannot just averahe the F1s. You need to total the TP, FN and FP.and then find F1 on the aggregates.
This issue is to document how you can get started with training, saving, loading and also running inference for any transformer based model and available datasets and/or processors.
Since we are working with a private repository at the moment, it is not possible to easily install this package. There are 3 ways of running our code on Colab:
pip install
for any other package, except that we need to authenticate before running it. This basically means that within colab, we'd be running a private pip install
to install the package directly from the develop branch of argument-mining
. Thus, any new changes that we make an push can automatically loaded by running the private install again.Since our code should be ready for testing, we are opting with the third method. If you want to develop the code on colab, then please reach out to me in private and I can help with setting that up.
Now, in order to authenticate, you will need a GitHub access token. Follow these instructions to create an access token. Save this access token in a .json file called github_config.json
that has the following format:
{
"username": "namiyousef",
"access_token": "YOURACCESSTOKEN"
}
Make sure that the username is MY username and NOT yours. This is because we will be installing a repository that is in my name.
Now, in order to access data we need it accessible from within Colab. We can do this by storing things in Google Drive. Make sure that you store any data that you need within your personal Google Drive. In particular, store the github_config.json
there and ONLY there. This is because you do NOT want it accessible to other people, since someone accessing it will be able to access your account.
Now, in terms of data (e.g. data for the project) I mentioned above that you can store it in your own drive. This is OK, but since we already have a shared drive (https://drive.google.com/drive/folders/1XaMWpeoSq04BkVGt16aS9Gk7PBjMtirS) you can also store data there. Just make sure that you don't overwrite anything and that each folder has a readme.txt
file explaining what is there in the file, so that we don't get lost.
In order to be able to access this shared folder programmatically, you will need to add it as a shortcut to your Google Drive. You can do this by right clicking the shared folder, and then clicking the 'add shortcut to drive'.
You will now be able to access the shared folder programmatically from within Colab.
Now, open Colab on your browser. You will be faced with a default screen for selecting notebooks. Navigate to the GitHub tab and check the 'include private repos' checkbox. This will prompt you to login to you GitHub and authenticate. Then from the repositories dropdown, find namiyousef/argument-mining
and then select the develop
branch. Once you have done this, find the notebook in the path experiments/yousef/End-to-end_GPU.ipynb
and select it.
In this notebook, configure the paths as appropriate (you will need to modify some other path variables along the way). Once you have done this, you will be able to run the notebook successfully.
When you are done using the notebook, you can save a copy in your personal drive. You can also push it to GitHub, but please use a different path than experiments/yousef/
because that will change the notebook and I currently have it setup to work with my directories. I would recommend that you push the notebook to GitHub under experiments/{your_name}/{file_name}
so that you can have it configured to how you want to use it, and also so you can have versioning on it.
Alternatively, you can save a copy in your personal drive (if you do this, the authentication might fail the next time you try to run it, so try to stick to GitHub wherever possible).
This issue is to monitor the datasets for the purpose of report writing.
Make sure that we discuss the limitations of the data that we are using to test, e.g. very small, no other similar datasets, etc.
The following documents are helpful for the darmstadt dataset: paper link, and annotation guidelines
Not sure what to use, ideally would go for a bi-lstm. Any thoughts?
Working bi-lstm model created but needs to be integrated with rest of module and make more efficient
So far we've based our evaluation schemes on how Kaggle expected us to evaluate. What else can we measure? Can we find references to support our decisions?
This issue is to monitor the creation of methods to create the relevant data labelling schemes.
Success Criteria:
Functions that take the input text and give the corresponding outputs
Methods in the dataset class that allow us to apply the labels to subtokens
Previously we worked with the Kaggle PERSUADE dataset. Until we have permission to use this, we should refrain at all costs.
Alternate datasets that look similar to that are hard to come by.
Here is a dataset we can use, the Darmstadt dataset: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422
NOTE: the prepared data has been uploaded to this link here: https://drive.google.com/drive/u/0/folders/1kV_DXsvNDgtyV6suPyS6FfwhoRT-wxeY
This ticket is to document datasets that we can use and any associated processing that might be needed.
Things to pay attention to:
Is the dataset valid for what we are trying to do? How was it trained?
Can we enhance the dataset and increase it in size? Will the small size affect our research?
Do we have other datasets that might be slightly similar but we could use?
Success Criteria:
Examples of models we can use to train, with a selected set of ideal hyperparameters to try
Should include references to papers as well
This should consider models with different tokenisers as well, related to #18
difference between roberta-base, bigbird-roberta-base, bigbird-roberta-large
Do CLS and SEP need to have separate labels than your training labels?
Yes, these should be labelled as -100.
Does PAD need to have a separate label?
No, this should be labelled as -100.
Does the attention mask basically ignore the effect of those things with attention mask 0, or is it still important?
Yes it does. You should have attention mask set to zero for PAD, but not for CLS and SEP because they can contain important information about the training items. The idea is:
for PAD: don't attend and don't compute loss
for CLS/SEP: attend but don't compute loss
There is a thread on huggingface forums on this.
It is a requirement to use -100 for the CLS and SEP tokens because of the crossentropy function in PyTorch. You might apply this to the subtokens as well if you wanted to ignore them.
First word should always be 0, NOT 1.
This bug exists in multiple places in the code, so you need to look into this.
As an improvement, set the start index as a global var in the config.py file
Places to look for:
predStr, predictionString, range, etc..
Look into Datasets, DataProcessors and Functions in data.py and evaluation.py
Extend training logic to work with multiple GPUs
useful links
Yihong suggested that we look into the difference that different tokenziers have on performance, however it is not clear what experiment we need to run for this.
This issue is for us to monitor that
As a start, we will try the following experiments:
Look into using Trainer
class from HuggingFace to streamline train/validation and test process
Interactive jobs:
Some reading bug due to encoded characters on Windows.
Low priority but should be looked into.
Create functions for enhancing the dataset,
example: spell check, or removing misspelt words
As of right now, the BigBird model (loaded using AutoModelForTokenClassification
) takes in encoded inputs using (AutoTokenizer
). When the model is trained, e.g. model(**inputs, labels=labels)
the size of labels
must be the same size as the tensors in inputs
. Does this always have to be the case?
If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"]
.
However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]']
and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O']
.
We need to answer the following:
['Yousef', 'B-PERSON']
become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']]
or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
[CLS]
and [SEP]
variables turn into 'O'
? What effect does this have on the classification?Summary
Create a working dataloader that takes input text and prepares it for the model.
Criteria:
Create metrics other than Multi F1 score to measure model performance
Requires implicit path to train-test-split.csv which means that you cannot use this method if you are doing .from_json. A htofix for this should be easy, but moving forward should look into #41 to make the design more robust
Hotfix is just to copy paste the train-test-split.csv to the relevant location. Changing from bug to enhancement since code isn't broken.
Current data processor created quickly, but as a result a number of shortcuts were made. This makes it a bit annoying to use from a user experience POV.
Can you re-think how this should work?
Make sure you are in a venv
with argminer
installed. Run argminer-api
from the command line. This should return two links. Go to your browser to the address of the first one, and then add /ui
at the end of it. You should now see a page containing the API endpoints:
You are now ready to begin!
Just checks if the API is alive
Returns metadata on a set of preselected models that we trained.
This endpoint allows you to test any huggingface model for the task of argument mining.
!Note max_length might conflict with model of choice
+ example: [Other, Claim, MajorClaim, Premise] as in AAE dataset
[
# document 1
["Claim:: This is the first sentence of document 1 classified as claim", "MajorClaim:: THIS is the second sentence of document 1 classified as MajorClaim", "Premise:: this is the third sentence of document 1" ],
# document 2
["Claim:: this is the first sentence of document 2", "Other:: this is the second sentence of document 2. Document 2 does not have a third sentence like document 1."],
]
Predict labels on a given text using one of the models we provide.
When the model is training, except for the first input, the outputs of the model are always the same class.
This needs to be looked into.
This was a problem due to learning rate
After reading from files and processing, the data should be in the following format, ready to be ingested as part of a PyTorch dataset:
Next features to add:
Create cross-val scheme that can account for stratified folds
This is an issue for documenting our meetings and making a note of the decisions we make.
Inference is a bit slow, can you find out why and how it can be improved?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.