wri-dssg-omdena / policy-data-analyzer Goto Github PK

Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.

License: Other

Makefile 0.01% Jupyter Notebook 91.93% Python 1.22% R 0.08% CSS 5.10% JavaScript 0.02% HTML 1.61% Shell 0.01% Dockerfile 0.02%

nlp sbert sentence-transformers huggingface machine-learning text-classification document-classification scraping policy environmental

policy-data-analyzer's Issues

Code refactoring for the data augmentation notebook.

One important estep in the project pipeline is to find a batch of prelabeled sentences that can be easily curated manually to be later used for model fine-tuning.
In a first step this was done in a notebook where different strategies where evaluated in different experimental setups.
Now this code should be cleaned and refactored to be integrated in the final pipeline.

join policy tags on the ecolex sample to the full ecolex dataset

Explore Sentence Transformer fine tuning

In our own branches, we can try:

Try different train/test data split and see which one is more accurate
Try different sentence transformer models
Compare fine tuning strategies if applicable (such as https://mccormickml.com/2019/07/22/BERT-fine-tuning/, if different at all)

Identify best set of keywords and search terms to find relevant documents from Ecolex

After the text has been extracted from the policy PDFs:

Find word embeddings suitable for Spanish documents, or any type of Spanish language model
Use keyword analysis/topic modeling to gather insights from the text and improve further searches
If possible, come up with a "similarity" or "distance" metric among relevant documents for easier filtering from non-relevant

SBERT for classification

Find a way of using SBERT for label prediction without using cosine similarity - i.e Find another mapping function from sentence embedding to label

data loading refactoring and new functions

This is an issue to improve the data loading tools. You can list your changes here:

Rename the function to load json from "load_file" to "load_json" in the src/utils.py
Add a funtion to list file names from a directory

format tagged sentences in excel to json

#27 Adapt sBert input to new format

Simplify the sBert classes by eliminating preprocessing steps
Adapt the input to new standards

Explore query strategies with sBERT

We have a initial setup of sBERT to be able to get sentence embeddings and then find the cosine similarity between two sentences.
This allows for using this setup as a search engine to look for sentences which are similar to a certain query.
In this issue we want to analyse the output of the search as we use different query approaches.
There is a more sophisticated approach in https://towardsdatascience.com/building-a-search-engine-with-bert-and-tensorflow-c6fdc0186c8a that we will also explore to see if the performance improves. We will set a new branch Antyukhov-search-engine

Explore CrossEncoders

As specified in this issue: UKPLab/sentence-transformers#350 (comment)

The authors of SBERT recommend using CrossEncoders for sentence classification.

Create a general evaluator for the models

Script that:

Takes in as input the results from a given model run (as a JSON file containing sentences and their labels) and a dataset of labeled sentences
Compares the differences between the model outputs and the ground truth
Has different metrics (cosine similarity, accuracy, precision-recall curve, etc.)

Fix variability issue

Improve preprocessing component

The current method that we are using to split sentences yields a great amount of wrongly splitt sentences.
We need to improve it so as to have a good final version when we want to use the fine-tunned transformers.

Run zero-shot learning experiments

Goal: Try to label more sentences automatically using an unsupervised approach + some labeled data

Specifically, try ideas described in https://joeddav.github.io/blog/2020/05/29/ZSL.html

Latent embedding approach (in blog)
Zero-shot topic classification: https://huggingface.co/zero-shot/

Organize new folders in repo

After migrating all the code from Omdena's repo, we should make sure that all the folders and naming conventions are in accordance to Patrick Ball's workflow structure (see https://hrdag.org/2016/06/14/the-task-is-a-quantum-of-workflow/)

Build Makefile to easily reproduce the project

Logistic regression with Spanish pre-trained embeddings

Using links provided by @Bcjg23, we can test these Spanish pre-trained embeddings:
- https://github.com/dccuchile/spanish-word-embeddings
- https://crscardellino.github.io/SBWCE/
Build a simple logistic regression model with the policy metadata as features
Apply cross-validation and optimize parameters

Refactoring of assisted labeling notebook

Edit README

Make changes to the README so that it contains updated information about the architecture, results and description of the project. The end goal is to spread the link to this repo as much as possible, and we need to have a good and presentable description of the project.

Text summarization with GPT3

Implement a spellchecker in the preprocessing pipeline

More detail in trello

Automate hyperparameter optimization

Look for information on automatic hyperparameter tuning optimization and its viability for our project
Define hyperparameters to be optimized for
Test new methods like population-based optimization or bayesian optimization

Bayesian Optimization explained

Conceptually: https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

Population Based Optimization

Concept + comparison with Bayesian: https://medium.com/distributed-computing-with-ray/hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b

Huggingface + Ray + W&B implementation of hyperparameter tuning

Blog detailing actual process: https://huggingface.co/blog/ray-tune

Build the scrapper (for Ecolex)

Optimize PDF2Text Pipeline

Analysis of the internal structure of legal documents

Extract text from Cristina's documents

Read pdf files directly from OneDrive or from zip file
Use OCR if needed for extracting text from image pdf files
Optional: Structure data into single file (or database) for future reading

Scraping new documents from El Salvador and From Chile

Spiders for US legislation

Weights & Biases experiments

Setup for Weights & Biases for our notebook and further experiments

We can use wandb for hyperparameter tuning and most importantly keeping track of experiments. With W&B free hosted service we can set this up efficiently.

Since the team version of wandb is paid (30 day free trial is there however), we will use the following project

https://wandb.ai/ramanshsharma/WRI

Please find the API key to write to the public project in Slack.

Goals

Create a shared project on weights & biases for the team to work on.
Set up training and validation accuracy/loss, weighted/macro F1 score plotting
Set up automatic hyperparameter tuning

Helpful links

Implement a spellchecker in pre-processing pipeline

Fix training loops and sentence transformer

Evaluation code should be refactored to only take care of calculating results, and storing of results should be done in a different area
Evaluation should be done on validation set, not test set
Add method to evaluate on test set

An example structure for the english documents would be:

/english_documents/raw_pdf/: Original/raw documents
/english_documents/text_files/: Text file version of the documents
- /english_documents/text_files/new/: New documents ready to be processed (read)
- /english_documents/text_files/processed/: Processed documents that have already gone through sentence extraction (write)
/english_documents/sentences/: JSON file containing sentences per documents (read AND write)
/english_documents/assisted_labeling/: Excel/CSV files for the assisted labeling part (read AND write)
/english_documents/metadata/: CSV files containing metadata for each country (file names, title of document, etc.) (write)
/english_documents/abbreviations/: Text files containing common abbreviations for each language (read)
Extra separate files:
- /english_documents/english_queries.xlsx: Queries (Excel) (read)

There are more databases to add, such as the one for actual embeddings (if needed) and the highlights per each document. Since we haven't created them yet, these are not necessary to create links to.

wri-dssg-omdena / policy-data-analyzer Goto Github PK

policy-data-analyzer's People

Contributors

Stargazers

Watchers

Forkers

policy-data-analyzer's Issues

Setup for Weights & Biases for our notebook and further experiments

Goals

Recommend Projects

Recommend Topics

Recommend Org