Code Monkey home page Code Monkey logo

conditioned-speech-gen's Introduction

Semantically Conditioned Language Models for Political Text Generation

We develop a 3-step fully automatized and scalable pipeline for generating high quality synthetic text corpus with semantically conditioned language models. Applying the model on a political speech corpus leaves a performant classifier in high confusion for discriminating real and synthetic texts, proving the ability of our pipeline.

3-step high quality synthetic text generation pipeline

  • An example fake generation from political domain:

"Mr. President, I speak about the need for hate crimes legislation. On May 1, 2003, Senator Kennedy and I introduced the Local Law Enforcement Enhancement Act, a bill that would add new categories to current hate crimes law, sending a signal that violence of any kind is unacceptable in our society. On September 3, 2001, three men were found guilty of beating and shooting two women in their downtown Atlanta apartment for allegedly being gay and mentally retarded. Crimes motivated by race, gender, religion, sexual orientation, disability or national origin are routinely covered by State and local governments, as are"

Reproducing results on ETH Euler cluster

Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

Close the current terminal and open a new one.

Setup Conda Environment, Load Modules, Activate Conda Environment

conda env create -f environment.yml
module load gcc/8.2.0 python_gpu/3.9.9 eth_proxy
conda activate cond_text_gen_project

Note

You do not need to run each of the upcoming steps. However, please make sure that you do the corresponding path changes outlined at each step.

  • If you have access to 'processed_df_train.csv' and 'processed_df_valid.csv' files, skip Step 1.
  • If you have access to pretrained checkpoint such as '18aug_k2t_gpt2medium_maxseqlen256_batch8_8_lr2e5_epoch2.pt', skip Step 2.
  • If all 'shard0, ..., shard9' subfolders in 'data/' folder has 'keyword.txt' file, skip Step 3. (currently like this.)
  • If all 'shard0, ..., shard9' subfolders in 'results/' folder has generated outputs such as 'results/shard0/finetunedgptmed_lr2e5_epoch2/Result_w_5.0_nBeams_1_nGenSent_128_nWordsPerSent_1_topP_0.9_WC_Guar_True_glove_maxSENTENCES.txt', skip Step 4. (currently like this.)
  • If you have access to 'quality_fake_df.csv', skip Step 5.

For access to above files, go to https://polybox.ethz.ch/index.php/s/qbGQzAyefAjS13N. Access only available internally.

Step 1: Preprocess Corpus

We assume that you have access to raw corpus directory indicated in line 8.

Go to process_corpus.py, and change line 9 with your save_dir (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets'). If you have access to 'processed_df_train.csv' and 'processed_df_valid.csv' files, put them under this directory.

Then

python process_corpus.py

Step 2: Fine-tune GPT-2

Go to config.yml.

Change data_path (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets').

Change checkpoint_dir (e.g. '/cluster/scratch/{eth_username}/nlp_lss_checkpoints'). If you have access to finetuned checkpoint, put it under this directory.

Change experiment name and other hyperparameters.

Then run on GPU with

bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python run.py --config config.yml

Step 3: Extract Keywords

Go to extract_keywords.py.

Change line 8 with your gensim data dir (e.g. os.environ['GENSIM_DATA_DIR']='/cluster/scratch/{eth_username}/nlp_lss_datasets').

Change line 9 with the dir of 'processed_df_valid.csv'. (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets')

Change line 10 with your preferred keyword_file_name (e.g. 'valid_df_keywords_20k.txt').

Note total_shard in line 17. Make sure that 'results/' folder has subfolders named shard0, shard1, ..., shard{total_shard-1}. (currently like this.)

Then

python extract_keywords.py

Step 4: Generate Fake Speech with K2T

Go to perplexity.py, and change line 5 with your cache_dir (e.g. '/cluster/scratch/{eth_username}/huggingface_cache'). Cache and gensim data directories needs to be changed because of low memory in $HOME directory.

Go to utility_gpt.py.

Change line 9 with your gensim data dir (e.g. os.environ['GENSIM_DATA_DIR']='/cluster/scratch/{eth_username}/nlp_lss_datasets'). (It should be the same with step 3.)

Change line 10 with your cache_dir (e.g. '/cluster/scratch/{eth_username}/huggingface_cache'). Cache and gensim data directories needs to be changed because of low memory in $HOME directory.

Change line 11 with your converter_table_path (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets/converter_table_glove'). Converter table holds glove word vectors for each token in gpt-2 space.

Go to k2t.py, and change line 744 model checkpoint dir (e.g. '/cluster/scratch/{eth_username}/nlp_lss_checkpoints/18aug_k2t_gpt2medium_maxseqlen256_batch8_8_lr2e5_epoch2.pt').

Then run on GPU with

bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard0/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard1/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard2/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard3/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard4/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard5/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard6/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard7/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard8/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120
bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python k2t.py -file_name=data/shard9/keywords.txt -results_subfolder=finetunedgptmed_lr2e5_epoch2 -do_guarantee=True -n_generated_sentences=120

Possible hyperparameters are as follows:

-file_name: the location for each extracted keyword file. (currently, no change needed)
-results_subfolder: the name of the subfolder under 'results/' to which the generations saved. (currently, no change needed)
-n_generated_sentences: sentence length.
-do_guarantee: whether to guarantee appearance of given keywords in the generated text.
-top_p: nucleus sampling parameter
-weight: shift strenght $\lambda_0$.
-det_BS: deterministic beam search
-task: should not be changed.

Step 5: Train real vs fake BERT Classifier

Go to train_fake_classifier.py.

Change line 15 data_dir with the dir 'processed_df_valid.csv' is saved on (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets).

Change line 40 results_path with the full path of your 'results/' folder (e.g. '/cluster/home/{eth_username}/conditioned_speech_gen/results'). Change line 41 folder_name with the name you entered to '-results_subfolder' in Step 4. (e.g.'finetunedgptmed_lr2e5_epoch2' ) Change line 42 experiment name with the name you find when you go to 'results/folder_name/your_experiment_settingsSENTENCES.txt' (e.g. 'Result_w_5.0_nBeams_1_nGenSent_128_nWordsPerSent_1_topP_0.9_WC_glove_maxSENTENCES.txt')

Create a temporary .py file, and run only once, and delete the temporary .py file: (data_dir is the same with line 15)

import os
import pandas as pd
quality_fake_df = pd.DataFrame(columns=['speech','perplexity'])
quality_fake_df.to_csv(os.path.join(data_dir,'quality_fake_df.csv'),index=False)

This is your high quality synthetic dataset that will compile only the fake samples which tricked the BERT classifier across different experiments runs. Only run it once before frequent experimentation.

Then run on GPU with

bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python train_fake_classifier.py

Step 6: Evaluate Compiled High Quality Dataset with BERT Classifier

Go to train_quality_evaluator.py.

Change line 15 data_dir with the dir 'processed_df_valid.csv' is saved on (e.g. '/cluster/scratch/{eth_username}/nlp_lss_datasets). If you have access to 'processed_df_valid.csv' and 'quality_fake_df.csv' files, put them under this directory.

Then run on GPU with

bsub -n 4 -W 23:59 -o euler_message -R "rusage[mem=4096, ngpus_excl_p=1]" -R "select[gpu_model0==NVIDIAGeForceRTX2080Ti]" python train_quality_evaluator.py

Note: This repository borrows code from https://github.com/dapascual/K2T, and modifies. (utility_gpt.py, k2t.py, perplexity.py, encode_keywords.py)

conditioned-speech-gen's People

Contributors

gozsoy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.