Configuration and code accompanying Detecting Persuasion with spaCy.
Persuasion techniques express shortcuts in the argumentation process, e.g. by leveraging on the emotions of the audience or by using logical fallacies to influence it. This project creates a spaCy pipeline with a SpanCategorizer
to detect and classify spans in which persuasion techniques are used in a text.
Notes:
- No other pre-processing of data is performed except conversion to
spacy
binary format. - Default configuration files are used for small, large, and transformer models.
- After training/evaluation of every model, the created model is removed! For the purpose of the associated article, we are interested in the metrics, not the created models.
- For the article describing this project,
suggester
configuration is changed manually to vary between maximum 16-grams and maximum 32-grams configurations. - Evaluation output for training different models (JSON format) is processed by
report.py
to allow for comparison. - GPU is used only for the transformer models.
- A
suggester
configuration for maximum 32-grams with a transformer model will run out of 8GB memory of a GPU. In the provided configuration here batch sizes are tweaked to make it run, but at a loss of some twenty percent of accuracy. * On a 6-core CPU, a 32-grams configuration with a transformer model took some 14 hours to run!
Python code is used to:
- create the corpus in
spacy
format from the original dataset. - extract data from generated metrics files for reporting.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
corpus |
Convert the data to spaCy's format |
train_sm |
Train and evaluate 'sm' model for 16-grams and 32-grams configurations |
train_lg |
Train and evaluate 'lg' model for 16-grams and 32-grams configurations |
train_trf |
Train and evaluate 'trf' model for 16-grams and 32-grams configurations |
report |
Convert metrics of the different trained models to CSV for reading into notebook |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
corpus → train_sm → train_lg → train_trf → report |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets |
Git | Dev dataset from SemEval2021 Task-6 'Detection of Persuasive Techniques in Texts and Images' |