This repository contains the code developed for the paper:
"Entropy in Legal Language" by Roland Friedrich, Mauro Luzzatto, Elliott Ash (2020), Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, 24 August 2020
A novel method has been introduced to measure the word ambiguity, i.e. local word entropy, in the corpora, based on a word2vec model.
The code has been developed to investigate the word ambiguity in the written text of opinions by the U.S. Supreme Court (SCOTUS) and the German Bundesgerichtshof (BGH), which are representative courts of the common-law and civil-law court systems.
Download the github repository:
git clone https://github.com/MauroLuzzatto/legal-entropy
Run the makefile
to install all python modules needed to run the code:
make init
Or install python requirements and spacy module manually:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
After the installation run the code as follows:
- Define the corpora to be processed in
corpus_setup.py
- Define the corpora to be evaluated in
experiment_setup.py
- run
TextPreprocessing.py
- run
ModelTraining.py
- run
EntropyEvaluation.py
- run
EntropyVisualization.py
The code is structured in five parts:
In the experiment setup the relevant corpus are loaded and the type of experiment is defined.
corpus_setup.py
: defined the corpora that should be loaded and preprocessed
experiment_setup.py
: define the experiments that should be conducted
config.ini
: define the main path, where the results should be saved
In a first step the text is preprocessed and cleaned. The corpus is split into a set of cleaned (e.g. lowercase, lemmatize) sentences. This also includes the creation of bigrams and trigrams using gensim
.
TextPreprocessing.py
: main class for the text preprocessing and cleaningpreprocessing.py
: contains helper functions for the data preparation.n_grams.json
: define the the threshold and min_count of words for the bigram and trigram creation
After the text preprocessing, the word2vec
model is trained using a defined set of hyperparameter.
ModelTraining.py
: main class for word2vec model training, the hyperparameters are defined in the json filehyperparemeters.json
: define the hyperparameters for the word2vec model training
The trained word2vec
models are used to calculate the conditional probability of each center words context words. Based on this probability distribution the entropy on word level is calculated (local word entropy).
EntropyEvaluation.py
: main class used for the entropy calculation based on the predicted context word probability
Finally, the calculated word entropies are visualized on a corpus level.
Visualization.py
: functions for visualizing the results
- Mauro Luzzatto - Maurol