Code and data for "A Multifactorial Approach to Constituent Orderings"
CoNLLU files
Gold-Standard: >Universal Dependencies
Larger files: >CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
Word embeddings
python3 code/ud_pp.py --input PATH_TO_UD_DATA --output OUTPUT_PATH
For each language, the code above generates: (1) Language_pp.csv (2) Language_words.txt (3) Language_tuples.txt
./word_count.sh
(modify directory within the shell script as needed)
Run for each language; this generates Language_wc file
python3 code/freq.py --path PATH_TO_Language_words --language FULL_LANGUAGE_NAME --code LANGUAGEU_CODE
E.g. pytho3 code/freq.py --path data/ --language English --code en
python3 code/hd.py --input PATH_TO_LARGER_FILES --output OUTPUT_PATH --language FULL_LANGUAGE_NAME(e.g. English)
Run for each language; this generates Language_pairs_all.txt
cat PAIR_FILE | sort | uniq -c | sort -rn > OUTPUT_FILE
E.g. cat English_pairs_all.txt | sort | uniq -c | sort -rn > English_jc
Again, take English as an example join -j 1 <(sort English_words.txt) <(sort cc.en.300.vec) > English_em
Follow Gulordava et al. (2018)
python3 code/context.py --data PATH_TO_TRAIN/DEV/TEST --model MODEL_NAME --pp PATH_TO_Language_pp.csv --language FULL_LANGUAGE_NAME
E.g. python3 code/context.py --data model/ --model en.pt --pp data/ --language English
python3 code/factors.py --pp PATH_TO_Language_pp.csv --em PATH_TO_fastText_embeddings --regress OUTPUT_PATH_TO_Regression_Data --language FULL_LANGUAGE_NAME
E.g. python3 code/factors.py --pp data/ --em data/cc.en.300.vec --language English
This generates Language_regression.csv
for each language
See code/analysis.R
for details