This folder contains (i) code and (ii) appendix to the paper 'Complexity/informativeness trade/off in the domain of indefinite pronouns' by Milica Denić, Shane Steinert-Threlkeld and Jakub Szymanik, to appear in Proceedings of Semantics and Linguistics Theory 2020.
The appendix file is Negative_indefinites_40_lang_data.pdf.
In the remainder, we describe the code.
- Python 2.7.14
- Required Python packages are in requirements.txt.
- R 3.5.2
- Required R packages are: datasets 3.5.2, dplyr 0.7.8, ggplot2 3.1.0, graphics 3.5.2, grDevices 3.5.2, methods 3.5.2, minpack.lm 1.2.1, plyr 1.8.4, stats 3.5.2, tidyr 0.8.2, utils 3.5.2, rlist 0.4.6, stringr 1.3.1
Python and R scripts are in the folder src. CSV files needed for scripts to run and generated by them are in the folder data.
-
- Description: It extracts the prior probability distribution over flavors from the annotated corpus from Beekhuizen et al.'s (2017) study downloaded from here, and stores it in Beekhuizen_priors.csv file.
- Dependencies: beekhuizen_full_set.csv
-
- Description: It generates the minimum-length feature-based descriptions of all logically possible indefinite pronouns (in terms of which combination of flavors they can express) and stores them in minimum-desc-indef.csv file.
-
- Description: Definitions of a series of useful functions for Experiments 1 and 2.
- Dependencies: Beekhuizen_priors.csv
-
- Description: It imports the data file with Haspelmath's 40 natural languages, generates 10000 aritificial languages used in Experiment 1, computes communicative cost and complexity of these languages and stores them into all_complexity_cost_exp1.csv. Finally, it performs synonymy matching and stores the matched languages in syn_matched_exp1.csv.
- Dependencies: Indefinites_functions.R, languages_real_40_updated.csv, minimum-desc-indef.csv
-
- Description: It generates 10 000 artificial languages used in Experiment 1 (5000 Haspel-ok and 5000 Not Haspel-ok languages), computes communicative cost and complexity of these languages and stores them into all_complexity_cost_exp2.csv. Finally, it performs synonymy matching and stores the matched languages in syn_matched_exp2.csv.
- Dependencies: Indefinites_functions.R, minimum-desc-indef.csv
-
- Description: It runs an evolutionary algorithm selecting for Pareto optimal languages for 100 generations. It stores the complexity and communicative cost measures of the final generation in finalgencostcom.csv. Finally, it selects dominant languages in terms of complexity and communicative cost from the final generation and the languages used in Experiment 1 and stores them in pareto_dominant.csv.
- Dependencies: all_complexity_cost_exp1.csv, minimum-desc-indef.csv, allitems.csv, (a file with all logically possible items), Beekhuizen_priors.csv
-
- Description: It imports the data file with dominant languages in terms of complexity and communicative cost and based on them estimates the Pareto frontier for indefinite pronouns. It plots languages of Experiment 1 and languages of Experiment 2 with respect to the frontier. It computes minimum Euclidian distances of languages of Experiment 1 and 2 from the Pareto frontier, and stores them in natural_distances_pareto.csv, artificial_distances_pareto.csv, Haspok_distances_pareto.csv, Haspnotok_distances_pareto.csv. Finally, it establishes that (i) natural languages are closer to the frontier than artificial languages; and (ii) that languages which satisfy Haspelmath's universals are closer to the frontier than languages which do not satisfy them.
- Dependencies: Indefinites_functions.R, syn_matched_exp1.csv, syn_matched_exp2.csv, pareto_dominant.csv