Code Monkey home page Code Monkey logo

geb_text_class_2020's Introduction

Fast, scalable, and automated identification of articles for biodiversity and macroecological datasets

Cornford et al., 2020


This directory contains the code and data needed to reproduce the analyses found in the above titled paper, in addition to demo code which facilitates the quick and easy fitting/application of text-classification models given specified csv files containing relevant, irrelevant and unclassified texts.
Before running any code, load and activate the conda python environment specified in "python_env.yml" using "$ conda env create -f python_env.yml" and either "$ source activate python_env" or "$ coda activate python_env". To Check the environment installed successfully use "conda list", which should list installed functions/modules.

/demo
	/Code
	Here, you can find the code needed to train/save logisitic regression and convolutional neural network classifiers (train_save_classifier_demo.py). You can also apply saved classifiers to unclassified texts (load_apply_classifer_demo.py). 

	Training requires the specification of two input files (within the python code file), relevant and irrelevant texts.
	Application requires the specification of one input file (within the python code), unclassifieds.

	/Data
	This contains the additional data required by the code files, including examples of positives, negatives and unclassifieds (in csv format, with columns including "Title" and "Abstract"), in addition to stop-words (NLTK_stop_words.txt). GloVe embeddings can be downloaded from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip.

	/Results
	Model files will be saved here, as will classified texts.



/workflow
	/Code
	Using the code and associated data/results, our analyses, as presented in the paper, can be reproduced.

	The run_workflow.sh file outlines the overall workflow and if run will conduct all analyses (R code files). 

	Code for preparing texts (prep_all_texts.py), fitting all combinations of classifiers (train_classifiers.py, train_predcits_classifiers.py), and analysing classifier performance (model_analysis_lpi.R, model_analysis_predicts.R) is included (Methods 2.2).

	Real-world benefits of applying the best performing text classifiers associated with the LPI and PREDICTS data are assessed in "search_analysis.R" (Methods 2.3).

	Code to test the consequences of altering the number of training documents used and the impact of including true negatives is provided (man_class_process.R, lr_npos_augment_fit.py, lr_npos_augment_anaylsis.R, Methods 2.4)

	We extract and visualise the feature weights of the best LR model with lr_feature_extraction.py and word_clouds.R (Methods 2.5)

	Additional analyses, as found in the Appendices to our paper, include determining the effect of using different sets of pseudo-negatives in the training data (resample_models.py, resamp_analysis.R, Appendix S2) and the potential consequences of the GloVe embeddings not being available for certain, potentially important, words (glove_text_coverage.py, lra_pred_glove_stop.R, Appendix S3).


	/Data
		/Text
		Consists of the texts needed to train the classifiers.
		Note: due to file size limits, the negative texts are provided in 3 separate csv files. These should be combined into a single file before use. e.g. pd.concat([neg1,neg2,neg3]). 
		The GloVe embeddings are also omitted due to file size but can be found at: http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
	
	Also found in the data directory are the NLTK stop words.	


	/Results
	All results files are those produced/used in our actual analysis.

	<dataset>_man_class... files are used/produced when analysing the effect of training set size and the use of true-positives on classifier performance.

		/Models
			/LR
			This directory stores fitted, logistic classifiers and the associated text vectorizers.
			Those already in the directory correspond to the LR A models from our manuscript.

		/Model_metrics
			/LR
			Various results files relevant to fitting and assessing logistic models.
			
			/NN
			Results files relevant to fitting and assessing neural network models.
		
		/Scopus_lpi
		Files containing the manually classified texts associated with targeted literature searches.
		
		/WoK_predicts
		Files containing the manually classified texts associated with targeted literature searches.
		
		/Figs
		If the analysis code files are run (in the order specified in run_workflow.sh), this directory contains the results figures used in our manuscript.

geb_text_class_2020's People

Contributors

rcornf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.